Please use this identifier to cite or link to this item: http://localhost:8081/xmlui/handle/123456789/11825
Title: FRAMEWORK FOR WEB DOCUMENT CLASSIFICATION BASED ON NAIVE BAYESIAN CLASSIFIER USING VOTING METHOD
Authors: Rajesh, G.
Keywords: ELECTRONICS AND COMPUTER ENGINEERING;WEB DOCUMENT CLASSIFICATION;NAIVE BAYESIAN CLASSIFIER;VOTING METHOD
Issue Date: 2008
Abstract: Automatic web document classification is the process of assigning a web documents to one or more predefined category. With the continuous increase of the information available in the World Wide Web (WWW) the importance of the web page classification problem grows significantly. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined classes. In this approach, a human expert performs the classification task, and alternatively, supervised classifiers are used to automatically classify document. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place, thus we can reduce this human participation . In this dissertation we propose a framework for web document classification by solving the semantic and structured keywords. The proposed system is based on Naive Bayesian (NB) classifier using a voting method on two different feature selection methods. The system uses both latent semantic indexing (LSI) and structure-oriented weighting technique (SWT) for feature selection and, training and classification is performed using Naive Bayesian classifier. The latent semantic indexing method projects terms and documents into a Boolean term-document matrix to find latent information in the document. At the same time, we also use the structure-oriented weighting technique which project terms and documents into weighted term-document matrix. These two features are sent to the NB classifier for training and testing respectively. Based on the output of the NB classifier, a voting method is used to determine the suitable class of the web page. By using the Voting method, we are taking the advantages of both semantic relationship between terms and documents and structure of the html document to improve the classifier accuracy. The proposed framework describes training and learning the classifier on two different feature vectors. These methods have been evaluated using yahoo directories web pages based on three parameters — recall, precision and F-measure. The results show that the proposed method works significantly better than the considering LSI features and SWT features separately. iii
URI: http://hdl.handle.net/123456789/11825
Other Identifiers: M.Tech
Research Supervisor/ Guide: Joshi, R. C.
metadata.dc.type: M.Tech Dessertation
Appears in Collections:MASTERS' THESES (E & C)

Files in This Item:
File Description SizeFormat 
ECDG13919.pdf3.62 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.