FRAMEWORK FOR WEB DOCUMENT CLASSIFICATION BASED ON NAIVE BAYESIAN CLASSIFIER USING VOTING METHOD

Rajesh, G.

DSpace Home
→
ELECTRONICS AND COMMUNICATION ENGINEERING (FORMERLY ELECTRONICS & COMPUTER ENGINEERING)
→
MASTERS' THESES (E & C)
→
View Item

dc.contributor.author	Rajesh, G.
dc.date.accessioned	2014-11-28T06:24:08Z
dc.date.available	2014-11-28T06:24:08Z
dc.date.issued	2008
dc.identifier	M.Tech	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/11825
dc.guide	Joshi, R. C.
dc.description.abstract	Automatic web document classification is the process of assigning a web documents to one or more predefined category. With the continuous increase of the information available in the World Wide Web (WWW) the importance of the web page classification problem grows significantly. As the information flow occurs at a high speed in the WWW, there is a need to organize it in the right manner so that a user can access it very easily. Previously the organization of information was generally done manually, by matching the document contents to some pre-defined classes. In this approach, a human expert performs the classification task, and alternatively, supervised classifiers are used to automatically classify document. In a supervised classification, manual interaction is required to create some training data before the automatic classification task takes place, thus we can reduce this human participation . In this dissertation we propose a framework for web document classification by solving the semantic and structured keywords. The proposed system is based on Naive Bayesian (NB) classifier using a voting method on two different feature selection methods. The system uses both latent semantic indexing (LSI) and structure-oriented weighting technique (SWT) for feature selection and, training and classification is performed using Naive Bayesian classifier. The latent semantic indexing method projects terms and documents into a Boolean term-document matrix to find latent information in the document. At the same time, we also use the structure-oriented weighting technique which project terms and documents into weighted term-document matrix. These two features are sent to the NB classifier for training and testing respectively. Based on the output of the NB classifier, a voting method is used to determine the suitable class of the web page. By using the Voting method, we are taking the advantages of both semantic relationship between terms and documents and structure of the html document to improve the classifier accuracy. The proposed framework describes training and learning the classifier on two different feature vectors. These methods have been evaluated using yahoo directories web pages based on three parameters — recall, precision and F-measure. The results show that the proposed method works significantly better than the considering LSI features and SWT features separately. iii	en_US
dc.language.iso	en	en_US
dc.subject	ELECTRONICS AND COMPUTER ENGINEERING	en_US
dc.subject	WEB DOCUMENT CLASSIFICATION	en_US
dc.subject	NAIVE BAYESIAN CLASSIFIER	en_US
dc.subject	VOTING METHOD	en_US
dc.title	FRAMEWORK FOR WEB DOCUMENT CLASSIFICATION BASED ON NAIVE BAYESIAN CLASSIFIER USING VOTING METHOD	en_US
dc.type	M.Tech Dessertation	en_US
dc.accession.number	G13919	en_US