Please use this identifier to cite or link to this item: http://localhost:8081/xmlui/handle/123456789/9886
Title: IMPROVING TEXT CLASSIFICATION ACCURACY USING BACKGROUND KNOWLEDGE
Authors: Bakshi, Neeraj
Keywords: ELECTRONICS AND COMPUTER ENGINEERING;IMPROVING TEXT CLASSIFICATION ACCURACY;BACKGROUND KNOWLEDGE;TEXT CLASSIFICATION
Issue Date: 2005
Abstract: Text classification is the task of classifying documents into a certain number of pre-defined categories or classes. Automatic text categorizers use a corpus of labeled textual strings or documents to assign the correct label to previously unseen strings or documents. Often the given set of labeled examples, or "training set", is insufficient to solve this problem as text classification learning algorithms require a large number of hand-labeled examples (training examples) to learn accurately. Labeled data are expensive to collect, as a human must take the time and effort to label it. In this dissertation, we present an approach to this problem wherein readily available information is incorporated into the learning process to allow for the creation of more accurate classifiers. This additional information is termed as "background knowledge". A framework for the incorporation of background knowledge into three distinct text classification learners is provided. In the first. approach, the background knowledge is used as a set of unlabeled examples in a generative model called Expectation Maximization (EM). The second approach called Co-training used with SVMs is to build two classifiers and the better prediction of the two classifiers is added to the labeled set for achieving more accurate classification. Lastly, the text classification task is seen as one of integration. by information using WHIRL, .a tool that combines database functionalities with techniques from the information-retrieval literature. The results show that text classification accuracy is improved considerably by using background knowledge. The system runs on Linux Fedora Core-1 environment with Pentium-IV 2.40 GHz processor and 256MB RAM. The languages used for code development are C and. Pert.
URI: http://hdl.handle.net/123456789/9886
Other Identifiers: M.Tech
Research Supervisor/ Guide: Garg, Kum Kum
metadata.dc.type: M.Tech Dessertation
Appears in Collections:MASTERS' THESES (E & C)

Files in This Item:
File Description SizeFormat 
ECDG12380.pdf4.66 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.