Please use this identifier to cite or link to this item: http://localhost:8081/xmlui/handle/123456789/9886
Full metadata record
DC FieldValueLanguage
dc.contributor.authorBakshi, Neeraj-
dc.date.accessioned2014-11-21T04:51:21Z-
dc.date.available2014-11-21T04:51:21Z-
dc.date.issued2005-
dc.identifierM.Techen_US
dc.identifier.urihttp://hdl.handle.net/123456789/9886-
dc.guideGarg, Kum Kum-
dc.description.abstractText classification is the task of classifying documents into a certain number of pre-defined categories or classes. Automatic text categorizers use a corpus of labeled textual strings or documents to assign the correct label to previously unseen strings or documents. Often the given set of labeled examples, or "training set", is insufficient to solve this problem as text classification learning algorithms require a large number of hand-labeled examples (training examples) to learn accurately. Labeled data are expensive to collect, as a human must take the time and effort to label it. In this dissertation, we present an approach to this problem wherein readily available information is incorporated into the learning process to allow for the creation of more accurate classifiers. This additional information is termed as "background knowledge". A framework for the incorporation of background knowledge into three distinct text classification learners is provided. In the first. approach, the background knowledge is used as a set of unlabeled examples in a generative model called Expectation Maximization (EM). The second approach called Co-training used with SVMs is to build two classifiers and the better prediction of the two classifiers is added to the labeled set for achieving more accurate classification. Lastly, the text classification task is seen as one of integration. by information using WHIRL, .a tool that combines database functionalities with techniques from the information-retrieval literature. The results show that text classification accuracy is improved considerably by using background knowledge. The system runs on Linux Fedora Core-1 environment with Pentium-IV 2.40 GHz processor and 256MB RAM. The languages used for code development are C and. Pert.en_US
dc.language.isoenen_US
dc.subjectELECTRONICS AND COMPUTER ENGINEERINGen_US
dc.subjectIMPROVING TEXT CLASSIFICATION ACCURACYen_US
dc.subjectBACKGROUND KNOWLEDGEen_US
dc.subjectTEXT CLASSIFICATIONen_US
dc.titleIMPROVING TEXT CLASSIFICATION ACCURACY USING BACKGROUND KNOWLEDGEen_US
dc.typeM.Tech Dessertationen_US
dc.accession.numberG12380en_US
Appears in Collections:MASTERS' THESES (E & C)

Files in This Item:
File Description SizeFormat 
ECDG12380.pdf4.66 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.