PARALLELISATION OF NAIVE BAYES CLASSIFICATION FOR UNSTRUCTURED TEXT DOCUMENTS

Agarwal, Nikhil

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/12501

Full metadata record

DC Field	Value	Language
dc.contributor.author	Agarwal, Nikhil	-
dc.date.accessioned	2014-12-01T07:29:31Z	-
dc.date.available	2014-12-01T07:29:31Z	-
dc.date.issued	2011	-
dc.identifier	M.Tech	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/12501	-
dc.guide	Toshniwal, D.	-
dc.description.abstract	Text classification is the task of assigning a given text document to one of the predefined categories depending on the contents of the document. It has found immense applications in fields as diverse as medicine, financial markets, information retrieval etc. Naive Bayes' is one of the most widely used algorithms for classification. However, the algorithm is significantly slow due to the large amount of calculations it has to perform. Thus, there is a need to parallelise the algorithm to reduce the time required for classification. The algorithm could be parallelised using grid computing, clusters, CPU threads or GPUs. Modem Graphics Processing Units (GPUs) have enabled high performance computing for general-purpose applications. GPUs are being used as co-processors in order to achieve a high overall throughput. CUDA programming model provides adequate C language like API, making it simpler to program for the GPU. In this dissertation, a CUDA based parallel implementation of Naive Bayes' text classification has been proposed. The classification step has been parallelised on GPU using different approaches each trying to exploit some property of the GPU. For example, use of shared memory against global memory, memory coalescing etc. The performance of the implen entation of Naive Bayes' text classification on GPUs has been compared with an efficient implementation of the same on a CPU. The semantic information of unstructured text can be used to improve the classification accuracy. WordNet and POS tagging have been used in this dissertation, to capture the semantic information in unstructured text. The dataset used for experiments is Reuters-21578, which is a collection of news articles that appeared on the Reuters newswire in 1987. The proposed parallel Naive Bayes' algorithm has been implemented on Nvidia's GTS 250 card with 128 processors and 512 MB GDDR3 RAM. The CPU used for the serial implementation consist of a Pentium P4 processor operating at 3 GHz and a DDR3 RAM of 4 GB. Experimental results show that the parallel implementation on GPUs is faster than the serial implementation.	en_US
dc.language.iso	en	en_US
dc.subject	ELECTRONICS AND COMPUTER ENGINEERING	en_US
dc.subject	PARALLELISATION	en_US
dc.subject	NAIVE BAYES	en_US
dc.subject	TEXT DOCUMENTS	en_US
dc.title	PARALLELISATION OF NAIVE BAYES CLASSIFICATION FOR UNSTRUCTURED TEXT DOCUMENTS	en_US
dc.type	M.Tech Dessertation	en_US
dc.accession.number	G21053	en_US
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
ECDG21053.pdf		3.88 MB	Adobe PDF	View/Open

Show simple item record