PARALLELISATION OF NAIVE BAYES CLASSIFICATION FOR UNSTRUCTURED TEXT DOCUMENTS

Agarwal, Nikhil

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/12501

Title:	PARALLELISATION OF NAIVE BAYES CLASSIFICATION FOR UNSTRUCTURED TEXT DOCUMENTS
Authors:	Agarwal, Nikhil
Keywords:	ELECTRONICS AND COMPUTER ENGINEERING;PARALLELISATION;NAIVE BAYES;TEXT DOCUMENTS
Issue Date:	2011
Abstract:	Text classification is the task of assigning a given text document to one of the predefined categories depending on the contents of the document. It has found immense applications in fields as diverse as medicine, financial markets, information retrieval etc. Naive Bayes' is one of the most widely used algorithms for classification. However, the algorithm is significantly slow due to the large amount of calculations it has to perform. Thus, there is a need to parallelise the algorithm to reduce the time required for classification. The algorithm could be parallelised using grid computing, clusters, CPU threads or GPUs. Modem Graphics Processing Units (GPUs) have enabled high performance computing for general-purpose applications. GPUs are being used as co-processors in order to achieve a high overall throughput. CUDA programming model provides adequate C language like API, making it simpler to program for the GPU. In this dissertation, a CUDA based parallel implementation of Naive Bayes' text classification has been proposed. The classification step has been parallelised on GPU using different approaches each trying to exploit some property of the GPU. For example, use of shared memory against global memory, memory coalescing etc. The performance of the implen entation of Naive Bayes' text classification on GPUs has been compared with an efficient implementation of the same on a CPU. The semantic information of unstructured text can be used to improve the classification accuracy. WordNet and POS tagging have been used in this dissertation, to capture the semantic information in unstructured text. The dataset used for experiments is Reuters-21578, which is a collection of news articles that appeared on the Reuters newswire in 1987. The proposed parallel Naive Bayes' algorithm has been implemented on Nvidia's GTS 250 card with 128 processors and 512 MB GDDR3 RAM. The CPU used for the serial implementation consist of a Pentium P4 processor operating at 3 GHz and a DDR3 RAM of 4 GB. Experimental results show that the parallel implementation on GPUs is faster than the serial implementation.
URI:	http://hdl.handle.net/123456789/12501
Other Identifiers:	M.Tech
Research Supervisor/ Guide:	Toshniwal, D.
metadata.dc.type:	M.Tech Dessertation
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
ECDG21053.pdf		3.88 MB	Adobe PDF	View/Open

Show full item record