PARALLELIZED CLUSTERING OF UNSTRUCTURED TEXT USING TREE BASED COMPRESSION

bagga, Atul

DSpace Home
→
ELECTRONICS AND COMMUNICATION ENGINEERING (FORMERLY ELECTRONICS & COMPUTER ENGINEERING)
→
MASTERS' THESES (E & C)
→
View Item

dc.contributor.author	bagga, Atul
dc.date.accessioned	2014-12-01T05:44:30Z
dc.date.available	2014-12-01T05:44:30Z
dc.date.issued	2011
dc.identifier	M.Tech	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/12412
dc.guide	Toshniwal, Durga
dc.description.abstract	Data mining is the extraction of knowledge from large amounts of data. The data can be of various types like medical data, data from sensors, market basket data, or text. When data mining techniques are applied to text databases the process is termed as text mining. Mining comprises of various tasks such as classification, clustering, rule generation etc. Text clustering is thus the process of dividing text documents into groups such that documents in the same group are similar to one another and different from documents in other groups. Clustering could be performed by different approaches like partitioning, hierarchical clustering, density based methods or grid based methods. Hierarchical methods have an edge over the others because of the presence of natural multi level hierarchies in case of text. Recent research in the field of text clustering shows the use of various hierarchical methods to cluster text. But these methods generally suffer from high complexity of both time and space. And storing these hierarchy trees is a major problem for large databases like unstructured text. In the present work, we propose to use a hierarchical clustering algorithm. The algorithm is based on a tree based compression to store the hierarchy of clustered documents such that only the summary of each cluster is stored. The summary of cluster is maintained in such a way that we can retrieve all important information like radius of cluster or centroid of cluster from the summary itself. Once the tree is formed in the memory an iterative approach is implemented to allocate clusters to each document. This reduces the space complexity considerably. For the problem of time complexity we parallelize the algorithm on CUDA architecture. Due to high dimensionality of text parallelization produces significant amount of performance gain over the non parallel version	en_US
dc.language.iso	en	en_US
dc.subject	ELECTRONICS AND COMPUTER ENGINEERING	en_US
dc.subject	PARALLELIZED CLUSTERING	en_US
dc.subject	UNSTRUCTURED TEXT	en_US
dc.subject	TREE BASED COMPRESSION	en_US
dc.title	PARALLELIZED CLUSTERING OF UNSTRUCTURED TEXT USING TREE BASED COMPRESSION	en_US
dc.type	M.Tech Dessertation	en_US
dc.accession.number	G20987	en_US