DSpace Repository

PARALLELIZED CLUSTERING OF UNSTRUCTURED TEXT USING TREE BASED COMPRESSION

Show simple item record

dc.contributor.author bagga, Atul
dc.date.accessioned 2014-12-01T05:44:30Z
dc.date.available 2014-12-01T05:44:30Z
dc.date.issued 2011
dc.identifier M.Tech en_US
dc.identifier.uri http://hdl.handle.net/123456789/12412
dc.guide Toshniwal, Durga
dc.description.abstract Data mining is the extraction of knowledge from large amounts of data. The data can be of various types like medical data, data from sensors, market basket data, or text. When data mining techniques are applied to text databases the process is termed as text mining. Mining comprises of various tasks such as classification, clustering, rule generation etc. Text clustering is thus the process of dividing text documents into groups such that documents in the same group are similar to one another and different from documents in other groups. Clustering could be performed by different approaches like partitioning, hierarchical clustering, density based methods or grid based methods. Hierarchical methods have an edge over the others because of the presence of natural multi level hierarchies in case of text. Recent research in the field of text clustering shows the use of various hierarchical methods to cluster text. But these methods generally suffer from high complexity of both time and space. And storing these hierarchy trees is a major problem for large databases like unstructured text. In the present work, we propose to use a hierarchical clustering algorithm. The algorithm is based on a tree based compression to store the hierarchy of clustered documents such that only the summary of each cluster is stored. The summary of cluster is maintained in such a way that we can retrieve all important information like radius of cluster or centroid of cluster from the summary itself. Once the tree is formed in the memory an iterative approach is implemented to allocate clusters to each document. This reduces the space complexity considerably. For the problem of time complexity we parallelize the algorithm on CUDA architecture. Due to high dimensionality of text parallelization produces significant amount of performance gain over the non parallel version en_US
dc.language.iso en en_US
dc.subject ELECTRONICS AND COMPUTER ENGINEERING en_US
dc.subject PARALLELIZED CLUSTERING en_US
dc.subject UNSTRUCTURED TEXT en_US
dc.subject TREE BASED COMPRESSION en_US
dc.title PARALLELIZED CLUSTERING OF UNSTRUCTURED TEXT USING TREE BASED COMPRESSION en_US
dc.type M.Tech Dessertation en_US
dc.accession.number G20987 en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record