dc.description.abstract |
Data mining is the extraction of knowledge from large amounts of data. The data can be of various types like medical data, data from sensors, market basket data, or text. When data mining techniques are applied to text databases the process is termed as text mining. Mining comprises of various tasks such as classification, clustering, rule generation etc. Text clustering is thus the process of dividing text documents into groups such that documents in the same group are similar to one another and different from documents in other groups.
Clustering could be performed by different approaches like partitioning, hierarchical clustering, density based methods or grid based methods. Hierarchical methods have an edge over the others because of the presence of natural multi level hierarchies in case of text. Recent research in the field of text clustering shows the use of various hierarchical methods to cluster text. But these methods generally suffer from high complexity of both time and space. And storing these hierarchy trees is a major problem for large databases like unstructured text.
In the present work, we propose to use a hierarchical clustering algorithm. The algorithm is based on a tree based compression to store the hierarchy of clustered documents such that only the summary of each cluster is stored. The summary of cluster is maintained in such a way that we can retrieve all important information like radius of cluster or centroid of cluster from the summary itself. Once the tree is formed in the memory an iterative approach is implemented to allocate clusters to each document. This reduces the space complexity considerably. For the problem of time complexity we parallelize the algorithm on CUDA architecture. Due to high dimensionality of text parallelization produces significant amount of performance gain over the non parallel version |
en_US |