CLUSTERING UNSTRUCTURED TEXT DOCUMENTS USING NAIVE BAYESIAN CONCEPT AND SHAPE PATTERN MATCHING

Roy, Rishiraj Saha

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/11958

Title:	CLUSTERING UNSTRUCTURED TEXT DOCUMENTS USING NAIVE BAYESIAN CONCEPT AND SHAPE PATTERN MATCHING
Authors:	Roy, Rishiraj Saha
Keywords:	ELECTRONICS AND COMPUTER ENGINEERING;TEXT DOCUMENTS;PATTERN MATCHING;NAIVE BAYESIAN CONCEPT
Issue Date:	2009
Abstract:	Text document databases are growing rapidly due to the increasing amount of information available in electronic form, such as research publications, news articles, books, and e-mails. Most text databases are semi-structured in that they are neither fully unstructured nor completely structured. Clustering is performed to organize this text data in an unsupervised fashion. It also acts as a preprocessing step for further mining operations like indexing and classification. Time series data mining involves applying mining techniques to time sequences. Much work has been done in this field in the past few decades. But the idea of applying time series data mining techniques on text data mapped to sequences has not yet been explored. We intend to address this problem in this work. In this dissertation, an algorithm for clustering unstructured text documents using naive Bayesian concept and shape pattern matching has been proposed. The first step involves data preprocessing. This includes stop word removal, word stemming, and dimensionality reduction using locality preserving indexing scheme. In the proposed work, we use the Vector Space Model to represent our dataset as a term weight matrix. In any natural language, semantically linked terms tend to occur together in documents. Based on this observation, the co-occurrences of pairs of terms in the term weight matrix are observed. This information is then used to build an initial term cluster matrix where each term may belong to one or more clusters. The naive Bayesian concept and cluster conditional independence is used to uniquely assign each term to a single term-cluster. The text documents are assigned to clusters using the simple statistical measure of arithmetic mean. This completes the first level of clustering in our proposed algorithm. Mapping text documents to vectors based on a list of terms converts them to sequences. Shape pattern-based similarity is a well-established technique in time series data mining. In this work, we apply shape pattern matching to group documents within the broad clusters obtained earlier, thus performing a second level of clustering. The proposed algorithm has been validated using benchmark datasets available on the internet. This includes two special datasets ADA (a marketing application) and SYLVA (an ecology application). Our results show that our proposed two-level text clustering scheme has a significantly better running time as compared to traditional algorithms.
URI:	http://hdl.handle.net/123456789/11958
Other Identifiers:	M.Tech
Research Supervisor/ Guide:	Toshniwal, Durga
metadata.dc.type:	M.Tech Dessertation
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
ECDG14451.pdf		9.98 MB	Adobe PDF	View/Open

Show full item record