CLUSTERING UNSTRUCTURED TEXT DOCUMENTS USING NAIVE BAYESIAN CONCEPT AND SHAPE PATTERN MATCHING

Roy, Rishiraj Saha

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/11958

Full metadata record

DC Field	Value	Language
dc.contributor.author	Roy, Rishiraj Saha	-
dc.date.accessioned	2014-11-28T10:48:51Z	-
dc.date.available	2014-11-28T10:48:51Z	-
dc.date.issued	2009	-
dc.identifier	M.Tech	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/11958	-
dc.guide	Toshniwal, Durga	-
dc.description.abstract	Text document databases are growing rapidly due to the increasing amount of information available in electronic form, such as research publications, news articles, books, and e-mails. Most text databases are semi-structured in that they are neither fully unstructured nor completely structured. Clustering is performed to organize this text data in an unsupervised fashion. It also acts as a preprocessing step for further mining operations like indexing and classification. Time series data mining involves applying mining techniques to time sequences. Much work has been done in this field in the past few decades. But the idea of applying time series data mining techniques on text data mapped to sequences has not yet been explored. We intend to address this problem in this work. In this dissertation, an algorithm for clustering unstructured text documents using naive Bayesian concept and shape pattern matching has been proposed. The first step involves data preprocessing. This includes stop word removal, word stemming, and dimensionality reduction using locality preserving indexing scheme. In the proposed work, we use the Vector Space Model to represent our dataset as a term weight matrix. In any natural language, semantically linked terms tend to occur together in documents. Based on this observation, the co-occurrences of pairs of terms in the term weight matrix are observed. This information is then used to build an initial term cluster matrix where each term may belong to one or more clusters. The naive Bayesian concept and cluster conditional independence is used to uniquely assign each term to a single term-cluster. The text documents are assigned to clusters using the simple statistical measure of arithmetic mean. This completes the first level of clustering in our proposed algorithm. Mapping text documents to vectors based on a list of terms converts them to sequences. Shape pattern-based similarity is a well-established technique in time series data mining. In this work, we apply shape pattern matching to group documents within the broad clusters obtained earlier, thus performing a second level of clustering. The proposed algorithm has been validated using benchmark datasets available on the internet. This includes two special datasets ADA (a marketing application) and SYLVA (an ecology application). Our results show that our proposed two-level text clustering scheme has a significantly better running time as compared to traditional algorithms.	en_US
dc.language.iso	en	en_US
dc.subject	ELECTRONICS AND COMPUTER ENGINEERING	en_US
dc.subject	TEXT DOCUMENTS	en_US
dc.subject	PATTERN MATCHING	en_US
dc.subject	NAIVE BAYESIAN CONCEPT	en_US
dc.title	CLUSTERING UNSTRUCTURED TEXT DOCUMENTS USING NAIVE BAYESIAN CONCEPT AND SHAPE PATTERN MATCHING	en_US
dc.type	M.Tech Dessertation	en_US
dc.accession.number	G14451	en_US
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
ECDG14451.pdf		9.98 MB	Adobe PDF	View/Open

Show simple item record