CLUSTERING AND CLASSIFICATION ON COMPLEX DATA FOR MINING APPLICATIONS

Patil, Bankat M.

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/302

Title:	CLUSTERING AND CLASSIFICATION ON COMPLEX DATA FOR MINING APPLICATIONS
Authors:	Patil, Bankat M.
Keywords:	CLUSTERING DATA;MINING APPLICATIONS;CLASSIFICATION ON COMPLEX DATA;GENETIC ALGORITHM
Issue Date:	2008
Abstract:	Data mining is the extraction of trends and patterns from massive quantities of data. It involves tasks such as classification, clustering and association rule mining. The data on which mining operations are performed can be numeric or categorical, simple or complex. A dataset having homogeneous type of attributes may be considered as simple whereas the one with heterogeneous attribute types may be termed as complex. Further, the data may be high-dimensional, noisy and incomplete. Most of the real world datasets are highdimensional, incomplete, noisy and complex in nature. Mining such datasets can be very challenging. Typical examples include - medical data, climatic data, transactional data, financial data, etc. In the present study, initially, a generic framework has been proposed for applying data mining techniques. The proposed framework is modular and hence extensible and so can be applied in different scenarios of data mining after some extensions or modifications. The framework comprises of the pre-processing module, clustering module, classification and hybrid clustering-classification modules. The knowledge generated can be used for the purpose of decision making. The pre-processing module may comprise of many different tasks such as missing value imputation, de-noising, normalization, feature selection etc. In the present work, we have focussed primarily on missing value imputation as most real world datasets contain a large percentage of incomplete instances. Two methods have been proposed for missing value imputation. The first proposed method imputes the missing values based on k-means clustering and N-nearest neighbours. For each incomplete instance, the closest cluster is identified using Euclidean distance. Within each cluster, N nearest-neighbours of the respective incomplete instance are identified and their values are used for filling the missing value by computing their mean. The proposed method was applied on 3 benchmark medical datasets taken from the UCI Machine Learning data repository, viz. - Pima Indians Diabetes dataset, BUPA liver Disorders dataset and the Wisconsin breast cancer dataset. The second proposed method for missing value imputation is based on the usage of Naive Bayes' approach. The data was treated as categorical in nature. The missing value of a particular attribute of an incomplete instance was computed based on the conditional probabilities of the various possible categories for the attribute with the missing value, given the values of the rest of the attributes. This proposed method was applied on the Pima Indians Diabetes dataset. The next module as per the proposed framework is the clustering module. An improved Fuzzy subspace clustering (FSC) algorithm called GAFSC has been proposed for clustering the pre-processed complex, high-dimensional data. The FSC algorithm has the limitation of pre-specifying parameters such as - the number of clusters k and the fuzzifier a. The proposed GAFSC addresses this limitation and computes the optimal values for k and a by coupling the genetic algorithm (GA) along with FSC and hence has been called the GAFSC clustering algorithm. In the hybrid clustering-classification module, 4 algorithms namely - KMCDTI, KMCSVM, KMCNB and KMCBP have been proposed. They make use of the k-means clustering algorithm and couple it respectively with decision tree induction, support vector machine. Naive Bayes' and the back propagation classification schemes to evolve improved, hybrid clustering-classification algorithm. In the proposed hybrid algorithms, initially clustering is done to remove outliers and misclassified records and then the classification is done on the resulting data instances. The measures such as percentage accuracy, sensitivity (TP-Rate) specificity, FP-rate, Confusion matrix, and AUC have been computed for performance evaluation by applying the proposed methods to various benchmark medical datasets from the UCI data repository. Comparisons have also been done by computing all of the above measures for similar algorithms. The results of the comparisons show that the proposed techniques are superior than the existing ones. Finally, the proposed algorithms KMCDTI, KMCNB, KMCSVM and KMCBP have been applied on a real world case dataset of burn patients collected from SRT Medical College & General Hospital, India which is one of the largest rural hospitals in India.
URI:	http://hdl.handle.net/123456789/302
Other Identifiers:	Ph.D
Research Supervisor/ Guide:	Toshniwal, Durga
metadata.dc.type:	Doctoral Thesis
Appears in Collections:	DOCTORAL THESES (MMD)

Files in This Item:

File	Description	Size	Format
CLUSTERING AND CLASSIFICATION ON COMPLEX DATA FOR MINING APLICATION.pdf		62.29 MB	Adobe PDF	View/Open

Show full item record