PARALLEL SVM AND PATTERN BASED DOCUMENT CATEGORIZATION

Sharma, Vaibhav

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/2358

Title:	PARALLEL SVM AND PATTERN BASED DOCUMENT CATEGORIZATION
Authors:	Sharma, Vaibhav
Keywords:	DOCUMENT CATEGORIZATION;SVM;ALGORITHM;ELECTRONICS AND COMPUTER ENGINEERING
Issue Date:	2008
Abstract:	In this dissertation, we have proposed two different improvements in process of categorization of text documents using SVM. First half of the dissertation covers description of work done for parallelization of SVM. In second half we have described method to utilize text patterns in documents to categorize them. Parallelization of SVM using cascaded training algorithm has been studied in the past. It simplifies the process of Parallelization and also maintains the accuracy of the results. Feedback in cascaded SVM has also been used to improve the accuracy of SVM. In cascaded SVM training data is divided randomly in many subsets and support vectors are identified for each subset. Support vectors from two or more subsets are then used to form new subsets for next stage of the algorithm. In first half of the dissertation we describe a more efficient division of training data using k-means clustering algorithm which reduces the number of stages in cascaded SVM, increases accuracy and decreases run time for sequential as well as parallel execution. We have also describe further improvement in algorithm to achieve more efficient parallelization using a parallel Sequential Minimal Optimization algorithm which has been proved in the past to perform better than other optimization methods for SVM. Various data mining and NLP techniques such as use of n-grams or maximal patterns have been studied in the past to.improve accuracy of text, categorization. Most of such techniques use knowledge of co-dependence between words to retrieve more accurate meaning of words to improve accuracy of categorization. Use of maximal patterns and extended patterns is one of the most promising techniques which has been used in past for two class sentiment detection problems. Pattern represents consecutive sequences of words having one or more occurrences in documents. Patterns which are more prominent in various documents of one class are taken and further selections are made manually. Many words occurring in between a pattern, does not alter the meaning of pattern. But, they do, not give direct match with the pattern. Extended patterns are used to overcome this problem. In second half of this work we have proposed automated selection of patterns, a fast pattern search algorithm and generalization of pattern based categorization.
URI:	http://hdl.handle.net/123456789/2358
Other Identifiers:	M.Tech
Research Supervisor/ Guide:	Sarje, A. K.
metadata.dc.type:	M.Tech Dessertation
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
ECDG22843.pdf		4.22 MB	Adobe PDF	View/Open

Show full item record