Please use this identifier to cite or link to this item:
Authors: Rajoria, Dinesh Kumar
Issue Date: 2011
Abstract: Speech is the most natural form of communication between humans. In the domain of digital signal processing, speech recognition is an exciting area of application. Spoken word recognition, in speech recognition field exists in multi direction approaches. It falls in the thrust area to develop an intelligent voice enabled or voice operated machine. The voice operated model can be used to automate many tasks which require hands-on human interaction. These tasks can be performed by spoken commands to execute something like turning on/off lights or opening/closing a door. In addition to these simple applications the spoken words demonstrates strengths with industrial, commercial and interdisciplinary applications. As one of the components of information system and one of the elements in authentication system, spoken word can work as bioinformatic fingerprint. The spoken word shows significant place in the security world for authentication system. There are many events and incidences in daily life, such as, internet/online transaction password being compromised by other internet users (hackers), or someone forgetting login-ID and password. All these incidences require a safe and secure, yet simple bioinformatic authentication system. The need would be more obvious considering the situation when information on computer is accidently deleted. In defense establishments and other high security places, conventional login-ID and password for authentication are easier to be compromised by hackers. The voice authentication system can authenticate a user at the login procedure by effectively utilizing the property of spoken words for minimizing threatening related problem in login-ID/ password based procedure. In linguistics, combination of words produces various type of word formation like two, three or four words combined together; construct a combination of words or compound word. In combined word, the meaning of the words interrelate (some time semantically related) in such away that anew meaning comes out which is very different from the meaning of the words in isolation. All spoken languages contain not only large number of short words but also various types of longer words and moderate length words such as "Chit Chat", "Not Only", and "But Also" in English language. For the present work, two words combination called as Paired Word (PW) are considered. These words consist oftwo short words separated by a variable-length gap. Spoken PWs have an edge over short and long words, when human voice fingerprint is considered for voice enabled systems. In spoken word recognition, short words have high misrecognition rate whereas, long words require more computation time and large memory space for storing speech templates. Sequentially uttered two word combination is free from above problems. Since they have moderate word length so computation time and memory space required is less. PWs are suggested in place of short words and long words as they provide good information content, large variability in pronunciation (due to co-articulation) and moderate length of utterance. Variable coarticulation gap between two words is an important phenomenon, as it depends on factors, such as language, length of vocal tract, air pressure through lungs, gender, age, etc. The characteristics ofthis gap have been abstractly and logically utilized in this work as a part of the information content of the signal. The motivation behind the present work is to explore the properties of PWs for a systematic intelligent approach for voice enabled system. The PWs which have been considered here are from Indian national language Hindi. Hindi is spoken by number ofnative and non-native speakers in India and around the world. Hindi language consists of different type of PWs. These PWs can play vital role for real-time applications in Indian language based system. The spoken word recognition problem comes in the domain of Automatic Speech Recognition (ASR) which is considered as a pattern recognition problem. Pattern recognition technique encompasses two fundamental tasks, description (front-end processing) and classification (back-end processing). In the description process, features are extracted from the spoken word utterances and in classification process, these features are categorized into suitable classes. On the basis of the above background, the objectives of the present research work were identified as follows: - Characterization of Hindi phonemes with special emphasis on Spoken Hindi Paired Word (SHPW). The objective has been to generate SHPW database by collecting spoken word data by various speakers with different variability. - Evaluate various pre-processing approaches such as endpoints detection, frame blocking, windowing and to apply these algorithms on SHPW utterances in order to determine their suitability for the present work. - Evaluation of various feature extraction and classification methods with an objective to select the best one and also selection of suitable parameters for better recognition of SHPW utterance. To develop an intelligent hybrid computing approach for the SHPW recognition system. The approach starts with the evaluation of different PW structures for spoken word recognition. Database of different SHPW was developed from speakers with different variability such as, gender, tone, style and other important aspects. This database is utilized for testing the algorithms developed. In the first phase, a generalized database of 26 paired word (650 utterances) consisting of different types of paired word is generated. Five speakers both male and female repeated the vocabulary five times each to generate the database. The recording was done at sampling rate of 10 kHz. Thereafter, in second phase, different types of PWs such as, copulative, reduplicated, partially duplicated, hybrid, adjectives, pronouns, verbs and numeric type were examined to find out most suitable PW type for present intelligent approach. Twenty five words in each of the eight category of paired word have been uttered by five speakers five times (1000 utterances for each speaker). Total 5000 utterances have been recorded in this phase with the same sampling rate as in phase one. Different components of pre-processing unit (Pre-emphasis, Endpoints detection, Frame blocking, and Windowing) with various suitable parameters have been explored. These parameters have been obtained after diverse examinations and computations with the objective ofachieving higher recognition ofutterances. In pre-emphasis filtering; dc bias is removed by use of first order high pass FIR filter. In order to avoid errors in the modelling of subsequent utterances of the same word, endpoints detection trims the paired word utterance to its tightest limits. Two measures of speech have been evaluated in endpoints detection, short-time energy and zero-crossing rate (ZCR). Frame blocking is required to treat the signal as stationary one. After examining different frame sizes and overlapping frames, in the present case 10 ms frame size has been selected with 5 ms overlapped frame. In windowing; each frame is multiplied by a window function. This process minimizes the signal discontinuities at the beginning and at the end of each frame. Furthermore, experimentation increases the correlation of spectral estimates between consecutive frames. Experimentation with different windowing techniques such as Hamming, Triangular, Hanning, Gausswin and Kaiser have been made to obtain the m most suitable window. Kaiser window has been found out to be the most appropriate in the present work from the point of view of PW recognition. After repetitive examination, appropriate value of p (coefficient of Kaiser Window) has been chosen for different approaches. For selecting suitable feature extraction method, different feature parameters such as energy, mean, entropy, FFT, STFT and wavelet transform have been computed for each frame of desired SHPW (after pre-processing) utterances and then fed to a specific classifier. Among these, discrete' wavelet transform gave promising results. After that different wavelets with varying order and decomposition level have been evaluated for the extraction of best suitable features. The Daubechies wavelet of order 10 and f level decomposition gives best results here. To minimize the number of input feature vectors, various statistical measures have been used. After applying different normalization techniques on these statistical measures, Max normalization has been found to give best results for the considered dataset. At the end of front-end processing, a vector of features for each template is obtained which becomes input to the classifier. In the back-end processing, classification of obtained feature vectors of SHPW started with different distance classifiers such as Euclidian, and Mahalanobis. Due to poor classification accuracy of these distance classifiers, other classifiers have been tested. Considering the fact that spoken words are probabilistic in nature, probabilistic classifier models have been considered. The classifiers such as Probabilistic Neural Network (PNN), multiclass Support Vector Machine (SVM) and weighted FPNN have been tested. Suitability of spread constant or smoothing parameter of PNN has been evaluated by successive computation for diverse approaches. The value of spread constant so obtained gives better recognition for the database under consideration. In addition to this, different cross-validation techniques have been tested in training phase of probabilistic classifier for handling the over-fitting problem of the database. The k-fold cross-validation technique with value of k = 10 gives best results. Both the categories of speaker dependent and speaker independent spoken word recognition/ classification approaches for voice enabled system have been tested for the considered databases. For the first database (generalized database), 650 templates of utterances have been tested on different algorithm for different classifiers. The SHPW utterances scored 100% recognition rate (with PNN, SVM and weighted FPNN) for speaker dependent approach, whereas, for speaker independent approach varying degree of recognition rate have been obtained. For the second database (categorized database), reduplicated type PW is giving least recognition rate whereas hybrid type PW shows maximum recognition rate (with Weighted FPNN). Therefore, it has been observed that weighted FPNN classifier is giving most prominent results with the considered databases. The hybrid paired word is most suitable type Hindi paired word for proposed intelligent hybrid computing (wavelet with weighted FPNN) approach for SHPW recognition. In the nutshell it has been found that hybrid type paired word is most suitable PW for voice enabled system based on Indian native language Hindi. Speaker dependent PW approach may facilitate in reducing the hacking and cracking of information of spoken words for real world applications. Speaker independent PW approach generalizes voice enabled system which maximizes its applications in various fields. The reported work in this thesis is an effort to determine suitability ofspoken Hindi paired word for speaker dependent and speaker independent voice enabled approach.
Other Identifiers: Ph.D
Appears in Collections:DOCTORAL THESES (Electrical Engg)

Files in This Item:
File Description SizeFormat 

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.