Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/16538
Full metadata record
DC FieldValueLanguage
dc.contributor.authorOhri, Namit-
dc.date.accessioned2025-05-28T15:54:28Z-
dc.date.available2025-05-28T15:54:28Z-
dc.date.issued2017-05-
dc.identifier.urihttp://localhost:8081/jspui/handle/123456789/16538-
dc.description.abstractSpeech recognition refers to the task of decoding an audio signal to recognize the phonemes or words that are being spoken. The current state of the art systems in speech recogni- tion use Deep Neural Networks (DNNs) for acoustic modelling with raw features as input. The feature extraction from the audio signal and their post processing make the front-end of the speech recognition system. The back-end of the speech recognition sys- tem consists of an acoustic model which is learned using the features, a pronunciation model and a language model. Auditory models are used in the front-end. They resemble the human auditory system and try to map the short-time spectra of the speech signal to a more accurate feature representation as compared to raw features. These models in the front-end have been tested with Gaussian Mixture Model-Hidden Markov Model based back-end resulting in no signi cant improvements. This thesis studies the e ect of using auditory model based features in the front-end as input to NN-based back-end in auto- matic speech recognition tasks such as phoneme recognition and word recognition. Delta features obtained from raw features are dynamic in nature. It is a common approach to augment the raw features with delta features in speech recognition. Auditory models have the static and dynamic information in the same channel, thus questioning the need for deltas. Schroeder-Hall (SH) model, an auditory model for short-term adaptation in the inner hair cell is used. Simulation studies are performed on TIMIT and Aurora 4 database, and results with various combinations of front-end and back-end are compared. The performance of a speech recognition system is measured by the error rate on the transcriptions that is being predicted by the system. Lower the error rate, better is the system. The SH model, which is one of the rst models of the auditory system, produces a more compact representation of features while reducing dimension and maintaining the same amount of information as the deltas. Bidirectional Long Short-Term Memory (BLSTM)-based back-end gives the best performance, and sequence objective functions work much better than other objective functions however BLSTM-based networks have a higher computational complexity than DNN-based systems. The features created from SH model give the best performance on TIMIT database with NN-based back-end. The BLSTM-Maximum Mutual Information (MMI) back-end gives the best result with delta features on Aurora 4 database but the delta features have double the input dimension as compared to SH features. Speaker information is not easily available in practical speech recognition tasks. The BLSTM-MMI network leads to state-of-the-art performance on a noisy database like Aurora 4, without using any of the speaker adaptation techniques which are widely used in the current state-of-the-art speech recognition systems.en_US
dc.description.sponsorshipINDIAN INSTITUTE OF TECHNOLOGY,ROORKEEen_US
dc.language.isoenen_US
dc.publisherI I T ROORKEEen_US
dc.subjectDeep Neural Networksen_US
dc.subjectAuditory Modelsen_US
dc.subjectSchroeder-Hallen_US
dc.subjectGaussian Mixtureen_US
dc.titleA HYBRID SCHROEDER HALL MODEL-NEURAL NETWORK BASED SPEECH RECOGNITION SYSTEMen_US
dc.typeOtheren_US
Appears in Collections:MASTERS' THESES (E & C)

Files in This Item:
File Description SizeFormat 
G27574.pdf1.36 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.