A HYBRID SCHROEDER HALL MODEL-NEURAL NETWORK BASED SPEECH RECOGNITION SYSTEM

Ohri, Namit

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/16538

Full metadata record

DC Field	Value	Language
dc.contributor.author	Ohri, Namit	-
dc.date.accessioned	2025-05-28T15:54:28Z	-
dc.date.available	2025-05-28T15:54:28Z	-
dc.date.issued	2017-05	-
dc.identifier.uri	http://localhost:8081/jspui/handle/123456789/16538	-
dc.description.abstract	Speech recognition refers to the task of decoding an audio signal to recognize the phonemes or words that are being spoken. The current state of the art systems in speech recogni- tion use Deep Neural Networks (DNNs) for acoustic modelling with raw features as input. The feature extraction from the audio signal and their post processing make the front-end of the speech recognition system. The back-end of the speech recognition sys- tem consists of an acoustic model which is learned using the features, a pronunciation model and a language model. Auditory models are used in the front-end. They resemble the human auditory system and try to map the short-time spectra of the speech signal to a more accurate feature representation as compared to raw features. These models in the front-end have been tested with Gaussian Mixture Model-Hidden Markov Model based back-end resulting in no signi cant improvements. This thesis studies the e ect of using auditory model based features in the front-end as input to NN-based back-end in auto- matic speech recognition tasks such as phoneme recognition and word recognition. Delta features obtained from raw features are dynamic in nature. It is a common approach to augment the raw features with delta features in speech recognition. Auditory models have the static and dynamic information in the same channel, thus questioning the need for deltas. Schroeder-Hall (SH) model, an auditory model for short-term adaptation in the inner hair cell is used. Simulation studies are performed on TIMIT and Aurora 4 database, and results with various combinations of front-end and back-end are compared. The performance of a speech recognition system is measured by the error rate on the transcriptions that is being predicted by the system. Lower the error rate, better is the system. The SH model, which is one of the rst models of the auditory system, produces a more compact representation of features while reducing dimension and maintaining the same amount of information as the deltas. Bidirectional Long Short-Term Memory (BLSTM)-based back-end gives the best performance, and sequence objective functions work much better than other objective functions however BLSTM-based networks have a higher computational complexity than DNN-based systems. The features created from SH model give the best performance on TIMIT database with NN-based back-end. The BLSTM-Maximum Mutual Information (MMI) back-end gives the best result with delta features on Aurora 4 database but the delta features have double the input dimension as compared to SH features. Speaker information is not easily available in practical speech recognition tasks. The BLSTM-MMI network leads to state-of-the-art performance on a noisy database like Aurora 4, without using any of the speaker adaptation techniques which are widely used in the current state-of-the-art speech recognition systems.	en_US
dc.description.sponsorship	INDIAN INSTITUTE OF TECHNOLOGY,ROORKEE	en_US
dc.language.iso	en	en_US
dc.publisher	I I T ROORKEE	en_US
dc.subject	Deep Neural Networks	en_US
dc.subject	Auditory Models	en_US
dc.subject	Schroeder-Hall	en_US
dc.subject	Gaussian Mixture	en_US
dc.title	A HYBRID SCHROEDER HALL MODEL-NEURAL NETWORK BASED SPEECH RECOGNITION SYSTEM	en_US
dc.type	Other	en_US
Appears in Collections:	MASTERS' THESES (E & C)

Files in This Item:

File	Description	Size	Format
G27574.pdf		1.36 MB	Adobe PDF	View/Open

Show simple item record