DESIGN AND EVALUATION OF ENHANCEMENT TECHNIQUES FOR SINGLE-CHANNEL SPEECH

Singh, Sachin

Please use this identifier to cite or link to this item: http://localhost:8081/jspui/handle/123456789/13991

Title:	DESIGN AND EVALUATION OF ENHANCEMENT TECHNIQUES FOR SINGLE-CHANNEL SPEECH
Authors:	Singh, Sachin
Keywords:	real-world;decomposition
Issue Date:	Apr-2015
Publisher:	ELECTRICAL ENGINEERING IIT ROORKEE
Abstract:	In real-world applications, a speech signal from the uncontrolled environment is often accompanied by various degradation components along with the actual speech components. The degradation components include background noise, reverberation and multi-talker speech. These unwanted interferences not only degrade perceptual speech quality and intelligibility which creates listening problem for human, but also give poor performance in automatic speech processing tasks like speech recognition, speaker recognition and hearing aid systems. Therefore, de-noising of corrupted single-channel speech has become a very necessary and important aspect for research in academia and industry. The presently available single channel noise reduction methods include spectral subtraction, Wiener filter, minimum mean square error estimation (MMSE) and p-MMSE, log-MMSE, KLT, PKLT etc. These methods are applicable for specific environment of speech signal. Some of these perform better for one particular types of noise whereas others are suitable for other types of noise. Considering the limitations of these methods, different categories of speech signals have been treated separately. Based on this, the objectives of the present research work have been formulated as: (1) design of a suitable method for enhancement of mixed noisy speech of very low (Negative) input SNR conditions; (2) design and development of a suitable method for suppression of non-stationary noise in singlechannel speech signal; (3) analysis and development of a suitable method for suppression of combined effect of background noise and reverberation; and (4) design and implementation of phase based single-channel speech enhancement technique. The mentioned objectives have been accomplished as follows: In the first objective, single-channel speech enhancement based on modified Wiener gain function using Wavelet Packet Transform (WPT) is proposed for suppression of noise from multiple sources in both the low (negative) and high SNR speech signal ranging from -15 dB to +15 dB. The method includes steps as (1) decomposition of speech signal upto 3rd level to get speech signal in eight different bands; (2) the FFT of these bands is computed to get the wavelet packet soft threshold which is applied on the above FFT output; (3) the WP soft threshold is also used to determine the modified gain function; (4) finally to get the processed output speech, the IFFT of the product of the modified Wiener gain function and WP thresholded FFT output is computed. The overlap-add method is used to get the end reconstructed speech signal. ii The performance of this proposed method is compared with other existing speech enhancement methods evaluating their performance parameters such as MSE, SNR, MOS, PESQ and SII. The dataset of low SNR ranging from -15 dB to +15 dB having mixed noise is used for performance evaluation of the implemented methods. The results show the improvement in terms of speech quality and intelligibility parameters. Proposed method gives highest improvement in comparison to other single-channel speech enhancement methods for all input SNR levels with various noise types. To overcome the problem of using true speech or true noise in binary mask based methods of speech enhancement, a fuzzy mask is proposed here under second objective. It is based on soft and hard wavelet packet threshold. The method includes steps as (1) decomposition of speech signal upto 3rd level to get speech signal in eight different bands; (2) the FFT of these bands is computed to get the wavelet packet soft and hard threshold which is applied on the FFT output; (3) in this procedure, the modified Wiener gain function determined similarly as above is applied to get the denoised speech signal in frequency domain at first stage; (4) in second stage, fuzzy mask is applied on the output of first stage for further enhancement; (5) finally to get the processed output speech, the IFFT of the product of the fuzzy mask and WP soft and hard thresholded FFT output is computed. Again, the overlap-add method is used to get the end reconstructed speech signal. Here again, the performance of this proposed method is compared with other existing speech enhancement methods comparing their performance parameters such as SNR, MOS, PESQ and STOI. The dataset of low SNR ranging from -15 dB to +15 dB having nonstationary noise is used for performance evaluation of the implemented methods. The results obtained from proposed method are much better than other existing single-channel speech enhancement methods. Most of the above implemented algorithms are used for speech enhancement of noise and reverberation separately and they do not work effectively in case of their combination (i.e. reverberation with noise). To suppress the combined effect of early and late reverberations with various types of noise, a binary reverberation mask is implemented here for the fulfillment of the third objective. In this proposed method signal-to-reverberant ratio (SRR) is calculated as a limit for ideal reverberant mask (IRM). The amplitudes with SRR greater than a preset threshold (i.e. -5dB) are used for reconstruction of dereverberated speech, while amplitudes with SRR values smaller than the threshold are eliminated. The construction of the SRR criterion assumes a priori knowledge of the input reverberant and target signal. Threshold values varying from iii 0dB to -90dB are analyzed for selection of IRM limit T. Finally, the dereverberated speech signal is constructed by multiplying noisy speech with reverberant mask. The proposed reverberant mask based speech enhancement method is compared with other existing speech enhancement methods in terms of speech quality and intelligibility measure parameters such as PESQ, CD, SNR and MSE. The maximum improvement in reverberated noisy speech is obtained by proposed method in terms of speech quality and intelligibility at all input SNR levels ranging from -25 dB to -5 dB. Most of the noise reduction algorithms perform the modification in amplitude only, while phase remains unchanged or discarded in the process of speech enhancement. Recently, it has been found that quality and intelligibility both can be improved upto a significant level by using either phase of speech signal only or phase with amplitude. Hence, signal phase ratio based single-channel speech enhancement method is implemented in fourth objective for further improvement in noisy speech signal which considers the phase of the noisy speech signal in processing. The phase ratio of noisy speech to noise signal is used in the phase based method. In this method two gain functions G1 and G2 are developed for correction in noisy phase by suppressing the noise coming from angles between 0 to ±π/2 and ±π/2 to ±π, respectively. For the reconstruction of speech spectrum, both gains are multiplied together and lower values of the phases are neglected for getting desired speech spectrum. Results are compared with other phase based methods (such as phase spectrum compensation (PSC), exploiting conjugate symmetry of the short-time Fourier spectrum and STFT-phase for the MMSE-optimal) and are analyzed in terms of speech quality, intelligibility measures (like SNR, SSNR, SIG, SII, BAK, OVL, and PESQ, etc.), informal subjective listening tests and spectrogram analysis. The performance measure parameters show that the proposed phase ratio based implemented method provides more effective improvement in noisy speech in comparison to other phase based speech enhancement methods. Implemented algorithms are evaluated for various languages i.e. Hindi, Kannada, Bengali, Malayalam, Tamil, Telgu, and English. Indian language database used for evaluation are taken from IIIT-H Indic Speech Databases which was developed at Speech and Vision Lab, IIIT-Hyderabad for the purpose of building speech synthesis system among Indian languages. The speech data were recorded by native speakers of each language. The recording was done in a studio environment using a standard headset microphone connected to a Zoom handy recorder. A set of 1000 sentences were selected for each language. These sentences were selected to cover 5000 most frequent words in text corpus of the corresponding language. The NOIZEUS database of clean and noisy speech was used for English language sentences. This iv database basically contains 30 IEEE sentences which were produced by three male and three female speakers in groups. The real-world sources of background noise at different SNRs were taken from AURORA and NOISEX-92 databases, respectively which include suburban train noise, babble, car, exhibition hall, restaurant, street, airport and train-station as noise sources. In the nut shell, it can be said that the present work is an effort to determine suitability of various single-channel speech enhancement techniques to get the maximum speech quality and intelligibility.
URI:	http://hdl.handle.net/123456789/13991
Research Supervisor/ Guide:	Anand, R. S. Tripathy, Manoj
metadata.dc.type:	Thesis
Appears in Collections:	DOCTORAL THESES (Electrical Engg)

Files in This Item:

File	Description	Size	Format
PhD Thesis Sachin Singh (11918012).pdf		7.75 MB	Adobe PDF	View/Open

Show full item record