dc.description.abstract |
In real-world applications, a speech signal from the uncontrolled environment is often
accompanied by various degradation components along with the actual speech components.
- The degradation components include background noise, reverberation and multi-talker
speech. These unwanted interferences not only degrade perceptual speech quality and
intelligibility which creates listening problem for human, but also give poor performance in
automatic speech processing tasks like speech recognition, speaker recognition and hearing
aid systems. Therefore, de-noising of corrupted single-channel speech has become a very
necessary and important aspect for research in academia and industry.
The presently available single channel noise reduction methods include spectral
subtraction, Wiener filter, minimum mean square error estimation (MMSE) and p-MMSE,
log-MMSE, KLT, PKLT etc. These methods are applicable for specific environment of
speech signal. Some of these perform better for one particular types of noise whereas others
are suitable for other types of noise. Considering the limitations of these methods, different
categories of speech signals have been treated separately. Based on this, the objectives of the
- present research work have been formulated as: (1) design of a suitable method for
enhancement of mixed noisy speech of very low (Negative) input SNR conditions; (2) design
and development of a suitable method for suppression of non-stationary noise in singlechannel
speech signal; (3) analysis and development of a suitable method for suppression of
combined effect of background noise and reverberation; and (4) design and implementation of
phase based single-channel speech enhancement technique. The mentioned objectives have
been accomplished as follows:
In the first objective, single-channel speech enhancement based on modified Wiener gain
function using Wavelet Packet Transform (WPT) is proposed for suppression of noise from
multiple sources in both the low (negative) and high SNR speech signal ranging from -15 dB
to +15 dB. The method includes steps as (1) decomposition of speech signal upto 3 d level to
get speech signal in eight different bands; (2) the FFT of these bands is computed to get the
wavelet packet soft threshold which is applied on the above FFT output; (3) the WP soft
threshold is also used to determine the modified gain function; (4) finally to get the processed
output speech, the IFFT of the product of the modified Wiener gain function and WP
thresholded FFT output is computed. The overlap-add method is used to get the end
reconstructed speech signal.
1
The performance of this proposed method is compared with other existing speech
enhancement methods evaluating their performance parameters such as MSE, SNR, MOS,
PESQ and SII. The dataset of low SNR ranging from -15 dB to +15 dB having mixed noise is
used for performance evaluation of the implemented methods. The results show the
improvement in terms of speech quality and intelligibility parameters. Proposed method gives
highest improvement in comparison to other single-channel speech enhancement methods for
all input SNR levels with various noise types.
To overcome the problem of using true speech or true noise in binary mask based methods
of speech enhancement, a fuzzy mask is proposed here under second objective. It is based on
soft and hard wavelet packet threshold. The method includes steps as (1) decomposition of
speech signal upto 3'' level to get speech signal in eight different bands; (2) the FFT of these
bands is computed to get the wavelet packet soft and hard threshold which is applied on the
FFT output; (3) in this procedure, the modified Wiener gain function determined similarly as
above is applied to get the denoised speech signal in frequency domain at first stage; (4) in
second stage, fuzzy mask is applied on the output of first stage for further enhancement; (5)
finally to get the processed output speech, the IFFT of the product of the fuzzy mask and WP
soft and hard thresholded FFT output is computed. Again, the overlap-add method is used to
get the end reconstructed speech signal.
Here again, the performance of this proposed method is compared with other existing
speech enhancement methods comparing their performance parameters such as SNR, MOS,
PESQ and STOI. The dataset of low SNR ranging from -15 dB to +15 dB having nonstationary
noise is used for performance evaluation of the implemented methods. The results
obtained from proposed method are much better than other existing single-channel speech
enhancement methods.
Most of the above implemented algorithms are used for speech enhancement of noise and
reverberation separately and they do not work effectively in case of their combination (i.e.
reverberation with noise). To suppress the combined effect of early and late reverberations
with various types of noise, a binary reverberation mask is implemented here for the
fulfillment of the third objective.
In this proposed method signal-to-reverberant ratio (SRR) is calculated as a limit for ideal
reverberant mask (IRM). The amplitudes with SRR greater than a preset threshold (i.e. -5dB)
are used for reconstruction of dereverberated speech, while amplitudes with SRR values
smaller than the threshold are eliminated. The construction of the SRR criterion assumes a
priori knowledge of the input reverberant and target signal. Threshold values varying from
0dB to -90dB are analyzed for selection of IRM limit T. Finally, the dereverberated speech
signal is constructed by multiplying noisy speech with reverberant mask.
The proposed reverberant mask based speech enhancement method is compared with
other existing speech enhancement methods in terms of speech quality and intelligibility
measure parameters such as PESQ, CD, SNR and MSE. The maximum improvement in
reverberated noisy speech is obtained by proposed method in terms of speech quality and
intelligibility at all input SNR levels ranging from -25 dB to -5 dB.
Most of the noise reduction algorithms perform the modification in amplitude only, while
phase remains unchanged or discarded in the process of speech enhancement. Recently, it has
been found that quality and intelligibility both can be improved upto a significant level by
using either phase of speech signal only or phase with amplitude. Hence, signal phase ratio
based single-channel speech enhancement method is implemented in fourth objective for
further improvement in noisy speech signal which considers the phase of the noisy speech
signal in processing. The phase ratio of noisy speech to noise signal is used in the phase based
method. In this method two gain functions Gi and G2 are developed for correction in noisy
phase by suppressing the noise coming from angles between 0 to ±7t/2 and ±2r/2 to ±ir,
respectively. For the reconstruction of speech spectrum, both gains are multiplied together
and lower values of the phases are neglected for getting desired speech spectrum. Results are
compared with other phase based methods (such as phase spectrum compensation (PSC),
exploiting conjugate symmetry of the short-time Fourier spectrum and STFT-phase for the
MMSE-optimal) and are analyzed in terms of speech quality, intelligibility measures (like
SNR, SSNR, SIG, SII, BAK, OVL, and PESQ, etc.), informal subjective listening tests and
spectrogram analysis. The performance measure parameters show that the proposed phase
ratio based implemented method provides more effective improvement in noisy speech in
comparison to other phase based speech enhancement methods.
Implemented algorithms are evaluated for various languages i.e. Hindi, Kannada, Bengali,
Malayalam, Tamil, Telgu, and English. Indian language database used for evaluation are
taken from lilT-H Indic Speech Databases which was developed at Speech and Vision Lab,
IiIT-Hyderabad for the purpose of building speech synthesis system among Indian languages.
The speech data were recorded by native speakers of each language. The recording was done
in a studio environment using a standard headset microphone connected to a Zoom handy
recorder. A set of 1000 sentences were selected for each language. These sentences were
selected to cover 5000 most frequent words in text corpus of the corresponding language. The
NOIZEUS database of clean and noisy speech was used for English language sentences. This
database basically contains 30 IEEE sentences which were produced by three male and three
female speakers in groups. The real-world sources of background noise at different SNRs
were taken from AURORA and NOISEX-92 databases, respectively which include suburban
train noise, babble, car, exhibition hall, restaurant, street, airport and train-station as noise
sources.
In the nut shell, it can be said that the present work is an effort to determine suitability of
various single-channel speech enhancement techniques to get the maximum speech quality
and intelligibility. |
en_US |