A single channel speech enhancement technique exploiting human auditory masking properties

To enhance extreme corrupted speech signals, an Improved Psychoacoustically Motivated Spectral Weighting Rule (IPMSWR) is proposed, that controls the predefined residual noise level by a time-frequency dependent parameter. Unlike conventional Psychoacoustically Motivated Spectral Weighting Rules (PMSWR), the level of the residual noise is here varied throughout the enhanced speech based on the discrimination between the regions with speech presence and speech absence by means of segmental SNR within critical bands. Controlling in such a way the level of the residual noise in the noise only region avoids the unpleasant residual noise perceived at very low SNRs. To derive the gain coefficients, the computation of the masking curve and the estimation of the corrupting noise power are required. Since the clean speech is generally not available for a single channel speech enhancement technique, the rough clean speech components needed to compute the masking curve are here obtained using advanced spectral subtraction techniques. To estimate the corrupting noise, a new technique is employed, that relies on the noise power estimation using rapid adaptation and recursive smoothing principles. The performances of the proposed approach are objectively and subjectively compared to the conventional approaches to highlight the aforementioned improvement.


Introduction
The enhancement of speech degraded by environmental or background noise still remains an open topic, although many significant approaches have been presented over years.The enhancement becomes more complicated especially for single channel noise reduction techniques, where no additional information about the corrupting noise and the real clean speech are available.Since the background noise is the fac-Correspondence to: F. X. Nsabimana (nsabfran@yahoo.fr)tor that degrades the most the quality and intelligibility of the speech, it should therefore be estimated first using adequate techniques such as (Doblinger, 1995;Martin, 2001;Cohen, 2002Cohen, , 2003;;Rangachar, 2004;Stouten, 2006;Nsabimana, 2009).Thereafter, the estimated noise power is used in the derivation of the gain function for a desired noise reduction technique such as (Ephraim, 1985;Gustafsson, 1998;Virag, 1999;Cohen, 2002;Nsabimana, 2009).
In (Tsoukalas, 1993;Gustafsson, 1998;Virag, 1999;Yi, 2004;Hu, 2004;Nsabimana, 2009), Psychoacoustically Motivated Spectral Weighting Rules (PMSWR), which derive a gain function based on the psychoacoustical properties of the human hearing system, were proposed.Unlike the techniques like the Log Spectral Amplitude (LSA) and the Optimally Modified Log Spectral Amplitude (OMLSA) (Ephraim, 1985;Cohen, 2002), the Psychoacoustically Motivated Spectral Weighting Rules (Gustafsson, 1998;Nsabimana, 2009) do not try for a complete noise removal, they preserve instead a predefined amount of the original corrupting noise throughout the enhanced speech to account for the loss of weak speech components.Based on the error minimization of the distortions of speech and noise power components compared to the masking curve of the rough clean speech estimate, the gain function is thereafter derived (Gustafsson, 1998;Nsabimana, 2009).While the PM-SWR approach (Gustafsson, 1998) generally accounts for a predefined constant amount of the original corrupting noise throughout the enhanced speech, the IPMSWR approach (Nsabimana, 2009) controls instead the level of the residual noise based on the discrimination between the regions with speech presence and speech absence by means of segmental SNR within critical bands.This paper presents for the noise reduction technique and speech enhancement an algorithm relying on the IPMSWR approach (Nsabimana, 2009), where the estimated corrupting noise power is obtained using rapid adaptation and recursive smoothing principles (RARS) (Nsabimana, 2009) instead of OSMS approach (Martin, 2001;Nsabimana, 2009).The investigations in (Nsabimana, 2009) revealed that the RARS Published by Copernicus Publications on behalf of the URSI Landesausschuss in der Bundesrepublik Deutschland e.V. approach adapts fast and provides accurate mean estimates than OSMS approach (Martin, 2001) especially for very low SNRs.To motivate further steps of improvements in the IPMSWR approach, the importance of the phase is also here emphasized during experimental results.
The outline of the paper is as follows: Sect. 2 presents the proposed technique, while experimental results and conclusion are presented in Sects.3 and 4 respectively.

The proposed approach
Figure 1 depicts the complete system of the proposed approach.In the analysis stage, the corrupted speech is processed frame by frame with an overlapping rate of 75%.The estimated noise power R ñ(k,m) is computed using RARS approach (Nsabimana, 2009), while the rough clean speech estimate S1 (k,m) needed for the computation of the masking threshold R T (k,m) is obtained using the OMLSA approach (Cohen, 2002).The masking curve R T (k,m) is computed as described in (Virag, 1999;Johnston, 1988;Zwicker, 1990;Zölzer, 2005) and summarized in (Virag, 1999).In the following, the derivation of the gain function is detailed.
Let consider the spectrum of a corrupted speech signal X(k,m) to be defined as where S(k,m) and N(k,m) are the short-time DFT coefficients at frequency bin k and frame number m for the clean speech and additive noise respectively.S(k,m) and N (k,m) are also assumed to be statistically independent and zero mean.As a complete noise removal is not intended for psychoacoustically motivated spectral weighting rules, the desired spectrum of the enhanced speech is therefore defined as where ζ (k,m)N (k,m) represents the estimated amount of the residual noise.But the estimated magnitude spectrum of the enhanced speech is given by (s.Fig. 1) The difference between Eqs. ( 2) and ( 3) yields the estimation error with the PSD of the error expressed as where the indexes k and m are omitted for G and ζ only for the sake of simplicity.R s (k,m) and R n (k,m) represent here the PSD of the clean speech s(n) and the corrupting noise n(n) respectively.Equation ( 5) is thus composed of the speech power distortion The optimal G(k,m) can be obtained by computing the minimum of the solid red parabola (R E ) of Fig. 2, while G(k,m) for the just noticeable distortion case is derived considering the crossing point between the green curve with square (R T ) and the blue curve with triangle (R E n ) of Fig. 2. As a complete masking of both distortions R E < R T is practically not possible, only the masking of the residual noise power distortions is taken into account.By masking the residual noise power distortions, the speech power distortions are also assumed to be minimized (Gustafsson, 1998).So equating noise power distortion R E n to masking curve of the rough clean speech R T , the spectral weighting rule is derived as where λ represents herein a frequency bandwidth and ζ (λ,m) is chosen based on the corresponding subband segmental SNR: as shown in Fig. 3, that is computed from a shifted sigmoid function.k s and k e represent in Eq. ( 7) the starting and ending bin of the i th band.
To reduce the spectral outliers in specific frequency bands, the gain function is manipulated based on the energy of the coefficients within critical bands as follows: The results with this technique are shown in Figs. 5 and 6, where it is clear that the corrupting noise has been properly controlled and sibilant sounds preserved.

Noise estimation
As the computation of Eq. ( 6) requires the knowledge of the corrupting noise power, only an estimated noise power can be used for the single channel case.Therefore, the Rapid Adaptation and Recursive Smoothing (RARS approach) (Nsabimana, 2009), that is depicted in Fig. 4, is applied here.
In the RARS approach, first the noise power is estimated first using Optimal Smoothing and Minimum Statistics (OSMS) approach (Martin, 2001) with a very short window.This yields an overestimation of the estimated noise power.Based on the smoothed posteriori SNR from the OSMS noise power a VAD index I is derived to compute the speech presence probability P and a smoothing parameter η.This smoothing parameter is finally applied to the unbiased estimated noise power R u from OSMS approach to account for the overestimation.In order to improve the adaptation time for the estimated noise power, a condition BC is used to track quickly the fast changes in the noise power.Results from (Nsabimana, 2009) reveal that the RARS approach adapts fast and provides accurate mean estimates than OSMS approach (Martin, 2001) especially for very low SNRs.

Experimental results
This section presents the performance evaluation of the proposed enhancement technique using the phase of the corrupted speech on one hand (s.Figs. 5 and 6) and the phase of the clean speech on the other hand (s.Fig. 8).To get a fair comparison, tests were carried out for different SNRs using additive white gaussian noise.A window length of 512 samples with a hop size of 25% for analysis and synthesis is applied for all approaches.Figures 5 and 6 present a subjective comparison in terms of spectrogram.These results show that the IPMSWR approach preserves sibilants (s-like sounds) even for very low SNRs (5-10 dB).
Figure 7 presents again the results obtained during listening test with headphones (Nsabimana, 2009).The fifteen subjects recruited for this test were all working in our lab.For this test, subjects had first to find the hidden reference signal and assign it 100%.The results from the simulated algorithms are then compared to the reference signal grade.The Mean Opinion Score (ITU-T P.862) represents the grades of the three enhancement techniques for three different kinds of noise.Figure 7 reveal that the IPM-SWR approach was graded best.

Usefulness of phase information
The importance of the phase information in speech enhancement is currently being investigated (Shannon, 2006;Shi, 2006;Aarabi, 2006;shi, 2007).To motivate further steps of improvements in the IPMSWR approach, the role of the phase is here emphasized using clean speech degraded with artificial additive white gaussian noise at different SNRs from 0 to 35 dB (s.Fig. 8).
Figure 8 depicts the segmental SNR improvement with IPMSWR approach using the phase of the disturbed speech for the resynthesis on one hand and the phase of the clean speech for the resynthesis on the other hand.The black curve with square represents here the segmental SNR of the disturbed speech, which stands for the reference segmental SNR.The red curve with circle, which depicts the results with the IPMSWR using the phase of the disturbed speech, clearly reveals an overall segmental SNR gain of ∼ = 5 dB.The dashed blue curve with diamond, which depicts the results with the PMSWR using the phase of the disturbed speech, remains close to the results of the IPMSWR approach only for SNRs higher than 15 dB as expected.The dashed green curve with triangle, which depicts the results with the IPM-SWR using the phase of the clean speech, reveals instead an overall segmental SNR gain of ∼ =8 dB.This clearly outlines the usefulness of the phase information.

Conclusions
A speech enhancement technique based on psychoacoustics principles is proposed here.The key components of this approach are a time-frequency dependent control parameter for the residual noise within critical bands and a better estimate of the rough clean speech.As additional information on the corrupting noise is not available, a technique to estimate the corrupting noise power has been presented.Simulations results at different SNRs reveal that the proposed technique performs best and preserves sibilant sounds even at very low SNRs.To motivate further steps of improvements in the IPMSWR approach, the importance of the phase information has been emphasized here during experimental results.The obtained results show that an increase of SNR gain is achieved when the phase of the clean speech is used.
Future works should thus address the estimation of the phase to increase the speech intelligibility.To avoid abrupt jumps, the gain coefficients should also be properly controlled.The Parameter optimization remains necessary.

Fig. 2 .
Fig. 2. Error minimization for the derivation of G(k,m).PSD of residual noise distortion R E n , PSD of speech distortion R E s , PSD of estimation error R E and masking threshold R T .