SPEECH ENHANCEMENT BY FORMANT SHARPENING IN THE CEPSTRAL DOMAIN David Cole and Sridha Sridharan Speech Research Laboratory, School of Electrical and Electronic Systems Engineering, Queensland University of Technology ABSTRACT: This paper presents a method for enhancing speech signals in the root cepstral domain, typically in conjunction with the cepstral subtraction technique. The effect of the processing is similar to the spectral sharpening method performed in the time domain, but with a much lower computational requirement when combined with cepstral subtraction and with better performance. The portions of the signal corresponding to frequency formant peaks are further amplified, while formant valleys are attenuated, with the aims of further reducing noise in those regions and improving speech quality. The cepstral procedure maintains the spectral tilt of all speech segments, unlike the time domain approach which only attempts to maintain the long-term average spectral tilt. The scheme was devised as part of an enhancement system designed for forensic speech enhancement, where speech intelligibility is an important consideration. INTRODUCTION Speech enhancement is a common requirement in speech processing systems, either as preprocessing for procedures such as speech or speaker recognition, or for improving the quality or intelligibility of speech for human audition. Thus speech enhancement has varying goals, depending on the use of the enhanced speech output and the procedures used as pre-processing for speech or speaker recognition might be quite different from those used for speech enhancement intended for improved human reception of the speech. The enhancement technique described in this paper was designed as an addition to the noise reduction system described by Fisher and Sridharan (1995), whose purpose was forensic speech enhancement. They used a combination of spectral subtraction and cepstral subtraction to produce enhancement results suitable for use in such an application. This requires that intelligibility not be impaired, and preferably improved. The procedure manipulates the cepstrum of the signal to provide an enhancement of the formant structure of the speech signal, with negligible computational cost when the signal cepstrum is already available. The procedure has similar aims to the time domain spectral sharpening technique, but has improved spectral tilt performance over that method. SPECTRAL SHARPENING The use of spectral sharpening for speech enhancement and noise reduction was suggested by Schaub and Straum (1991). The technique originally was used for adaptive post-filtering in speech coding schemes using linear prediction (Ramamoorthy and Jayant 1984). The basis is the linear predictive filter defined by A(z) = a K 1 2 m 1z + a 2z + + amz These coefficients are used for post-filtering of the decoded signal using the transfer function with filter parameters 0 < β < γ < 1. H (z) β = γ Accepted after full review page 244
The result of this post-filtering is a sharpening of the formant structure of the speech signal, with amplified formant peaks and attenuated formant valleys effectively due to the poles of the linear predictive resynthesis filter being shifted closer to the unit circle. As well as the enhancement application described by Schaub & Straub, this post-filtering technique has also been used to improve the quality of synthesized speech (Dines et al 2001). In the enhancement scheme of Schaub & Straub, high-pass filtering is applied to the speech signal before the spectral sharpening filter in an attempt to compensate for the general spectral tilt of the long-term speech spectrum. This is to avoid overemphasis of the spectral tilt characteristic. As will be shown here, this is not ideal, as the tilt of the speech spectrum in the short term often differs markedly from the long-term characteristic. A block diagram of the scheme is shown in Figure 1. LP analysis a 1 a m 1 ( z ) αz 1 β γ Figure 1. Spectral sharpening in the time domain using linear prediction. Figure 2 illustrates the effect of the time domain spectral sharpening scheme. This shows the smoothed (10 th order LPC) frequency spectrum of the vowel /a/ before (solid line) and after (dashed line) processing. The high pass pre-emphasis filter parameters were adjusted to maintain approximately the spectral tilt of the original spectrum for this segment. Figure 2. Smoothed spectrum of phoneme /a/ (a) before time domain processing (solid line) (b) after time domain processing (dashed line) Accepted after full review page 245
Figure 3 shows another segment of the same utterance, corresponding to the phoneme /s/. The original spectrum (solid line) shows the rising spectral tilt typical of this phoneme. After processing (with identical parameters to those for Figure 2), the smoothed spectrum of the processed segment shows the problem inherent in this scheme the pre-emphasis is compensating for an expected falling spectral tilt and this, combined with the rising spectral tilt of the post-filter, produces an overemphasis of the higher frequencies. Figure 3. Smoothed spectrum of phoneme /s/ (a) before time domain processing (solid line) (b) after time domain processing (dashed line) ROOT CEPSTRAL PROCESSING OF SPEECH SIGNALS The use of the root cepstrum in speech processing is generally associated with cepstral subtraction. The techniques of spectral subtraction and cepstral subtraction are both well known and widely used for noise reduction, where the speech signal s(n) is corrupted additively by noise z(n) to produce the observed noisy speech signal y(n): y (n) = s(n) + z(n) where n is the discrete time index. The discrete frequency domain equivalent is Y (k) = S(k) + Z(k) Spectral subtraction involves calculating an estimate Z ~ (k) of the noise spectral magnitude and subtracting this from the observed spectral magnitude Y (k). The resultant estimate of the clean signal spectral magnitude S ~ (k) is recombined with the noisy signal phase and inverse Fourier transformed to yield the estimated clean speech signal ~ s(n) : s(n) = F {( Y(k) Z ~ (k) ) e } j Y(k) ~ -1 Accepted after full review page 246
Generally, it is attempted to calculate the noise estimate during speech pauses, although a common approach is to use an auto-regressive average of the noisy speech signal as the noise estimate. The original cepstrum calculation used the inverse Fourier transform of the logarithm of the spectral magnitude. This is impractical in low noise conditions, where the undefined log(0) condition can arise. Thus, the root cepstrum is generally used for speech cepstral subtraction. The root cepstrum of the observed signal is: ŷ(n) -1 = F 1 p { Y(k) } with p typically 2 or 4. Cepstral subtraction follows the same form as spectral subtraction, using an accumulated cepstral noise estimate subtracted from the observed cepstrum. Despite its lack of a mathematical basis (since subtraction in the cepstral domain equates to deconvolution, not subtraction, in the time domain) cepstral subtraction is accepted as superior in performance to spectral subtraction for additive noise removal (Wu et al 1991), producing a less distorted output. Fisher and Sridharan (1994) concluded that spectral subtraction outperformed cepstral subtraction for low signal to noise ratios, and so devised a tandem system using both techniques which outperformed either used singly. The spectral sharpening procedure described below was developed to further improve the output quality of this system. SPECTRAL SHARPENING IN THE ROOT CEPSTRAL DOMAIN The use of the cepstrum in speech processing is not confined to speech enhancement. It has also been used widely in speech and speaker recognition fields because of its ability to separate vocal tract and excitation information. For vocal tract characterisation, the cepstrum is similar to linear prediction coefficients, although the information contained in the two sets of parameters is not identical. Linear prediction provides formant location and bandwidth defined by the roots of the LPC equation. In contrast, the cepstral coefficients can be used to build up a smoothed spectrum using the sinusoidal basis functions of the Fourier transform. Thus it is common for speech or speaker recognition systems to use either linear predictive coefficients or (low order) cepstral coefficients to parameterise vocal tract characteristics. The use of the root cepstrum for spectral sharpening utilises the encoding of the smoothed speech spectrum in the low order cepstral coefficients. By increasing the magnitude of selected cepstral coefficients, it is possible to increase the amplitude of the corresponding smoothed spectral envelope of the reconstructed signal. By considering the nature of the Fourier transformation, it is clear that the main contribution to the spectral tilt of a speech segment will be contained in the low order cepstral coefficients: n = 1 and possibly also 2. Thus to maintain the spectral tilt of a speech segment, these lowest order cepstral coefficients are unmodified (as well as the 0 th coefficient, which represents the energy level of the segment.) Expressing this scheme mathematically, we operate on the input cepstrum ŷ (n) to produce output cepstrum qˆ (n) as follows: qˆ(n) = ŷ(n) k n k n = b k n = 1 n < n < n l otherwise h Typically the boost factor b, which is greater than 1, is applied (for a frame size of 256 samples) for n=3 20 or thereabouts, which is adequate to represent the speech formant structure. Clearly, the procedure requires very little computational effort when the cepstrum has already been calculated for a cepstral subtraction procedure. Accepted after full review page 247
Figures 4 and 5 show the performance of the scheme for the same two speech segments used previously. The /a/ segment output is almost identical to that produced by the time domain scheme, and the /s/ segment output is quite clearly superior, maintaining the overall spectral tilt well while enhancing the formant structure. Figure 4. Smoothed spectrum of phoneme /a/ (a) before cepstral processing (solid line) (b) after cepstral processing (dashed line) Figure 5. Smoothed spectrum of phoneme /s/ (a) before cepstral processing (solid line) (b) after cepstral processing (dashed line) Accepted after full review page 248
OUTPUT QUALITY The maintenance of spectral tilt for individual segments, as shown in the above figures, results in an audible improvement in quality of the output of the cepstrally enhanced speech as compared with the time domain enhanced version. The need to use fixed pre-emphasis in the time domain scheme means that, in general, the spectral tilt of individual segments will be altered, adversely affecting the quality of the speech output. Usually, the result is a general increase in high frequency content, producing an unpleasant, harsh output. In contrast, the cepstral enhancement method maintains segmental spectral tilt, producing more natural and higher quality speech. These quality comparisons have not yet been quantified, but audio demonstrations that show these differences may be found at http://www.rcsavt.bee.qut.edu.au/pages/demonstrations.html REFERENCES Fisher, A. & Sridharan, S. (1994) Speech Enhancement for Forensic Applications, International Conference on Speech Science and Technology, SST-94, 40-45. Dines, J. & Sridharan, S. (2001) Trainable speech synthesis with hidden Markov models, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-2001), 833-836. Ramamoorthy, V. & Jayant, N. (1984) Enhancement of ADPCM speech by adaptive post-filtering, Technical Report Technical Journal of AT&T, V63-I8, 1465-1475. Schaub, A. & Straub, P. (1991). Spectral sharpening for speech enhancement / noise reduction, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-91), 993-996. Wu, C.S., Nguyen, V.V., Sabrin, H., Kushner, W. & Damoulakis, J. (1991). Fast self-adapting broadband noise removal in the cepstral domain, Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP-91), 957-960. Accepted after full review page 249