University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Lecture 6 Slides Jan 31 st, 2005 Outline of Today s Lecture Cepstral Analysis of speech signals 1
books & sources Huang, Acero, Hon, Spoken Language Processing, Chapter 6. Rabiner&Juang: Fundamentals of speech recognition. Deller et. al. Discrete-time Processing of speech signals Beranek, Acoustics, 1993. Flanagan, Speech Analysis Synthesis and Perception Clark & Yallop, An Intro to Phonetics and phonology Ladefoged A Course in Phonetics Lieberman & Blumstein Speech physiology, speech perception, and acoustic phonetics K. Stevens, Acoustic Phonetics Malmberg, Manual of phonetics Rossing, The Science of Sound Linguistics 001, University of Pennsylvania Background Extract important bits of speech signal while filter out parts that (for intelligibility) do not matter speech compression (compress only essential information, use fewer bits to encode), represent speaker ID separately speech recognition: need concise, accurate, robust speaker normalized representation of speech signal. 2
Recall Linear Systems Suppose we re given signal low-frequency high-freq noise Spectrum combines linearly as well: = + We can use a linear LPF to (mostly) recover x 1 Production Model Glottis impulse train excites vocal tract E( jω) periodic pulse train Time Varying Vocal tract system function Φ( jω) S( jω) Speech signal Information is in θ, but can t use linear filter to separate these two components. Goal: turn convolution into a linear operator (namely, addition) 3
Cepstral Processing Complex Cepstrum keep 1 st and 2 nd term, but often not needed for speech processing (phase is less important) Real Cepstrum keep only 1 st term, assume zero phase. Used very often in speech processing Turns into linear combination Goal: find way to separate glottal excitation from vocal tract response Real Cepstral Processing real & even c s [n] is real and even in n voiced speech analysis (s[n] is periodic, can use DFS representation of S(e jω ), with p=pitch period, and D(k) DFS coefficients: 4
Real Cepstral Processing Since glottal pulse is approximated by impulse train, the DFS coefficients are essentially seen as sampling the underlying vocal tract response Real Cepstral Processing Caveats glottal pulse is not a true impulse et () = gt ()*() it We typically window speech with length L window Convolution in frequency to give: windowing glottal pulse train and weighting it by vocal tract response We will make approximation: 5
Real Cepstral Processing Applying the real cepstrum this is linear combination of what originally was convolution Important points: c s (ω) is periodic (because S(e jω ) is periodic with p=2π) C s (ω) is real C s (ω) is even (since s[m] is real, S(e jω ) is even) DFS (line spectral coefficients) From definition of cepstrum (taking IDTFT) So we ve gone from the IDTFT to the DCT, and also note that α n = c s (n), since c s (n) is symmetric and even Real Cepstral Processing Note, c(ω) is in the frequency domain (it is a log spectrum), but when we do IDTFT, it becomes a funny time domain, but also we are looking at the spectrum of a spectrum (due to DFS and DCT interpretation). So, what is it, time or frequency, or both? Time Domain 1. Frequency Domain 2. Spectrum 3. Frequency Axis 4. Harmonics 5. Filtering (removing components) 1. Quefrency domain 2. Cepstrum 3. Quefrency axis 4. Rahmonics 5. Liftering Hope is that in the quenfrency domain, the summands occupy different parts of quenfrency axis, and if so, we can do some liftering quickly varying part slowly varying part 6
Real Cepstral Processing Real Cepstral Processing We can then do liftering to obtain smoothed spectrum. We use a low-time lifter (similar to low-pass filter in frequency) to remove c e (n), the glottal source, while retaining c φ (n), the element containing the communicative information. converts to quefrency domain 7
Real Cepstral Processing We can t recover φ(n) since all we have is log φ(ω) We can get φ(ω), the magnitude response (but this is ok) Minimum phase assumption H mp has all poles in unit circle H ap is an all-pass system, reflects zeros that are outside of H inside of unit circle, adds poles outside of unit circle So if φ(ω) is min. phase, we can get it from just the magnitude φ(ω) Even from φ(ω), this contains information bearing element in speech Note that we do this for every window in speech, get a series of speech vectors which will be the input to a speech recognition system for windowed frame-based processing (later lecture) But we can do other things as well, including: Application: pitch estimation Another application of this procedure. Since the pitch periods in the above center figure are clear, it is possible to find periodicity in the upper cepstral coefficients and use that at the pitch value. In other words, since both modulation and pitch are clear in this representation, either can relatively easily be extracted. 8