Paper Review Seminar Research Issues in Speech Recognition. Bartosz Ziolko

Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko 1

Computer speech recognition system Acoustic signal Automatic speech recognition system Sequence of symbols 1870 Alexander Graham Bell 2

Definition & classification Speech recognition allows computers equipped with a microphone to interpret human speech, e.g. for transcription. It is an alternative method of interacting with a computer. Classification: system requires or does not require the user to "train" the system to recognise speech patterns, system is trained for one user only or is speaker independent, system can recognise continuous speech or discrete words only, system is intended for clear speech material (no distorted speech, background noise or other speaker talking simultaneously) or not, vocabulary is small or large. 3

Applications Computer users can create and edit documents and interact with computer more quickly because people are able to speak faster than anyone can type. People who are poor typists (especially people with sight disability) can extraordinarily increase their productivity. Speaking to computer is much faster and easier than typing! 4

Is speech recognition more than 100 year old? 1. 1870 - Alexander Graham Bell - phonoautograph 2. The Swiss linguist Ferdinand de Saussure Course in General Linguistics (1916) 3. Radio Rex - 1920 5

Approaches Isolated word recognition constrains the possible recognized phrases to a small-sized possible responses. L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. Dictation transcribes speech word by word, does not require semantic understanding, the goal is to identify the exact words. Natural language recognition allows the speaker to provide natural, sentence-length patterns. S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 6

Scheme of the speech recognition system Time-frequency analysis Lexical decoding L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. Speech segmentation Segment parameterization Fitting the nearest basis element Transcription and building the words Syntactic analysis Semantic analysis 7

Pronunciation English language Afghanistan agency heighten Polish language Afganistan agencja wzmagać German language Afganistan agentur steigen Many words in English language sound alike (e.g. night and knight). I helped Apple wreck a nice beach sounds like I helped Apple recognize speech. Context dependency for the phonemes, phonemes with different left and right context have different realizations. A general solution requires human knowledge and experience as well as advanced pattern recognition and artificial intelligence. 8

Difficulties Co-articulation of phonemes and words makes the task of speech recognition difficult, Intonation and sentence stress plays an important role in the interpretation. Utterances "go!", "go?" and "go." can clearly be recognized by a human but are difficult for a computer, In naturally spoken language there are no pauses between words. It is difficult for a computer to decide where word boundaries lie. 9

Acoustic pressure [db] Speech audibility Tadeusiewicz R., Sygnał mowy (Speech Signal), Wydawnictwa Komunikacji i Łączności, Warszawa, Poland, 1988. Pain threshold 140 120 100 80 60 40 Speech area 20 0 Stimulation threshold -20 0 0.1 1 10 Frequency [khz] 10

Amplitude Jean Baptiste Joseph Fourier On the Propagation of Heat in Solid Bodies 1807 Fourier spectrum s ˆ( f ) s( t) exp( 2 jft ) dt 1 0.75 0.50 0.25 100 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Frequency [khz] 11

Frequency [Hz] Frequency [Mel] Nonlinear scale f f _ Hz 1000 log 1 1000 _ mel 2 5000 2500 4000 2000 3000 1500 2000 1000 1000 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] 0 0 0.2 0.4 0.6 Time [s] 12

Cepstrum The term cepstrum was introduced by Bogert et al. and has come to be accepted terminology for the (inverse) Fourier transform of the logarithm of the power spectrum of a signal. (L.R.Rabiner and R.W.Schafer, Signal Processing of Speech Signals, Prentice Hall, Englewood-cliffs, NJ, 1978) Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first four letters. A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. There is a complex cepstrum and a real cepstrum. The cepstrum was defined in a 1963 paper: Tukey, J. W., B. P. Bogert and M. J. R. Healy: "The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, 209-243. New York: Wiley. 13

Cepstrum Verbally: the cepstrum is the FT of the log of the power spectrum. Signal Frequenc y spectrum Power spectrum Cepstru m FFT Squaring Smoothing Logarithm FFT Many texts incorrectly state that the process is FT log IFT, i.e. that the cepstrum is the "inverse Fourier transform of the log of the spectrum". 14

Mel-Frequency Cepstrum Coefficients S.B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences", IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-28, No.4, 1980. M is the number of cepstrum coefficients X k (k = 1,2,,12) represents the log-energy output of the ith filter S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 15

Other parameters D.Zhu, K.K.Paliwal, "Product of Power Spectrum and Group Delay Function For Speech Recognition", Proceedings of ICASSP 2004, pp.i- 125-8 Mel-frequency Product Spectrum Cepstral Coefficients phase spectrum information K. Ishizuka and N. Miyazaki, "Speech Feature Extraction Method Representing Periodicity and Aperiodicity in Sub Bands for Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i-141-4. It focuses on feture extraction that represents aperiodicity of speech. The method is based on Gammatone filter banks, framing, autocorrelation and comb filters. H. Misra, S. Ikbal, H. Bourlard, H. Hermansky, "Spectral Entropy Based Feature for Robust ASR", Proceedings of ICASSP 2004, pp.i- 193-6. H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. of Acoust. Soc. Amer., vol. 87, no.4, pp. 1738-52, 1990 Normalizing a spectrum into function like probability mass function (PMF) allows to calculate entropy. Yoshizawa, N. Hayasaka, N. Wada and Y. Miyanaga, "Cepstral Gain Normalization For Noise Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i-209-12. 16

Hidden Markov Model A Hidden Markov Model (HMM) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further analysis, for example for speech recognition applications. Speech recognition systems are generally based on HMM or hybrid solutions with artificial neural networks. Statistical model gives the probability of an observed sequence of acoustic data by the application of Bayes rule: P word acoustic p acoustic word P word p word acoustic L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February 1989. P(mushroom soup) > P(much rooms hope) It can be similarly applied for phonemes, words, syntactic and semantics 17

Resolution m Scale a Frequency [Hz] Wavelet spectra Daubechies phi of order 12 Daubechies psi of order 12 5000 4000 3000 2000 1000 0.8 0.6 0.4 0.2 0-0.2-0.4 2 4 6 8 10 12 14 F(d12_phi(w)) 0.75 0.5 0.25 0-0.25-0.5-0.75-6 -4-2 0 2 4 6 F(d12_psi(w)) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 134 142 150 1 2 3 4 5 6 7 8 8 1000 2000 3000 4000 5000 6000 7000 Time b 1000 2000 3000 4000 5000 6000 7000 Time 2 -m n STFT versus continuous and discrete wavelet spectrum 1 0.8 0.6 0.4 0.2 0-15 -10-5 0 5 10 15 1 0.8 0.6 0.4 0.2 0-15 -10-5 0 5 10 15 I. Daubechies, Orthonormal bases of compactly supported wavelets, Commun. Pure Appl. Math., pp. 909-996, 1988 O. Rioul, M. Vetterli, Wavelets and signal processing, IEEE Signal Processing Mag., vol.8, pp. 14-38, October 1991. O. Farooq, S. Datta, Wavelet based robust subband features for phoneme recognition, IEE Proceedings: Vision, Image & Signal Processing, vol.151, no.3, pp. 187-93, 2004. 18

Reverse Scale Amplitude Speech signal and its discrete wavelet transform 0 0 0 Time Andrzej 19

The frequency band splitting Decomposition level Frequency [Hz] Discretization density D1 2756 5512 2 t Sampling frequency f 0 11025 Hz D2 1378 2756 ] 4 t means t 90.7 μs discretization density D3 689 1378 8 t D4 345 689 16 t D5 172 345 32 t D6 86 172 64 t=5.805 ms 20

Other topics in speech recognition R. Sarikaya, J.H.L. Hansen, High Resolution Speech Feature Parametrization for Monophone Based Stressed Speech Recognition, IEEE Signal Processing Letters, vol. 7, no. 7, pp. 182-5, July 2000. M. Wester, J. Frankel, S. King, "Asynchronous Articulatory Feature Recogntion Using Dynamic Bayesian Networks",Proc. IEICI Beyond HMM Workshop, Kyoto, December 2004. Impact of stress (neutral, angry, loud, Lombard) on monophone speech recognition accuracy. Paper compares sets of parameters: MFCC, Wavelet Packet Parameters (continuous time), SBC (subband-based cepstral) Waveforms are parameterised as 12 MFCCs and energy with 1st and 2nd derivatives appended. Features are here namely: manner, place, voicing, rounding, front-back, static. 21

Others topics in speech recognition M. Bacchiani and B. Roark, "Metadata Conditional Language Modeling", Proceedings of ICASSP 2004, pp.i-241-4. G.Evermann, H.Y. Chan, M.J.F Gales, T. Hain, X.liu, D.Mrva, L.Wang, P.C. Woodland, "Develpment of the 2003 CU-HTK Conversational Telephone Speech Transcription System", Proceedings of ICASSP 2004, pp.i-249-52. It describes an algorithm using meta-data like calling phone number to recognise speaker and adapt ASR system to the user. HTK is the most recognized academic toolkit for automatic speech recognition system, based on HMM and MFCC. It has been designed at the University of Cambridge by the Machine Intelligence Laboratory. http://htk.eng.cam.ac.uk/ H. Van hamme, "Robust Speech Recognition using Cepstral Domain Missing Data Techniques and Noisy Masks", Proceedings of ICASSP 2004, pp.i-213-6. It describes Missing Data Techniques and improved Missing Data Detector. MDD can compute missing data masks from the noisy signal involving harmonic decomposition without long-term noise averageing. 22

Open issues and research topics Large vocabulary Semantic analysis Phoneme segmentation Different languages Dialects supporting 23

Segmentation Andrzej ENTIRE SEGMENTS 24

Thank you for your attention 25