Paper Review Seminar Research Issues in Speech Recognition. Bartosz Ziolko

Size: px

Start display at page:

Download "Paper Review Seminar Research Issues in Speech Recognition. Bartosz Ziolko"

Cory Hudson
6 years ago
Views:

1 Paper Review Seminar Research Issues in Speech Recognition Bartosz Ziolko 1

2 Computer speech recognition system Acoustic signal Automatic speech recognition system Sequence of symbols 1870 Alexander Graham Bell 2

3 Definition & classification Speech recognition allows computers equipped with a microphone to interpret human speech, e.g. for transcription. It is an alternative method of interacting with a computer. Classification: system requires or does not require the user to "train" the system to recognise speech patterns, system is trained for one user only or is speaker independent, system can recognise continuous speech or discrete words only, system is intended for clear speech material (no distorted speech, background noise or other speaker talking simultaneously) or not, vocabulary is small or large. 3

4 Applications Computer users can create and edit documents and interact with computer more quickly because people are able to speak faster than anyone can type. People who are poor typists (especially people with sight disability) can extraordinarily increase their productivity. Speaking to computer is much faster and easier than typing! 4

5 Is speech recognition more than 100 year old? Alexander Graham Bell - phonoautograph 2. The Swiss linguist Ferdinand de Saussure Course in General Linguistics (1916) 3. Radio Rex

6 Approaches Isolated word recognition constrains the possible recognized phrases to a small-sized possible responses. L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February Dictation transcribes speech word by word, does not require semantic understanding, the goal is to identify the exact words. Natural language recognition allows the speaker to provide natural, sentence-length patterns. S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 6

7 Scheme of the speech recognition system Time-frequency analysis Lexical decoding L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February Speech segmentation Segment parameterization Fitting the nearest basis element Transcription and building the words Syntactic analysis Semantic analysis 7

8 Pronunciation English language Afghanistan agency heighten Polish language Afganistan agencja wzmagać German language Afganistan agentur steigen Many words in English language sound alike (e.g. night and knight). I helped Apple wreck a nice beach sounds like I helped Apple recognize speech. Context dependency for the phonemes, phonemes with different left and right context have different realizations. A general solution requires human knowledge and experience as well as advanced pattern recognition and artificial intelligence. 8

9 Difficulties Co-articulation of phonemes and words makes the task of speech recognition difficult, Intonation and sentence stress plays an important role in the interpretation. Utterances "go!", "go?" and "go." can clearly be recognized by a human but are difficult for a computer, In naturally spoken language there are no pauses between words. It is difficult for a computer to decide where word boundaries lie. 9

10 Acoustic pressure [db] Speech audibility Tadeusiewicz R., Sygnał mowy (Speech Signal), Wydawnictwa Komunikacji i Łączności, Warszawa, Poland, Pain threshold Speech area 20 0 Stimulation threshold Frequency [khz] 10

11 Amplitude Jean Baptiste Joseph Fourier On the Propagation of Heat in Solid Bodies 1807 Fourier spectrum s ˆ( f ) s( t) exp( 2 jft ) dt Frequency [khz] 11

2000 3000 1500 2000 1000 1000 500 0 0 0.1 0.2 0.

12 Frequency [Hz] Frequency [Mel] Nonlinear scale f f _ Hz 1000 log _ mel Time [s] Time [s] 12

13 Cepstrum The term cepstrum was introduced by Bogert et al. and has come to be accepted terminology for the (inverse) Fourier transform of the logarithm of the power spectrum of a signal. (L.R.Rabiner and R.W.Schafer, Signal Processing of Speech Signals, Prentice Hall, Englewood-cliffs, NJ, 1978) Etymology: "cepstrum" is an anagram of "spectrum", formed by reversing the first four letters. A cepstrum is the result of taking the Fourier transform of the decibel spectrum as if it were a signal. There is a complex cepstrum and a real cepstrum. The cepstrum was defined in a 1963 paper: Tukey, J. W., B. P. Bogert and M. J. R. Healy: "The quefrency alanysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum, and saphe-cracking". Proceedings of the Symposium on Time Series Analysis (M. Rosenblatt, Ed) Chapter 15, New York: Wiley. 13

14 Cepstrum Verbally: the cepstrum is the FT of the log of the power spectrum. Signal Frequenc y spectrum Power spectrum Cepstru m FFT Squaring Smoothing Logarithm FFT Many texts incorrectly state that the process is FT log IFT, i.e. that the cepstrum is the "inverse Fourier transform of the log of the spectrum". 14

15 Mel-Frequency Cepstrum Coefficients S.B. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences", IEEE Trans. on Acoustics, Speech and Signal Processing, vol. ASSP-28, No.4, M is the number of cepstrum coefficients X k (k = 1,2,,12) represents the log-energy output of the ith filter S. Young, "Large Vocabulary Continuous Speech Recognition." IEEE Signal Processing Magazine 13(5): 45-57, (1996). 15

16 Other parameters D.Zhu, K.K.Paliwal, "Product of Power Spectrum and Group Delay Function For Speech Recognition", Proceedings of ICASSP 2004, pp.i Mel-frequency Product Spectrum Cepstral Coefficients phase spectrum information K. Ishizuka and N. Miyazaki, "Speech Feature Extraction Method Representing Periodicity and Aperiodicity in Sub Bands for Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i It focuses on feture extraction that represents aperiodicity of speech. The method is based on Gammatone filter banks, framing, autocorrelation and comb filters. H. Misra, S. Ikbal, H. Bourlard, H. Hermansky, "Spectral Entropy Based Feature for Robust ASR", Proceedings of ICASSP 2004, pp.i H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. of Acoust. Soc. Amer., vol. 87, no.4, pp , 1990 Normalizing a spectrum into function like probability mass function (PMF) allows to calculate entropy. Yoshizawa, N. Hayasaka, N. Wada and Y. Miyanaga, "Cepstral Gain Normalization For Noise Robust Speech Recognition", Proceedings of ICASSP 2004, pp.i

17 Hidden Markov Model A Hidden Markov Model (HMM) is a statistical model where the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters, based on this assumption. The extracted model parameters can then be used to perform further analysis, for example for speech recognition applications. Speech recognition systems are generally based on HMM or hybrid solutions with artificial neural networks. Statistical model gives the probability of an observed sequence of acoustic data by the application of Bayes rule: P word acoustic p acoustic word P word p word acoustic L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceedings of the IEEE, vol. 77, no. 2 February P(mushroom soup) > P(much rooms hope) It can be similarly applied for phonemes, words, syntactic and semantics 17

Resolution m Scale a Frequency [Hz] Wavelet spectra Daubechies phi of order 12 Daubechies psi of order 12 5000 4000 3000 2000 1000 0.8 0.6 0.4 0.2 0-0.2-0.

5 0.25 0-0.25-0.5-0.75-6 -4-2 0 2 4 6 F(d12_psi(w)) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 Time [s] 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 134 142 150 1 2 3 4 5 6

2 0-15 -10-5 0 5 10 15 1 0.8 0.6 0.4 0.2 0-15 -10-5 0 5 10 15 I. Daubechies, Orthonormal bases of compactly supported wavelets, Commun. Pure Appl. Math., pp.

18 Resolution m Scale a Frequency [Hz] Wavelet spectra Daubechies phi of order 12 Daubechies psi of order F(d12_phi(w)) F(d12_psi(w)) Time [s] Time b Time 2 -m n STFT versus continuous and discrete wavelet spectrum I. Daubechies, Orthonormal bases of compactly supported wavelets, Commun. Pure Appl. Math., pp , 1988 O. Rioul, M. Vetterli, Wavelets and signal processing, IEEE Signal Processing Mag., vol.8, pp , October O. Farooq, S. Datta, Wavelet based robust subband features for phoneme recognition, IEE Proceedings: Vision, Image & Signal Processing, vol.151, no.3, pp ,

19 Reverse Scale Amplitude Speech signal and its discrete wavelet transform Time Andrzej 19

20 The frequency band splitting Decomposition level Frequency [Hz] Discretization density D t Sampling frequency f Hz D ] 4 t means t 90.7 μs discretization density D t D t D t D t=5.805 ms 20

21 Other topics in speech recognition R. Sarikaya, J.H.L. Hansen, High Resolution Speech Feature Parametrization for Monophone Based Stressed Speech Recognition, IEEE Signal Processing Letters, vol. 7, no. 7, pp , July M. Wester, J. Frankel, S. King, "Asynchronous Articulatory Feature Recogntion Using Dynamic Bayesian Networks",Proc. IEICI Beyond HMM Workshop, Kyoto, December Impact of stress (neutral, angry, loud, Lombard) on monophone speech recognition accuracy. Paper compares sets of parameters: MFCC, Wavelet Packet Parameters (continuous time), SBC (subband-based cepstral) Waveforms are parameterised as 12 MFCCs and energy with 1st and 2nd derivatives appended. Features are here namely: manner, place, voicing, rounding, front-back, static. 21

22 Others topics in speech recognition M. Bacchiani and B. Roark, "Metadata Conditional Language Modeling", Proceedings of ICASSP 2004, pp.i G.Evermann, H.Y. Chan, M.J.F Gales, T. Hain, X.liu, D.Mrva, L.Wang, P.C. Woodland, "Develpment of the 2003 CU-HTK Conversational Telephone Speech Transcription System", Proceedings of ICASSP 2004, pp.i It describes an algorithm using meta-data like calling phone number to recognise speaker and adapt ASR system to the user. HTK is the most recognized academic toolkit for automatic speech recognition system, based on HMM and MFCC. It has been designed at the University of Cambridge by the Machine Intelligence Laboratory. H. Van hamme, "Robust Speech Recognition using Cepstral Domain Missing Data Techniques and Noisy Masks", Proceedings of ICASSP 2004, pp.i It describes Missing Data Techniques and improved Missing Data Detector. MDD can compute missing data masks from the noisy signal involving harmonic decomposition without long-term noise averageing. 22

23 Open issues and research topics Large vocabulary Semantic analysis Phoneme segmentation Different languages Dialects supporting 23

24 Segmentation Andrzej ENTIRE SEGMENTS 24

25 Thank you for your attention 25

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders