Speech recognition 1 Few useful books Speech recognition 2 Automatic speech recognition Lawrence Rabiner, Biing-Hwang Juang, Fundamentals of speech recognition, Prentice-Hall, Inc. Upper Saddle River, NJ, USA 1993 Dan Jurafsky, James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2009 Contents: Introduction Speech production recap Phonetics recap Software resources Speech recognition 3 Speech recognition Speech recognition 4 Hidden Markov Model toolkit CMU Sphinx-II Speech Recognizer NIST Speech Recognition Scoring Utilities SRI Language Model Toolkit CMU / Cambridge Language Model Toolkit Goal: convert an acoustic signal X into a word sequence W independent of speaker and environment Implementation: Several types of recognizers Isolated word recognition each word is surrounded by silence Word spotting detect a word in presence of surrounding words word Connected-Word Recognition word sequences constrained by a fixed grammar (e.g., telephone numbers) Continuous Speech Recognition fluent, uninterrupted speech
Components of a speech recognizer Speech recognition 5 Challenges in speech recognition Speech recognition 6 Acoustic model Knowledge of acoustics and phonetics Microphone and environment differences Speaker differences Pronunciation dictionary How words are formed from their constituent sounds Language model What constitutes a word What words are likely to occur and in what sequence Word boundaries in continuous speech are unclear Continuous speech is less articulated Coarticulation and phonetic context within and accros words variability Speaker independent vs speaker dependent system Inter and intra-speaker variability: stress, emotion, speaking rate Environment variability: stationary/nonstationary noise, microphone vs telephone vs cell phone Context variability Speech recognition 7 Speech production Speech recognition 8 Vocal tract cavities pharyngeal and oral Articulators components of the vocal tract that move to produce various speech sounds: velum, lips, tongue, teeth Source-filter representation of speech production Speech production is an acoustic filtering operation Larynx and lungs provide source excitation Vocal and nasal tract act as a filter that shapes the spectrum of the signal
Speech recognition 9 Source-filter model of speech production Phonetics Speech recognition 10 Phoneme abstract unit that can be used for writing a language down in a systematic or unambiguous way Phoneme categories Vowels air passes freely through resonators Consonants air is partially or totally obstructed in one or more places as it passes through the resonators Speech recognition 11 Classification of speech sounds Speech recognition 12 Voiced / Voiceless Voiced if vocal chords vibrate Nasal / Oral Nasal if air travels through nasal cavity and oral cavity is closed Consonants / Vowels Consonants when there is obstruction of the air stream above the glottis (glottis=space between the vocal chords) Characterized by place and manner of articulation and voicing Vowels - characterized by lip position, tongue height and tongue advancement Lateral / Non-lateral Non-lateral when the air stream passes through the middle of the oral cavity (not along)
Phonetic alphabets Speech recognition 13 CMU dictionary Speech recognition 14 Describe the sounds of a language Each language will have its own unique set of phonemes Words are represented by sequences of phoneme Useful representation for speech recognition! IPA phonetic representation standard, used for most world languages Character set difficult to manipulate on computer ARPAbet English ASCII representation CMU Sphinx phonetic symbols Based on ARPA - more appropriate to our purpose http://www.speech.cs.cmu.edu/cgi-bin/cmudict Vowels Speech recognition 15 Coarticulation Speech recognition 16 IY IH EH AE AX UH u AA UW AO F AY TH AY S AY SH AY
Speech recognition 17 Speech recognition 18 Given a sequence of observations (evidence) from an audio signal O = o 1 o 2 T Determine the underlying word sequence W W = w 1 w 2 m The number of words m is unknown and the observation sequence is variable in lengtht The goal is to minimize the classification error rate Solution: maximize the posterior probability This requires optimization over all possible word strings P(O) does not impact optimization, therefore: Speech recognition 19 Speech recognition 20 Assuming that words can be represented by a sequence of states, S Optimization problem: Words are composed of phonemes, phonemes are represented by states Implementation: O P(O S) P(S W) P(W) Observation (feature) sequence Acoustic model Pronunciation model (lexicon) Language model
Speech recognition 21 Optimization searches for the most likely word sequence given the observations (features) we cannot evaluate all possible word sequences Hidden Markov Model representation of speech Speech recognition 22 We need to define a representation for modeling the states HMMs a method for approximately searching the best sequence given the evidence Viterbi algorithm ways to make it fast