Plasticity in Systems for Automatic Speech Recognition: A Review Roger K Moore & Stuart P Cunningham Overview Automatic Speech Recognition (ASR) breakthroughs key components training / recognition Practical Challenges user characteristics user environment user behaviour Plasticity in ASR flexibility / robustness learning / adaptation
ASR in the 1950s & 60s store of reference templates training switch subset finite state syntax pre-processor end-point detector comparator best match ASR in the 1970s bottom-up semantic semantic interpreter interpreter semantic semantic knowledge-base knowledge-base grammar grammar syntactic syntactic parser parser phonetic phonetic rule-base rule-base lexical lexical access access phonetic phonetic decoder decoder segmentor segmentor feature feature extractor extractor pre-processor pre-processor lexicon lexicon top-down
Breakthroughs in ASR Integrated search dynamic time warping (DTW) Stochastic modelling hidden Markov models (HMM) Sub-word representations context-dependent phones (triphones) Contemporary ASR target vocabulary pronouncing pronouncing dictionary dictionary phonetic transcription model model selection selection inventory of sub-word models HMM HMM re-estimation re-estimation word-boundary word-boundary modifications modifications training corpora noise models model model combination combination language model language language model model re-estimation re-estimation input signal front-end front-end signal signal processing processing integrated network of HMM states Viterbi Viterbi search search most probable path & lattice of alternatives
Key ASR Components Language Model Acoustic Model Noise Model Pronunciation Model ASR Components Language model n-grams The cat sat on the Acoustic model context-dependent phones e.g. tri-phones bigram trigram str /t:s_r/
ASR Components Noise model Pronunciation model dictionary citation forms + variants Noise HMM + = Speech HMM /n/-deletion: /reiz@n/ -> ->/reiz@/ /r/-deletion: /Amst@rdAm/ -> ->/Amst@dAm/ /t/-deletion: /rextstre:ks/ -> ->/rexstre:ks/ /@/-insertion: /delft/ /delft/->-> /del@ft/ How HMM-based ASR Works A very quick tutorial (with no maths)
Markov Model Markov Model Alignment
Markov Model Hidden Markov Model
HMM Alignment HMMs for Speech Whole-word HMMs Sub-word HMMs Context-dependent sub-word HMMs s:#_e e:s_v v:e_@ @:v_n Seven n:@_#
ASR Target Vocabulary one one Phonetic Transcription two two three three Hidden Markov Models /wvn/ /wvn/ /tu/ /tu/ /Tri/ /Tri/ Sub-Word Triphones (w:#_v)(v:w_n)(n:v_#) (t:#_u)(u:t_#) (T:#_r)(r:T_i)(i:r_#) HMM Network ASR one two three max Pr one one three two
feedback Practical Challenges USER speech speech input input speech speech output output keyboard keyboard input input text text output output pen-pad pen-pad input input camera camera graphical graphical output output mouse mouse input input Linguistic Linguistic Interpreter Interpreter Generator Generator Spatio-Temporal Spatio-Temporal Interpreter Interpreter Generator Generator other tasks, distractions, noise, vibration, acceleration DIALOGUE DIALOGUE MANAGER MANAGER A P P L II C A T II O N Plasticity in ASR Practical ASR systems have to be able to adapt / learn in order to be flexible / robust, but the compilation of priors into an integrated network tends to lead to a static data structure Plasticity can be achieved by re-compilation of the network adaptation of the model parameters modification of the input representation
Concepts from Machine Learning Supervised learning (training) maximum likelihood (ML) expectation-maximisation (EM) maximum a-posteriori (MAP) maximum mutual information (MMI) Unsupervised learning (adaptation) Acoustic Model Adaptation speaker-dependent Recognition Rate speaker-independent Amount of Adaptation Data
Acoustic Model Adaptation Model set selection Maximum likelihood linear regression (MLLR) Eigen-voices Vocal tract length normalisation (VTLN) 20 Response magnitude (db) 0 0 1 2 3 4 5 6 Frequency (khz) Environment Compensation Spectral subtraction (SS) Cepstral mean normalisation (CMN) Relative spectral process (RASTA)
Language Model Adaptation Off-Line model interpolation constraint specification On-Line dynamic cache trigger models TASK-RELATED BACKGROUND TEXT CORPUS MODEL MERGING TASK-SPECIFIC ADAPTATION TEXT CORPUS [Bellegarda, 2004] Pronunciation Adaptation ABI Accents of the British Isles
Pronunciation Adaptation AM adaptation vs extended dictionary Japanese speaking English table : /teibl/ /teiburu/ Italian speaking English team : /ti:m/ /ti:m@/ linked : /linkt/ /link@t/ Adapt dictionary using phone recogniser English phone recogniser on German aktuelles : /?aktu:?el@s/ /{ktwel@us/, /{ktw3:m@z/, /ktw3:l@s/ /{ktwel@uz/, /{tkw3:r@s/, /@kwe@res/ [Goronzy et al, 2004] Pronunciation Adaptation 30 25 Word Error Rate (%) 20 15 10 5 0 Native Non- Native MLLR ExtDict MLLR+ ExtDict [Goronzy et al, 2004]
Summary Contemporary ASR changes dynamically to accommodate new speakers unexpected user behaviour real acoustic environments The prime purpose of such plasticity is to improve recognition accuracy Discussion Points The computational techniques employed by ASR for adaptation and learning may (or may not) give insights into plasticity in human speech perception Future progress in ASR may (or may not) be determined by insights gained at this workshop
Thankyou Any questions