Phonological Models in Automatic Speech Recognition

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition at ICSI: Broadcast News and beyond

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Lecture 9: Speech Recognition

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Consonants: articulation and transcription

Phonetics. The Sound of Language

Modeling function word errors in DNN-HMM based LVCSR systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Universal contrastive analysis as a learning principle in CAPT

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Characterizing and Processing Robot-Directed Speech

Phonological Processing for Urdu Text to Speech System

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Proceedings of Meetings on Acoustics

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

A Neural Network GUI Tested on Text-To-Phoneme Mapping

DOWNSTEP IN SUPYIRE* Robert Carlson Societe Internationale de Linguistique, Mali

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Investigation on Mandarin Broadcast News Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Mandarin Lexical Tone Recognition: The Gating Paradigm

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Letter-based speech synthesis

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

English Language and Applied Linguistics. Module Descriptions 2017/18

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Journal of Phonetics

Calibration of Confidence Measures in Speech Recognition

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

NIH Public Access Author Manuscript Lang Speech. Author manuscript; available in PMC 2011 January 1.

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

LING 329 : MORPHOLOGY

Probabilistic Latent Semantic Analysis

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Consonant-Vowel Unity in Element Theory*

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Manner assimilation in Uyghur

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Multi-View Features in a DNN-CRF Model for Improved Sentence Unit Detection on English Broadcast News

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Word Segmentation of Off-line Handwritten Documents

Radical CV Phonology: the locational gesture *

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Improvements to the Pruning Behavior of DNN Acoustic Models

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Switchboard Language Model Improvement with Conversational Data from Gigaword

Stages of Literacy Ros Lugg

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Florida Reading Endorsement Alignment Matrix Competency 1

Phonological and Phonetic Representations: The Case of Neutralization

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Corrective Feedback and Persistent Learning for Information Extraction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Prediction of Maximal Projection for Semantic Role Labeling

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Lecture 1: Machine Learning Basics

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

DIBELS Next BENCHMARK ASSESSMENTS

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Constraining X-Bar: Theta Theory

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Lexical phonology. Marc van Oostendorp. December 6, Until now, we have presented phonological theory as if it is a monolithic

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

Large vocabulary off-line handwriting recognition: A survey

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Transcription:

Phonological Models in Automatic Speech Recognition Karen Livescu Toyota Technological Institute at Chicago June 19, 28

What can automatic speech recognition (ASR) do? NIST benchmark evaluation results 1988 27 WER = (#subs + #ins + #del) / #ref meetings: ~25 4% telephone conversations: ~2% broadcast news: ~1% WSJ dictation: ~5 1% [figure from Fiscus et al. 7, The Rich Transcription 27 Meeting Recognition Evaluation, http://www.nist.gov/speech/publications/papers/] digits: <1%

What is so difficult about conversational speech? Non speech (e.g. laughter, sigh) Variable speaking rate Disfluencies (e.g. partial words, hesitations, repeated syllables) Extreme pronunciation variation

Pronunciation variation in conversational speech: Examples word probably sense everybody don t baseform p r aa b ax b l iy s eh n s eh v r iy b ah d iy d ow n t (2) p r aa b iy (1) s eh n t s (1) eh v r ax b ax d iy (37) d ow n (1) p r ay (1) s ih t s (1) eh v er b ah d iy (16) d ow surface forms (1) p r aw l uh (1) p r ah b iy (1) p r aa l iy (1) eh ux b ax iy (1) eh r uw ay (1) eh b ah iy (6) ow n (4) d ow n t (3) d ow t (1) p r aa b uw (1) p ow ih (1) p aa iy (1) p aa b uh b l iy (1) p aa ah iy # pronunciations / word 8 6 4 2 5 1 15 2 (3) d ah n (3) ow (3) n ax (2) d ax n (2) ax (1) n uw minimum # occurrences [data from Greenberg et al. 96]

Effect of pronunciation variation on ASR performance Words pronounced non canonically are more likely to be mis recognized [Fosler Lussier 99] Deletions are especially difficult to account for [Jurafsky et al. 1] Conversational speech is recognized at almost twice the error rate of read speech [Weintraub et al. 96] Style Word error rate (%) Spontaneous conversation 52.6 Read conversational 37.6 Read dictation 28.8 Simulated data experiments show potential benefit of a good pronunciation model [McAllaster et al. 98] Test data Word error rate (%) Real 48.8 Simulated from dictionary 1.8 Simulated from transcription 43.9

Overview Preliminaries: Automatic speech recognition (ASR) Phone based pronunciation models Non phonetic alternatives Ongoing/future work

Speech recognition: The generative statistical setting language model P(w) w = some words pronunciation model P(q w) q = [ s s s ah ah ah m m m m m w ] observation model P(a q) a = Recognition w* = argmax w P(w a) = argmax w P(a w) P(w) = argmax w P(w) q P(q w) P(a q)

Speech recognition: The generative statistical setting e.g. n gram: language model P(w) P(w = w 1, w 2,, w k ) = Π i P(w i w i 1, w i 2,, w i (n 1) ) pronunciation model P(q w) w = either : iy/.5 dh er 1 2 3 4 ay/.5 observation model P(a q) P(a i q i ) q=iy q=ay q=dh

Overview Preliminaries: Automatic speech recognition (ASR) Phone based pronunciation models Non phonetic alternatives Ongoing/future work

Phone based pronunciation modeling Lexicon is expanded with substitution, deletion, and insertion rules as in derivational phonology [Chomsky & Halle 68] sense dictionary / s eh n s / [t] insertion rule [ s eh n t s ] Transformation rules are of the form u s / u L _ u R ; p, e.g. Epenthetic stop insertion: Ø t / n _ s ;.5 Flapping: t dx / V _ V ;.7 Rules are derived from Linguistic knowledge [Zue et al. 75, Cohen 89, Tajchman et al. 95, Finke & Waibel 97, Hazen et al. 2, Seneff & Wang 5] Data [Chen 9, Riley & Ljolje 95, Byrne et al. 97, Riley et al. 99, Fosler Lussier 99]

Learning phonological rules from data training waveform baseline pronunciation graph phonetic recognition / manual transcription p o r ax l eh n alignment yes dx.4 t.4 tcl.1 Ø.1 yes next phone unstressed vowel? t.4 tcl.3 dx.2 Ø.1 previous phone stressed vowel? no no Ø.5 tcl.3 t.15 dx.5 probability estimation p o r t l ax n d p o r ax l eh n ε dx.8 t.1 tcl.5 Ø.5 Ø.4 tcl.3 t.2 dx.

Finite state representation of phonological rules Rewrite rules of the form u s / u L _ u R can be represented as finite state transducers (FSTs) [Johnson 72] Example: /t/ flapping rule t dx / V _ V Multiple ordered rules F 1, F 2, can be combined into a single FST via composition F 1 F 2 [from Jurafsky & Martin, Speech and Language Processing, ]

Phone based pronunciation modeling: Some results Model Task Impact on WER (%) Rule learning from manual Broadcast news 12.7 1. transcriptions + retraining [Riley et al. 99] Switchboard 44.7 43.8 Decision trees + dynamic lexicon Broadcast news 21.4 2.4 [Fosler Lussier 99] Knowledge based rules + FST weight learning [Hazen et al. 2] Weather queries 12.1 11. Roughly 1 3% WER improvement across tasks Significant improvements on difficult tasks, but not as large as expected Implicit pronunciation modeling with one pronunciation per word [Hain 2] Observation model accounts for remaining variability Similar performance to multi pronunciation dictionaries State of the art uses one/a few fixed pronunciations per word

Overview Preliminaries: Automatic speech recognition (ASR) Phone based pronunciation models Non phonetic alternatives Ongoing/future work

The argument against the phone Pronunciation changes are gradual If had [h ae d] [h eh d] then had is confusable with head Is [ae] [eh] really happening? No: [figure from Saraclar & Khudanpur, Speech Communication, 4]

Automatically derived units & syllables Automatically derived sub word units [Holter & Svendsen 97, Bacchiani & Ostendorf 99, Varadarajan et al. 8] Learned by segmentation + clustering of the acoustics Lexicon built by aligning word segments with learned units Syllable units [Ganapathiraju et al. 1, Sethy & Narayanan 3] Motivation: Reduction phenomena reported to occur within syllable boundaries Human transcribers label syllables more easily than phones [Fosler Lussier et al. 99] States not shared across syllables had and head are always different Both approaches have impressive results on small vocabulary tasks (~1/3 reduction in WER), but are not directly applicable to infrequent words/syllables

Two paths toward progress Adapt syllable/automatic unit models for larger vocabularies Look to phonology again This time, autosegmental/articulatory phonology

Articulatory features as subword units Inspired by ideas in phonology Autosegmental phonology [Goldsmith 76]: Phonetic representation consists of multiple tiers of segments, with some constraints ( associations ) among them Articulatory phonology [Browman & Goldstein 92]: Tiers consist of articulatory gestures, with phasing relations Surface realizations stray from dictionary via (1) asynchrony and (2) gesture reduction feature LIP LOC LIP OP TT LOC TB LOC TT OP, TB OP GLO VEL values protruded, labial, dental closed, critical, narrow, wide dental, alveolar, palato alveolar, retroflex palatal, velar, uvular, pharyngeal closed, critical, narrow, mid narrow, mid, wide closed (glottal stop), critical (voiced), open (voiceless) closed (non nasal), open (nasal) TB-LOC TT-LOC TB-OP TT-OP LIP-OP LIP-LOC GLOTTIS VELUM

The argument against the phone (2) [X ray video from Speech Communication Group, MIT]

The argument against the phone (3) sense [s eh n t s] Phone insertion? wants [w aa n t s] Phone deletion?? sense [s ih n t s] Phone deletion + substitution?? several [s eh r v ax l] Exchange of two phones?!?!? Texas Instruments [t eh k s ih n s ch em ih n n s] everybody [eh r uw ay]

The argument against the phone (4) Even humans have difficulty with phonetic transcription [Ostendorf 99, Fosler Lussier et al. 99] Deleted phones are sometimes still perceived Inter transcriber disagreement is high (~25% string error) [Saraclar 4] Feature level transcription may be more reliable [Livescu et al. 7] Time in agreement (Kappa statistic) 1 8 6 4 2 feature based hybrid phone based pl1 pl2 dg1 dg2 nas glo vow avg rd

Revisiting sense [s eh n t s], [s ih n t s] feature GLO open dictionary VEL closed TB mid / uvular mid / palatal TT critical / alveolar mid / alveolar phone s eh values critical open open closed mid / uvular closed / alveolar critical / alveolar n s surface variant #1 feature GLO VEL TB TT phone open closed mid / uvular critical / alveolar s critical open mid / palatal mid / alveolar eh values open closed mid / uvular closed / alveolar critical / alveolar n t s surface variant #2 feature GLO VEL TB TT open closed mid / uvular critical / alveolar values critical open open closed mid nar / palatal mid / uvular mid nar / alveolar closed / alveolar critical / alveolar phone s ih t s n

Articulatory feature models: Main Ideas baseform dictionary everybody index 1 2 3 phone eh v r iy GLO crit crit crit crit LIPS wide crit nar wide + asynchrony index GLO index LIPS 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 + feature substitutions target LIPS actual LIPS W W W W C C C C C N N N W W N N N C C C C N N N [Livescu & Glass 4, 5]

Articulatory feature models: Initial approaches Finite state models with product state space [Erler & Freeman 96; Deng et al. 97; Richardson & Bilmes 3] Each state is a vector of feature values Asynchrony among features allowed between target articulations Two pass models [Huckvale 94, Blackburn 96, Reetz 98] 1 st pass: Feature classification [from Richardson & Bilmes, Speech Communication, 3] 2 nd pass: Decoding word sequence from features A modeling problem Finite state models don t take advantage of known independence properties Two pass models assume too much independence

Articulatory feature models: Recent work Articulatory approaches require more flexible probability models One solution: dynamic Bayesian networks Allows the factorization of the state into multiple variables Can represent independence assumptions exactly Recently gaining popularity in ASR [Zweig 98, Bilmes 99, JHU WS1/4/6] At least one ASR oriented toolkit available (GMTK) [Bilmes 2]

Aside: Bayesian networks Bayesian network (BN): Directed graph representation of a distribution over a set of variables Graph node variable + its distribution given parents (Lack of) graph edges independencies Joint distribution = product of local distributions Dynamic Bayesian network (DBN): BN with a repeating structure lip rounding p(r) F1 tongue height p(h) p( f r, h) Example: hidden Markov model (HMM) frame i-1 S O frame i S O s p(si i -1 ) p(o i s i ) p( o, : L s : L ) = L p(s ) p(o s ) p(si si -1 ) p(o i= 1 i s i ) Uniform algorithms for (among other things) Finding the most likely values of some variables, given the rest (analogous to Viterbi algorithm for HMMs) Learning model parameters via expectation maximization

Approach: Main Ideas baseform dictionary everybody index 1 2 3 phone eh v r iy GLO crit crit crit crit LIPS wide crit nar wide + asynchrony ind GLO ind LIPS P ( index GLO index LIPS = 1 ) 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 + feature substitutions target LIPS actual LIPS W W W W C C C C C N N N W W N N N C C C C N N N P ( act = N tar = W ) [Livescu & Glass 4, 5]

Dynamic Bayesian network based articulatory model word word index phone voc. lip op. eh C W 1 v C C 2 r C N 3 iy C W index 1ips target lips index 1ips target lips CL C N CL.7 C.2.7 N.1.2.7 M.1.2 O.1 actual 1ips index tongue sync lips,tongue actual 1ips index tongue sync lips,tongue index phone voc. lip op. eh C W 1 v C C 2 r C W.5 N.5 3 iy C W......... target tongue actual tongue synctongue, voc. target tongue actual tongue synctongue, voc.... index voc. 1 1 1 2 2 2 2 2 index voc. index voc. index lip op. 1 1 1 1 1 2 2 2 target lip op. W W W W C C C C C W W W target voc. target voc. actual lip op. W W N N N C C C C W W W actual voc. actual voc.

Model parameters Phone to feature mapping phone GLOT VEL LIP OPEN TT OPEN aa V (1) CL (1) WI (1) WI (1) m V (1) OP (1) CL (1) CL (.2), CR (.2), NA (.2), M N (.2) Soft synchrony constraints P(async A;B ) Feature substitution probabilities LIP OPEN u \s CL CR NA WI CL.8.15.5 CR.8.2 NA.8.2 WI 1 GLOT u \s V V 1 VL VL 1 Transition probabilities In each frame, the probability of transitioning to the next state in the word Maximum likelihood parameter values learned via Expectation Maximization

Where will the data for parameter learning come from? Manual transcriptions Articulatory measurements (EMA, X ray microbeam, MRI, ) Nowhere!

Lexical access experiments Recognition of Switchboard words Given manual transcription Phone based model: 66% coverage, 54% accuracy Feature based model: 75% coverage, 61% accuracy everybody [ eh r uw ay ] hyp. state seq. hyp. targets input What works? Vowel nasalization & rounding; nasal + stop nasal, some schwa deletions What doesn t work? Some deletions; vowel retroflexion; alveolar + [y] palatal

Overview Preliminaries: Automatic speech recognition (ASR) Phone based pronunciation models Non phonetic alternatives Ongoing/future work

Ongoing work Not a complete recognizer need observation model P(a q), where q = hidden variables Gaussian mixture distribution conditioned on all features [Livescu et al. 7] Separate observation model per feature P(a voicing), P(a lips), [Livescu et al. 3, 7] Posterior based models P(voicing a), P(lips a), [Hasegawa Johnson et al. 5, Cetin et al. 7] Applied to lipreading, improves accuracy over viseme based models [Saenko et al. 5, 6] Additional ongoing work: cross word modeling, audio visual speech recognition [Hasegawa Johnson et al. 7]

Concluding remarks Speech recognition has borrowed much from phonology Derivational phonology phonetic rule based pronunciation modeling Autosegmental/articulatory phonology feature based modeling The best sub word representation is unlikely to be the phone Syllables, acoustically defined units, articulatory features A time of transition for pronunciation modeling New approaches may require new statistical/machine learning tools Graphical models provide a natural framework

Concluding questions Can we use speech recognition models to learn something about speech? instruments [ ih s ch em ih n s ] 1 2.7 1.2.7 2.1.2.7 3.1.2 4.1 How much reduction can occur? transcription ph VEL U VEL S VEL ph TT-LOC U TT-LOC S TT-LOC How do these depend on the speaker, dialect, language impairment,? How do model scores relate to human perceptual judgments?