Human Speech Recognition. Julia Hirschberg CS4706 (thanks to Francis Ganong and John Paul Hosum for some slides)

Similar documents
Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonological and Phonetic Representations: The Case of Neutralization

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

On the Formation of Phoneme Categories in DNN Acoustic Models

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Modeling function word errors in DNN-HMM based LVCSR systems

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Human Emotion Recognition From Speech

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Modeling function word errors in DNN-HMM based LVCSR systems

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Proceedings of Meetings on Acoustics

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Florida Reading Endorsement Alignment Matrix Competency 1

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Emotion Recognition Using Support Vector Machine

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

English Language and Applied Linguistics. Module Descriptions 2017/18

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Rhythm-typology revisited.

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

On the nature of voicing assimilation(s)

Universal contrastive analysis as a learning principle in CAPT

Phonological encoding in speech production

Understanding and Supporting Dyslexia Godstone Village School. January 2017

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

WHEN THERE IS A mismatch between the acoustic

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Journal of Phonetics

Lecture 2: Quantifiers and Approximation

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

THE PERCEPTION AND PRODUCTION OF STRESS AND INTONATION BY CHILDREN WITH COCHLEAR IMPLANTS

To appear in the Proceedings of the 35th Meetings of the Chicago Linguistics Society. Post-vocalic spirantization: Typology and phonetic motivations

Consonants: articulation and transcription

THE RECOGNITION OF SPEECH BY MACHINE

Stages of Literacy Ros Lugg

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Segregation of Unvoiced Speech from Nonspeech Interference

raıs Factors affecting word learning in adults: A comparison of L2 versus L1 acquisition /r/ /aı/ /s/ /r/ /aı/ /s/ = individual sound

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Learners Use Word-Level Statistics in Phonetic Category Acquisition

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Speaker Identification by Comparison of Smart Methods. Abstract

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

South Carolina English Language Arts

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Speaker recognition using universal background model on YOHO database

Learning Methods in Multilingual Speech Recognition

Natural Language Processing. George Konidaris

The Strong Minimalist Thesis and Bounded Optimality

Python Machine Learning

Investigation on Mandarin Broadcast News Speech Recognition

L1 Influence on L2 Intonation in Russian Speakers of English

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Journal of Phonetics

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Phonological Processing for Urdu Text to Speech System

Visual CP Representation of Knowledge

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Switchboard Language Model Improvement with Conversational Data from Gigaword

Sample Goals and Benchmarks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Automatic intonation assessment for computer aided language learning

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Evolution of Symbolisation in Chimpanzees and Neural Nets

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Linking Task: Identifying authors and book titles in verbose queries

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

A study of speaker adaptation for DNN-based speech synthesis

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Demonstration of problems of lexical stress on the pronunciation Turkish English teachers and teacher trainees by computer

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Speaker Recognition. Speaker Diarization and Identification

Large Kindergarten Centers Icons

GOLD Objectives for Development & Learning: Birth Through Third Grade

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Arabic Orthography vs. Arabic OCR

Transcription:

Human Speech Recognition Julia Hirschberg CS4706 (thanks to Francis Ganong and John Paul Hosum for some slides)

Linguistic View of Speech Perception Speech is a sequence of articulatory gestures Many parallel levels of description Phonetic, Phonologic Prosodic Lexical Syntactic, Semantic, Pragmatic Human listeners make use of all these levels in speech perception Multiple cues and strategies used in different contexts

ASR Paradigm Given an acoustic observation: What is the most likely sequence of words to explain the input? Using Acoustic Model Language Model Two problems: How to score hypotheses (Modeling) How to pick hypotheses to score (Search)

So.What s Human about State of the Art ASR? Input Wave Front End Acoustic Features Acoustic Models Search Lexicon Language Models

N1 Front End: MFCC Input Wave Front End Acoustic Features Acoustic Models Search Postprocessing Lexicon Language Input Wave Sampling, Windowing Models FastFourierTransform Mel Filter Bank: cosine transform first 8-12 coefficients Stacking, computation of deltas:normalizations: filtering, etc Linear Transformations:dimensionality reduction Acoustic Features

Slide 5 N1 change color of 2nd box to pink; first 1/3 only Nuance, 3/7/2010

Input Wave Basic Lexicon Front End Acoustic Features Acoustic Models Search A list of spellings and pronunciations Canonical pronunciations And a few others Limited to 64k entries Support simple stems and suffixes Linguistically naïve No phonological rewrites Doesn t support all languages Lexicon Language Models

Lexical Access Frequency sensitive, like ASR We access high frequency words faster and more accurately with less information than low frequency Access in parallel, like ASR We access multiple hypotheses simultaneously Based on multiple cues

How Does Human Perception Differ from ASR? Could ASR systems benefit by modeling any of these differences?

How Do Humans Identify Speech Sounds? Perceptual Critical Point Perceptual Compensation Model Phoneme Restoration Effect Perceptual Confusability Non Auditory Cues Cultural Dependence Categorical vs. Continuous

How Much Information Do We Need to Identify Phones? Furui (1986) truncated CV syllables from the beginning, the end, or both and measured human perception of truncated syllables Identified perceptual critical point as truncation position where there was 80% correct recognition Findings: 10 msec during point of greatest spectral transition is most critical for CV identification Crucial information for C and V is in this region C can be mainly perceived by spectral transition into following V

Can this help ASR?

Target Undershoot Vowels may or may not reach their target formant due to coarticulation Amount of undershoot depends on syllable duration, speaking style, How do people compensate in recognition? Lindblom & Studdert Kennedy (1967) Synthetic stimuli in wvw and yvy contexts with V F2 varying from high (/ih/) to low (/uh/) and with different transition slopes from consonant to vowel Subjects asked to judge /ih/ or /uh/

Boundary for perception of /ih/ and /uh/ (given the varying F2 values) different in the wvw context and yvy context In yvy contexts, mid level values of F2 were heard as /uh/, and in wvw contexts, mid level values of F2 heard as /ih/ /w ih w y uh y

Perceptual Compensation Model Conclusion: subjects relying on direction and slope of formant transitions to classify vowels Lindblom s PCM: normalize formant frequencies based on formants of the surrounding consonants, canonical vowel targets, syllable duration Application to ASR? Determining locations of consonants and vowels is non trivial

Can this help ASR?

Phoneme Restoration Effect Warren 1970 presented subjects with The state governors met with their respective legislatures convening in the capital city. Replaced [s] in legislatures with a cough Task: find any missing sounds Result: 19/20 reported no missing sounds (1 thought another sound was missing) Conclusion: much speech processing is top down rather than bottom up

Perceptual Confusability Studies Hypothesis: Confusable consonants are confusable in production because they are perceptually similar E.g. [dh/z/d] and [th/f/v] Experiment: Embed syllables beginning with targets in noise Ask listeners to identify Look at confusion matrix

Is there confusion between voiced and voiceless sounds? Shepard s similarity metric S ij = P P ij ii + + P P ji jj

Can this help ASR?

Speech and Visual Information How does visual observation of articulation affect speech perception? McGurk Effect (McGurk & McDonald 1976) Subjects heard simple syllables while watching video of speakers producing phonetically different syllables (demo) E.g. hear [ba] while watching [ga] What do they perceive? Conclusion: Humans have a perceptual map of place of articulation different from auditory

Can this help ASR?

Speech/Somatosensory Connection Ito et al 2008 show that stretching mouth can influence speech perception Subjects heard head, had, or something on a continuum in between Robotic device stretches mouth up, down, or backward Upward stretch leads to head judgments and downward to had but only when timing of stretch imitates production of vowel What does this mean about our perceptual maps?

Can this help ASR?

Is Speech Perception Culture Dependent? Mandarin tones High, falling, rising, dipping (usually not fully realized) Tone Sandhi: dipping, dipping rising, dipping Why? Easier to say Dipping and rising tones perceptually similar so high is appropriate substitute Comparison of native and non native speakers tone perception (Huang 2001)

Determine perceptual maps of Mandarin and American English subjects Discrimination task, measuring reaction time Two syllables compared, differing only in tone Task: same or different? Averaged reaction times for correct different answers Distance is 1/rt

Mandarin High [55] High [55] Rising [35] Dipping [214] Falling [51] Rising [35] 563 615 Dipping [214] American High [55] Rising [35] 579 683 536 706 Dipping [214] Falling [51 588 548 545 600 592 608 Falling [51]

Can this help ASR?

Is Human Speech Perception Categorical or Continuous? Do we hear discrete symbols, or a continuum of sounds? What evidence should we look for? Categorical: There will be a range of stimuli that yield no perceptual difference, a boundary where perception changes, and another range showing no perceptual difference, e.g. Voice onset time (VOT) If VOT long, people hear unvoiced plosives If VOT short, people hear voiced plosives But people don t hear ambiguous plosives at the boundary between short and long (30 msec).

Non categorical, sort of Barclay 1972 presented subjects with a range of stimuli between /b/, /d/, and /g/ Asked to respond only with /b/ or /g/. If perception were completely categorical, responses for /d/ stimuli should have been random, but they were systematic Perception may be continuous but have sharp category boundaries, e.g.

Can this help ASR?

Where is ASR Going Today? 3 >5 Triphones > Quinphones Trigrams > Pentagrams Bigger acoustic models More parameters More mixtures Bigger lexicons 65k > 256k

Bigger language models More data, more parameters Bigger acoustic models More sharing Bigger language models Better back offs More kinds of adaptation Feature space adaptation Discriminative training instead of MLE to penalize error producing parameter settings Rover: combinations of recognizers Finite State Machine architecture to flatten knowledge into uniform structure

But not Perceptual Linear Prediction: modify cepstral coefficients by psychophysical findings Use of articulatory constraints Modeling features instead of specific phonemes Neural Nets, SVM / Kernel methods, Example Based Recognition, Segmental Models (frames >segments), Graphical Models (merge graph theory/probability theory) Parsing

No Data Like More Data Still Winning Standard statistical problems Curse of dimensionality, long tails Desirability of priors Quite sophisticated statistical models Advances due to increased size and sophistication of models Like Moore s law: no breakthroughs, dozens of small incremental advances Tiny impact of linguistic theory/experiments

Next Class Newer tasks for recognition