Automatic speech recognition

Similar documents
Consonants: articulation and transcription

On the Formation of Phoneme Categories in DNN Acoustic Models

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Phonetics. The Sound of Language

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Learning Methods in Multilingual Speech Recognition

Lecture 9: Speech Recognition

Speaker Recognition. Speaker Diarization and Identification

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speech Recognition at ICSI: Broadcast News and beyond

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Body-Conducted Speech Recognition and its Application to Speech Support System

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Universal contrastive analysis as a learning principle in CAPT

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Phonological Processing for Urdu Text to Speech System

Edinburgh Research Explorer

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

THE RECOGNITION OF SPEECH BY MACHINE

Investigation on Mandarin Broadcast News Speech Recognition

Speaker recognition using universal background model on YOHO database

SARDNET: A Self-Organizing Feature Map for Sequences

Natural Language Processing. George Konidaris

Contrasting English Phonology and Nigerian English Phonology

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Speech Emotion Recognition Using Support Vector Machine

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Phonological and Phonetic Representations: The Case of Neutralization

Modeling function word errors in DNN-HMM based LVCSR systems

Segregation of Unvoiced Speech from Nonspeech Interference

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

age, Speech and Hearii

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Mandarin Lexical Tone Recognition: The Gating Paradigm

Proceedings of Meetings on Acoustics

Modeling function word errors in DNN-HMM based LVCSR systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Christine Mooshammer, IPDS Kiel, Philip Hoole, IPSK München, Anja Geumann, Dublin

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

First Grade Curriculum Highlights: In alignment with the Common Core Standards

source or where they are needed to distinguish two forms of a language. 4. Geographical Location. I have attempted to provide a geographical

Characterizing and Processing Robot-Directed Speech

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Automatic English-Chinese name transliteration for development of multilingual resources

WHY SOLVE PROBLEMS? INTERVIEWING COLLEGE FACULTY ABOUT THE LEARNING AND TEACHING OF PROBLEM SOLVING

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

Quarterly Progress and Status Report. Sound symbolism in deictic words

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Large vocabulary off-line handwriting recognition: A survey

Letter-based speech synthesis

Automatic Pronunciation Checker

Speech Recognition by Indexing and Sequencing

Journal of Phonetics

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

9 Sound recordings: acoustic and articulatory data

Keynounce. A Game for Pronunciation Generation through Crowdsourcing

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

TEKS Comments Louisiana GLE

Affricates. Affricates, nasals, laterals and continuants. Affricates. Affricates. Study questions

Consonant-Vowel Unity in Element Theory*

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

GOLD Objectives for Development & Learning: Birth Through Third Grade

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

**Note: this is slightly different from the original (mainly in format). I would be happy to send you a hard copy.**

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling user preferences and norms in context-aware systems

Beginning primarily with the investigations of Zimmermann (1980a),

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Lecture 1: Machine Learning Basics

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Transcription:

Speech recognition 1 Few useful books Speech recognition 2 Automatic speech recognition Lawrence Rabiner, Biing-Hwang Juang, Fundamentals of speech recognition, Prentice-Hall, Inc. Upper Saddle River, NJ, USA 1993 Dan Jurafsky, James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall, 2009 Contents: Introduction Speech production recap Phonetics recap Software resources Speech recognition 3 Speech recognition Speech recognition 4 Hidden Markov Model toolkit CMU Sphinx-II Speech Recognizer NIST Speech Recognition Scoring Utilities SRI Language Model Toolkit CMU / Cambridge Language Model Toolkit Goal: convert an acoustic signal X into a word sequence W independent of speaker and environment Implementation: Several types of recognizers Isolated word recognition each word is surrounded by silence Word spotting detect a word in presence of surrounding words word Connected-Word Recognition word sequences constrained by a fixed grammar (e.g., telephone numbers) Continuous Speech Recognition fluent, uninterrupted speech

Components of a speech recognizer Speech recognition 5 Challenges in speech recognition Speech recognition 6 Acoustic model Knowledge of acoustics and phonetics Microphone and environment differences Speaker differences Pronunciation dictionary How words are formed from their constituent sounds Language model What constitutes a word What words are likely to occur and in what sequence Word boundaries in continuous speech are unclear Continuous speech is less articulated Coarticulation and phonetic context within and accros words variability Speaker independent vs speaker dependent system Inter and intra-speaker variability: stress, emotion, speaking rate Environment variability: stationary/nonstationary noise, microphone vs telephone vs cell phone Context variability Speech recognition 7 Speech production Speech recognition 8 Vocal tract cavities pharyngeal and oral Articulators components of the vocal tract that move to produce various speech sounds: velum, lips, tongue, teeth Source-filter representation of speech production Speech production is an acoustic filtering operation Larynx and lungs provide source excitation Vocal and nasal tract act as a filter that shapes the spectrum of the signal

Speech recognition 9 Source-filter model of speech production Phonetics Speech recognition 10 Phoneme abstract unit that can be used for writing a language down in a systematic or unambiguous way Phoneme categories Vowels air passes freely through resonators Consonants air is partially or totally obstructed in one or more places as it passes through the resonators Speech recognition 11 Classification of speech sounds Speech recognition 12 Voiced / Voiceless Voiced if vocal chords vibrate Nasal / Oral Nasal if air travels through nasal cavity and oral cavity is closed Consonants / Vowels Consonants when there is obstruction of the air stream above the glottis (glottis=space between the vocal chords) Characterized by place and manner of articulation and voicing Vowels - characterized by lip position, tongue height and tongue advancement Lateral / Non-lateral Non-lateral when the air stream passes through the middle of the oral cavity (not along)

Phonetic alphabets Speech recognition 13 CMU dictionary Speech recognition 14 Describe the sounds of a language Each language will have its own unique set of phonemes Words are represented by sequences of phoneme Useful representation for speech recognition! IPA phonetic representation standard, used for most world languages Character set difficult to manipulate on computer ARPAbet English ASCII representation CMU Sphinx phonetic symbols Based on ARPA - more appropriate to our purpose http://www.speech.cs.cmu.edu/cgi-bin/cmudict Vowels Speech recognition 15 Coarticulation Speech recognition 16 IY IH EH AE AX UH u AA UW AO F AY TH AY S AY SH AY

Speech recognition 17 Speech recognition 18 Given a sequence of observations (evidence) from an audio signal O = o 1 o 2 T Determine the underlying word sequence W W = w 1 w 2 m The number of words m is unknown and the observation sequence is variable in lengtht The goal is to minimize the classification error rate Solution: maximize the posterior probability This requires optimization over all possible word strings P(O) does not impact optimization, therefore: Speech recognition 19 Speech recognition 20 Assuming that words can be represented by a sequence of states, S Optimization problem: Words are composed of phonemes, phonemes are represented by states Implementation: O P(O S) P(S W) P(W) Observation (feature) sequence Acoustic model Pronunciation model (lexicon) Language model

Speech recognition 21 Optimization searches for the most likely word sequence given the observations (features) we cannot evaluate all possible word sequences Hidden Markov Model representation of speech Speech recognition 22 We need to define a representation for modeling the states HMMs a method for approximately searching the best sequence given the evidence Viterbi algorithm ways to make it fast