CS 545 Lecture XI: Speech (some slides courtesy Jurafsky&Martin)

Similar documents
Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

Speaker recognition using universal background model on YOHO database

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

Human Emotion Recognition From Speech

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

On the Formation of Phoneme Categories in DNN Acoustic Models

Probabilistic Latent Semantic Analysis

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Edinburgh Research Explorer

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speaker Identification by Comparison of Smart Methods. Abstract

Calibration of Confidence Measures in Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Lecture 9: Speech Recognition

Automatic Pronunciation Checker

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Support Vector Machines for Speaker and Language Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Investigation on Mandarin Broadcast News Speech Recognition

Speech Recognition by Indexing and Sequencing

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Speaker Recognition. Speaker Diarization and Identification

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Mandarin Lexical Tone Recognition: The Gating Paradigm

Segregation of Unvoiced Speech from Nonspeech Interference

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Python Machine Learning

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lecture 1: Machine Learning Basics

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Linking Task: Identifying authors and book titles in verbose queries

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Proceedings of Meetings on Acoustics

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Statistical Parametric Speech Synthesis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Generative models and adversarial training

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Body-Conducted Speech Recognition and its Application to Speech Support System

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Multi-Lingual Text Leveling

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Voice conversion through vector quantization

Corrective Feedback and Persistent Learning for Information Extraction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Deep Neural Network Language Models

Rhythm-typology revisited.

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Affective Classification of Generic Audio Clips using Regression Models

Cross Language Information Retrieval

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Phonological Processing for Urdu Text to Speech System

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

An Online Handwriting Recognition System For Turkish

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Arabic Orthography vs. Arabic OCR

CS Machine Learning

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Assignment 1: Predicting Amazon Review Ratings

arxiv: v2 [cs.cv] 30 Mar 2017

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Transcription:

CS 545 Lecture XI: Speech (some slides courtesy Jurafsky&Martin) brownies_choco81@yahoo.com brownies_choco81@yahoo.com Benjamin Snyder Announcements Office hours change for today and next week: 1pm - 1:45pm or by appointment -- but please schedule ahead HW 4 / 5? will be out soon Speech in a Slide Frequency gives pitch; amplitude gives volume amplitude n p ee ch l a b Frequencies at each time slice processed into observation vectors frequency n s..a12a13a12a14a14..

The Noisy-Channel Model w P (w) language model o P (o w) acoustic model ASR System Components Language Model Acoustic Model source P(w) w channel P(o w) o best w decoder observed o argmax P(w o) = argmax P(o w)p(w) w w Phoneme Inventories Phoneme: sound used as a building block in words Some phonemes occur in most languages (b, p, m, n, s) But substantial variation occurs in size and scope of phoneme inventories across languages Consonants characterized by 1. place of articulation 2. manner of articulation 3. voicing

Vowels Phonotactics Languages exhibit phonotactics Some phoneme sequences are favored, others are forbidden Phonotactics are largely language-specific... But, often shared within language families And some sound sequences are anatomically difficult for everyone: kgvrsatr

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speakerdependent) Continuous speech (vs isolated-word) Current error rates Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 WSJ read speech 20K 3 Broadcast news 64,000+ 10 Conversational Telephone 64,000+ 20

HSR versus ASR Task Vocab ASR Hum SR Continuous digits 11.5.009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 5K 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt 7/30/08 Speech and Language Processing Jurafsky and Martin Issues Pronunciation error 3-4 times higher for native Spanish and Japanese speakers Car noise error 2-4 times higher Multiple speakers LVCSR Design Intuition Build a statistical model of the speech-towords process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 7/30/08

Speech Recognition Architecture 7/30/08 Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) 7/30/08 16 Noisy Channel Part 1: Words to Phonemes (transitions in HMM)

Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/ cmudict We ll represent the lexicon as an HMM 7/30/08 HMMs for speech: the word six Phones are not homogeneous! 5000 0 0.48152 ay k 0.937203 Time (s) 7/30/08 Speech and Language Processing Jurafsky and Martin 19

Each phone has 3 subphones Resulting HMM word model for six with their subphones Noisy Channel Part 1I: Phonemes to Sounds (emissions in HMM)

George Miller figure And also, human acoustic perception... We care about the filter not the source Most characteristics of the source F0 Details of glottal pulse Don t matter for phone detection What we care about is the filter The exact position of the articulators in the oral tract So we want a way to separate these And use only the filter function 7/30/08 Speech and Language Processing Jurafsky and Martin 4 Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: 7/30/08 Speech and Language Processing Jurafsky and Martin 3

MFCC: Mel-Frequency Cepstral Coefficients Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 7/30/08 Speech and Language Processing Jurafsky and Martin Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame o i And given a phone q we want to detect Compute p(o i q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 7/30/08 Speech and Language Processing Jurafsky and Martin 2

Gaussian Mixture Models Also called fully-continuous HMMs P(o q) computed by a Gaussian: p(o q) = 1 (o µ)2 exp( ) σ 2π 2σ 2 7/30/08 Speech and Language Processing Jurafsky and Mart Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o q): P(o q) is highest here at mean P(o q) P(o q is low here, very far from mean) o 7/30/08 Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation We could just compute the mean and variance from the data: T µ i = 1 o t s.t. o t is phone i T t=1 T σ 2 i = 1 (o t µ i ) 2 s.t. o t is phone i T t=1 7/30/08

But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: p( o q) = 2π D 2 1 D exp( 1 D (o[d] µ[d]) 2 ) 2 σ 2 [d] σ 2 d =1 [d] d =1 Gaussian Intuitions: Size of Σ µ = [0 0] µ = [0 0] µ = [0 0] Σ = I Σ = 0.6I Σ = 2I As Σ becomes larger, Gaussian becomes more spread out; as Σ becomes smaller, Gaussian more compressed 7/30/08 Text and figures from Andrew Ng s lecture notes for Speech CS229 and Language Processing Jurafsky and Martin 30 Actually, mixture of gaussians Phone A Phone B Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data 7/30/08 34

Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature HMMs for speech HMM for digit recognition task 7/30/08 47

Training and Decoding Training Would be easy if phones observed (Maximum Likelihood) But they are not... and neither are mixture weights Use EM algorithm (Expectation Maximization) Decoding Basic idea: Viterbi algorithm from last time But many little details... Summary: ASR Architecture Five easy pieces: ASR Noisy Channel architecture 1) Feature Extraction: 39 MFCC features 2) Acoustic Model: Gaussians for computing p(o q) 3) Lexicon/Pronunciation Model HMM: what phones can follow each other 4) Language Model N-grams for computing p(w i w i-1 ) 5) Decoder Viterbi algorithm: dynamic programming for combining all these to get word sequence from speech! 7/30/08