Speech and Language Processing. Chapter 9 of SLP Automatic Speech Recognition (I)

Similar documents
Human Emotion Recognition From Speech

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Speaker Recognition. Speaker Diarization and Identification

Speaker recognition using universal background model on YOHO database

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

WHEN THERE IS A mismatch between the acoustic

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 9: Speech Recognition

Speaker Identification by Comparison of Smart Methods. Abstract

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

On the Formation of Phoneme Categories in DNN Acoustic Models

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A study of speaker adaptation for DNN-based speech synthesis

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Segregation of Unvoiced Speech from Nonspeech Interference

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Support Vector Machines for Speaker and Language Recognition

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Probabilistic Latent Semantic Analysis

Proceedings of Meetings on Acoustics

Edinburgh Research Explorer

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Body-Conducted Speech Recognition and its Application to Speech Support System

Learning Methods in Multilingual Speech Recognition

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Recognition by Indexing and Sequencing

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Bi-Annual Status Report For. Improved Monosyllabic Word Modeling on SWITCHBOARD

Calibration of Confidence Measures in Speech Recognition

Perceptual scaling of voice identity: common dimensions for different vowels and speakers

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

THE RECOGNITION OF SPEECH BY MACHINE

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Lecture 1: Machine Learning Basics

Mandarin Lexical Tone Recognition: The Gating Paradigm

Automatic Pronunciation Checker

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Consonants: articulation and transcription

Voice conversion through vector quantization

Investigation on Mandarin Broadcast News Speech Recognition

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Rhythm-typology revisited.

Switchboard Language Model Improvement with Conversational Data from Gigaword

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Affective Classification of Generic Audio Clips using Regression Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Linking Task: Identifying authors and book titles in verbose queries

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Voiceless Stop Consonant Modelling and Synthesis Framework Based on MISO Dynamic System

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Spoofing and countermeasures for automatic speaker verification

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Generative models and adversarial training

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

age, Speech and Hearii

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lecture 10: Reinforcement Learning

Ansys Tutorial Random Vibration

Transcription:

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Outline for ASR ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation Model (HMM) 3) Feature Extraction 4) Acoustic Model 5) Decoder Training Evaluation 7/30/08 Speech and Language Processing Jurafsky and Martin 2

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 7/30/08 Speech and Language Processing Jurafsky and Martin 3

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speakerdependent) Continuous speech (vs isolated-word) 7/30/08 Speech and Language Processing Jurafsky and Martin 4

Current error rates Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 WSJ read speech 20K 3 Broadcast news 64,000+ 10 Conversational Telephone 64,000+ 20 7/30/08 Speech and Language Processing Jurafsky and Martin 5

HSR versus ASR Task Vocab ASR Hum SR Continuous digits 11.5.009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 5K 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt 7/30/08 Speech and Language Processing Jurafsky and Martin 6

Why is conversational speech harder? A piece of an utterance without context The same utterance with more context 7/30/08 Speech and Language Processing Jurafsky and Martin 7

LVCSR Design Intuition Build a statistical model of the speech-towords process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 7/30/08 Speech and Language Processing Jurafsky and Martin 8

Speech Recognition Architecture 7/30/08 Speech and Language Processing Jurafsky and Martin 9

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given 7/30/08 the waveform. Speech and Language Processing Jurafsky and Martin 10

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3,,o t Define a sentence as a sequence of words: W = w 1,w 2,w 3,,w n 7/30/08 Speech and Language Processing Jurafsky and Martin 11

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S = W: W ˆ = argmaxp(w O) W L We can use Bayes rule to rewrite this: W ˆ P(O W )P(W ) = argmax W L P(O) Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: W ˆ = argmaxp(o W )P(W ) W L 7/30/08 Speech and Language Processing Jurafsky and Martin 12

Noisy channel model likelihood prior W ˆ = argmax W L P(O W )P(W ) 7/30/08 Speech and Language Processing Jurafsky and Martin 13

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal Source) 7/30/08 Speech and Language Processing Jurafsky and Martin 14

Speech Architecture meets Noisy Channel 7/30/08 Speech and Language Processing Jurafsky and Martin 15

Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) 7/30/08 Speech and Language Processing Jurafsky and Martin 16

Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/ cmudict We ll represent the lexicon as an HMM 7/30/08 Speech and Language Processing Jurafsky and Martin 17

HMMs for speech: the word six 7/30/08 Speech and Language Processing Jurafsky and Martin 18

Phones are not homogeneous! 5000 0 0.48152 ay k 0.937203 Time (s) 7/30/08 Speech and Language Processing Jurafsky and Martin 19

Each phone has 3 subphones 7/30/08 Speech and Language Processing Jurafsky and Martin 20

Resulting HMM word model for six with their subphones 7/30/08 Speech and Language Processing Jurafsky and Martin 21

HMM for the digit recognition task 7/30/08 Speech and Language Processing Jurafsky and Martin 22

Detecting Phones Two stages Feature extraction Basically a slice of a spectrogram Phone classification Using GMM classifier 7/30/08 Speech and Language Processing Jurafsky and Martin 23

Discrete Representation of Signal Represent continuous signal into discrete form. 7/30/08 Thanks to Bryan Pellom for this slide Speech and Language Processing Jurafsky and Martin 24

Digitizing the signal (A-D) Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone ( Wideband ): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 7/30/08 Speech and Language Processing Jurafsky and Martin 25

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun.au 40 byte header 7/30/08 Speech and Language Processing Jurafsky and Martin 26

Discrete Representation of Signals Byte swapping Little-endian vs. Big-endian Some audio formats have headers Headers contain meta-information such as sampling rates, recording condition Raw file refers to 'no header' Example: Microsoft wav, Nist sphere Nice sound manipulation tool: sox. change sampling rate convert speech formats 7/30/08 Speech and Language Processing Jurafsky and Martin 27

MFCC: Mel-Frequency Cepstral Coefficients 7/30/08 Speech and Language Processing Jurafsky and Martin 28

Pre-Emphasis Pre-emphasis: boosting the energy in the high frequencies Q: Why do this? A: The spectrum for voiced segments has more energy at lower frequencies than higher frequencies. This is called spectral tilt Spectral tilt is caused by the nature of the glottal pulse Boosting high-frequency energy gives more info to Acoustic Model Improves phone recognition performance 7/30/08 Speech and Language Processing Jurafsky and Martin 29

Example of pre-emphasis Before and after pre-emphasis Spectral slice from the vowel [aa] 7/30/08 Speech and Language Processing Jurafsky and Martin 30

MFCC process: windowing 7/30/08 Speech and Language Processing Jurafsky and Martin 31

Windowing Why divide speech signal into successive overlapping frames? Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. Frames Frame size: typically, 10-25ms Frame shift: the length of time between successive frames, typically, 5-10ms 7/30/08 Speech and Language Processing Jurafsky and Martin 32

MFCC process: windowing 7/30/08 Speech and Language Processing Jurafsky and Martin 33

Common window shapes Rectangular window: Hamming window 7/30/08 Speech and Language Processing Jurafsky and Martin 34

Discrete Fourier Transform Input: Windowed signal x[n] x[m] Output: For each of N discrete frequency bands A complex number X[k] representing magnidue and phase of that frequency component in the original signal Discrete Fourier Transform (DFT) Standard algorithm for computing DFT: Fast Fourier Transform (FFT) with complexity N*log(N) In general, choose N=512 or 1024 7/30/08 Speech and Language Processing Jurafsky and Martin 35

Discrete Fourier Transform computing a spectrum A 25 ms Hamming-windowed signal from [iy] And its spectrum as computed by DFT (plus other smoothing) 7/30/08 Speech and Language Processing Jurafsky and Martin 36

Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: 7/30/08 Speech and Language Processing Jurafsky and Martin 37

Mel-scale A mel is a unit of pitch Pairs of sounds perceptually equidistant in pitch Are separated by an equal number of mels Mel-scale is approximately linear below 1 khz and logarithmic above 1 khz Definition: 7/30/08 Speech and Language Processing Jurafsky and Martin 38

Mel Filter Bank Processing Mel Filter bank Uniformly spaced before 1 khz logarithmic scale after 1 khz 7/30/08 Speech and Language Processing Jurafsky and Martin 39

Log energy computation Log of the square magnitude of the output of the mel filterbank Why log? Logarithm compresses dynamic range of values Human response to signal level is logarithmic humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker s mouth moving closer to mike) Why square? Phase information not helpful in speech 7/30/08 Speech and Language Processing Jurafsky and Martin 40

The Cepstrum One way to think about this Separating the source and filter Speech waveform is created by A glottal source waveform Passes through a vocal tract which because of its shape has a particular filtering characteristic Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of oral cavity, some harmonics are amplified more than others 7/30/08 Speech and Language Processing Jurafsky and Martin 41

Vocal Fold Vibration UCLA Phonetics Lab Demo 7/30/08 Speech and Language Processing Jurafsky and Martin 42

George Miller figure 7/30/08 Speech and Language Processing Jurafsky and Martin 43

We care about the filter not the source Most characteristics of the source F0 Details of glottal pulse Don t matter for phone detection What we care about is the filter The exact position of the articulators in the oral tract So we want a way to separate these And use only the filter function 7/30/08 Speech and Language Processing Jurafsky and Martin 44

The Cepstrum The spectrum of the log of the spectrum Spectrum Log spectrum Spectrum of log spectrum 7/30/08 Speech and Language Processing Jurafsky and Martin 45

Thinking about the Cepstrum 7/30/08 Pictures from John Coleman (2005) Speech and Language Processing Jurafsky and Martin 46

Mel Frequency cepstrum The cepstrum requires Fourier analysis But we re going from frequency space back to time So we actually apply inverse DFT Details for signal processing gurus: Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT) 7/30/08 Speech and Language Processing Jurafsky and Martin 47

Another advantage of the Cepstrum DCT produces highly uncorrelated features We ll see when we get to acoustic modeling that these will be much easier to model than the spectrum Simply modelled by linear combinations of Gaussian density functions with diagonal covariance matrices In general we ll just use the first 12 cepstral coefficients (we don t want the later ones which have e.g. the F0 spike) 7/30/08 Speech and Language Processing Jurafsky and Martin 48

Dynamic Cepstral Coefficient The cepstral coefficients do not capture energy So we add an energy feature Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). So we want to add the changes in features (the slopes). We call these delta features We also add double-delta acceleration features 7/30/08 Speech and Language Processing Jurafsky and Martin 49

Typical MFCC features Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC: 12 MFCC (mel frequency cepstral coefficients) 1 energy feature 12 delta MFCC features 12 double-delta MFCC features 1 delta energy feature 1 double-delta energy feature Total 39-dimensional features 7/30/08 Speech and Language Processing Jurafsky and Martin 50

Why is MFCC so popular? Efficient to compute Incorporates a perceptual Mel frequency scale Separates the source and filter IDFT(DCT) decorrelates the features Improves diagonal assumption in HMM modeling Alternative PLP 7/30/08 Speech and Language Processing Jurafsky and Martin 51

Next Time: Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame o i And given a phone q we want to detect Compute p(o i q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 7/30/08 Speech and Language Processing Jurafsky and Martin 52

Summary ASR Architecture The Noisy Channel Model Five easy pieces of an ASR system 1) Language Model 2) Lexicon/Pronunciation Model (HMM) 3) Feature Extraction 4) Acoustic Model 5) Decoder Training Evaluation 7/30/08 Speech and Language Processing Jurafsky and Martin 53