L12: Template matching

Similar documents
Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Automatic Pronunciation Checker

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

WHEN THERE IS A mismatch between the acoustic

Speaker recognition using universal background model on YOHO database

Voice conversion through vector quantization

Segregation of Unvoiced Speech from Nonspeech Interference

Mandarin Lexical Tone Recognition: The Gating Paradigm

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Are You Ready? Simplify Fractions

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 9: Speech Recognition

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Proceedings of Meetings on Acoustics

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Lecture 1: Machine Learning Basics

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

A Case Study: News Classification Based on Term Frequency

Large vocabulary off-line handwriting recognition: A survey

Speech Recognition by Indexing and Sequencing

Modeling function word errors in DNN-HMM based LVCSR systems

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Body-Conducted Speech Recognition and its Application to Speech Support System

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

An Online Handwriting Recognition System For Turkish

Word Segmentation of Off-line Handwritten Documents

Individual Differences & Item Effects: How to test them, & how to test them well

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

THE RECOGNITION OF SPEECH BY MACHINE

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Major Milestones, Team Activities, and Individual Deliverables

Using Proportions to Solve Percentage Problems I

Automatic intonation assessment for computer aided language learning

Investigation on Mandarin Broadcast News Speech Recognition

Detecting English-French Cognates Using Orthographic Edit Distance

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Physics XL 6B Reg# # Units: 5. Office Hour: Tuesday 5 pm to 7:30 pm; Wednesday 5 pm to 6:15 pm

Reinforcement Learning by Comparing Immediate Reward

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

ECE-492 SENIOR ADVANCED DESIGN PROJECT

South Carolina English Language Arts

Artificial Neural Networks written examination

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speaker Recognition. Speaker Diarization and Identification

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

The Strong Minimalist Thesis and Bounded Optimality

arxiv: v1 [cs.cl] 2 Apr 2017

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Improvements to the Pruning Behavior of DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Speaker Identification by Comparison of Smart Methods. Abstract

MTH 215: Introduction to Linear Algebra

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Lecture 10: Reinforcement Learning

SARDNET: A Self-Organizing Feature Map for Sequences

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Degeneracy results in canalisation of language structure: A computational model of word learning

CS Machine Learning

INPE São José dos Campos

Seminar - Organic Computing

Psychology 102- Understanding Human Behavior Fall 2011 MWF am 105 Chambliss

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Extending Place Value with Whole Numbers to 1,000,000

Transcription:

Introduction to ASR Pattern matching Dynamic time warping Refinements to DTW L12: Template matching This lecture is based on [Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 1

Introduction to ASR What is automatic speech recognition? The goal of ASR is to accurately and efficiently convert a speech signal into a text message transcription of the spoken words This process should be independent of The device used to record the speech (i.e., the microphone) The speaker s characteristics (i.e., age, gender, accent) The acoustic environment (i.e., quiet office vs. noisy rooms, outdoors) The ultimate goal, which has yet to be achieved, is to perform as well as a human listener would Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 2

History of ASR Rule-based methods Early work starting in the 1950s focused on developing of rules based on (1) acoustic-phonetic knowledge (i.e. formants), or (2) ad-hoc measurements of properties of the speech signal These systems could recognize digit and isolated words, but performed poorly because of (1) coarticulation effects and (2) the inflexibility of rulebased hard decisions Template-matching During the 1960s and 1970s, work on ASR benefited from developments in pattern matching techniques, particularly dynamic time warping These systems worked reasonably well at isolated word recognition, but did not properly use information about variability in the speech signal Statistical modeling During the 1970s and 1980s, research on ASR moved towards statistical methods for acoustic and language modeling These methods (e.g., hidden Markov models, n-grams) have now been almost universally adopted for ASR Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 3

Milestones in speech and multimodal technology research [Juang and Rabiner, 2004] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 4

Overall architecture of a modern ASR system [Rabiner and Schafer, 2007] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 5

Lecture plan In this lecture, we will review early work on pattern matching and the use of dynamic time warping (DTW) Though DTW has been largely superseded, the algorithm still finds uses in other areas of speech research The second lecture on ASR will focus on hidden Markov models (HMM), their basic structure and learning algorithms The third lecture will cover refinements for HMMs, including issues of robustness and speaker independence The fourth lecture will discuss large-vocabulary ASR, including acoustic and language modeling, and decoding The fifth lecture will introduce HTK, the most widely used software for research and development in ASR Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 6

Basis for the approach Pattern matching As we have previously seen, the relationship between acoustic patterns and their linguistic content is quite complex, partly due to coarticulation However, if the same person repeats the same isolated word on separate occasions, the pattern is likely to be similar, particularly when looking at spectrograms This suggests one potential approach for ASR Store examples of acoustic patterns (call them templates) for all the words to be recognized For an incoming word, compare it with each of the stored patterns and assign it to the closest match Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 7

Distance metrics Assume for now that the words are spoken in isolation, at exactly the same speed, and that the start/end points can be easily detected We will soon see how to relax this unrealistic assumptions In this case, the distance between words may be computed as Divide words into short-time frames (e.g., 10-20 ms) Compute a feature vector for each frame (e.g., DFT) Calculate Euclidean distance for each pair of frames Sum up distances across frames Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 8

What makes good feature vectors for ASR? With the exception of tonal languages (e.g., Mandarin), pitch information generally does not carry much phonetic information Thus, the feature vector should ignore harmonic structure and instead focus on the spectral envelope of the signal A reasonable approach is to perform a filter-bank analysis following an auditory frequency scale (e.g., Mel, critical bands), and then compute the log-power at each channel The log-power ensures that weaker formants (which may be of linguistic significance) are properly weighted in the distance measure We may also subtract the average log-power of each word to cancel differences in vocal effort or distance to the microphone Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 9

10-channel filter-bank analysis three eight eight [Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 10

Euclidean distance between words Similar Different [Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 11

End-point detection Our earlier approach assumes that the start and end points for each word are known or can be easily found One may detect end-points by means of a simple level threshold This approach, however, may fail when words start or end with weak sounds (e.g., [f]) Words may also have periods of silence within them (e.g., containing ) Other problems include speaker artifacts (lip clicking, exhalation, etc.) Improvements in end-point detection can be achieved by accounting for the spectral properties of the background noise However, despite its apparent simplicity, end-point detection is often unreliable Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 12

Allowing for timescale variation Up to now we have also assumed that words to be compared are of the same length and that the corresponding frames represent the same phonetic features In practice, speakers use different speaking rates, and these rates are non-uniform (see earlier slide with two STFT for the word eight ) Fortunately, these issues can be addressed by aligning the two words with a mathematical technique known as dynamic time warping Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 13

Problem formulation Dynamic time warping Assume that an incoming speech pattern and a template pattern are to be compared, having n and N frames respectively Some metric has been used to calculate the distance d i, j between frame i of the incoming speech and frame j of the template Our goal is to find a path from 1,1 to n, N such that the sum of the total distances between frames is minimum One approach is to evaluate every possible path between both points, and select the path with the lowest overall distance As you imagine, this will only work for very small values of n and N To solve this problem efficiently, we use a mathematical technique known as dynamic programming (DP) In order for DP to be applicable, the problem must exhibit two properties Overlapping subproblems: the problem can be broken down into subproblems whose solution method can be reused over and over Optimal substructure: the solution can be obtained by the combination of optimal solutions to its subproblems Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 14

[Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 15

Consider the problem illustrated in the previous figure, and assume that The path always goes forward in time (i.e., has a non-negative slope) We cannot skip individual frames from each pattern (i.e., jumps are not allowed) Consider a point i, j in the middle of both patterns Let D i, j denote the cumulative distance along the optimum path from 1,1 to (i, j) i,j D i, j = d x, y x,y=1 x,y optimal path If i, j is on the optimal path, then the optimal path must also pass through one of its three neighboring cells: i, j 1, i, j 1, i 1, j 1 Therefore, the cumulative distance can be computed as D i, j = min D i, j 1, D i 1, j, D i 1, j 1 + d i, j Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 16

In other words, the best way to get to i, j is to get to one of its immediately preceding points by the best way, and then take the appropriate step to i, j Thus, a simple procedure may be used to fill in matrix D Initialization: D 1,1 = d 1,1 Cells along the left-hand side can only follow one direction (vertical), so starting with D 1,1, values for D 1, j can be calculated for increasing values of j Once the left column is completed, the second column can be computed from D i,j = min D i,j 1, D i 1,j, D i 1,j 1 + d i,j, and so forth The value obtained for D n, N is the score for the best way of matching the two words If you also are interested in finding the optimal path itself, then additional book-keeping is necessary to backtrack from D n, m to D 1,1 Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 17

ex12p1.m Aligning utterances with DTW Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 18

Refinements to DTW Penalties for time-scale distortions DTW works best when both words have similar lengths, otherwise the path will contain a large number of vertical and horizontal segments Although the presence of these segments will implicitly penalize the overall cost of the path, it is sometimes advisable to penalize them explicitly D i 1, j + d i, j + hdp D i, j = min D i 1, j 1 + 2d i, j D i, j 1 + d i, j + vdp where the value of horizontal and vertical distortion penalties (vdp,hdp) must be determined empirically Length normalization The cumulative distance depends on the length of the example and the template, so the best-match decision will favor short templates To remove this bias, one can divide the total distance by the length of the template Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 19

Score pruning DP provides very significant savings when compared to evaluating every possible path, but remains computationally intensive when there is a large number of templates to be matched Additional savings can be obtained by not allowing path with relatively bad scores to propagate forward As an example, it is unlikely that a cell whose cumulative distance D i, j far exceeds the minimum in its column will be part of the minimal path Thus, one could prune any paths continuing from that cell In doing so, we trade-off optimality with potentially significant savings (i.e., speedup by a factor of 5-10) Note, however, that most circumstances where pruning is likely to eliminate the optimal path will be those when the two words are different, in which case overestimating the cumulative distance will not affect recognition rates Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 20

DTW alignment between two examples of the word eight with score pruning but no time-scale distortion DTW alignment between two examples of the word eight with score-pruning AND a small time-scale distortion; notice the more plausible matching of the two timescales DTW alignment between two dissimilar words ( three and eight ) with time-scale distortion; score pruning was removed as the resulting path would have been seriously suboptimal [Holmes, 2001, ch. 8] Introduction to Speech Processing Ricardo Gutierrez-Osuna CSE@TAMU 21