Segment-Based Speech Recognition

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 9: Speech Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

WHEN THERE IS A mismatch between the acoustic

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Emotion Recognition Using Support Vector Machine

Lecture 1: Machine Learning Basics

Segregation of Unvoiced Speech from Nonspeech Interference

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

A study of speaker adaptation for DNN-based speech synthesis

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Probabilistic Latent Semantic Analysis

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Speaker recognition using universal background model on YOHO database

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Human Emotion Recognition From Speech

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Proceedings of Meetings on Acoustics

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Mandarin Lexical Tone Recognition: The Gating Paradigm

Investigation on Mandarin Broadcast News Speech Recognition

Stages of Literacy Ros Lugg

SARDNET: A Self-Organizing Feature Map for Sequences

A Case Study: News Classification Based on Term Frequency

Edinburgh Research Explorer

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Calibration of Confidence Measures in Speech Recognition

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

ADDIS ABABA UNIVERSITY SCHOOL OF GRADUATE STUDIES MODELING IMPROVED AMHARIC SYLLBIFICATION ALGORITHM

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Automatic Pronunciation Checker

Speaker Identification by Comparison of Smart Methods. Abstract

Rhythm-typology revisited.

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Large vocabulary off-line handwriting recognition: A survey

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Characterizing and Processing Robot-Directed Speech

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised Face Detection

Speaker Recognition. Speaker Diarization and Identification

On the nature of voicing assimilation(s)

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

First Grade Curriculum Highlights: In alignment with the Common Core Standards

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Phonological Processing for Urdu Text to Speech System

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

Support Vector Machines for Speaker and Language Recognition

Journal of Phonetics

Data Modeling and Databases II Entity-Relationship (ER) Model. Gustavo Alonso, Ce Zhang Systems Group Department of Computer Science ETH Zürich

Letter-based speech synthesis

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

The Strong Minimalist Thesis and Bounded Optimality

Reducing Features to Improve Bug Prediction

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonological and Phonetic Representations: The Case of Neutralization

Unit Selection Synthesis Using Long Non-Uniform Units and Phonemic Identity Matching

An Online Handwriting Recognition System For Turkish

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Speech Recognition by Indexing and Sequencing

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Transcription:

Segment-Based Speech Recognition Introduction Searching graph-based observation spaces Anti-phone modelling Near-miss modelling Modelling landmarks Phonological modelling Lecture # 16 Session 2003 6.345 Automatic Speech Recognition Segment-based ASR 1

Waveform Segment-Based Speech Recognition Frame-based measurements (every 5ms) Segment network created by interconnecting spectral landmarks - k ax p er ao m - uw dx z dh ae - t - k computers that talk Probabilistic search finds most likely phone & word strings 6.345 Automatic Speech Recognition Segment-based ASR 2

Segment-based Speech Recognition Acoustic modelling is performed over an entire segment Segments typically correspond to phonetic-like units Potential advantages: Improved joint modelling of time/spectral structure Segment- or landmark-based acoustic measurements Potential disadvantages: Significant increase in model and search computation Difficulty in robustly training model parameters 6.345 Automatic Speech Recognition Segment-based ASR 3

Hierarchical Acoustic-Phonetic Modelling Homogeneous measurements can compromise performance Nasal consonants are classified better with a longer analysis window Stop consonants are classified better with a shorter analysis window % Classification Error 24 22 20 18 16 10 12.5 15 17.5 20 22.5 25 27.5 30 Window Duration (ms) Nasal Stop Class-specific information extraction can reduce error 6.345 Automatic Speech Recognition Segment-based ASR 4

Committee-based Phonetic Classification Change of temporal basis affects within-class error Smoothly varying cosine basis better for vowels and nasals Piecewise-constant basis better for fricatives and stops 30 28 % Error 26 24 22 S1: 5 averages S3: 5 cosines 20 Overall Vowel Nasal Weak Fricative Stop Combining information sources can reduce error 6.345 Automatic Speech Recognition Segment-based ASR 5

Phonetic Classification Experiments (A. Halberstadt, 1998) TIMIT acoustic-phonetic corpus Context-independent classification only 462 speaker training corpus, 24 speaker core test set Standard evaluation methodology, 39 common phonetic classes Several different acoustic representations incorporated Various time-frequency resolutions (Hamming window 10-30 ms) Different spectral representations (MFCCs, PLPCCs, etc) Cosine transform vs. piecewise constant basis functions Evaluated MAP hierarchy and committee-based methods Method % Error Baseline 21.6 MAP Hierarchy 21.0 Committee of 8 Classifiers 18.5* Committee with Hierarchy 18.3 * Development set performance 6.345 Automatic Speech Recognition Segment-based ASR 6

Speech Statistical Approach to ASR Signal Processor A Language Model Acoustic Model W* = argmax P( W A) W Linguistic Decoder 6.345 Automatic Speech Recognition Segment-based ASR 7 W P( W ) PAW ( ) Words W * Given acoustic observations, A, choose word sequence, W*, which maximizes a posteriori probability, P(W A) Bayes rule is typically used to decompose P(W A) into acoustic and linguistic terms PW ( A) = PAWPW ( ) ( ) PA ( )

ASR Search Considerations A full search considers all possible segmentations, S, and units, U, for each hypothesized word sequence, W * W = argmax P( W A) = argmax P( WUS A) Can seek best path to simplify search using dynamic programming (e.g., Viterbi) or graph-searches (e.g., A*) * * * W, U, S arg max P( WUS A) WUS,, The modified Bayes decomposition has four terms: PW ( US A) = W W S U PA ( SUW) PS ( UW ) PU ( W) P( W ) PA ( ) In HMM s these correspond to acoustic, state, and language model probabilities or likelihoods 6.345 Automatic Speech Recognition Segment-based ASR 8

HMMs Examples of Segment-based Approaches Variable frame-rate (Ponting et al., 1991, Alwan et al., 2000) Segment-based HMM (Marcus, 1993) Segmental HMM (Russell et al., 1993) Trajectory Modelling Stochastic segment models (Ostendorf et al., 1989) Parametric trajectory models (Ng, 1993) Statistical trajectory models (Goldenthal, 1994) Feature-based FEATURE(Cole et al., 1983) SUMMIT(Zue et al., 1989) LAFF (Stevens et al., 1992) 6.345 Automatic Speech Recognition Segment-based ASR 9

Segment-based Modelling at MIT Baseline segment-based modelling incorporates: Averages and derivatives of spectral coefficients (e.g., MFCCs) Dimensionality normalization via principal component analysis PDF estimation via Gaussian mixtures Example acoustic-phonetic modelling investigations, e.g., Alternative probabilistic classifiers (e.g., Leung, Meng) Automatically learned feature measurements (e.g., Phillips, Muzumdar) Statistical trajectory models (Goldenthal) Hierarchical probabilistic features (e.g., Chun, Halberstadt) Near-miss modelling (Chang) Probabilistic segmentation (Chang, Lee) Committee-based classifiers (Halberstadt) 6.345 Automatic Speech Recognition Segment-based ASR 10

SUMMIT Segment-Based ASR SUMMIT speech recognition is based on phonetic segments Explicit phone start and end times are hypothesized during search Differs from conventional frame-based methods (e.g., HMMs) Enables segment-based acoustic-phonetic modelling Measurements can be extracted over landmarks and segments - dh eh k x n p uw d er z ae - t aa v - k m - p h er z aa - Recognition is achieved by searching a phonetic graph Graph can be computed via acoustic criterion or probabilistic models Competing segmentations make use of different observation spaces Probabilistic decoding must account for graph-based observation space 6.345 Automatic Speech Recognition Segment-based ASR 11

Frame-based Speech Recognition Observation space, A, corresponds to a temporal sequence of acoustic frames (e.g., spectral slices) a 2 a 1 a 3 a 1 A = {a 1 a 2 a 3 } a 3 a 1 a 2 a 2 a 3 Each hypothesized segment, s i, is represented by the series of frames computed between segment start and end times The acoustic likelihood, P(A SW), is derived from the same observation space for all word hypotheses P(a 1 a 2 a 3 SW) P(a 1 a 2 a 3 SW) P(a 1 a 2 a 3 SW) 6.345 Automatic Speech Recognition Segment-based ASR 12

Feature-based Speech Recognition Each segment, s i, is represented by a single feature vector, a i A = {a 1 a 2 a 3 a 4 a 5 } a 1 a 3 a 5 a 2 a 4 X = {a 1 a 3 a 5 } Y = {a 2 a 4 } X = {a 1 a 2 a 4 a 5 } Y = {a 3 } Given a particular segmentation, S, A consists of X, the feature vectors associated with S, as well as Y, the feature vectors associated with segments not in S: A = X UY To compare different segmentations it is necessary to predict the likelihood of both X and Y: P(A SW) = P(XY SW) P(a 1 a 3 a 5 a 2 a 4 SW) P(a 1 a 2 a 4 a 5 a 3 SW) 6.345 Automatic Speech Recognition Segment-based ASR 13

Searching Graph-Based Observation Spaces: The Anti-Phone Model Create a unit, α, to model segments that are not phones For a segmentation, S, assign anti-phone to extra segments All segments are accounted for in the phonetic graph Alternative paths through the graph can be legitimately compared α αdh α eh α α α α α α α α α α α α α α α α α α α α α α α - k x n p uw d er z ae - t aa v - k m - aa - Path likelihoods can be decomposed into two terms: 1 The likelihood of all segments produced by the anti-phone (a constant) 2 The ratio of phone to anti-phone likelihoods for all path segments MAP formulation for most likely word sequence, W, given by: N S * Px ( ui ) W = argmax i P( s ui ) P( U W) P( W) WS, i Px ( ) i i α 6.345 Automatic Speech Recognition Segment-based ASR 14

Modelling Non-lexical Units: The Anti-phone Given a particular segmentation, S, A consists of X, the segments associated with S, as well as Y, the segments not associated with S: P(A SU)=P(XY SU) Given segmentation S, assign feature vectors in X to valid units, and all others in Y to the anti-phone Since P( XY α) is a constant, K, we can write P(XY SU) assuming independence between X and Y P( X α) P( X U) P( XY SU) = P( XY U) = P( X U) P( Y α ) = K P( X α) P( X α) We need consider only segments in S during search: N S * Px ( U) W = arg max i P( s ui ) P( U W) P( W) WUS,, i Px ( ) i i α 6.345 Automatic Speech Recognition Segment-based ASR 15

SUMMIT Segment-based ASR 6.345 Automatic Speech Recognition Segment-based ASR 16

Anti-Phone Framework Properties Models entire observation space, using both positive and negative examples Log likelihood scores are normalized by the anti-phone Good scores are positive, bad scores are negative Poor segments all have negative scores Useful for pruning and/or rejection Anti-phone is not used for lexical access No prior or posterior probabilities used during search Allows computation on demand and/or fastmatch Subsets of data can be used for training Context-independent or -dependent models can be used Useful for general pattern matching problems with graphbased observation spaces 6.345 Automatic Speech Recognition Segment-based ASR 17

Beyond Anti-Phones: Near-Miss Modelling Anti-phone modelling partitions the observation space into two parts (i.e., on or not on a hypothesized segmentation) Near-miss modelling partitions the observation space into a set of mutually exclusive, collectively exhaustive subsets One near-miss subset pre-computed for each segment in a graph Temporal criterion can guarantee proper near-miss subset generation (e.g., segment A is a near-miss of B iff A s mid-point is spanned by B) - k x B A Am - p uw d er z dh A A B eh - t aa - k During recognition, observations in a near-miss subset are mapped to the near-miss model of the hypothesized phone Near-miss models can be just an anti-phone, but can potentially be more sophisticated (e.g., phone dependent) 6.345 Automatic Speech Recognition Segment-based ASR 18

Creating Near-miss Subsets Near-miss subsets, A i, associated with any segmentation, S, must be mutually exclusive, and exhaustive: A =U A A i i S Temporal criterion guarantees proper near-miss subsets Abutting segments in S account for all times exactly once Finding all segments spanning a time creates near-miss subsets a 1 a 3 a 5 a 1 a 5 a 4 a 2 a 4 a 2 a 1 A 1,A 2 a 2 A 1,A 2 a 3 A 2,A 3,A 4 a 4 A 4,A 5 a 5 A 4,A 5 A 1 = {a 1 a 2 } A 2 = {a 1 a 2 a 3 } A 3 = {a 3 } A 4 = {a 3 a 4 a 5 } A 5 = {a 4 a 5 } A = U A i S S = {{a 1 a 3 a 5 }, {a 1 a 4 }, {a 2 a 5 }} 6.345 Automatic Speech Recognition Segment-based ASR 19

Modelling Landmarks We can also incorporate additional feature vectors computed at hypothesized landmarks or phone boundaries dh eh - k x n p uw d er z ae - t aa v - k m - aa - Every segmentation accounts for every landmark Some landmarks will be transitions between lexical-units Other landmarks will be considered internal to a unit Both context-independent or dependent units are possible Effectively model transitions between phones (i.e., diphones) Frame-based models can be used to generate segment graph 6.345 Automatic Speech Recognition Segment-based ASR 20

Modelling Landmarks Frame-based measurements: Computed every 5 milliseconds Feature vector of 14 Mel-Scale Cepstral Coefficients (MFCCs) Frame-based feature vectors Landmarks Landmark-based measurements: Compute average of MFCCs over 8 regions around landmark 8 regions X 14 MFCC averages = 112 dimension vector 112 dims. reduced to 50 using principal component analysis 6.345 Automatic Speech Recognition Segment-based ASR 21

Probabilistic Segmentation Uses forward Viterbi search in first-pass to find best path a r Lexical Nodes z m h# t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Time Relative and absolute thresholds used to speed-up search 6.345 Automatic Speech Recognition Segment-based ASR 22

Probabilistic Segmentation (con t) Second pass uses backwards A* search to find N-best paths Viterbi backtrace is used as future estimate for path scores a r Lexical Nodes z m h# t 0 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Time Block processing enables pipelined computation 6.345 Automatic Speech Recognition Segment-based ASR 23

Phonetic Recognition Experiments TIMIT acoustic-phonetic corpus 462 speaker training corpus, 24 speaker core test set Standard evaluation methodology, 39 common phonetic classes Segment and landmark representations based on averages and derivatives of 14 MFCCs, energy and duration PCA used for data normalization and reduction Acoustic models based on aggregated Gaussian mixtures Language model based on phone bigram Probabilistic segmentation computed from diphone models Method % Error Triphone CDHMM 27.1 Recurrent Neural Network 26.1 Bayesian Triphone HMM 25.6 Anti-phone, Heterogeneous classifiers 24.4 6.345 Automatic Speech Recognition Segment-based ASR 24

Phonological Modelling Words described by phonemic baseforms Phonological rules expand baseforms into graph, e.g., Deletion of stop bursts in syllable coda (e.g., laptop) Deletion of /t/ in various environments (e.g., intersection, destination, crafts) Gemination of fricatives and nasals (e.g., this side, in nome) Place assimilation (e.g., did you (/d ih jh uw/)) Arc probabilities, P(U W), can be trained Most HMMs do not have a phonological component 6.345 Automatic Speech Recognition Segment-based ASR 25

Phonological Example Example of what you expanded in SUMMIT recognizer Final /t/ in what can be realized as released, unreleased, palatalized, or glottal stop, or flap what you 6.345 Automatic Speech Recognition Segment-based ASR 26

Word Recognition Experiments Jupiter telephone-based, weather-queries corpus 50,000 utterance training set, 1806 in-domain utterance test set Acoustic models based on Gaussian mixtures Segment and landmark representations based on averages and derivatives of 14 MFCCs, energy and duration PCA used for data normalization and reduction 715 context-dependent boundary classes 935 triphone, 1160 diphone context-dependent segment classes Pronunciation graph incorporates pronunciation probabilities Language model based on class bigram and trigram Best performance achieved by combining models Method % Error Boundary models 7.6 Segment models 9.6 Combined 6.1 6.345 Automatic Speech Recognition Segment-based ASR 27

Summary Some segment-based speech recognition techniques transform the observation space from frames to graphs Graph-based observation spaces allow for a wide-variety of alternative modelling methods to frame-based approaches Anti-phone and near-miss modelling frameworks provide a mechanism for searching graph-based observation spaces Good results have been achieved for phonetic recognition Much work remains to be done! 6.345 Automatic Speech Recognition Segment-based ASR 28

References J. Glass, A Probabilistic Framework for Segment-Based Speech Recognition, to appear in Computer, Speech & Language, 2003. D. Halberstadt, Heterogeneous Acoustic Measurements and Multiple Classifiers for Speech Recognition, Ph.D. Thesis, MIT, 1998. M. Ostendorf, et al., From HMMs to Segment Models: A Unified View of Stochastic Modeling for Speech Recognition, Trans. Speech & Audio Proc., 4(5), 1996. 6.345 Automatic Speech Recognition Segment-based ASR 29