Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM

Similar documents
BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Methods in Multilingual Speech Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 9: Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Modeling function word errors in DNN-HMM based LVCSR systems

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Effect of Word Complexity on L2 Vocabulary Learning

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Calibration of Confidence Measures in Speech Recognition

Edinburgh Research Explorer

arxiv: v1 [cs.lg] 7 Apr 2015

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

WHEN THERE IS A mismatch between the acoustic

Characterizing and Processing Robot-Directed Speech

Speech Emotion Recognition Using Support Vector Machine

Improvements to the Pruning Behavior of DNN Acoustic Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Large vocabulary off-line handwriting recognition: A survey

Lecture 1: Machine Learning Basics

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Speaker recognition using universal background model on YOHO database

Generative models and adversarial training

CSL465/603 - Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Corrective Feedback and Persistent Learning for Information Extraction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Letter-based speech synthesis

Automatic Pronunciation Checker

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Investigation on Mandarin Broadcast News Speech Recognition

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Reinforcement Learning by Comparing Immediate Reward

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Acquiring Competence from Performance Data

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

The influence of orthographic transparency on word recognition. by dyslexic and normal readers

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The International Coach Federation (ICF) Global Consumer Awareness Study

Problems of the Arabic OCR: New Attitudes

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Small-Vocabulary Speech Recognition for Resource- Scarce Languages

Developing a TT-MCTAG for German with an RCG-based Parser

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning From the Past with Experiment Databases

Offline Writer Identification Using Convolutional Neural Network Activation Features

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Human Emotion Recognition From Speech

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Idaho Public Schools

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Universal contrastive analysis as a learning principle in CAPT

Learning Probabilistic Behavior Models in Real-Time Strategy Games

Segregation of Unvoiced Speech from Nonspeech Interference

Learning Methods for Fuzzy Systems

Speech Recognition by Indexing and Sequencing

A Review: Speech Recognition with Deep Learning Methods

Using Synonyms for Author Recognition

Transcription:

Resource Optimized Speech Recognition using Kullback-Leibler Divergence based HMM Ramya Rasipuram David Imseng, Marzieh Razavi, Mathew Magimai Doss, Herve Bourlard 24 October 2014 1/23

Automatic Speech Recognition (ASR) Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Language morning good 0.7 god 0.01 cat 0.001 2/23

Hidden Markov Models (HMMs) for ASR language IS Is it? IT hypothesis word sequence pronunciation lexicon /ih/ /z/ /ih/ /t/ subwords deterministic /sil ih+z/ /ih z+ih/ /z ih+t/ /ih t+sil/ HMM states s_ih_2 s_ih_2 acoustic Acoustic features 3/23

Standard HMM-based ASR W t word sequence pronunciation lexicon subwords lexical acoustic HMM states Acoustic features Acoustic Model: 1 GMM HMM/GMM 2 ANN Hybrid HMM/ANN Lexical Model: 1 Deterministic decision trees 4/23

Resources for ASR Linguist Transcribed speech data Pronunciation lexicon good /g/ /uh/ /d/ god /g/ /aa/ /d/ cat /k/ /aa/ /t/ Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Decoder Good Morning /g/ 0.9 /uh/ 0.7 /d/ 0.8 /k/ 0.01 /aa/ 0.1 /t/ 0.15 Text data Language morning good 0.7 god 0.01 cat 0.001 5/23

ASR for Under-Resourced Languages Limited or no transcribed speech Linguistic expertise may not be available Limited or no text resources 6/23

Limited Transcribed Speech Data Borrow resources Sounds of languages or phonemes can be shared across languages Pronunciation lexicon is important word pronun word pronun English a ei b b i: Italian a b @ German a, a: b e: French A, a, E b & Greek α a β v 7/23

Conventional Approaches MAP adaptation Tandem Trees: LI Trees: LD Train: LI Adapt: LD GMM: LD v t ANN: LI Log PCA v t LI: language-independent data from resource-rich language(s) LD: language-dependent data from under-resourced language 8/23

No Pronunciation Lexicon 1 Pay a linguist expensive, time consuming 2 Graphemes as subword units easy, not optimal context dependent graphemes Word Phone Grapheme Read r eh d R E A D r iy d decision trees clustered context dependent graphemes GMMs 9/23

Limited Transcribed Speech and No Pronunciations Multilingual graphemes? Worse than monolingual grapheme-based ASR Language word pronun word pronun English a a b b Spanish a b Italian a b German a b French a Greek α? β? 10/23

Probabilistic Lexical Modeling W t 0 < P( ) < 1, D d=1 P(ad ) = 1 lexical Lexical : θ l = {y i } I i=1 y i = [y 1 i,..., yd i ]T, y d i = P(ad ) a 1 a D θ l estimated by training a HMM acoustic Kullback-Leibler divergence based HMM (KL-HMM) 1 1 Aradilla G., Acoustic Models for Posterior Features in Speech Recognition, EPFL PhD Thesis, 2008 11/23

KL-HMM System Lexical parameters HMM state sequence Acoustic unit probability vector sequence y 1 1.. y D 1.. Acoustic Acoustic observation sequence (PLP) y 1 2.. y D 2........ y 1 3.. y D 3 a 12 a 23 a l 1 l 2 l 3 34 a 01 a 11 a 22 a 33 z 1 1 z D 1 p(a 1 ) z 1 t z D t... a 1 a 2 ANN a D p(a D ).. D Number of acoustic units ( 4,,,, +4 ) x 1,,,, x T z 1 T z D T 12/23

KL-HMM Features: posterior probability estimates of acoustic units z t = [z 1 t,..., z d t,..., z D t ] T, z d t = p( ) State distribution: categorical distribution y i = [y 1 i,..., yd i,..., yd i ]T, y d i = P(ad ) Local score: Kullback-Leibler (KL) divergence D ( z S(z t, y i ) = z d d ) t log t d=1 Parameter estimation: Viterbi Expectation Maximization algorithm cost function based on KL-divergence y d i 13/23

Decoding W t language pronunciation lexicon lexical a 1 a 2 a D match between acoustic and lexical evidence KL-divergence a 1 a 2 a D acoustic 14/23

Advantage 1: Resource Optimization Speech signal Feature extraction acoustic features Language dependent data Language independent data Acoustic likelihood/ probabilities of sounds Pronunciation lexicon read R E A D thing T H I N G that T H A T Lexical Decoder subword units /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 "I Read that book" /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Language 15/23

Advantage 2: Grapheme Subword Units Linguist Pronunciation lexicon read R E A D thing T H I N G that T H A T subword units Speech signal Feature extraction acoustic features Acoustic likelihood/ probabilities of sounds Lexical /r/ 0.9 /eh/ 0.8 /d/ 0.8 /er/ 0.09 /iy/ 0.1 /t/ 0.15 Decoder /r/ 0.8 /eh/ 0.5 /d/ 0.9 /er/ 0.1 /iy/ 0.4 /t/ 0.09 Language "I Read that book" 16/23

Task Build speech recognition system for Greek but with Limited transcribed speech data No pronunciation lexicon Borrow resources from French, German, Italian, Spanish and English language independent (LI) data 17/23

Systems KL HMM KL-HMM: Greek Tandem Trees: Greek a 1 a D a 1 a D GMM: Greek ANN: LI ANN: LI Log v t PCA v t MAP adaptation Trees: LI GMM Train: LI Adapt: Greek HMM/GMM Trees: Greek GMM: Greek 18/23

Results KL-HMM Tandem Word accuracy in % 85 Phone 80 75 70 65 60 55 Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes MAP adaptation Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes HMM/GMM Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme Word accuracy in % 85 80 75 70 65 60 55 Phone Grapheme 50 5 9 18 37 75 150 300 800 Amount of training data in minutes 50 5 9 18 37 75 150 300 800 Amount of training data in minutes 19/23

Advantage 3: Pronunciation Variability Modeling Train data: Native speech Test data: Native and non-native speech HMM/GMM Hybrid HMM/ANN KL-HMM Trees Trees KL HMM a 1 a D GMMs ANN a 1 a D ANN 20/23

Results 78 76 74 Word accuracy 72 70 68 66 64 62 HMM/GMM-Overall Hybrid HMM/ANN-Overall KL-HMM-Overall 60 57(mono) 195 385 549 759 1101 3000 Number of clustered CD units 21/23

Conclusions KL-HMM approach for speech recognition: 1 Efficient resource sharing 2 Suitable for both grapheme and phone based pronunciation lexicon 3 Suitable when task is challenged by both transcribed speech and pronunciation resource constraints 4 Performs better or comparable in well-resourced conditions 22/23

Thank you for your attention Questions? 23/23