Plasticity in Systems for Automatic Speech Recognition: A Review. Roger K Moore & Stuart P Cunningham. Overview

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Investigation on Mandarin Broadcast News Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 9: Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Learning Methods in Multilingual Speech Recognition

A study of speaker adaptation for DNN-based speech synthesis

WHEN THERE IS A mismatch between the acoustic

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Speech Emotion Recognition Using Support Vector Machine

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Natural Language Processing. George Konidaris

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

English Language and Applied Linguistics. Module Descriptions 2017/18

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Detecting English-French Cognates Using Orthographic Edit Distance

Body-Conducted Speech Recognition and its Application to Speech Support System

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Visual CP Representation of Knowledge

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker recognition using universal background model on YOHO database

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Edinburgh Research Explorer

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Automatic Pronunciation Checker

Phonological Processing for Urdu Text to Speech System

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Parsing of part-of-speech tagged Assamese Texts

Speech Recognition by Indexing and Sequencing

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

21st Century Community Learning Center

Large vocabulary off-line handwriting recognition: A survey

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Rhythm-typology revisited.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using dialogue context to improve parsing performance in dialogue systems

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

PHONETIC DISTANCE BASED ACCENT CLASSIFIER TO IDENTIFY PRONUNCIATION VARIANTS AND OOV WORDS

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

What is a Mental Model?

Improvements to the Pruning Behavior of DNN Acoustic Models

CS Machine Learning

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lecture 1: Machine Learning Basics

CS 598 Natural Language Processing

arxiv: v1 [cs.cl] 2 Apr 2017

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Letter-based speech synthesis

Distant Supervised Relation Extraction with Wikipedia and Freebase

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Automatic intonation assessment for computer aided language learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

THE world surrounding us involves multiple modalities

Phonological and Phonetic Representations: The Case of Neutralization

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Human Emotion Recognition From Speech

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

Lecture Notes in Artificial Intelligence 4343

M55205-Mastering Microsoft Project 2016

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Longman English Interactive

Modeling full form lexica for Arabic

ROSETTA STONE PRODUCT OVERVIEW

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Applications of memory-based natural language processing

Calibration of Confidence Measures in Speech Recognition

Transcription:

Plasticity in Systems for Automatic Speech Recognition: A Review Roger K Moore & Stuart P Cunningham Overview Automatic Speech Recognition (ASR) breakthroughs key components training / recognition Practical Challenges user characteristics user environment user behaviour Plasticity in ASR flexibility / robustness learning / adaptation

ASR in the 1950s & 60s store of reference templates training switch subset finite state syntax pre-processor end-point detector comparator best match ASR in the 1970s bottom-up semantic semantic interpreter interpreter semantic semantic knowledge-base knowledge-base grammar grammar syntactic syntactic parser parser phonetic phonetic rule-base rule-base lexical lexical access access phonetic phonetic decoder decoder segmentor segmentor feature feature extractor extractor pre-processor pre-processor lexicon lexicon top-down

Breakthroughs in ASR Integrated search dynamic time warping (DTW) Stochastic modelling hidden Markov models (HMM) Sub-word representations context-dependent phones (triphones) Contemporary ASR target vocabulary pronouncing pronouncing dictionary dictionary phonetic transcription model model selection selection inventory of sub-word models HMM HMM re-estimation re-estimation word-boundary word-boundary modifications modifications training corpora noise models model model combination combination language model language language model model re-estimation re-estimation input signal front-end front-end signal signal processing processing integrated network of HMM states Viterbi Viterbi search search most probable path & lattice of alternatives

Key ASR Components Language Model Acoustic Model Noise Model Pronunciation Model ASR Components Language model n-grams The cat sat on the Acoustic model context-dependent phones e.g. tri-phones bigram trigram str /t:s_r/

ASR Components Noise model Pronunciation model dictionary citation forms + variants Noise HMM + = Speech HMM /n/-deletion: /reiz@n/ -> ->/reiz@/ /r/-deletion: /Amst@rdAm/ -> ->/Amst@dAm/ /t/-deletion: /rextstre:ks/ -> ->/rexstre:ks/ /@/-insertion: /delft/ /delft/->-> /del@ft/ How HMM-based ASR Works A very quick tutorial (with no maths)

Markov Model Markov Model Alignment

Markov Model Hidden Markov Model

HMM Alignment HMMs for Speech Whole-word HMMs Sub-word HMMs Context-dependent sub-word HMMs s:#_e e:s_v v:e_@ @:v_n Seven n:@_#

ASR Target Vocabulary one one Phonetic Transcription two two three three Hidden Markov Models /wvn/ /wvn/ /tu/ /tu/ /Tri/ /Tri/ Sub-Word Triphones (w:#_v)(v:w_n)(n:v_#) (t:#_u)(u:t_#) (T:#_r)(r:T_i)(i:r_#) HMM Network ASR one two three max Pr one one three two

feedback Practical Challenges USER speech speech input input speech speech output output keyboard keyboard input input text text output output pen-pad pen-pad input input camera camera graphical graphical output output mouse mouse input input Linguistic Linguistic Interpreter Interpreter Generator Generator Spatio-Temporal Spatio-Temporal Interpreter Interpreter Generator Generator other tasks, distractions, noise, vibration, acceleration DIALOGUE DIALOGUE MANAGER MANAGER A P P L II C A T II O N Plasticity in ASR Practical ASR systems have to be able to adapt / learn in order to be flexible / robust, but the compilation of priors into an integrated network tends to lead to a static data structure Plasticity can be achieved by re-compilation of the network adaptation of the model parameters modification of the input representation

Concepts from Machine Learning Supervised learning (training) maximum likelihood (ML) expectation-maximisation (EM) maximum a-posteriori (MAP) maximum mutual information (MMI) Unsupervised learning (adaptation) Acoustic Model Adaptation speaker-dependent Recognition Rate speaker-independent Amount of Adaptation Data

Acoustic Model Adaptation Model set selection Maximum likelihood linear regression (MLLR) Eigen-voices Vocal tract length normalisation (VTLN) 20 Response magnitude (db) 0 0 1 2 3 4 5 6 Frequency (khz) Environment Compensation Spectral subtraction (SS) Cepstral mean normalisation (CMN) Relative spectral process (RASTA)

Language Model Adaptation Off-Line model interpolation constraint specification On-Line dynamic cache trigger models TASK-RELATED BACKGROUND TEXT CORPUS MODEL MERGING TASK-SPECIFIC ADAPTATION TEXT CORPUS [Bellegarda, 2004] Pronunciation Adaptation ABI Accents of the British Isles

Pronunciation Adaptation AM adaptation vs extended dictionary Japanese speaking English table : /teibl/ /teiburu/ Italian speaking English team : /ti:m/ /ti:m@/ linked : /linkt/ /link@t/ Adapt dictionary using phone recogniser English phone recogniser on German aktuelles : /?aktu:?el@s/ /{ktwel@us/, /{ktw3:m@z/, /ktw3:l@s/ /{ktwel@uz/, /{tkw3:r@s/, /@kwe@res/ [Goronzy et al, 2004] Pronunciation Adaptation 30 25 Word Error Rate (%) 20 15 10 5 0 Native Non- Native MLLR ExtDict MLLR+ ExtDict [Goronzy et al, 2004]

Summary Contemporary ASR changes dynamically to accommodate new speakers unexpected user behaviour real acoustic environments The prime purpose of such plasticity is to improve recognition accuracy Discussion Points The computational techniques employed by ASR for adaptation and learning may (or may not) give insights into plasticity in human speech perception Future progress in ASR may (or may not) be determined by insights gained at this workshop

Thankyou Any questions