Articulatory features for word recognition using dynamic Bayesian networks

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Learning Methods in Multilingual Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Consonants: articulation and transcription

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Modeling function word errors in DNN-HMM based LVCSR systems

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Phonological Processing for Urdu Text to Speech System

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Universal contrastive analysis as a learning principle in CAPT

Phonetics. The Sound of Language

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Letter-based speech synthesis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

SARDNET: A Self-Organizing Feature Map for Sequences

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Lecture 1: Machine Learning Basics

Proceedings of Meetings on Acoustics

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Human Emotion Recognition From Speech

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Simulation

Journal of Phonetics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Speech Emotion Recognition Using Support Vector Machine

DIBELS Next BENCHMARK ASSESSMENTS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Speaker Recognition. Speaker Diarization and Identification

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

WHEN THERE IS A mismatch between the acoustic

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Edinburgh Research Explorer

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Automatic Pronunciation Checker

Statewide Framework Document for:

Phonological and Phonetic Representations: The Case of Neutralization

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Software Maintenance

Sample Goals and Benchmarks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Mandarin Lexical Tone Recognition: The Gating Paradigm

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Rule Learning with Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Calibration of Confidence Measures in Speech Recognition

Dublin City Schools Mathematics Graded Course of Study GRADE 4

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Teacher: Mlle PERCHE Maeva High School: Lycée Charles Poncet, Cluses (74) Level: Seconde i.e year old students

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Switchboard Language Model Improvement with Conversational Data from Gigaword

English Language and Applied Linguistics. Module Descriptions 2017/18

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Corpus Linguistics (L615)

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

CEFR Overall Illustrative English Proficiency Scales

Perceived speech rate: the effects of. articulation rate and speaking style in spontaneous speech. Jacques Koreman. Saarland University

Problems of the Arabic OCR: New Attitudes

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

Artificial Neural Networks written examination

Language Acquisition by Identical vs. Fraternal SLI Twins * Karin Stromswold & Jay I. Rifkin

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Investigation on Mandarin Broadcast News Speech Recognition

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Improved Hindi Broadcast ASR by Adapting the Language Model and Pronunciation Model Using A Priori Syntactic and Morphophonemic Knowledge

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Underlying Representations

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

Detecting English-French Cognates Using Orthographic Edit Distance

Word Segmentation of Off-line Handwritten Documents

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

First Grade Standards

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Constructing Parallel Corpus from Movie Subtitles

The Role of String Similarity Metrics in Ontology Alignment

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Arabic Orthography vs. Arabic OCR

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Transcription:

Articulatory features for word recognition using dynamic Bayesian networks Centre for Speech Technology Research, University of Edinburgh 10th April 2007

Why not phones? Articulatory features Articulatory feature recognition Data Models AF Results Pronunciation model 6-state word models Phone-based word models Articulatory feature-based word models

What is wrong with phones? Why not phones? Articulatory features Spontaneous speech effects modelling words as sequences of non-overlapping phone segments ( beads-on-a-string paradigm) is unrealistic and creates many problems difficult to model the variation present in spontaneous, conversational speech

What is wrong with phones? Why not phones? Articulatory features Spontaneous speech effects modelling words as sequences of non-overlapping phone segments ( beads-on-a-string paradigm) is unrealistic and creates many problems difficult to model the variation present in spontaneous, conversational speech variation arises from the overlapping, asynchronous nature of speech production standard solution: context-dependent phone models, though these can only deal with certain effects, and necessitate parameter tying to alleviate problems of data sparsity

What is wrong with phones? Why not phones? Articulatory features Language universality a universal phone set has to be large (e.g. IPA) will contain many rarely-used symbols not at all clear that the same IPA symbol is actually pronounced the same in different languages anyway

What is wrong with phones? Why not phones? Articulatory features Language universality a universal phone set has to be large (e.g. IPA) will contain many rarely-used symbols not at all clear that the same IPA symbol is actually pronounced the same in different languages anyway A large phone set is problematic for modelling, just like trying to do large-vocab ASR using whole-word models.

What is wrong with phones? Why not phones? Articulatory features Language universality a universal phone set has to be large (e.g. IPA) will contain many rarely-used symbols not at all clear that the same IPA symbol is actually pronounced the same in different languages anyway A large phone set is problematic for modelling, just like trying to do large-vocab ASR using whole-word models. One solution: decompose/factorise phones into a small set of symbols/factors

Why not phones? Articulatory features Articulatory features (AFs) linguistic motivation We are building a recognition system in which articulatory features, not phones, mediate between words and acoustic observations. AFs are multi-levelled features such as place, manner of articulation, etc

Why not phones? Articulatory features Articulatory features (AFs) linguistic motivation We are building a recognition system in which articulatory features, not phones, mediate between words and acoustic observations. AFs are multi-levelled features such as place, manner of articulation, etc they provide a compact encoding of variation present in natural speech

Why not phones? Articulatory features Articulatory features (AFs) linguistic motivation We are building a recognition system in which articulatory features, not phones, mediate between words and acoustic observations. AFs are multi-levelled features such as place, manner of articulation, etc they provide a compact encoding of variation present in natural speech allow simple accounts of spontaneous speech effects

Why not phones? Articulatory features Articulatory features (AFs) linguistic motivation We are building a recognition system in which articulatory features, not phones, mediate between words and acoustic observations. AFs are multi-levelled features such as place, manner of articulation, etc they provide a compact encoding of variation present in natural speech allow simple accounts of spontaneous speech effects it should be easier to specify a language-universal feature set

Why not phones? Articulatory features Articulatory features (AFs) linguistic motivation We are building a recognition system in which articulatory features, not phones, mediate between words and acoustic observations. AFs are multi-levelled features such as place, manner of articulation, etc they provide a compact encoding of variation present in natural speech allow simple accounts of spontaneous speech effects it should be easier to specify a language-universal feature set this is an articulatory-inspired representation - we are not trying to do articulatory inversion, which aims to recover precise articulator positions.

Why not phones? Articulatory features Articulatory features (AFs) machine-learning motivation AFs are a distributed (factorial) representation

Why not phones? Articulatory features Articulatory features (AFs) machine-learning motivation AFs are a distributed (factorial) representation potential to make better use of limited training data effectively, train a number of low-cardinality classifiers fewer classes: less likely to suffer data sparsity

Feature specification Talk outline Data Models Articulatory Feature results feature values cardinality manner approximant, fricative, nasal, stop, vowel, silence 6 place labiodental, dental, alveolar, velar, high, mid, low, silence 8 voicing voiced, voiceless, silence 3 rounding rounded, unrounded, nil, silence 4 front-back front, central, back, nil, silence 5 static static, dynamic, silence 3

OGI Numbers Talk outline Data Models Articulatory Feature results OGI numbers 30-word subset

OGI Numbers Talk outline Data Models Articulatory Feature results OGI numbers 30-word subset a little over 6 hours of train and 2 hours test data

OGI Numbers Talk outline Data Models Articulatory Feature results OGI numbers 30-word subset a little over 6 hours of train and 2 hours test data AF labels generated by mapping from time-aligned phone labels, using diacritics where appropriate Worldbet example manner place voice front round static f five fricative labdent -voice nil nil static I six vowel high +voice front -round static

OGI Numbers Talk outline Data Models Articulatory Feature results OGI numbers 30-word subset a little over 6 hours of train and 2 hours test data AF labels generated by mapping from time-aligned phone labels, using diacritics where appropriate Worldbet example manner place voice front round static f five fricative labdent -voice nil nil static I six vowel high +voice front -round static 39-dimensional acoustic observation vector: 12 Mel-frequency cepstral coefficients and energy, plus 1st and 2nd derivatives.

Word segmentations Talk outline Data Models Articulatory Feature results Word segmentations are derived from phonetic transcriptions

Word segmentations Talk outline Data Models Articulatory Feature results Word segmentations are derived from phonetic transcriptions Output from Fiona s semi-automatic dictionary generating procedure

Word segmentations Talk outline Data Models Articulatory Feature results Word segmentations are derived from phonetic transcriptions Output from Fiona s semi-automatic dictionary generating procedure Timing information is used to train word models

Data Models Articulatory Feature results Evaluating performance No ideal metric with which to evaluate framewise accuracy: comparison with phone-derived feature labels penalizes asynchrony

Data Models Articulatory Feature results Evaluating performance No ideal metric with which to evaluate framewise accuracy: comparison with phone-derived feature labels penalizes asynchrony recognition accuracy: 100 (n(correct) n(insertions)) /n(total labels) more useful, though has capacity to penalize events would like to capture, e.g. where assimilation should lead to the deletion of a feature value

Data Models Articulatory Feature results Evaluating performance No ideal metric with which to evaluate framewise accuracy: comparison with phone-derived feature labels penalizes asynchrony recognition accuracy: 100 (n(correct) n(insertions)) /n(total labels) more useful, though has capacity to penalize events would like to capture, e.g. where assimilation should lead to the deletion of a feature value s make it possible to compare effect of phones and AFs directly

Data Models Articulatory Feature results ANN/HMMs without inter-feature dependencies =1 =1 =1 =1 =1 =1 m t-1 m t v 1 t- v t p t-1 p t f 1 - t s 1 t- f t s t r t-1 r t =1 = 1 =1 =1 = 1 = 1

Data Models Articulatory Feature results GMM/DBNs with inter-feature dependencies y t-1 y t-1 y t-1 y t y t y t m t-1 m t v t-1 v t p t-1 p t f t-1 f t s t-1 s t r t-1 r t y t-1 y t-1 y t-1 y t y t y t

Data Models Articulatory Feature results ANN/DBNs with inter-feature dependencies =1 =1 =1 =1 =1 =1 m t-1 m t v t-1 v t p t-1 p t f t-1 f t s t-1 s t r t-1 r t =1 =1 =1 =1 =1 =1

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy State level coupling of features is indeed beneficial

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy State level coupling of features is indeed beneficial Reduced our dependence on phone-derived feature labels and learned set of asynchronous changes

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy State level coupling of features is indeed beneficial Reduced our dependence on phone-derived feature labels and learned set of asynchronous changes Order of magnitude fewer feature combinations may be a suitable operating point between:

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy State level coupling of features is indeed beneficial Reduced our dependence on phone-derived feature labels and learned set of asynchronous changes Order of magnitude fewer feature combinations may be a suitable operating point between: All possible feature value combinations (linguistically implausible)

Data Models Articulatory Feature results Summary of AF results model average correct correct together accuracy combinations ANN/HMM 86.7% 71.7% 83.5% 3751 GMM/DBN 86.2% 79.4% 83.4% 117 ANN/DBN 89.1% 84.6% 87.8% 54 Shown that DBNs can match ANN accuracy State level coupling of features is indeed beneficial Reduced our dependence on phone-derived feature labels and learned set of asynchronous changes Order of magnitude fewer feature combinations may be a suitable operating point between: All possible feature value combinations (linguistically implausible) Only combinations which correspond to canonical phonemes (back to the beads-on-a-string problem).

Towards a word model Talk outline Pronunciation model 6-state word models Phone-based word models AF word models We have the observation process in place: AF recognizer y 000 111 000 111 0000 1111 0000 1111 0000 1111 000 111 000 111 observation f1 f2 f3 f4 f5 f6 features t1 t2 t3 t4 t5 t6 templates w

Towards a word model Talk outline Pronunciation model 6-state word models Phone-based word models AF word models We have the observation process in place: AF recognizer y observation f1 f2 f3 f4 f5 f6 features t1 t2 t3 t4 t5 t6 templates w Now we simply add on the rest to build a word recognizer.

Incorporating a pronunciation model Pronunciation model 6-state word models Phone-based word models AF word models Complete integration of word-feature layer

Incorporating a pronunciation model Pronunciation model 6-state word models Phone-based word models AF word models Complete integration of word-feature layer component will form observation process

Incorporating a pronunciation model Pronunciation model 6-state word models Phone-based word models AF word models Complete integration of word-feature layer component will form observation process Generate word by choosing a template for each feature group, where a template gives a sequence of feature values, but not timings.

Incorporating a pronunciation model Pronunciation model 6-state word models Phone-based word models AF word models Complete integration of word-feature layer component will form observation process Generate word by choosing a template for each feature group, where a template gives a sequence of feature values, but not timings. manner template (i) p=0.6 fricative vowel approximant [f ao r] "four" observations manner template (ii) p=0.4 fricative vowel [f ao]

Pronunciation model 6-state word models Phone-based word models AF word models Unfortunately it s not straightforward how to add the word recognition to the observation process.

Pronunciation model 6-state word models Phone-based word models AF word models Unfortunately it s not straightforward how to add the word recognition to the observation process. So back to basics...

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models word counter word word position phone acoustic observation

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models word counter word word transition word position phone transition phone acoustic observation

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models word counter word word transition word position phone transition phone acoustic observation

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models word counter word word transition word position phone transition phone acoustic observation

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models end of utterance observation word counter word word transition word position phone transition phone acoustic observation

Talk outline Pronunciation model 6-state word models Phone-based word models AF word models end of utterance observation word counter lexical variant word word transition word position phone transition phone acoustic observation

6-state word models Talk outline Pronunciation model 6-state word models Phone-based word models AF word models 6 states per word 31 words (30 words + silence) No pronunciation model 13 iterations of splitting and vanishing scheme

6-state word models Talk outline Pronunciation model 6-state word models Phone-based word models AF word models 6 states per word 31 words (30 words + silence) No pronunciation model 13 iterations of splitting and vanishing scheme 7.1% WER

Phone-based word model Pronunciation model 6-state word models Phone-based word models AF word models 3 states per phone 31 words (30 words + silence) No explicit pronunciation variation model Top 1 variant in training data for each word 13 iterations of splitting and vanishing scheme

Phone-based word model Pronunciation model 6-state word models Phone-based word models AF word models 3 states per phone 31 words (30 words + silence) No explicit pronunciation variation model Top 1 variant in training data for each word 13 iterations of splitting and vanishing scheme 6.9% WER

Pronunciation model 6-state word models Phone-based word models AF word models Articulatory feature-based word model feature # templates manner 232 place 312 voicing 48 rounding 137 front-back 223 static 62 CPT for p(lex var word) with AFs observed

Pronunciation model 6-state word models Phone-based word models AF word models Articulatory feature-based word model feature # templates manner 232 place 312 voicing 48 rounding 137 front-back 223 static 62 CPT for p(lex var word) with AFs observed However, too many zero prob utterances and memory allocation problems

Pronunciation model 6-state word models Phone-based word models AF word models Articulatory feature-based word model feature # templates manner 232 place 312 voicing 48 rounding 137 front-back 223 static 62 CPT for p(lex var word) with AFs observed However, too many zero prob utterances and memory allocation problems 1 variant per word - add in pronunciation variation later

Pronunciation model 6-state word models Phone-based word models AF word models Articulatory feature-based word model feature # templates manner 232 place 312 voicing 48 rounding 137 front-back 223 static 62 CPT for p(lex var word) with AFs observed However, too many zero prob utterances and memory allocation problems 1 variant per word - add in pronunciation variation later still working on this...

Talk outline WERs for state-based word models and phone-based word models look good. Watch this space for AF results