On the Formation of Phoneme Categories in DNN Acoustic Models

Similar documents
Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

A study of speaker adaptation for DNN-based speech synthesis

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Modeling function word errors in DNN-HMM based LVCSR systems

Lecture 9: Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Learning Methods in Multilingual Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speaker Identification by Comparison of Smart Methods. Abstract

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Human Emotion Recognition From Speech

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

SARDNET: A Self-Organizing Feature Map for Sequences

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Characterizing and Processing Robot-Directed Speech

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Proceedings of Meetings on Acoustics

Phonetics. The Sound of Language

Deep Neural Network Language Models

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

WHEN THERE IS A mismatch between the acoustic

Speaker Recognition. Speaker Diarization and Identification

Speech Recognition at ICSI: Broadcast News and beyond

Python Machine Learning

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Improvements to the Pruning Behavior of DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Stages of Literacy Ros Lugg

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A Deep Bag-of-Features Model for Music Auto-Tagging

Phonology Revisited: Sor3ng Out the PH Factors in Reading and Spelling Development. Indiana, November, 2015

Segregation of Unvoiced Speech from Nonspeech Interference

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Journal of Phonetics

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Quarterly Progress and Status Report. Voiced-voiceless distinction in alaryngeal speech - acoustic and articula

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

arxiv: v1 [cs.lg] 7 Apr 2015

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Cultivating DNN Diversity for Large Scale Video Labelling

Self-Supervised Acquisition of Vowels in American English

Speech Emotion Recognition Using Support Vector Machine

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

SOUND STRUCTURE REPRESENTATION, REPAIR AND WELL-FORMEDNESS: GRAMMAR IN SPOKEN LANGUAGE PRODUCTION. Adam B. Buchwald

Artificial Neural Networks written examination

Learning Methods for Fuzzy Systems

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Edinburgh Research Explorer

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

INPE São José dos Campos

Model Ensemble for Click Prediction in Bing Search Ads

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Self-Supervised Acquisition of Vowels in American English

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

THE RECOGNITION OF SPEECH BY MACHINE

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speaker recognition using universal background model on YOHO database

Softprop: Softmax Neural Network Backpropagation Learning

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Evolutive Neural Net Fuzzy Filtering: Basic Description

Consonants: articulation and transcription

Automatic English-Chinese name transliteration for development of multilingual resources

Annotation and Taxonomy of Gestures in Lecture Videos

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

DEVELOPMENT OF LINGUAL MOTOR CONTROL IN CHILDREN AND ADOLESCENTS

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Pobrane z czasopisma New Horizons in English Studies Data: 18/11/ :52:20. New Horizons in English Studies 1/2016

Letter-based speech synthesis

Effect of Word Complexity on L2 Vocabulary Learning

Universal contrastive analysis as a learning principle in CAPT

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Transcription:

On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine

Motivation Large performance gap between humans and state- of- the- art ASR systems Computational principles of DNNs remain elusive; they are analytically intractable Improving these models requires a better understanding of their transformations T. Nagamine

Introduction to acoustic models acous&c model Dahl et al., IEEE Transactions on Audio, Speech, and Signal Processing 2012

Introduction to acoustic models acous&c model Dahl et al., IEEE Transactions on Audio, Speech, and Signal Processing 2012

Phonemes Smallest contrastive unit in language e.g., k vs. b in cat/bat ~40-60 in English Output target in acoustic modeling T. Nagamine

Phonetic Features Manner of articulation Place of articulation Voicing T. Nagamine

Phonetic Features Manner of articulation Place of articulation Voicing T. Nagamine

Phonetic Features Manner of articulation same manner (plosive) same place (labial) Place of articulation Voicing /k/ /g/ /p/ /b/ /m/ T. Nagamine

Phonetic Features Manner of articulation Place of articulation Voicing /s/ = unvoiced same manner (frica&ve) same place (alveolar) /z/ = voiced T. Nagamine

Phonetic Features Dis$nc$ve Features Chomsky, Halle, Stevens T. Nagamine

Distinctive Features T. Nagamine

Phonemes and phones Phoneme Smallest contrastive unit in language. Abstract idea. Phone Instances of phonemes in actual utterances. Physical segments. Example: pat vs. bat 4 phonemes 6 phones T. Nagamine

2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label Feed- forward series of nonlinear transforma&ons

2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label?

2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label?

DNN Architecture Input layer 11 frames of 24- dimensional log Mel Xilter bank coefxicients + deltas 2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label

DNN Architecture Input layer 11 frames of 24- dimensional log Mel Xilter bank coefxicients + deltas 5 sigmoid hidden layers 256 nodes each; fully connected feed- forward 2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label

DNN Architecture Input layer 11 frames of 24- dimensional log Mel Xilter bank coefxicients + deltas 5 sigmoid hidden layers 256 nodes each; fully connected feed- forward Softmax output layer 41 nodes for 40 phonemes and silence; context independent 2nd temporal derivative 1st temporal derivative log Mel filterbank coefficients Input Hidden Layers t uw ah dh er k ey s ih z sil ao t uw ah dh er Output k ey s ih z sil ao aa ae ah ao aw ax ay b ch d dh eh er ey f g hh ih iy jh k l m n ng ow oy p r s sh sil t th uh uw v w y z zh Predicted Label Actual Label

Speech stimuli & DNN activations Input HL 1-3 Hidden 4 Layer Activation HL 5 Output (label) t uw ah dh er k ey s ih z sil ao Actual Label

Speech stimuli & DNN activations Input HL 1-3 Hidden 4 Layer Activation HL 5 Output (label) response to t t uw ah dh er k ey s ih z sil ao Actual Label

Speech stimuli & DNN activations Input HL 1-3 Hidden 4 Layer Activation HL 5 Output (label) response to z t uw ah dh er k ey s ih z sil ao Actual Label

Summary of Xindings 1. Nodes are selective to phonetic features at the individual and population level

= phoneme onset

manner of ar0cula0on (closure) ch, jh, g, k b, p d, t

manner of ar0cula0on (closure) + unvoiced ch k p t

place of ar0cula0on (labial) f, v b, p m

Phoneme Selec0vity Index (PSI)

Hidden Layer 1 nodes

Hidden Layer 1 phonemes

Hidden Layer 1

Hidden Layer 1

Hidden Layer 1 Hidden Layer 5

Neural responses to speech in human superior temporal gyrus (STG) o Mesgarani et al., Science 2014

Examples of average phoneme responses in STG Plosives! Fricatives! Low vowels! High vowels! Nasals! Phoneme selec&vity index Diversity of responses: Strong preference at various STG sites to specixic phoneme groups with shared attributes Mesgarani et al., Science 2014

Clustering the PSI vectors Global structures (population) Local structures (single electrode) Place Manner Mesgarani et al., Science 2014

Summary of Xindings 1. Single nodes and populations of nodes in a layer are selective to phonetic features 2. Phonetic feature encoding becomes more explicit in deeper layers

Summary of Xindings 1. Single nodes and populations of nodes in a layer are selective to phonetic features 2. Node selectivity to phonetic features becomes more explicit in deeper layers 3. Network invariance is learned through explicit representation of sources of variability

phoneme = t example selec0vity for three nodes (N1, N2, N3)

phoneme = t example selec0vity for three nodes (N1, N2, N3) t

phoneme = t example selec0vity for three nodes (N1, N2, N3) phoneme instances t clustering

phoneme = t example selec0vity for three nodes (N1, N2, N3) nodes t clustering

Summary of Xindings 1. Single nodes and populations of nodes in a layer are selective to phonetic features 2. Node selectivity to phonetic features becomes more explicit in deeper layers 3. Network invariance is learned through explicit representation of sources of variability

Questions?