A Hybrid Neural Network/Hidden Markov Model

Similar documents
Human Emotion Recognition From Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

On the Formation of Phoneme Categories in DNN Acoustic Models

A study of speaker adaptation for DNN-based speech synthesis

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Neural Network GUI Tested on Text-To-Phoneme Mapping

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning Methods in Multilingual Speech Recognition

Speech Emotion Recognition Using Support Vector Machine

Speaker Identification by Comparison of Smart Methods. Abstract

Python Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probabilistic Latent Semantic Analysis

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Artificial Neural Networks written examination

Speaker recognition using universal background model on YOHO database

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

INPE São José dos Campos

Automatic Pronunciation Checker

Calibration of Confidence Measures in Speech Recognition

WHEN THERE IS A mismatch between the acoustic

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Rule Learning With Negation: Issues Regarding Effectiveness

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Evolutive Neural Net Fuzzy Filtering: Basic Description

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

CS Machine Learning

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 1: Machine Learning Basics

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Axiom 2013 Team Description Paper

SARDNET: A Self-Organizing Feature Map for Sequences

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Edinburgh Research Explorer

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

(Sub)Gradient Descent

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Beyond the Pipeline: Discrete Optimization in NLP

Learning Methods for Fuzzy Systems

Speech Recognition by Indexing and Sequencing

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning with Negation: Issues Regarding Effectiveness

Artificial Neural Networks

Lecture 9: Speech Recognition

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Assignment 1: Predicting Amazon Review Ratings

Knowledge Transfer in Deep Convolutional Neural Nets

An Online Handwriting Recognition System For Turkish

Segregation of Unvoiced Speech from Nonspeech Interference

Natural Language Processing. George Konidaris

Investigation on Mandarin Broadcast News Speech Recognition

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Detecting English-French Cognates Using Orthographic Edit Distance

A Review: Speech Recognition with Deep Learning Methods

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

CSL465/603 - Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Generative models and adversarial training

Corrective Feedback and Persistent Learning for Information Extraction

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Issues in the Mining of Heart Failure Datasets

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Deep Neural Network Language Models

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Improvements to the Pruning Behavior of DNN Acoustic Models

Time series prediction

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Switchboard Language Model Improvement with Conversational Data from Gigaword

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Word Segmentation of Off-line Handwritten Documents

Transcription:

A Hybrid Neural Network/Hidden Markov Model Method for Automatic Speech Recognition Hongbing Hu Advisor: Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University 03/18/2008 1

Introduction Automatic Speech Recognition (ASR) Translate speech into text Most investigated research topic in speech processing Applications Speech user interface for computers (Microsoft speech recognition in Windows) Tl Telephone queries (operator/touch htone replacement) Voice Dialing (for cell phones) Difficulties of Automatic Speech Recognition Speaker variability (pronunciation, rate, overlaps) Acoustic variability y( (noise, reverb, talker movement) Style Variability (reading vs. conversational speech) 2

Speech Recognition Architecture Speech Waveform Feature Extraction Speech Features Classification (Recognition) Recognizer (HMM/NN) Phonemes i n i: d sil Words I need a 3

Hidden Markov Models (HMM) Hidden Markov Model (HMM) A stochastic process to determine probability of an observable sequence A finite-state machine, at each time t that a state j is entered, an observation o t is emitted with probability density b j (o t ) Transition from state i to state j modeled with probability a ij a 22 a 33 a 44 S i : State S 1 a 12 S 2 a 23 S 3 a 34 S 4 a 45 S 5 b 2 (o 1 ) b 3 (o i ) b 4 (o i ) a ij : Transition Probability b i (o j ): Emission i Probability bilit 4

HMMs in Speech Recognition Most popular approach for Phoneme HMM continuous recognition A HMM used to model a phoneme or a word Observable sequence associated with speech feature vectors O 1 O t Probability of a particular feature sequence over a HMM model computed to determine recognition T decision α j ( ) = α i ( t Speech Feature Vectors t 1) a b ( o ) i = 1 a ij : Transition Prob., b i (o j ): Emission Prob. ij j t 5

Neural Network (NN) Neural Network Inspired by biological nervous systems (such as our brain) Node (artificial neuron) Basic unit in a Neural Network Output determined from the weighted sum of its inputs and activation function x 1 Input w 1 Weight Node Output f y Activation Function y d ( i i 0 i= 1 = f x w + w ), where ( net w i 1 if x i f ) = 1 if net net < 0 0 6

Neural Networks in Speech Recognition A neural network consists of multiple layer nodes Input layer is enlarged to accept speech feature Recognition decision is made from output layer Node weights need to be trained for the desired output n 11 n 31 Speech Feature Vector n 12 n 21 n 32 Recognit ion Decision Input Layer n 1i n 2i n 3i Hidden Layer Output Layer 7

NN for Feature Dimensionality Reduction Difficulties in practical speech recognition Large dimensionality of acoustic feature spaces Significant load in model training ( Curse of dimensionality ) Nonlinear Principal Components Analysis (NLPCA) Neural Network based feature dimensionality reduction φ(x): Transformed feature of the data point x for machine learning x φ(x) ( ) φ( ) T f d f t f th φ(.) : D R R M R M : M dimension feature space φ(.): A neural network mapping to obtain more linear features 8

Nonlinear Principal Components Analysis Bottleneck neural network Input Data Dimensionality Reduced Data Dimensionality reduced data has more effective representation 9

Limitations of HMMs and NNs HMMs Poor discriminative power because of Maximum likelihood training criteria The first order Markov Assumption is only an approximation, leading to reduced performance Neural Networks Lack of ability to account for temporal variations in speech Lack of mathematical framework for combining phonetic models, thus a poor representation for continuous speech 10

Hybrid NN/HMM Method Neural Networks Feature dimensionality reduction ability Nonlinear transformation ability HMMs Long-term (dependencies) d continuous speech recognition Easily combined with language model Neural Network Hidden Markov Model Hybrid Recognition Method Flexibility and recognition performance Improvement 11

Hybrid NN/HMM Method Architecture Neural Network used for feature transformation Obtain low-dimensional but efficient representations of speech feature Middle layer of bottleneck neural network output dimensionality reduced feature HMM recognizer Each HMM corresponds to a phoneme, using phonetic feature detectors recognize dimensionality reduced features Pre-process HMM Recognizer Speech Feature Dimensionality Reduced Feature 12

Neural Network Training Neural Network Training 1 NN is trained from labeled training data Training target data corresponding each phoneme for output layer is generated using phoneme specific binary codes Back-propagation algorithm Speech Feature Training Space Target 1 0.8 0.5 0.6 0 0.4-0.5-1 0 10 20 100 80 60 40 20 300 1 0.8 Transformed 0.6 0.4 Feature Space 0.2 0 0.2 0 0 10 20 300 10 20 30 40 0 5 10 15 20 25 300 5 10 15 20 13

Experiments TIMIT database a total of 6300 sentences, about 400 minutes 10 sentences spoken by each of 630 speakers from 8 major dialect regions of the U.S. Time-aligned phonetic transcription is provided 4680 sentences for training, 1620 for test HMM models (HTK toolkit ver 3.4) 39 phonemes (39 HMMs) 5 states and 3 mixtures Bigram language model 14

Feature Comparison MFCC (Mel-Frequency Cepstrum Coefficients) standard in speech recognition, but limited feature dimensionality) DCTC (Discrete Cosine Transform Coefficients) High dimension dynamic feature Results in Accuracy (HMM recognition only) Accuracy = (N-D-I-H)/N, N:number of phonemes, D: deletion errors, I: Insertion errors, H:substitution errors Num. of features DCTC MFCC (MFCC_E) 13 51.13% 50.96% 26 59.15% 51.81% 39 62.86% 64.77% (MFCC_E_D_A) 91 62.16% --- 15

Experimental Results for Hybrid Method 70 Reco ognition Acc curacy [%] 68 66 64 62 60 58 56 54 52 50 Training (91 Dim.) Test (91 Dim.) Training Data Test Data 50 30 25 20 15 13 10 6 4 Num of Dimensions in Reduced Feature Space Dimensionality reduced features yield higher accuracy than original 91 features 16

Conclusions A hybrid Neural Network/Hidden Markov Model is proposed Using the nonlinear transformation ability of Neural Networks, a hybrid method yields better performance Future works Exploring training target settings of Neural Network for more effective feature dimensionality reduction Global optimization of Neural Network and Hidden Markov Model 17