Phoneme Recognition Using Deep Neural Networks

Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Human Emotion Recognition From Speech

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Python Machine Learning

A study of speaker adaptation for DNN-based speech synthesis

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Speech Emotion Recognition Using Support Vector Machine

Generative models and adversarial training

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

WHEN THERE IS A mismatch between the acoustic

Learning Methods in Multilingual Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Lecture 1: Machine Learning Basics

Speech Recognition at ICSI: Broadcast News and beyond

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

SARDNET: A Self-Organizing Feature Map for Sequences

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Word Segmentation of Off-line Handwritten Documents

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Deep Neural Network Language Models

CSL465/603 - Machine Learning

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 7 Apr 2015

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Model Ensemble for Click Prediction in Bing Search Ads

INPE São José dos Campos

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

arxiv: v1 [cs.lg] 15 Jun 2015

Speaker recognition using universal background model on YOHO database

A Deep Bag-of-Features Model for Music Auto-Tagging

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speech Recognition by Indexing and Sequencing

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Segregation of Unvoiced Speech from Nonspeech Interference

Investigation on Mandarin Broadcast News Speech Recognition

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

arxiv: v2 [cs.cv] 30 Mar 2017

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Probabilistic Latent Semantic Analysis

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Knowledge Transfer in Deep Convolutional Neural Nets

Calibration of Confidence Measures in Speech Recognition

Attributed Social Network Embedding

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Identification by Comparison of Smart Methods. Abstract

Offline Writer Identification Using Convolutional Neural Network Activation Features

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

An Online Handwriting Recognition System For Turkish

Assignment 1: Predicting Amazon Review Ratings

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

CS Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Proceedings of Meetings on Acoustics

arxiv: v1 [cs.cv] 10 May 2017

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

Edinburgh Research Explorer

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

Improvements to the Pruning Behavior of DNN Acoustic Models

Semi-Supervised Face Detection

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Cultivating DNN Diversity for Large Scale Video Labelling

Transcription:

CS229 Final Project Report, Stanford University Phoneme Recognition Using Deep Neural Networks John Labiak December 16, 2011

1 Introduction Deep architectures, such as multilayer neural networks, can be used to learn highly-complex, highlynonlinear functions by mapping inputs to outputs through multiple layers of nonlinear transformations. Problems in artificial intelligence (AI) are filled with very complex, poorly understood processes, and deep architectures have shown promise when applied to a variety of problems in AI, such as visual object recognition [1], and natural language processing [2]. However, very little has been done to explore the benefits of deep architectures for automatic speech recognition (ASR). In a typical speech recognition system, a hidden Markov model (HMM) is used to model the sequential structure of speech, and Gaussian mixture models (GMMs) are used as density models of acoustic feature vectors to estimate the state-dependent probability distributions of the HMM. Recently, researchers have begun exploring ways to leverage the modeling capacity of deep neural networks (DNNs) for automatic speech recognition. For example, it is possible to replace GMMs with DNNs for acoustic modeling within the HMM framework [3]. Deep neural networks have also been applied within a new paradigm for ASR, which replaces the traditionally used HMM with segmental conditional random fields (SCRFs). Within this framework, DNNs have been used to construct phoneme recognizers, which are then fed as an additional feature to the SCRF model. Deep neural networks are typically constructed by stacking multilayer neural networks, such as denoising autoencoders. Previous work has suggested the importance of context on recognition performance [4]. One approach to generating a large context window within a stacked architecture is to concatenate the posterior outputs of a classifier, and then use these posterior features as inputs to a second classifier. In this work, we test the effect of larger context on phoneme recognition for a softmax classifier. In particular, we construct a phoneme recognizer by stacking softmax classifiers, using concatenated posterior outputs from a softmax classifier as posterior features for a second softmax classifier. The contribution of this work is twofold. Firstly, we gain insight into the importance of using a large context for phoneme recognition. In particular, we test the idea that using a large context improves phoneme recognition by enabling a softmax classifier to learn temporal patterns and phonotactics from the training set. Secondly, the work done here sets the stage for future work in deep learning. In particular, the insight gained into the role of context can be used to build a better phone detector, which can then be used as an additional feature for an ASR system based on the SCRF model. The remainder of this paper is outlined as follows: First, we describe the methods used in our work. Next, we present the results of our experiments. Finally, we conclude with a discussion of the results and future directions of our work. 2 Methods 2.1 Softmax Regression Softmax regression generalizes logistic regression for classification problems in which there are more than two classes. Suppose we have k classes (i.e. we have y (i) {1, 2,..., k}). Then, for a given feature vector x (i) R n+1 (where x 0 = 1), the softmax classifier outputs a vector of posterior probabilites: 1

h θ (x (i)) = p ( y (i) = 1 x (i) ; θ ) p ( y (i) = 2 x (i) ; θ ). p ( y (i) = k x (i) ; θ ) = 1 k j=1 e θt j x(i) e θt 1 x(i) e θt 2 x(i). e θt k x(i) where θ 1, θ 2,..., θ k R n+1 are chosen to minimize: [ J (θ) = 1 m y (i) log h θ (x (i)) ( + 1 y (i)) ( log (1 h θ x (i)))] + λ m 2 i=1 k n i=1 j=0 over a set of training examples {(x (1), y (1) ),..., (x (m), y (m) )}. The softmax classifer then assigns x (i) to the class with the largest posterior probability. That is, it sets y (i) = j where j = arg max j p ( y (i) = j x (i) ; θ ). Figure 1(a) shows a diagram of the softmax classifier, where we have n + 1 input nodes and k output nodes. θ 2 ij 21 frames 23 frames x h(x) So#max 1 So#max 2 +1 21 frames 23 frames (a) Softmax classifier (b) Stacked softmax classifier with 23 frame posterior feature vectors Figure 1: Overview of the architectures used in our experiments. 2.2 Stacked Softmax Classifier For the stacked softmax classifier, the posterior probability vectors output by a softmax classifier (Softmax 1) are concatenated for a window of speech frames and used as input to a second softmax classifier (Softmax 2). Figure 2(a) shows an overview of the architecture for a stacked softmax classifier in which a 23 frame window is used to construct the posterior features. In particular, note that we use two sets of context frames within this stacked architecture: we use 21 frame MFCC feature vectors as input to the first softmax classifier and 23 frame posterior features as input to the second softmax classifier. The details of the construction of these two sets of features are included below. 2

2.3 Experimental Setup Our experiments are based on phonetic classification of frames of speech from the Broadcast News database (approximately 430 hours of data). For the first softmax classifier, we construct acoustic feature vectors by concatenating 13-dimensional Mel-frequency cepstral coefficients (MFCCs) for a window of 21 speech frames. We preprocess the data using PCA to whiten the features, thus yielding 139-dimensional feature vectors for the first softmax classifier. For the second softmax classifier, we construct posterior feature vectors by concatenating the posterior outputs from the first softmax classifier for a window of speech frames. We perform experiments using 3, 13, and 23 frame context windows. We use force-aligned outputs of a speech recognizer as ground-truth for the phoneme labels. The set of phonetic labels contains 42 classes, including 41 non-silence classes and 1 silence class. For the silence class, we map silence, background noise, and voiced noise to a single silence token. We divide the data into subsets for training and testing the performance of the classifiers. We train both softmax classifiers using mini-batch LBFGS. We use a batch size of 20 and 5 files (each file contains approximately 1 hour of speech), for training the first and second softmax classifiers, respectively. For both classifiers, we regularize the cost function using a weight decay parameter of λ = 1e 4. Furthermore, for each classifier, we initialize the parameters for each batch using the average parameter values across all previous batches, and train on each batch for 20 iterations. After training is complete, we test the classifiers by recording frame level accuracies for the phonetic classification task on the held out test set. 3 Results Table 1 displays the results of our experiments on the Broadcast News data. In particular, Table 1 displays the frame level accuracies for our phoneme recognition task. Referring to Table 1, we see that using a large context improves the performance of our softmax classifier, with the largest gains seen as we increase the context window from 3 to 13 frames, and leveling off thereafter. Classifier Accuracy (%) Softmax 37.94 Stacked-softmax w/ 3 frame context 40.77 Stacked-softmax w/ 13 frame context 44.34 Stacked-softmax w/ 23 frame context 45.49 Table 1: Frame level accuracies for phoneme recognition on the Broadcast News data. 4 Discussion In this work, we tested the effect of context on phoneme recognition for a softmax classifier. In particular, we constructed a simple neural network by stacking softmax classifiers. Within this architecture, we used concatenated posterior outputs of a softmax classifier as inputs to a second softmax classifier. By stacking softmax classifiers in this manner, we gained insight into the importance of context for classification of phonetically labeled speech frames. In particular, we found that stacking softmax classifiers improves frame level accuracy over a single softmax classifier, and the accuracy improves with increasing context, in the range of 3 to 23 context frames for the posterior features. 3

(a) Stacked softmax classifier w/ 3 frame context (b) Two-layer neural network (c) Deep neural network Figure 2: Neural network architectures. The work considered herein is part of a larger project to construct a phoneme recognizer which can be fed as a detector stream to an ASR system based on the SCRF model. We have, therefore, provided insight into which neural network architectures work well for phoneme recognition. The stacked softmax classifier created a large context by using a window of concatenated posterior feature vectors. Figure 2(a) shows a stacked softmax classifier in which a 3 frame context window is used to construct posterior features. This architecture is a special case of a two-layer neural network (Figure 2(b)). Future work will consider alternative architectures for constructing a phone recognizer. For example, we might consider deeper architectures, such as that in Figure 2(c), which can be constructed by stacking many multilayer neural networks. Deeper neural networks include additional layers of nonlinearity, and experiments with deeper architectures will give us insight into the effects of adding these additional layers of nonlinearity. In particular, there is great interest in comparing the effects of context versus additional layers of nonlinearity for phoneme recognition. We might also consider alternative approaches for generating a large context, such as constructing acoustic features by concatenating a much larger window of MFCC features. Finally, future work may examine the effects of context and additional layers of nonlinearity together by considering alternative deep network architectures which generate large context by using concatenated posterior features. 4

References [1] M.A. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun, Unsupervised learning of invariant feature hierarchies with applications to object recognition., in Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition (CVPR-07), 2007, pp. 18. [2] R. Collobert and J. Weston, A unified architecture for natural language processing: deep neural networks with multitask learning, in Proceedings of the 25th International Conference on Machine Learning (ICML-08), 2008, pp. 160167. [3] G. Zweig, P. Nguyen, D. Van Compernolle, K. Demuynck, L. Atlas, P. Clark, G. Sell, F. Sha, M. Wang, A. Jansen, H. Hermansky, D. Karakos, S. Thomas, G.S.V.S. Sivaram, K. Kintzley, S. Bowman, and J. Kao, Speech Recognition with Segmental Conditional Random Fields: Final Report from the 2010 JHU Summer Workshop, no. MSR-TR-2010-173, November 2010. [4] S. Thomas, P. Nguyen, G. Zweig, H. Hermansky, MLP Based Phoneme Detectors for Automatic Speech Recognition. Microsoft Research, 2010. 5