Deep Neural Networks for Acoustic Modelling. Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)

Similar documents
Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.lg] 7 Apr 2015

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A study of speaker adaptation for DNN-based speech synthesis

On the Formation of Phoneme Categories in DNN Acoustic Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Calibration of Confidence Measures in Speech Recognition

Softprop: Softmax Neural Network Backpropagation Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Artificial Neural Networks written examination

arxiv: v1 [cs.cl] 27 Apr 2016

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Generative models and adversarial training

A Review: Speech Recognition with Deep Learning Methods

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Improvements to the Pruning Behavior of DNN Acoustic Models

A Deep Bag-of-Features Model for Music Auto-Tagging

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Model Ensemble for Click Prediction in Bing Search Ads

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Human Emotion Recognition From Speech

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.cv] 10 May 2017

Learning Methods in Multilingual Speech Recognition

WHEN THERE IS A mismatch between the acoustic

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

INPE São José dos Campos

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

(Sub)Gradient Descent

Speech Recognition at ICSI: Broadcast News and beyond

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

CS Machine Learning

THE world surrounding us involves multiple modalities

Learning Methods for Fuzzy Systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

arxiv: v1 [cs.lg] 15 Jun 2015

An empirical study of learning speed in backpropagation

arxiv: v4 [cs.cl] 28 Mar 2016

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Axiom 2013 Team Description Paper

A Reinforcement Learning Variant for Control Scheduling

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Knowledge Transfer in Deep Convolutional Neural Nets

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Evolutive Neural Net Fuzzy Filtering: Basic Description

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speech Emotion Recognition Using Support Vector Machine

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Residual Stacking of RNNs for Neural Machine Translation

Attributed Social Network Embedding

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Word Segmentation of Off-line Handwritten Documents

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Cultivating DNN Diversity for Large Scale Video Labelling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Dialog-based Language Learning

Artificial Neural Networks

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 10: Reinforcement Learning

Learning to Schedule Straight-Line Code

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

arxiv: v2 [cs.ir] 22 Aug 2016

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Soft Computing based Learning for Cognitive Radio

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Test Effort Estimation Using Neural Network

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Deep Facial Action Unit Recognition from Partially Labeled Data

Transcription:

Deep Neural Networks for Acoustic Modelling Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)

Introduction Automatic speech recognition Speech signal Feature Extraction Acoustic Modelling Decoder Recognized text Language Modelling

Introduction Acoustic modelling using deep neural networks Speech signal Feature Extraction Acoustic Modelling Decoder Recognized text Language Modelling

Background HMM-GMMs have prevailed in ASR for last four decades Difficult for any new methods to outperform them for acoustic modelling Can GMMs capture all information in acoustic features? No. Inefficient in modelling the data that lie on or near nonlinear manifold in the data space Need for better models Artificial neural networks (ANNs) are known to capture the nonlinearities in the data Natural to think of ANNs as an alternative to GMMs

Background ANNs are not new for speech recognition Two decades back, researchers employed the ANNs for ASR Unable to outperform the GMMs Hardware and learning algorithms were restricted the capacity of ANNs Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs) DNNs outperform the GMMs (finally ;) )

Deep Neural Networks (DNNs) Feed-forward ANNs with more than one hidden layers

Our task Frame based phoneme recognition using simple DNNs Experiments with various input features Compare the results with GMMs Try complex DNNs (if time permits) Deep belief networks (DBN) Recurrent neural networks (RNNs)

Our task Frame based phoneme recognition using simple DNNs Experiments with various input features Compare the results with GMMs Try complex DNNs (if time permits) Deep belief networks (DBN) Recurrent neural networks (RNNs)

Database Training data : 151 Finnish speech sentences (~ 15 mins) Development data 135 sentences (~ 11 mins) Evaluation data 100 sentences (~ 8 mins)

Simple DNN Similar to multi-layer perceptron (MLP) Hidden Layers: [300, 300] Activations: Sigmoid Optimization: Stochastic Gradient Descent (SGD) Error criteria: Categorical crossentropy Software tool: Keras Input: MFCC features with 39 dimension Output: 24 Finnish phonemes Normalization: Mean-variance

Performance of simple DNN (MLP) Input feature Frame-wise accuracy (%) Single frame [t] 63.81 Three frames [t-1, t, t+1] 67.59 Five frames [t-2, t-1, t, t+1, t+2] 67.22

DBN

Deep Belief Network (DBN) This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization. After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP. Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights. Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label. 13

Restricted Boltzmann Machine (RBM) This is type of generative neural network. The idea is to generate an energy surface or heat map in form of probability density. Energy: Probability density: Optimize: Use Gibbs sampling for <.>_model 14

DBN-pretraining Stack of RBMs: Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer. The process is done bottom-up Iterate multiple for multiple epochs 15

Setups Using Theano-based tutorial code from deeplearning.net Hidden layers uses activation function sigmoid function. Prediction layer (top layer) is a softmax layer. Loss function is categorical cross entropy. Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities) Each input is MFCC in context of 3 (triphone) 16

Experiments Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen Train with and without pre-training to compare The number of hidden layers varies from 1 to 3 The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained) Experimenting with some 3-hidden-layer hourglass model, the results don t show real improvement. 17

DBN Results The best model is non-pretrained 500_500 network. Accuracy on validation set is 66.82% The table show predicting accuracy on trained models on validation set. Model size Pre-trained Iterations Non-pretrained Iterations 100 60.188 48934 60.344 39830 200 61.235 44382 62.792 48934 300 61.387 39830 62.721 39830 400 61.284 42106 63.561 37554 100_100 61.641 48934 62.638 44382 200_200 63.106 47796 64.266 39830 300_300 63.808 46658 64.716 37554 400_400 63.741 51210 64.634 33002 500_500 66.820 33002 600_600 65.327 30726 100_100_100 62.237 55762 62.926 46658 200_200_200 63.589 53486 64.19 40968 300_300_300 63.572 44382 63.73 33002 400_400_400 63.106 44382 64.941 35278 18

Recurrent Networks

Recurrent Neural Networks Output of a recurrent n/w at time t depends on inputs at time t as well as state of the n/w at time t-1. Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window. Infinitely deep in a sense

Our Model We use a fixed context size with frames upto t-context fed into the RNN. Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.

Learning in recurrent nets We can compute the error at time t (cross entropy error) and backpropagate the gradients through time, similar to backpropagation in MLP. Problem is these gradients can die out or blow up if sequence is very long One solution for explosive gradients is to truncate the depth in time till which you propagate Other solution is to use more complex recurrent units like LSTMs

LSTM Cell Consists of a memory unit and 3 gates Each gate is affected by current input and previous output state of the cell. The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.

Learning Details and Regularization We use RMSprop learning algorithm a form of gradient descent where learning rate is automatically scaled by rms value of most recent gradients Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much Dropout only in the embedding and output layer, bad idea to do it recurrent connections.

Results with RNNs - Accuracies Context 10, 200 units, Dropout 0.3 Type of Unit simple lstm Accuracy on Eval 66.43 67.76 LSTM, Context 10, Dropout 0.3 Size of n/w 50 100 200 Accuracy on Eval 67.79 68.11 67.76 LSTM, 200 Units, Dropout 0.3 Context Window 5 10 20 Accuracy on Eval 68.11 67.76 68.76 LSTM, Context 10, 200 Units Dropout Prob 0.0 0.3 0.5 0.7 Accuracy on Eval 66.47 67.76 68.21 68.19

Summary Results: All Models Context Window MLP DBN RNN Accuracy on Eval 67.59 66.82 68.76

Source code is available on GitHub : https://github.com/rakshithshetty/dnn-speech 27

References George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012. Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013. Some figures are taken from prof. Juha Karhunen s slides of the course Machine Learning and Neural Networks. Implementation DBN code are take and modified from tutorial on deeplearning.net 28

Questions? 29