Deep Neural Networks for Acoustic Modelling. Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)

Deep Neural Networks for Acoustic Modelling Bajibabu Bollepalli Hieu Nguyen Rakshith Shetty Pieter Smit (Mentor)

Introduction Automatic speech recognition Speech signal Feature Extraction Acoustic Modelling Decoder Recognized text Language Modelling

Introduction Acoustic modelling using deep neural networks Speech signal Feature Extraction Acoustic Modelling Decoder Recognized text Language Modelling

Background HMM-GMMs have prevailed in ASR for last four decades Difficult for any new methods to outperform them for acoustic modelling Can GMMs capture all information in acoustic features? No. Inefficient in modelling the data that lie on or near nonlinear manifold in the data space Need for better models Artificial neural networks (ANNs) are known to capture the nonlinearities in the data Natural to think of ANNs as an alternative to GMMs

Background ANNs are not new for speech recognition Two decades back, researchers employed the ANNs for ASR Unable to outperform the GMMs Hardware and learning algorithms were restricted the capacity of ANNs Advancements in hardware as well in machine learning algorithms allows us to train large multilayer (deep) ANNs called Deep Neural Networks (DNNs) DNNs outperform the GMMs (finally ;) )

Deep Neural Networks (DNNs) Feed-forward ANNs with more than one hidden layers

Our task Frame based phoneme recognition using simple DNNs Experiments with various input features Compare the results with GMMs Try complex DNNs (if time permits) Deep belief networks (DBN) Recurrent neural networks (RNNs)

Database Training data : 151 Finnish speech sentences (~ 15 mins) Development data 135 sentences (~ 11 mins) Evaluation data 100 sentences (~ 8 mins)

Simple DNN Similar to multi-layer perceptron (MLP) Hidden Layers: [300, 300] Activations: Sigmoid Optimization: Stochastic Gradient Descent (SGD) Error criteria: Categorical crossentropy Software tool: Keras Input: MFCC features with 39 dimension Output: 24 Finnish phonemes Normalization: Mean-variance

Performance of simple DNN (MLP) Input feature Frame-wise accuracy (%) Single frame [t] 63.81 Three frames [t-1, t, t+1] 67.59 Five frames [t-2, t-1, t, t+1, t+2] 67.22

DBN

Deep Belief Network (DBN) This neural network is similar to MLP but the weights are pre-trained using multiple Restricted Boltzmann Machines (RBM) instead of only random initialization. After the model is pre-trained, the weight are fine-tuned again. The process is similar to model training of only MLP. Pre-training step is unsupervised (without using the true target label of data point), we try to regenerate input x from the hidden representation induced by input x. The knowledge learned is encoded by the values of the weights. Fine-tuning is supervised training step, where we try to maximize the prediction accuracy of the data points with true label. 13

Restricted Boltzmann Machine (RBM) This is type of generative neural network. The idea is to generate an energy surface or heat map in form of probability density. Energy: Probability density: Optimize: Use Gibbs sampling for <.>_model 14

DBN-pretraining Stack of RBMs: Two consecutive layers are trained using a RBM with the lower one is hidden layer and the upper one is visible layer. The process is done bottom-up Iterate multiple for multiple epochs 15

Setups Using Theano-based tutorial code from deeplearning.net Hidden layers uses activation function sigmoid function. Prediction layer (top layer) is a softmax layer. Loss function is categorical cross entropy. Output is either predicted label (one of 24 phonemes) or probabilities of 24 phonemes (predicted label is argmax of probabilities) Each input is MFCC in context of 3 (triphone) 16

Experiments Pre-training is tricky, after some rough estimates, pre-training learning rate 1e-5 is chosen Train with and without pre-training to compare The number of hidden layers varies from 1 to 3 The size of each hidden layer varies from 100 to 600 (some results with size 500 and 600 were not trained) Experimenting with some 3-hidden-layer hourglass model, the results don t show real improvement. 17

DBN Results The best model is non-pretrained 500_500 network. Accuracy on validation set is 66.82% The table show predicting accuracy on trained models on validation set. Model size Pre-trained Iterations Non-pretrained Iterations 100 60.188 48934 60.344 39830 200 61.235 44382 62.792 48934 300 61.387 39830 62.721 39830 400 61.284 42106 63.561 37554 100_100 61.641 48934 62.638 44382 200_200 63.106 47796 64.266 39830 300_300 63.808 46658 64.716 37554 400_400 63.741 51210 64.634 33002 500_500 66.820 33002 600_600 65.327 30726 100_100_100 62.237 55762 62.926 46658 200_200_200 63.589 53486 64.19 40968 300_300_300 63.572 44382 63.73 33002 400_400_400 63.106 44382 64.941 35278 18

Recurrent Networks

Recurrent Neural Networks Output of a recurrent n/w at time t depends on inputs at time t as well as state of the n/w at time t-1. Thus are ideal to model sequences, as time dependencies can be learnt in the recurrent weights In case of phoneme classification it is now easy to include arbitrary amount of context i.e previous frames within a window. Infinitely deep in a sense

Our Model We use a fixed context size with frames upto t-context fed into the RNN. Then, the hidden state of RNN at time t is used to predict the class of the frame at time t.

Learning in recurrent nets We can compute the error at time t (cross entropy error) and backpropagate the gradients through time, similar to backpropagation in MLP. Problem is these gradients can die out or blow up if sequence is very long One solution for explosive gradients is to truncate the depth in time till which you propagate Other solution is to use more complex recurrent units like LSTMs

LSTM Cell Consists of a memory unit and 3 gates Each gate is affected by current input and previous output state of the cell. The 3 cells control data flow to the memory, retention of memory & activation of output from the cell.

Learning Details and Regularization We use RMSprop learning algorithm a form of gradient descent where learning rate is automatically scaled by rms value of most recent gradients Regularize using dropout : For each training sample some units are randomly switched off. This forces each unit to learn something useful and not co-depend too much Dropout only in the embedding and output layer, bad idea to do it recurrent connections.

Results with RNNs - Accuracies Context 10, 200 units, Dropout 0.3 Type of Unit simple lstm Accuracy on Eval 66.43 67.76 LSTM, Context 10, Dropout 0.3 Size of n/w 50 100 200 Accuracy on Eval 67.79 68.11 67.76 LSTM, 200 Units, Dropout 0.3 Context Window 5 10 20 Accuracy on Eval 68.11 67.76 68.76 LSTM, Context 10, 200 Units Dropout Prob 0.0 0.3 0.5 0.7 Accuracy on Eval 66.47 67.76 68.21 68.19

Summary Results: All Models Context Window MLP DBN RNN Accuracy on Eval 67.59 66.82 68.76

Source code is available on GitHub : https://github.com/rakshithshetty/dnn-speech 27

References George E. Dahl Abdel-rahman Mohamed and Geoffrey Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, Volume 20 Issue 1, 2012. Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. ArXiv, abs/1303.5778, 2013. Some figures are taken from prof. Juha Karhunen s slides of the course Machine Learning and Neural Networks. Implementation DBN code are take and modified from tutorial on deeplearning.net 28

Questions? 29