Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Amnon Drory & Matan Karo 19/12/2017 Deep Speech 1
Overview 19/12/2017 Deep Speech 2
Automatic Speech Recognition 19/12/2017 Deep Speech 3
19/12/2017 Deep Speech 4
19/12/2017 Deep Speech 5
19/12/2017 Deep Speech 6
100,000,000 H.M.U = Hundred Million Users 19/12/2017 Deep Speech 7
The Task: Good Speech Recognition 19/12/2017 Deep Speech 8
Traditional Speech Recognition (ASR) 19/12/2017 Deep Speech 9
Traditional ASR + Deep Learning 19/12/2017 Deep Speech 10
Baidu s Approach: End-To-End Neural Net 100,000,000 19/12/2017 Deep Speech 11
Speed up 19/12/2017 Deep Speech 12
Training Data: Annotated Audio Thousands of hours of annotated speech for training: in English and Mandarin. 19/12/2017 Deep Speech 13
Training Data: Raw Text Use text to learn a lot about the language. This can help us in understanding speech which words are common which word is reasonable in the current context 19/12/2017 Deep Speech 14
Lecture Plan Overview Input Output + CTC Model Architecture Results 19/12/2017 Deep Speech 15
Input 19/12/2017 Deep Speech 16
ASR A complete speech application: Speech transcription Word spotting/ trigger word Speaker identification /verification 19/12/2017 Deep Speech 17
Audio Input Input: Raw Audio, 1D signal Pre-Process: Spectrogram 19/12/2017 Deep Speech 18
Preprocessing: SortaGrad Dealing with different lengths of utterances Try to keep similar length utterances together (LibriSpeech clean data.) 19/12/2017 Deep Speech 19
Preprocessing: Data Augmentation Additive noise increases robustness to noisy speech increases the data set : 10k hours of raw audio -> 100k hours 19/12/2017 Deep Speech 20
Output 19/12/2017 Deep Speech 21
The Goal Create a neural network (RNN) from which we can extract transcription, y. Train from labeled pairs x, y 19/12/2017 Deep Speech 22
Connectionist Temporal Classification (CTC) The network is also called Acoustic Model. Acoustic model main issue - length(x)!= length(y) Solution - divide the transcription task to steps: RNN output neurons c encode distribution over symbols Encode: x c Define a mapping from distribution to text β f(c) y Find function f for achieving y In training : summation for all mappings In testing : ML using beamsearch 19/12/2017 Deep Speech 23
Connectionist Temporal Classification (CTC) RNN creates probability vectors (distribution) using Softmax For grapheme-based model: c A, B, C, D,, blank, space Independence assumption: P c x = i=1 N P(c i x) 19/12/2017 Deep Speech 24
Training With CTC Mapping: Given a character sequence c, remove duplicates and blanks Therefore P y x is the summation over all possible c with the same mapping: 19/12/2017 Deep Speech 25
Training With CTC Update network parameters θ to maximize likelihood of correct label y : θ = argmax θ i log P y i x i θ = argmax θ i log c β c =y (i) P c x i There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages). 19/12/2017 Deep Speech 26
Decoding Network outputs P c x, we want P y x Simple naive solution: Max Decoding β( argmax c P c x ) c c _ a a a a b b b b _ 19/12/2017 Deep Speech 27
Max Decoding Doesn t work in practice, good for diagnostics 19/12/2017 Deep Speech 28
Language model: n-gram A probabilistic Markov Model : P(x i x i n 1,, x i 1 ) 19/12/2017 Deep Speech 29
Language model: n-gram Examples from Google n-gram corpus 19/12/2017 Deep Speech 30
Decoding with LM Even with better decoding schemes CTC model tends to make spelling and linguistic errors. Solution: Combine a Language Model! argmax y log{p y x P y α word_count(y) β } α weights between LM and CTC network β- encourages more words in transcription Use Beam Search to find the transcript y 19/12/2017 Deep Speech 31
Decoding with LM 19/12/2017 Deep Speech 32
Decoding with LM : Beam Search The Naive approach 19/12/2017 Deep Speech 33
Decoding with LM : Beam Search 19/12/2017 Deep Speech 34
Architecture 19/12/2017 Deep Speech 36
Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 37
RNN as state machine 19/12/2017 Deep Speech 38
RNN as grid Forward Pass 19/12/2017 Deep Speech 39
Back Propagation 19/12/2017 Deep Speech 40
Weight Update 19/12/2017 Deep Speech 41
Bi-Directional RNN 19/12/2017 Deep Speech 42
RNN with limited future context 19/12/2017 Deep Speech 43
RNN vs. LSTM vs. GRU 19/12/2017 Deep Speech 44
Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 45
Convolutional Layers: Images 19/12/2017 Deep Speech 46
Frequency Convolutional Layers: Audio Time 19/12/2017 Deep Speech 47
Results 19/12/2017 Deep Speech 48
Test Sets 19/12/2017 Deep Speech 49
Results: Sometimes better than Humans 19/12/2017 Deep Speech 50
Results: Sometimes better than Humans 19/12/2017 Deep Speech 51
Results: Sometimes better than Humans 19/12/2017 Deep Speech 52
Questions? 19/12/2017 Deep Speech 53
The End 19/12/2017 Deep Speech 54