Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Amnon Drory & Matan Karo 19/12/2017 Deep Speech 1

Overview 19/12/2017 Deep Speech 2

Automatic Speech Recognition 19/12/2017 Deep Speech 3

19/12/2017 Deep Speech 4

19/12/2017 Deep Speech 5

19/12/2017 Deep Speech 6

100,000,000 H.M.U = Hundred Million Users 19/12/2017 Deep Speech 7

The Task: Good Speech Recognition 19/12/2017 Deep Speech 8

Traditional Speech Recognition (ASR) 19/12/2017 Deep Speech 9

Traditional ASR + Deep Learning 19/12/2017 Deep Speech 10

Baidu s Approach: End-To-End Neural Net 100,000,000 19/12/2017 Deep Speech 11

Speed up 19/12/2017 Deep Speech 12

Training Data: Annotated Audio Thousands of hours of annotated speech for training: in English and Mandarin. 19/12/2017 Deep Speech 13

Training Data: Raw Text Use text to learn a lot about the language. This can help us in understanding speech which words are common which word is reasonable in the current context 19/12/2017 Deep Speech 14

Lecture Plan Overview Input Output + CTC Model Architecture Results 19/12/2017 Deep Speech 15

Input 19/12/2017 Deep Speech 16

ASR A complete speech application: Speech transcription Word spotting/ trigger word Speaker identification /verification 19/12/2017 Deep Speech 17

Audio Input Input: Raw Audio, 1D signal Pre-Process: Spectrogram 19/12/2017 Deep Speech 18

Preprocessing: SortaGrad Dealing with different lengths of utterances Try to keep similar length utterances together (LibriSpeech clean data.) 19/12/2017 Deep Speech 19

Preprocessing: Data Augmentation Additive noise increases robustness to noisy speech increases the data set : 10k hours of raw audio -> 100k hours 19/12/2017 Deep Speech 20

Output 19/12/2017 Deep Speech 21

The Goal Create a neural network (RNN) from which we can extract transcription, y. Train from labeled pairs x, y 19/12/2017 Deep Speech 22

Connectionist Temporal Classification (CTC) The network is also called Acoustic Model. Acoustic model main issue - length(x)!= length(y) Solution - divide the transcription task to steps: RNN output neurons c encode distribution over symbols Encode: x c Define a mapping from distribution to text β f(c) y Find function f for achieving y In training : summation for all mappings In testing : ML using beamsearch 19/12/2017 Deep Speech 23

Connectionist Temporal Classification (CTC) RNN creates probability vectors (distribution) using Softmax For grapheme-based model: c A, B, C, D,, blank, space Independence assumption: P c x = i=1 N P(c i x) 19/12/2017 Deep Speech 24

Training With CTC Mapping: Given a character sequence c, remove duplicates and blanks Therefore P y x is the summation over all possible c with the same mapping: 19/12/2017 Deep Speech 25

Training With CTC Update network parameters θ to maximize likelihood of correct label y : θ = argmax θ i log P y i x i θ = argmax θ i log c β c =y (i) P c x i There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages). 19/12/2017 Deep Speech 26

Decoding Network outputs P c x, we want P y x Simple naive solution: Max Decoding β( argmax c P c x ) c c _ a a a a b b b b _ 19/12/2017 Deep Speech 27

Max Decoding Doesn t work in practice, good for diagnostics 19/12/2017 Deep Speech 28

Language model: n-gram A probabilistic Markov Model : P(x i x i n 1,, x i 1 ) 19/12/2017 Deep Speech 29

Language model: n-gram Examples from Google n-gram corpus 19/12/2017 Deep Speech 30

Decoding with LM Even with better decoding schemes CTC model tends to make spelling and linguistic errors. Solution: Combine a Language Model! argmax y log{p y x P y α word_count(y) β } α weights between LM and CTC network β- encourages more words in transcription Use Beam Search to find the transcript y 19/12/2017 Deep Speech 31

Decoding with LM 19/12/2017 Deep Speech 32

Decoding with LM : Beam Search The Naive approach 19/12/2017 Deep Speech 33

Decoding with LM : Beam Search 19/12/2017 Deep Speech 34

Architecture 19/12/2017 Deep Speech 36

Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 37

RNN as state machine 19/12/2017 Deep Speech 38

RNN as grid Forward Pass 19/12/2017 Deep Speech 39

Back Propagation 19/12/2017 Deep Speech 40

Weight Update 19/12/2017 Deep Speech 41

Bi-Directional RNN 19/12/2017 Deep Speech 42

RNN with limited future context 19/12/2017 Deep Speech 43

RNN vs. LSTM vs. GRU 19/12/2017 Deep Speech 44

Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 45

Convolutional Layers: Images 19/12/2017 Deep Speech 46

Frequency Convolutional Layers: Audio Time 19/12/2017 Deep Speech 47

Results 19/12/2017 Deep Speech 48

Test Sets 19/12/2017 Deep Speech 49

Results: Sometimes better than Humans 19/12/2017 Deep Speech 50

Results: Sometimes better than Humans 19/12/2017 Deep Speech 51

Results: Sometimes better than Humans 19/12/2017 Deep Speech 52

Questions? 19/12/2017 Deep Speech 53

The End 19/12/2017 Deep Speech 54