Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Size: px

Start display at page:

Download "Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin"

Jared Stewart
5 years ago
Views:

1 Speech Recognition Deep Speech 2: End-to-End Speech Recognition in English and Mandarin Amnon Drory & Matan Karo 19/12/2017 Deep Speech 1

2 Overview 19/12/2017 Deep Speech 2

3 Automatic Speech Recognition 19/12/2017 Deep Speech 3

4 19/12/2017 Deep Speech 4

5 19/12/2017 Deep Speech 5

6 19/12/2017 Deep Speech 6

7 100,000,000 H.M.U = Hundred Million Users 19/12/2017 Deep Speech 7

8 The Task: Good Speech Recognition 19/12/2017 Deep Speech 8

9 Traditional Speech Recognition (ASR) 19/12/2017 Deep Speech 9

10 Traditional ASR + Deep Learning 19/12/2017 Deep Speech 10

11 Baidu s Approach: End-To-End Neural Net 100,000,000 19/12/2017 Deep Speech 11

12 Speed up 19/12/2017 Deep Speech 12

13 Training Data: Annotated Audio Thousands of hours of annotated speech for training: in English and Mandarin. 19/12/2017 Deep Speech 13

14 Training Data: Raw Text Use text to learn a lot about the language. This can help us in understanding speech which words are common which word is reasonable in the current context 19/12/2017 Deep Speech 14

15 Lecture Plan Overview Input Output + CTC Model Architecture Results 19/12/2017 Deep Speech 15

16 Input 19/12/2017 Deep Speech 16

17 ASR A complete speech application: Speech transcription Word spotting/ trigger word Speaker identification /verification 19/12/2017 Deep Speech 17

18 Audio Input Input: Raw Audio, 1D signal Pre-Process: Spectrogram 19/12/2017 Deep Speech 18

19 Preprocessing: SortaGrad Dealing with different lengths of utterances Try to keep similar length utterances together (LibriSpeech clean data.) 19/12/2017 Deep Speech 19

20 Preprocessing: Data Augmentation Additive noise increases robustness to noisy speech increases the data set : 10k hours of raw audio -> 100k hours 19/12/2017 Deep Speech 20

21 Output 19/12/2017 Deep Speech 21

22 The Goal Create a neural network (RNN) from which we can extract transcription, y. Train from labeled pairs x, y 19/12/2017 Deep Speech 22

23 Connectionist Temporal Classification (CTC) The network is also called Acoustic Model. Acoustic model main issue - length(x)!= length(y) Solution - divide the transcription task to steps: RNN output neurons c encode distribution over symbols Encode: x c Define a mapping from distribution to text β f(c) y Find function f for achieving y In training : summation for all mappings In testing : ML using beamsearch 19/12/2017 Deep Speech 23

24 Connectionist Temporal Classification (CTC) RNN creates probability vectors (distribution) using Softmax For grapheme-based model: c A, B, C, D,, blank, space Independence assumption: P c x = i=1 N P(c i x) 19/12/2017 Deep Speech 24

25 Training With CTC Mapping: Given a character sequence c, remove duplicates and blanks Therefore P y x is the summation over all possible c with the same mapping: 19/12/2017 Deep Speech 25

26 Training With CTC Update network parameters θ to maximize likelihood of correct label y : θ = argmax θ i log P y i x i θ = argmax θ i log c β c =y (i) P c x i There is an efficient dynamic programming algorithm to compute the inner summation and its gradient. (Also implanted in open sources packages). 19/12/2017 Deep Speech 26

27 Decoding Network outputs P c x, we want P y x Simple naive solution: Max Decoding β( argmax c P c x ) c c _ a a a a b b b b _ 19/12/2017 Deep Speech 27

28 Max Decoding Doesn t work in practice, good for diagnostics 19/12/2017 Deep Speech 28

29 Language model: n-gram A probabilistic Markov Model : P(x i x i n 1,, x i 1 ) 19/12/2017 Deep Speech 29

30 Language model: n-gram Examples from Google n-gram corpus 19/12/2017 Deep Speech 30

31 Decoding with LM Even with better decoding schemes CTC model tends to make spelling and linguistic errors. Solution: Combine a Language Model! argmax y log{p y x P y α word_count(y) β } α weights between LM and CTC network β- encourages more words in transcription Use Beam Search to find the transcript y 19/12/2017 Deep Speech 31

32 Decoding with LM 19/12/2017 Deep Speech 32

33 Decoding with LM : Beam Search The Naive approach 19/12/2017 Deep Speech 33

34 Decoding with LM : Beam Search 19/12/2017 Deep Speech 34

35 Architecture 19/12/2017 Deep Speech 36

36 Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 37

37 RNN as state machine 19/12/2017 Deep Speech 38

38 RNN as grid Forward Pass 19/12/2017 Deep Speech 39

39 Back Propagation 19/12/2017 Deep Speech 40

40 Weight Update 19/12/2017 Deep Speech 41

41 Bi-Directional RNN 19/12/2017 Deep Speech 42

42 RNN with limited future context 19/12/2017 Deep Speech 43

43 RNN vs. LSTM vs. GRU 19/12/2017 Deep Speech 44

44 Model Architecture 11 layers The chosen architecture: 3 x 2D conv, 7 x RNN, 1 x FC Batch Normalization along the DNN. 19/12/2017 Deep Speech 45

45 Convolutional Layers: Images 19/12/2017 Deep Speech 46

46 Frequency Convolutional Layers: Audio Time 19/12/2017 Deep Speech 47

47 Results 19/12/2017 Deep Speech 48

48 Test Sets 19/12/2017 Deep Speech 49

49 Results: Sometimes better than Humans 19/12/2017 Deep Speech 50

50 Results: Sometimes better than Humans 19/12/2017 Deep Speech 51

51 Results: Sometimes better than Humans 19/12/2017 Deep Speech 52

52 Questions? 19/12/2017 Deep Speech 53

53 The End 19/12/2017 Deep Speech 54

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick