arxiv: v2 [cs.cl] 28 Feb PDF Free Download

SEQUENCE-TO-SEQUENCE ASR OPTIMIZATION VIA REINFORCEMENT LEARNING Andros Tjandra 1, Sakriani Saki 1,2, Saoshi Nakamura 1,2 1 Graduae School of Informaion Science, Nara Insiue of Science and Technology, Japan 2 RIKEN, Cener for Advanced Inelligence Projec AIP, Japan {andros.jandra.ai6, ssaki, s-nakamura}@is.nais.jp arxiv:1710.10774v2 [cs.cl 28 Feb 2018 ABSTRACT Despie he success of sequence-o-sequence approaches in auomaic speech recogniion (ASR) sysems, he models sill suffer from several problems, mainly due o he mismach beween he raining and inference condiions. In he sequence-o-sequence archiecure, he model is rained o predic he grapheme of he curren ime-sep given he inpu of speech signal and he ground-ruh grapheme hisory of he previous ime-seps. However, i remains unclear how well he model approximaes real-world speech during inference. Thus, generaing he whole ranscripion from scrach based on previous predicions is complicaed and errors can propagae over ime. Furhermore, he model is opimized o maximize he likelihood of raining daa insead of error rae evaluaion merics ha acually quanify recogniion qualiy. This paper presens an alernaive sraegy for raining sequence-o-sequence ASR models by adoping he idea of reinforcemen learning (RL). Unlike he sandard raining scheme wih maximum likelihood esimaion, our proposed approach uilizes he policy gradien algorihm. We can (1) sample he whole ranscripion based on he model s predicion in he raining process and (2) direcly opimize he model wih negaive Levenshein disance as he reward. Experimenal resuls demonsrae ha we significanly improved he performance compared o a model rained only wih maximum likelihood esimaion. Index Terms End-o-end speech recogniion, reinforcemen learning, policy gradien opimizaion 1. INTRODUCTION Sequence-o-sequence models have been recenly shown o be very effecive for many asks such as machine ranslaion [1, 2, image capioning [3, 4, and speech recogniion [5. Wih hese models, we are able o learn a direc mapping beween he variable-lengh of he source and he arge sequences ha are ofen no known apriori using only a single neural nework archiecure. This way, many complicaed hand-engineered models can also be simplified by leing DNNs find heir way o map from inpu o oupu spaces [5, 6, 7. Therefore, we can eliminae he need o consruc separae componens, i.e., a feaure exracor, an acousic model, a lexicon model, or a language model, as is commonly required in convenional ASR sysems such as hidden Markov model-gaussian mixure model (HMM-GMM)-based or hybrid HMM-DNN. A generic sequence-o-sequence model commonly consiss of hree modules: (1) an encoder module for represening source daa informaion, (2) a decoder module for generaing ranscripion oupu and (3) an aenion module for exracing relaed informaion from an encoder represenaion based on he curren decoder sae. A decoding scheme was done based on a lef-o-righ decoding procedure. In he raining sage, given he curren inpu of he speech signal, he decoder produces a grapheme in he curren ime-sep wih maximal probabiliy condiioned on he ground-ruh of he grapheme hisory in he previous ime-seps. This raining scheme is usually referred as a eacher-forcing mehod [8. However, in he inference sage, since he ground-ruh of he ranscripion is no known, he model mus produce he grapheme in he curren imesep based on an approximaion of he correc grapheme in previous ime-seps. Therefore, an incorrec decision in an earlier ime-sep may propagae hrough subsequen ime-seps. Anoher drawback is he differences in he use of objecive funcions beween raining and evaluaion schemes. In he raining sage, he model is mosly opimized by combining he eacher-forcing approach wih he maximum likelihood esimaion (MLE) for each frame. On he oher hand, he recogniion accuracy is evaluaed by calculaing he minimum sring edi-disance (Levenshein disance) beween he correc ranscripion and he recogniion oupu. Such differences may resul in subopimal performance [9. Opimizing he model parameer wih he appropriae objecive funcion is crucial o achieve good model performance, or in oher words, direc opimizaion wih respec o he evaluaion merics migh be necessary. In his paper, we propose an alernaive sraegy for raining a sequence-o-sequence ASR by adoping an idea from RL. Specifically, we uilize a policy gradien algorihm (REINFORCE) [10 o simulaneously alleviae boh of he above problems. By reaing our decoder as a policy nework or an agen, we are able o (1) sample he whole ranscripion based on model s predicion in he raining process and (2) direcly opimize he model wih negaive Levenshein disance as he reward. Our model hus inegraes he power of he sequence-o-sequence approach o learn he mapping beween he speech signal and he ex ranscripion based on he srengh of reinforcemen learning o opimize he model wih ASR performance meric direcly. 2. SEQUENCE-TO-SEQUENCE ASR Sequence-o-sequence model is a ype of neural nework model ha direcly models condiional probabiliy P (y x), where x = [x 1,..., x S is he source sequence wih lengh S, and y = [y 1,..., y T is he arge sequence wih lengh T. Mos common inpu x is a sequence of feaure vecors like Mel-specral filerbank and/or MFCC. Therefore, x R S F where F is he number of feaures and S is he oal frame lengh for an uerance. Oupu y, which is a speech ranscripion sequence, can be eiher a phoneme or a grapheme (characer) sequence. Figure 1 shows he overall srucure of he aenion-based encoder-decoder model ha consiss of encoder, decoder, and aenion modules. The encoder ask processes inpu sequence x and oupus represenaive informaion h E = [h E 1,..., h E S for he

Fig. 1. Aenion-based encoder-decoder archiecure. decoder. The aenion module is an exension scheme ha helps he decoder find relevan informaion on he encoder side based on curren decoder hidden saes [2. An aenion module produces conex informaion c a ime based on he encoder and decoder hidden saes wih following equaion: c = S a (s) h E s (1) s=1 a (s) = Align(h E s, h D ) = exp(score(h E s, h D )) S s=1 exp(score(he s, h D )). (2) There are several variaions for he score funcions: h E s, h D, do produc Score(h E s, h D ) = h E s W sh D, bilinear Vs anh(w s[h E s, h D ), MLP where Score : (R M R N ) R, M is he number of hidden unis for he encoder and N is he number of hidden unis for he decoder. Finally, he decoder ask, which predics he arge sequence probabiliy a ime based on he previous oupu and conex informaion c can be formulaed: T log P (y x; θ) = log P (y h D, c ; θ) (4) where h D is he las decoder layer ha conains summarized informaion from all previous inpu y < and θ is our model parameers. 3. SEQUENCE-TO-SEQUENCE OPTIMIZATION WITH REINFORCEMENT LEARNING In his secion, we inroduce our proposed approach ha inegraes policy opimizaion wih he sandard encoder-decoder ASR model. We sar by describing he policy gradien mehod and followed by he reward consrucion for our ASR agen. (3) 3.1. Policy Gradien Policy gradien is a ype of reinforcemen learning algorihm for opimizing he expeced rewards wih respec o he parameerized policy [11. To apply he idea from he policy gradien mehod, we need o esablish a connecion beween our ASR model and he reinforcemen learning formulaion. For reinforcemen learning, we reformulae our sysem as a Markov Decision Process (MDP) = (S, A, T, R), where S is he sae space, A is he se of possible acions, T is he ransiion probabiliy, and R is he reward funcion. Here, our ask is o generae a ex ranscripion given he inpu speech waveform, and he encoder-decoder neural nework (Secion 2) will ac as an agen. For each ime-sep = 1, 2, 3..., T, we can define sae s S as s = [h D, c, which is he concaenaion beween he decoder hidden sae and he conex informaion a ime. Acion a A equals a = y, where acion space A conains all possible grapheme + end of senence eos symbols in our daase. Reward funcion R for our ASR ask will be explained laer in Secion 3.2. Given a pair of speech and ranscripion (x (n), y (n) ) a n-h index, R (n) is he reward for ranscripion y compared o groundruh y (n). Our opimizaion arge is o maximize expeced reward E y[r (n) π θ wih respec o θ as our neural nework parameer where π θ (a s ) = P (y h D(n), c (n) ; θ) = P (y y <, x (n) ; θ). To use he firs-order opimizaion mehod (e.g., sochasic gradien ascen / descen), we need o calculae he gradien from he expeced rewards: [ θ E y R (n) π θ = θ P (y x (n) ; θ)r (n) dy = θ P (y x (n) ; θ)r (n) dy = P (y x (n) ; θ) θ log P (y x (n) ; θ)r (n) dy = E y [ θ log P (y x (n) ; θ)r (n). (5) In Eq. 5, we derived a similar equaion wih he gradien from he Minimum Risk Training objecive [12. However, insead of using only final reward R (n) and disribue i equally o every imesep, we replace he R (n) wih he ime-disribued reward R (n) = T i= γi i and provide more informaive reward for each imesep on every sample. Therefore, we replace Eq. 5 o use uilize emporal srucure = [1,.., T : θ E y = E y E y 1 M M π θ m=1 T θ log P (y y <, x (n) ; θ) R (n) θ log P (y y <, x (n) ; θ) T (m) R (n,m) θ log P (y (n,m) y (n,m) (6) <, x (n) ; θ) (7) where T is he lengh of ranscripion y, R (n) = T i is he generalized equaion for accumulaed fuure reward based on he curren sae and acion a ime-, and γ is he discoun facor o reduce he effec of fuure rewards. For Eq. 7, R (n,m) is he i= γi

sampling and sop afer we ge an eos symbol. Then we calculae discouned reward R (n,m) for each ime-sep based on he fuure rewards. 3.2. Reward Consrucion for ASR Tasks Mos ASR sysems are evaluaed based on edi-disance or he Levenshein disance algorihm. Therefore, we also consruc our reward funcion R(y, y (n), ) o calculae algorihm. We define reward as = by uilizing he edi-disance { (ED(y 1:, y (n) ) ED(y 1: 1, y (n) )) if > 1 (ED(y 1:, y (n) ) y (n) ) if = 1 where ED(, ) is he edi-disance funcion beween wo ranscripions, y 1: is he subsring of y from index 1 o, and y (n) is he ground-ruh lengh. Inuiively, we ry o calculae wheher he curren new ranscripion a ime- decreases he edi-disance compared o previous ranscripion, and we muliply i by -1 for a posiive reward if our new edi-disance a ime is smaller han he previous 1 edi disance. Fig. 2. Comparison beween eacher-forcing and policy gradien raining processes. In he raining sage, eacher-forcing se he model o be condiioned on he ground-ruh from he daase. Meanwhile, policy gradien mehod se he model o be condiioned on is own predicion from previous ime-sep o predics he curren imesep oupu probabiliy. reward for he m-h sample based on he n-h uerance and imesep and T (m) is he lengh of sample y (n,m). In he real world, i is impracical o inegrae all possible ranscripion y o calculae he gradien of he expeced reward in Eq. 6. Therefore, we uilize Mone Carlo sampling o sample M ranscripion sequence y (n,m) P (y x (n) ; θ) from our model o calculae he gradien wih empirical expecaion in Eq. 7. Since he REINFORCE gradien esimaor is usually oo noisy and migh hinder our learning process, here are several ricks o reduce he variance [13, 14. In his paper, we normalize reward R = (R µ ) σ where µ and σ are he moving average and sandard deviaion for ime-sep. For he final-reward R (n) in Eq. 5, we normalize he reward across M samples. To summarize our explanaion, we provide an illusraion in Fig. 2 ha compares he difference beween eacher-forcing and policy gradien mehod for raining he sequence-o-sequence model. Teacher-forcing is opimized by rying o maximize MLE objecive funcion: MLE(y (n), p(y )) = c 1{y (n) = c} log p(y = c), (8) which is calculaed per ime-sep based on ground-ruh label y (n). In he policy gradien, firs we sample M sequences via Mone Carlo 4. EXPERIMENT 4.1. Speech Daase and Feaure Exracion In his sudy, we invesigaed he performance of our proposed mehod on WSJ [15 wih idenical definiions of raining, developmen, and es ses as he Kaldi s5 recipe [16. We separaed WSJ ino wo experimens using WSJ-SI84 only and WSJ-SI284 daa for raining. We used dev 93 for our validaion se and eval 92 for our es se. We used he characer sequence as our decoder arge and followed he preprocessing seps proposed by [17. The ex from all he uerances was mapped ino a 32-characer se: 26 (a-z) leers of he alphabe, aposrophes, periods, dashes, space, noise, and eos. In all experimens, we exraced he 40 dims + + (oal 120 dimensions) log Mel-specrogram feaures from our speech and normalized every dimension ino zero mean and uni variance. 4.2. Model Archiecure On he encoder side, we fed our inpu feaures ino a linear layer wih 512 hidden unis followed by he LeakyReLU [18 acivaion funcion. We used hree bidirecional LSTMs (Bi-LSTM) for our encoder wih 256 hidden unis for each LSTM (oal 512 hidden unis for Bi-LSTM). To improve he running ime and reduce he memory consumpion, we used hierarchical subsampling [19, 5 on he op wo Bi-LSTM layers and reduced he number of encoder ime-seps by a facor of 4. On he decoder side, we used a 128-dimensional embedding marix o ransform he inpu graphemes ino a coninuous vecor, followed by one-unidirecional LSTMs wih 512 hidden unis. For our scorer funcion inside he aenion module, we used MLP scorers (Eq. 3) wih 256 hidden unis and Adam [20 opimizer wih a learning rae of 5e 4. In he raining phase, we sared o rain our model wih MLE (Eq. 8) unil convergence. Afer ha, we coninued raining by adding an RL-based objecive unil our model sopped improving. For our RL-based objecive, we ried four scenarios using differen discoun facors γ = {0, 0.5, 0.95} and only global reward R (Eq. 5). To calculae he gradien based on Eq. 7, we sampled up o M = 15 sequences for each uerance. In he decoding phase, we exraced our ranscripion wih a beam search sraegy (beam size = 5) and normalized log-likelihood

log P (Y X; θ) by dividing i by he ranscripion lengh o preven he decoder from favoring shorer ranscripions. We did no use any language model or lexicon dicionary in his work. All of our models were implemened on he PyTorch framework 1. 5. RESULTS AND DISCUSSION Table 1. Characer error rae (CER) resul from baseline and proposed models on WSJ-SI84 and WSJ-SI284 daases. All resuls were produced wihou a language model or lexicon dicionary. Models Resuls WSJ-SI84 CER (%) MLE CTC [21 20.34 % A Enc-Dec Conen [21 20.06 % A Enc-Dec Locaion [21 17.01 % Join CTC+A (MTL) [21 14.53 % A Enc-Dec (ours) 17.68 % MLE + RL (final reward R) 15.46 % (ime reward R, γ = 0) 15.99 % (ime reward R, γ = 0.5) 15.05 % (ime reward R, γ = 0.95) 13.90 % WSJ-SI284 CER (%) MLE CTC [21 8.97% A Enc-Dec Conen [21 11.08% A Enc-Dec Locaion [21 8.17% Join CTC+A (MTL) [21 7.36% A Enc-Dec (ours) 7.69% MLE+RL (final reward R) 7.26 % (ime reward R, γ = 0) 6.64 % (ime reward R, γ = 0.5) 6.37 % (ime reward R, γ = 0.95) 6.10 % Table 1 shows all he experimen resuls from he WSJ-SI84 and WSJ-SI284 daases. We compared our resuls wih several published models such as CTC, Aenion Encoder-Decoder and Join CTC-Aenion model rained wih MLE objecive. We also creaed our own baseline model wih Aenion Encoder-Decoder and rained only wih MLE objecive. The difference beween our Aenion Encoder-Decoder ( A Enc-Dec (ours) ) is our decoder calculae he aenion probabiliy and conex vecor based on curren hidden sae insead of previous hidden sae. We also reused he previous conex vecor by concaenaing i wih he inpu embedding vecor. We explore several configuraions by only using final reward R and ime disribued reward R wih differen γ = [0, 0.5, 0.95 values. Our resul shows ha wih by combining he eacher forcing wih policy gradien approach improved our model performance sig- 1 PyTorch hps://gihub.com/pyorch/pyorch/ nificanly compared o a sysem jus rained wih he eacher forcing mehod only. Furhermore, we also found ha discoun facor γ = 0.95 give he bes performance on boh daases. 6. RELATED WORK Reinforcemen learning is a subfield of machine learning ha creaes an agen ha ineracs wih is environmen and learn how o maximize he rewards using some feedback signal. Many reinforcemen learning applicaions exis, including building an agen ha can learn how o play a game wihou any explici knowledge [22, 23, conrol asks in roboics [24, and dialogue sysem agens [25, 26. No only limied o hese areas, reinforcemen learning has also been adoped for improving sequence-based neural nework models. Ranzao e al. [27 proposed an idea ha combined REINFORCE wih an MLE objecive for raining called MIXER. In he early sage of raining, he firs s seps are rained wih MLE and he remaining T s seps wih REINFORCE. They decrease s as he raining progress over ime. By using REINFORCE, hey rained he model using non-differeniable ask-relaed rewards (e.g., BLEU for machine ranslaion). In his paper, we did no need o deal wih any scheduling or mix any sampling wih eacher forcing ground-ruh. Furhermore, MIXER did no sample muliple sequences based on he REINFORCE Mone Carlo approximaion and hey were no invesigae MIXER on an ASR sysem. In a machine ranslaion ask, Shen e al. [12 improved he neural machine ranslaion (NMT) model using Minimum Risk Training (MRT). Google NMT [9 sysem combined MLE and MRT objecives o achieve beer resuls. For ASR ask, Shanon e al. [28 performed WER opimizaion by sampling pahs from he laices used during smbr raining which migh be similar o REINFORCE algorihm. Bu, he work was only applied on CTC-based model. From he probabilisic perspecive, MRT formulaion resembles he expeced reward formulaion used in reinforcemen learning. Here, MRT formulaion equally disribue he senence-level loss ino all of he ime-seps in he sample. In conras, we applied he RL sraegy o an ASR ask and found ha using final reward R is no an effecive mehod for raining our sysem because he loss diverged and produced a worse resul. Therefore, we proposed a emporal srucure and applied imedisribued reward R. Our resuls demonsrae ha we improved our performance significanly compared o he baseline sysem. 7. CONCLUSION We inroduced an alernaive sraegy for raining sequence-osequence ASR models by inegraing he idea from reinforcemen learning. Our proposed mehod inegraes he power of sequenceo-sequence approaches o learn he mapping beween speech signal and ex ranscripion based on he srengh of reinforcemen learning o opimize he model wih ASR performance meric direcly. We also explored several differen scenarios for raining wih RL-based objecive. Our resuls show ha by combining RL-based objecive ogeher wih MLE objecive, we significanly improved our model performance compared o he model jus rained wih he MLE objecive. The bes sysem achieved up o 6.10% CER in WSJ-SI284 using ime-disribued reward seings and discoun facor γ = 0.95. 8. ACKNOWLEDGEMENTS Par of his work was suppored by JSPS KAKENHI Gran Numbers JP17H06101 and JP17K00237.

9. REFERENCES [1 Ilya Suskever, Oriol Vinyals, and Quoc V Le, Sequence o sequence learning wih neural neworks, in Advances in neural informaion processing sysems, 2014, pp. 3104 3112. [2 Dzmiry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, Neural machine ranslaion by joinly learning o align and ranslae, arxiv preprin arxiv:1409.0473, 2014. [3 Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio, Show, aend and ell: Neural image capion generaion wih visual aenion, in Inernaional Conference on Machine Learning, 2015, pp. 2048 2057. [4 Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumiru Erhan, Show and ell: A neural image capion generaor, in Proceedings of he IEEE conference on compuer vision and paern recogniion, 2015, pp. 3156 3164. [5 Dzmiry Bahdanau, Jan Chorowski, Dmiriy Serdyuk, Philemon Brakel, and Yoshua Bengio, End-o-end aenion-based large vocabulary speech recogniion, in Proc. ICASSP, 2016. IEEE, 2016, pp. 4945 4949. [6 William Chan, Navdeep Jaily, Quoc Le, and Oriol Vinyals, Lisen, aend and spell: A neural nework for large vocabulary conversaional speech recogniion, in Acousics, Speech and Signal Processing (ICASSP), 2016 IEEE Inernaional Conference on. IEEE, 2016, pp. 4960 4964. [7 Andros Tjandra, Sakriani Saki, and Saoshi Nakamura, Aenion-based wav2ex wih feaure ransfer learning, in 2017 IEEE Auomaic Speech Recogniion and Undersanding Workshop, ASRU 2017, Okinawa, Japan, December 16-20, 2017, 2017, pp. 309 315. [8 Ronald J Williams and David Zipser, A learning algorihm for coninually running fully recurren neural neworks, Neural compuaion, vol. 1, no. 2, pp. 270 280, 1989. [9 Yonghui Wu, Mike Schuser, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, e al., Google s neural machine ranslaion sysem: Bridging he gap beween human and machine ranslaion, arxiv preprin arxiv:1609.08144, 2016. [10 Ronald J Williams, Simple saisical gradien-following algorihms for connecionis reinforcemen learning, Machine learning, vol. 8, no. 3-4, pp. 229 256, 1992. [11 Richard S. Suon and Andrew G. Baro, Inroducion o Reinforcemen Learning, MIT Press, Cambridge, MA, USA, 1s ediion, 1998. [12 Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu, Minimum risk raining for neural machine ranslaion, in Proceedings of he 54h Annual Meeing of he Associaion for Compuaional Linguisics, ACL 2016, Augus 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. [13 Evan Greensmih, Peer L Barle, and Jonahan Baxer, Variance reducion echniques for gradien esimaes in reinforcemen learning, Journal of Machine Learning Research, vol. 5, no. Nov, pp. 1471 1530, 2004. [14 Andriy Mnih and Karol Gregor, Neural variaional inference and learning in belief neworks, in Proceedings of he 31s Inernaional Conference on Inernaional Conference on Machine Learning-Volume 32. JMLR. org, 2014, pp. II 1791. [15 Douglas B. Paul and Jane M. Baker, The design for he Wall Sree Journal-based CSR corpus, in Proceedings of he Workshop on Speech and Naural Language, Sroudsburg, PA, USA, 1992, HLT 91, pp. 357 362, Associaion for Compuaional Linguisics. [16 Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burge, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Per Molicek, Yanmin Qian, Per Schwarz, Jan Silovsky, Georg Semmer, and Karel Vesely, The Kaldi speech recogniion oolki, in IEEE 2011 Workshop on Auomaic Speech Recogniion and Undersanding. Dec. 2011, IEEE Signal Processing Sociey, IEEE Caalog No.: CFP11SRW-USB. [17 Awni Y Hannun, Andrew L Maas, Daniel Jurafsky, and Andrew Y Ng, Firs-pass large vocabulary coninuous speech recogniion using bi-direcional recurren DNNs, arxiv preprin arxiv:1408.2873, 2014. [18 Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li, Empirical evaluaion of recified acivaions in convoluional nework, arxiv preprin arxiv:1505.00853, 2015. [19 Alex Graves e al., Supervised sequence labelling wih recurren neural neworks, vol. 385, Springer, 2012. [20 Diederik Kingma and Jimmy Ba, Adam: A mehod for sochasic opimizaion, arxiv preprin arxiv:1412.6980, 2014. [21 Suyoun Kim, Takaaki Hori, and Shinji Waanabe, Join CTCaenion based end-o-end speech recogniion using muliask learning, in Acousics, Speech and Signal processing (ICASSP), 2017 IEEE Inernaional Conference on. IEEE, 2017. [22 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Marin Riedmiller, Andreas K. Fidjeland, Georg Osrovski, Sig Peersen, Charles Beaie, Amir Sadik, Ioannis Anonoglou, Helen King, Dharshan Kumaran, Daan Wiersra, Shane Legg, and Demis Hassabis, Human-level conrol hrough deep reinforcemen learning, Naure, vol. 518, no. 7540, pp. 529 533, 02 2015. [23 David Silver, Aja Huang, Chris J Maddison, Arhur Guez, Lauren Sifre, George Van Den Driessche, Julian Schriwieser, Ioannis Anonoglou, Veda Panneershelvam, Marc Lanco, e al., Masering he game of go wih deep neural neworks and ree search, Naure, vol. 529, no. 7587, pp. 484 489, 2016. [24 Jens Kober and Jan Peers, Reinforcemen learning in roboics: A survey, in Reinforcemen Learning, pp. 579 610. Springer, 2012. [25 Sainder P Singh, Michael J Kearns, Diane J Liman, and Marilyn A Walker, Reinforcemen learning for spoken dialogue sysems, in Advances in Neural Informaion Processing Sysems, 2000, pp. 956 962. [26 Jiwei Li, Will Monroe, Alan Rier, Michel Galley, Jianfeng Gao, and Dan Jurafsky, Deep reinforcemen learning for dialogue generaion, arxiv preprin arxiv:1606.01541, 2016. [27 Marc Aurelio Ranzao, Sumi Chopra, Michael Auli, and Wojciech Zaremba, Sequence level raining wih recurren neural neworks, arxiv preprin arxiv:1511.06732, 2015. [28 Ma Shannon, Opimizing expeced word error rae via sampling for speech recogniion, arxiv preprin arxiv:1706.02776, 2017.

arxiv: v2 [cs.cl] 28 Feb 2018