Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Size: px

Start display at page:

Download "Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices"

Norma Susanna Grant
6 years ago
Views:

1 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices Zixing Zhang, Joel Pino, Chrisian Plahl, Björn Schuller, Member, IEEE, Daniel Wille Absrac In his aricle, he reverberaion problem for hands-free voice conrolled devices is addressed by employing Bidirecional Long Shor-Term Memory (BLSTM) recurren neural neworks. Such neworks use memory blocks in he hidden unis, enabling hem o exploi a self-learn amoun of emporal conex. The main objecive of his echnique is o minimize he mismach beween he disan alk (reverberan/disored) speech and he close alk (clean) speech. To achieve his, he nework is rained by mapping he cepsral feaure space from he disan alk channel o is counerpar from he close alk channel frame-wisely in erms of regression. The mehod has been successfully evaluaed on a realisically recorded reverberan French corpus by a large scale of experimens of comparing a variey of nework archiecures, invesigaing differen nework raining arges (differenial or absolue), and combining wih common adapaion echniques. In addiion, he robusness of his echnique is also accessed by cross-room evaluaion on boh, a simulaed French corpus and a realisic English corpus. Experimenal resuls show ha he proposed novel BLSTM dereverberaion models rained by he differenial arges reduce he word error rae (WER) by 16% relaively on he French corpus (inra room scenario) as well as 8% relaively on he English corpus (iner room scenario) 1. Index Terms Hand-Free Voiced Conrolled Devices, Bidirecional Long Shor-Term Memory, Indirec Feaure Enhancemen, Dereverberaion. I. INTRODUCTION Human compuer ineracion via voice is increasingly being used and acceped in consumer elecronics because of he advanages of hands-free operaion: simpliciy, mobiliy, 1 The research leading o hese resuls is sponsored by Nuance Communicaions, Inc., where Zixing Zhang pursued his inernship from Augus 2013 o December Z. Zhang is wih he Machine Inelligence & Signal Processing Group, Insiue for Human-Machine Communicaion, Technische Universiä München, München, 80333, Germany ( zixing.zhang@um.de). J. Pino is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( joel.pino@nuance.com). C. Plahl is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( chrisian.plahl@nuance.com). B. Schuller wih he Machine Inelligence & Signal Processing Group, Insiue for Human-Machine Communicaion, Technische Universiä München, München, 80333, Germany ( schuller@um.de). He is also wih he Deparmen of Compuing, Imperial College London, London, SW7 2AZ, Unied Kingdom ( bjoern.schuller@imperial.ac.uk). D. Wille is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( daniel.wille@nuance.com). Conribued Paper Manuscrip received 07/09/14 Curren version published 09/23/14 Elecronic version published 09/23/ /14/$ IEEE cusomizabiliy, ec. For some personal compuing devices such as noebooks and smar phones, he user is close o he microphones due o he inheren naure of hese devices and applicaion (e.g., personal assisan). In many oher applicaions such as digial elevision [1], se-op boxes, home auomaion [2], car navigaion sysem [3], and human robo ineracion, he ulimae user experience is he abiliy o communicae hands free from a disance ypically a few meers. In his case, however, he disan conrolled speech can underlie significan disorion due o room reverberaion, echo from loud speaker, and addiive noise sources, which leads o high word error rae of speech recogniion, and consequenly resuls in poor user experience. Reverberaion is an undesired acousic phenomenon in he conex of speech recogniion where he speech signal from he user reaches he microphone wih differen ime delays and ampliude aenuaions, caused by he reflecion of various surfaces in he acousic enclosure such as a living room. The speech signal acquired by a microphone is a sum of hree componens: (a) he direc pah signal whose power is inversely proporional o he square of he disance from he speaker [4]; (b) he early reflecions from he walls, floor, ceiling, ec., and hese depend on he posiion of he speaker; (c) he lae reverberaion which depends mainly on he size of he room and reflecive properies of he room surface. This is considered o be less dependen on he posiion of he speaker [4], [5]. In he pas decades, exensive research has been carried ou o handle such harmful effecs. Based on wha is addressed, hey can broadly be sored ino hree caegories: signal, feaure, and model-based approaches. The signal-based approaches are o enhance he reverberan signal from emporal or specral informaion. Typical mehods include blind deconvoluion by inverse filering [6], beamforming (e.g., delay-and-sum mehod) which is based on mulimicrophones [5], ec. The feaure-based approaches aemp o remove he influence of reverberaion direcly from he corruped feaure vecors. Well-known echniques involve feaure normalizaion like cepsral mean normalizaion (CMN), which is effecive for miigaing early reverberaion [7], exacing exper crafed feaures like RASTA-PLP [8], and so on. Boh signal- and feaure based approaches are locaed in he fron-end of ASR sysem according o ETSI sandard ES The model-based approaches are applied in he

2 526 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 back-end of an ASR sysem, and adjusing he parameers of he acousic model o he saisical properies of reverberan feaure vecors or ailoring he decoder o he reverberan feaure vecors. One or more adapaion echniques are applied, for example, maximum a poseriori (MAP) [9], maximum likelihood linear ransformaion (MLLR) [10], and feaure-space MLLR (fmllr or CMLLR) [11], o reduce he mismach of Hidden Markov Models (HMMs) rained on clean speech and reverberan speech. In he recen pas, a prominen echnique is o rain deep neural neworks (DNNs) [12] using a wide variey of reverberaed daa sources. The key objecive is o derive he original speech feaures o a high level represenaion. Is poenial capabiliy for noise robus auomaic speech recogniion (ASR) has been demonsraed previously [13], [14]. Anoher approach which has laely received increasing aenion is o use neural neworks for feaure enhancemen, which aims o remove he reverberaion characerisic informaion from he disan alk speech on he means of learning a mapping rule from he disan alk feaure space o is close alk counerpar. The main advanage of his approach is ha i leaves he feaure exracion and he back-end unouched, as he mapping is performed afer feaure exracion and prior o decoding. Therefore, he echnique can be easily inegraed wih any exising ASR sysems. This work was firsly realized by employing a muli-layer percepron (MLP) via mapping muliple channel array speech o clean speech. Then, i was exended by using recurren neural neworks (RNNs) [16] for he 2nd CHiME challenge [17], where reducion in word error raes were observed. Long shor-erm memory recurren neural neworks (LSTM-RNNs) [18], a more sophisicaed form of RNNs, nowadays have been successfully applied o a variey of paern recogniion asks, especially o sequenial paern asks, i.e., handwriing recogniion [19], coninuous speech recogniion [20], and driver disracion deecion [21]. Compared wih classic RNNs, LSTM neural neworks adop memory blocks o replace he individual arificial neurons. Therefore, hese neworks can learn an opimized range of conexual informaion, aiming a overcoming he vanishing gradien problem of convenional RNNs [18], [22]. The superioriy of LSTM neural neworks (especially he bidirecional ype of BLSTM) when compared o DNNs and convenional RNNs have been empirically confirmed in several recen comparaive sudies [20], [23]. Moreover, in 2013 he effeciveness of LSTM neworks o handle nonsaionary noisy speech was firs demonsraed [24] and laer exended o enhance reverberaed noisy speech [25]. In his paper, he BLSTM-RNNs are explored o learn he nonlinear feaure mapping rule. In comparison wih he work done previously [25], his work conribues o (1) evaluaing he BLSTM dereverberaion approach by execuing exensive experimens on realisic and synhesized reverberaed speech, and comparing he approach wih oher radiional nework srucures like MLP and (B)RNN in order o exploi he poenial value of memory neworks; (2) proposing he differenial feaure vecors beween he disan alk (reverberan/disored) speech and close alk (clean) speech as raining arges, which differs wih he previous work [25] where only he absolue feaure vecors of close alk speech are adoped as raining arges; (3) comparing and inegraing our feaure enhancemen mehods wih he widely used adapaion algorihms like MLLR and CMLLR; and (4) accessing he robusness of he echniques in he scenarios of mismached recording environmens beween raining and evaluaion ses. The remainder of his paper is organized as follows. Secion II describes a framework of a feaure dereverberaion sysem by neural neworks, which are rained by eiher absolue or differenial arges as given successively. Then, he deails of BLSTM srucure are presened in Secion III. Secion IV mainly focuses on invesigaing he effeciveness of our mehods by conducing a large-scale experimens in various scenarios, afer a shor descripion of our daabases and experimenal seups. Finally, conclusions are drawn and possible fuure direcions are poined ou in Secion V. II. FEATURE DEREVERBERATION BY NEURAL NETWORK A. Sysem Overview The framework of BLSTM models for dereverberaion in disan alk ASR is illusraed in Fig. II-A. The clean alk signal s() is corruped by convoluional noise r () and addiive noise n () when ransmiing hrough space channel. So, he observed disan alk signal sˆ( ) a he microphone can be wrien as: sˆ () s() r( ) n( ). (1) For he sake of simplificaion, addiive noise is ignored in his aricle. Thus, equaion (1) becomes sˆ( ) s() r(). (2) The oal lengh of RIR can be denoed as T60 which represens he ime aken for he energy in he impulse response o decay by 60dB compared o he direc sound. The RIR r () can be divided ino wo porions: The early reflecion re () ha includes several srong reflecions, and he lae reverberaion rl () ha consiss of a series of numerous indisinguishable reverberaion. This is, r () re() rl(), (3) where () 0 ( ) 0 r () r T e rl() r T T (4) 0 oherwise, 0 oherwise, and T is he lengh of he specral analysis window (20-30 ms). Thus, equaion (2) can be changed ino sˆ () s() re() s( T) rl( ). (5) When he lengh of RIR T60 is much shorer han he analysis window size T, r () is equal o re (), which only affecs he speech affecs he speech signals wihin a frame (analysis window). This linear disorion in he specral

3 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 527 Fig. 1. Framework of BLSTM models for dereverberaion in disan alk ASR. domain can be effecively miigaed by convenional echniques like CMN [7]. For mos applicaions (e.g., occurring in ypical office and home environmen), however, he reverberaion ime T60 ranges from 200 o 1000 ms [26] ha is much longer han he analysis window size, resuling in an undesirable influence on he following speech frames. For example, if he duraion of a RIR is 1 s (T60) and a feaure frame is exraced every 10 ms, one RIR would smear across he following 100 frames. Therefore, his disored speech, afer applying shor-ime discree Fourier ransform (STDFT), can be formulaed by: D1 Sˆ (, f) S(, f) Re(, f) S( d, f) Rl( d, f), (6) d 1 where Rd (, f ) denoes he par of R( f ) (i.e., STDFT of RIR r ()) corresponding o frame delay d. In his case, he channel disorion is no more of muliplicaive naure in a linear specral domain raher i is convoluional. Assuming he phases of differen frames are non-correlaed for simplificaion, he power specrum of (6) can be approximaed as Sˆ (, f) S(, f) Re (, f) D1 (7) 2 2 S( d, f) R ( d, f). d 1 l To exrac he sandardized feaure vecors in cepsral domain for ASR, logarihms and discree cosine ransform (DCT) are execued over he above specral signals. So, D( ln Sˆ(, f) ) D( ln S(, f) ) D( ln R e (, f ) ) (8) 2 D( ln M (, f) ), where D denoes he discree cosine ransformaion marix, and D S( d, f) R (, ) d 1 l d f M (, f) S(, f) Re (, f) (9) 2 Sˆ (, f). 2 2 S(, f) R (, f) e If he speech signal ransmission channel is invariable wihin 2 he senence period, he second erm of D ( ln Re (, f ) ) in (8) can be reaed as a consan, and can be heoreically removed jus by subracing he cepsral mean over each uerance [7]. Therefore, he objecive of our sraegy is o ge 2 rid of he hird erm of D ( ln M (, f) ) which is he proporion of he power specrum of he whole observed disored speech and he disored speech only convolued by early reverberaion (cf. (9)). The specific way o realize such a sraegy in his aricle is o apply neural neworks o map he feaure vecors x d ha are exraced from he disan alk speech signals sˆ( ) o he arge ones frame by frame. Finally, he enhanced feaure vecors xˆ will be fed ino he ASR decoder. B. Differenial vs. Absolue Targes for Training Neural Neworks From (8) and (9), one can observe ha he erm of Mf (, ) is no only relaive o he early reflecion, bu also convolued o he lae reverberaion of previous speech signals. Such highly nonlinear and nonsaionary characerisic makes dereverberaion an exremely challenging ask [5], [26]. To his end, using a nonlinear sysem o predic his erm migh be a poenially promising approach. On he oher hand, he close relaionship of Mf (, ) wih he numerous previous speech frames also implies he possibiliy of compensaing for he lae reverberaion by leveraging he long-erm acousic conex. Tha is, o exploi he sequence of reverberan feaure vecors preceding he curren ones migh be also beneficial for miigaing he lae reverberaion. The radiional way o capure such conexual informaion is o use riphone HMMs, which is empirically proved no sufficien for his ask [17]. Moivaed by hese analyses, an approach is explored based on a nonlinear and more efficien conex-learning-abiliy neural nework [18] BLSTM-RNN o remove such convolued lae reverberaion in he cepsral domain. More specifically, wo ways could be applied according o (8) via d ransforming he disored feaure vecors x from he disan speech signal sˆ( ) ino: 1) he corresponding absolue (clean) ones x c from close alk speech signals s() by minimizing he following objecive funcion of he mean squared error (MSE): N c c 2 J( ) ( x ˆ x ), (10) n1 where x ˆc is he prediced close alk feaure, and N is he dimensionaliy of he feaure vecor. This direc channel mapping sraegy has already been invesigaed previously [24], [25]. 2) he corresponding differenial (dela) ones x which are obained from laer reverberaion of Mf (, ) (cf. (9)). Before raining he neural nework, he differenial vecors are d calculaed by subracing he feaure vecors of disan alk x c from hose of he corresponding close alk x. When raining he neural neworks, he parameers are opimized by minimizing: N 2 J( ) ( x ˆ x ), (11) n1

4 528 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 where xˆ is he prediced differenial feaure. Afer ha, hese mapped differenial vecors are added o he original disan alk feaure vecors x d frame by frame, so as o compensae he disorion by reverberaion. This indirec channel mapping sraegy is firsly proposed and invesigaed in his work. III. BIDIRECTIONAL LONG SHORT-TERM MEMORY NEURAL NETWORK As discussed in Secion II, a nonlinear sysem wih he capabiliy of learning long-erm conexual informaion is preferred o ackle wih he nonlinear, nonsaionary, and highly convolued lae reverberaion. The convenional MLP propagaes he inpu signals unidirecionally layer-by-layer wih sigmoid acivaions wihou any recurren connecion, and needs o sack several successive feaure vecors as inpu. Neverheless, he capabiliy of capuring conex informaion is sill limied by he chosen conex [27]. Anoher mehod o address his problem is o employ RNNs, where he oupu of a previous ime sep is looped back and used as addiional inpu. However, research shows ha sandard RNNs can no access long-range conex since he backpropagaed error eiher blows up or decays over ime (he vanishing gradien problem) [22]. Fig. 2. LSTM memory block. The symbols f g, f i, and f o denoe logisic sigmoid, anh, and anh acivaion funcions, respecively; i, o, and f are he acivaions of he inpu, oupu, and forge gaes a ime, respecively; x, h, and c represen inpu, oupu, and cell values of he memory block a ime, respecively; b is a bias. To overcome his limiaion, [18] inroduced LSTM neworks, which are able o sore informaion in memory cells over a long period of ime. LSTM neworks can be inerpreed as RNNs in which he radiional neurons are replaced by scaled memory blocks (shown in Fig. 2). Similar o he cyclic connecions in RNNs, hese memory blocks are recurrenly conneced. Every memory block consiss of self-conneced linear memory cells and hree muliplicaive gae unis: inpu, oupu, and forge gae. The inpu and oupu gaes scale he inpu and oupu of he cell while he forge gae scales he inernal sae. In oher words, he hree gaes are responsible for wriing, reading, and reseing he memory cell values, respecively. For example, if he forge gae is open and he inpu gae is closed (i.e., he inpu gae acivaion is close o zero), he acivaion of he cell will no be overwrien by new inpus, and herefore he informaion from previous ime can be accessed a he following arbirary ime seps by opening he oupu gae. (Please refer o [18] and [19] for more deails.) In paricular, for a memory block, he acivaion of he inpu gae i is composed of four componens: i f g( Wxix Whih 1Wcic 1bi), (12) where fg denoes he logisic sigmoid funcion of he inpu uni, W is a weigh marix of he connecions from all inpu gaes, oupu gaes, or forge gaes in he same hidden layer o he inpu uni, x is he inpu vecor, h is he hidden vecor, and bi is he uni bias. The acivaion of he forge gae f follows he same principle, and can be wrien as f f g( Wxf x Whfh 1Wcfc 1 bf ). (13) The memory cell value c is he sum of he inpus a ime sep and is previous ime sep acivaions ha are muliplied by forge gae acivaion, and updaed by: c i fi( Wxcx Whch 1bc) f c 1, (14) where fi is he anh acivaion funcion. Finally, he oupu of he memory cell is conrolled by he oupu gae acivaions of o fg( Wxox Wh oh 1 Wcoc bo), (15) and delivered by h o fo( c), (16) where fo is also a anh acivaion funcion. Noe ha each memory block can be regarded as a separae, independen uni. Therefore, if each memory block includes one memory cell, he acivaion vecors i, o, f, and c are all of same size as h, i.e., he number of memory blocks in he hidden layer. And from he formulas given above, i can be seen ha he values of all memory cells and block oupus in he previous ime sep -1 will cerainly affec he acivaions of all inpu gaes, oupu gaes, forge gaes, even he inpu unis in he curren ime sep in he same layer, excep he case beween memory cell and oupu gae i is he curren sae of memory cell c raher han he sae from previous ime sep ha conribues o forge gae acivaion. Overall, he LSTM memory cell can sore and access informaion over long emporal range and hus avoid he vanishing gradien problem [18]. Therefore, LSTM could also be regarded as a naural exension of DNNs for emporal sequence daa, where he deepness comes from layers hrough ime. Sandard RNNs have access o pas bu no o fuure conex. To exploi boh, pas and fuure conex, RNNs can be exended o bidirecional RNNs, where wo separae recurren hidden layers scan he inpu sequences in opposie direcions [28]. The f nework calculaes is forward hidden layer acivaions h from he beginning o he end of he sequence, and is backward b hidden layer acivaions h from he end o he beginning of he sequence, hen updaes he oupu layer by f b y Wfyh Wbyh b y, (17)

5 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 529 where W fy and W by are he forward and backward weigh marices, and b y is he hidden bias vecor. The forward and backward direced layers are conneced o he same oupu layer, which herefore can access he whole conex. IV. EXPERIMENTS AND RESULTS A. Daabases To demonsrae he effeciveness of he proposed mehods, wo daabases a French and a English corpus were recorded beforehand in a realisic acousic space environmen. Boh daabases are colleced for speech conrolled TV applicaion. This applicaion is designed o enable he user o change he TV conrols (volume, brighness, ec.) or browse he programs using her voice. Table I shows he saisics of he wo daabases. The French corpus is recorded in a living room wih furniure, where one microphone near he mouh records he close alk, and anoher microphone array consising of 16 channels records he disan alk. 22 naive French speakers (11 females) were asked o speak naurally so as o conrol he TV as heir wish, i.e., je veux un film avec Cameron Diaz (I wan a movie wih Cameron Diaz). Finally, 8.3 h recordings are obained, including abou 7 k senences and 45 k words in oal. The disan alk daa obained from a 16-channel microphone array is grouped ino four disjoin ses (1-4, 5-8, 9-12, and 13-16). The four channel speech in each of he ses is beamformed and noise reduced o ge a single speech signal. As a resul, he amoun of disan alk raining/es daa is four imes is close alk counerpar. The whole daabase was hen divided ino raining and es se speaker-independenly and equally. TABLE I DISTRIBUTION OF SPEAKERS, SENTENCES, WORDS, AND RECORDING TIME OF CLOSE TALK PER PARTITION OF FRENCH AND ENGLISH CORPORA. French English rain es rain es # speakers (f/m) 11 (5/6) 11 (6/5) 9 (5/4) 11 (5/6) # senences # words ime (hours) Likewise, 6.3 h of recordings were capured for he English corpus which comprises 20 speakers (10 females), and approximae 3 k senences, 18 k words in oal. For French, he raining and es daa ses were recorded in he same room, bu for English, hese daa ses were recorded in differen rooms. The deails of he French and English corpus are shown in Table I. In he ongoing, he proposed echniques is mainly evaluaed on he French corpus. The English daabase is used o sudy he impac of mismach in acousic (room) environmens beween raining and esing condiions. B. Experimenal Seup The sereo raining (close alk and disan alk) feaure vecors are ime aligned such ha he Pearson produc-momen correlaion coefficien (PCC) is maximized beween he MFCC-0 ime series. The raining uerances wih maximum PCC coefficien lower han 0.9 were dropped o avoid uerances wih severe channel disorions. The mapping echniques were evaluaed on he sandard MFCCs. The 12 dimensional saic MFCCs were appended o heir firs, second, and hird order regression coefficiens, resuling in a feaure vecor of size 48. The feaure vecors of c d x and x are exraced from he close and disan alk signals, respecively, every 10 ms using a window size of 25 ms. Then, he differenial feaure vecors of x are acquired c d by x x. Furhermore, before raining he neural neworks, he global means and variances are calculaed over he close alk, disan alk, and heir differenial feaure vecors of he whole neural nework s raining ses. Then, mean and variance normalizaion are performed over he nework inpus and arges (i.e., he absolue or he differenial feaure vecors) using he means and variances from he corresponding ses, respecively. For he neural neworks, boh inpu and oupu node numbers are equal o he dimension of he feaure vecor (48 in our case) excep ha sacked frames are used as inpu. And one hidden layer wih 200 neurons is chosen. Paricularly, for he LSTM memory block, inpu and oupu gaes adop hyperbolic angen (anh) acivaion funcions, and he forge gaes ake logisic acivaion funcions. During nework raining, gradien descen is implemened wih a learning rae of 10-5 and a momenum of 0.9. Zero mean Gaussian noise wih sandard deviaion 0.1 is added o he inpu acivaions in he raining phase in order o improve generalizaion. All weighs are randomly iniialized in he range from -0.1 o 0.1. Finally, he early sopping sraegy is used as no improvemen of he MSE on he evaluaion se has been observed during 20 epochs. C. Speech Recogniion Evaluaion The effeciveness of differen mapping sraegies and neural nework configuraions was evaluaed on a research ASR sysem available off-he-shelf. The acousic models were rained on mobile daa colleced on hand held devices. The performance of he ASR is measured and compared in erms of word error rae (WER) and is relaive reducion (WERR) merics, and he baselines for he close alk and disan alk of he French corpus are 11.8% and 19.41% WERs, respecively. 1) Neural Nework Archiecures: A performance comparison is shown in Table II beween BLSTM neworks and oher neworks such as MLP, and recurren neworks wihou memory (RNNs) or wih memory (i.e., LSTM) Noe ha, according o he empirical experience, he bes performance for raining MLP and (B)RNN was achieved by a learning rae of 10-6, as opposed o 10-5 for he (B)LSTM neworks. From Table II, i can be seen ha, when no conex is used a he inpu of he MLP, here is an increase in WER compared o he baseline. Whereas, he recurren neural neworks (sandard RNN and more sophisicaed LSTM) show lower WERs. This is because of heir abiliy o capure he conexual informaion implicily. When he emporal

6 530 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 conex is increased a he inpu of he MLP, here is a seady decrease in WERs and for 600 hidden nodes and a conex of 7 frames, a WERR of 7% is delivered over he baseline sysem. TABLE II PERFORMANCE OF THE BASELINE RECOGNIZER AND DEREVERBERANT SYSTEMS BY ADOPTING VARIOUS NEURAL NETWORK ARCHITECTURES LIKE MLP, RNN, BRNN, LSTM, AND BLSTM WITH DIFFERENT NUMBER OF HIDDEN NEURONS AND STACKED FEATURE FRAMES. FR: FRAMES; WGT: WEIGHTS. nework # neurons #fr # wg WER[%] WERR[%] w/o mapping (close alk) w/o mapping (disan alk) k k MLP k k k RNN k BRNN k LSTM k BLSTM k BLSTM k BLSTM M RNN and LSTM models capure only he pas informaion. However, for dereverberaion, i is imporan o learn he emporal smearing in he fuure frames because he disan alk signal is delayed (fuure) and aenuaed version of he close alk signal (cf. Secion II-A). The bidirecional RNN and LSTM yield significan (one-side z-es, p<0.001) reducion in WERs compared o he corresponding unidirecional models capuring pas informaion. I can also be seen ha boh uni- and bi- direcional LSTM models give lower WERs compared o he simple RNN models. This can be aribued o he sophisicaed archiecure of he individual neurons compared o he simple neuron. Previous acousic informaion can be sored in he memory cell unil he inpu gaes and he forge gaes allow o (parly) change i (cf. Secion III). Moreover, as seven successive frames are simulaneously fed ino BLSTM neworks, no improvemen is observed from his side (see Table II). Hence, he BLSTM seems o learn conex beer if feaure frames are presened one by one and he increased size of he inpu layer raher harms recogniion performance. In addiion, when increasing one hidden layer wih 200 neurons o hree hidden layers wih neurons, he performance improvemen is no obvious. In he following experimens, one hidden layer wih 200 neurons is kep as he BLSTM nework s archiecure on he French corpus. To visualize he mapping learned by he BLSTM model, he rajecories of MFCC-0 for wo randomly seleced uerances are ploed in Fig. 3. The figure shows hree rajecories close alk (red), disan alk (green), and mapped (or esimaed) close alk (blue). I can be seen ha he MFCC-0 curves of mapped close alk speech (by BLSTM neworks) are closer o he original one han he disan alk speech during he Fig. 3. The scaled MFCC-0 (0-255) of a close alk uerance (red), a disan alk one (green), and a mapped close alk one (blue) for wo examples. speaking period, and are smooher during he silence period. This indicaes ha he reverberan signals and channel noise are successfully suppressed. Such a feaure enhancemen phenomenon can be furher confirmed over he enire raining se and he whole feaure vecors. Fig. 4 presens he PCCs of he 48 MFCCs beween disan alk uerances (hollow circle and doed line)/mapped uerances (solid circle and line) and close alk uerances over he whole raining se. Obviously, he PCCs are boosed afer reverberaed feaures are enhanced, which could demonsrae he performance improvemen of ASR by using a BLSTM dereverberaion model. Fig. 4. Pearson produc-momen Correlaion Coefficien (PCC) of 48 MFCCs beween disan alk uerances (hollow circle and doed line)/mapped close alk uerances (solid circle and line) and close alk uerances over he whole raining se. 2) Training on Differenial Targes: As discussed in Subsecion II-B, here are wo ways o obain he enhanced feaures from disan alk, eiher by direc way (raining neworks wih absolue arges) or by indirec way (raining neworks wih differenial arges). Table III compares he performance of he wo mapping ways in ASR sysem. By checking hree ypes of BLSTM nework srucure, he BLSTM dereverberaion models rained on differenial arges perform beer han he models rained on he absolue arges when he nework srucure is simpler. I can be seen ha a gain of abou 3% relaive WERR (a he 0.05 significance level in a one-side z-es) is achieved when only 144 neurons are used in only one hidden layer, compared o using absolue arges.

7 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 531 TABLE III PERFORMANCE COMPARISON BY USING ABSOLUTE TARGETS AND DIFFERENTIAL TARGETS. arges # neurons WER [%] WERR [%] abs diff abs diff abs diff To find ou he raionale behind his phenomenon, he disribuion of globally normalized log energies (MFCC-0) on he absolue arges (a) and he differenial arges (b) over he whole French corpus is ploed in Fig. 5. Obviously, he differenial arges have a symmerical unimodal disribuion which is cenered around zero. In conras, he absolue arge has a bimodal disribuion which could be harder o learn. Therefore, he simpler he neural neworks are, he higher a gain would be obained via raining on he differenial arges. Such superioriy of differenial arges-based learning can furher be verified in Subsecions IV-C3 and IV-C4. Fig. 5. Disribuion of normalized log energy (MFCC-0) of absolue arges (a) and differenial arges (b). 3) Incorporaing CMLLR and MLLR: As he disan alk is passing hrough he BLSTM dereverberaion models, is feaure vecors are ransformed (almos) o he clean arge, which mos preexising acousic models are rained on. Thus, his echnique could also be considered as a sor of feaure adapaion. I is ineresing o see wheher incorporaing backend adapaion echniques like CMLLR and MLLR can furher enhance he ASR performance. As expeced, wihou our mapping echnique he WERs for disan alk decrease from 19.41%, over 19.01%, o 17.19% wih no adapaion, CMLLR, and CMLL + MLLR, respecively (as shown in Table IV). The WERs drop furher o 16.43%, 16.34%, and 15.68% when inegraing wih our suggesed mapping echnique, which resuls in 15.4%, 13.8% and 7.8% relaive WERR, respecively (All improvemens are a he significance level in a one-side z-es). Overall, he bes resul is achieved by combining boh mapping and adapaion (CMLLR + MLLR) echniques, wih 8.8% and 19.2% performance improvemen in WERR in comparison wih adapaion echniques only and he baseline (w/o adapaion and mapping), respecively. Addiionally, Table IV also shows ha if he close alk was falsely deeced as disan alk and fed ino he mapping and adapaion sysems, he WER will increase abou 10% relaively. TABLE IV ASR EVALUATION ON DISTANT TALK AND CLOSE TALK SETS BY COMBINING BLSTM DEEVERBERATION AND ADAPTATION (CMLLR AND MLLR) TECHNIQUES. ABS./DIFF.: ABSOLUTE/DFFERENTIAL TRGETS. [%] ar- disan alk close alk adapaion ges WER WERR WER WERR w/o adapaion w/o mapping w/ mapping abs w/ mapping diff w/ CMLLR w/o mapping w/ mapping abs W /mapping diff w/ CMLLR+MLLR w/o mapping w/ mapping abs w/ mapping diff ) Iner Room Evaluaion: In he above experimens, he daa se used for raining he dereverberaion model is recorded in he same room wih he evaluaion se. In he reallife applicaion, however, he evaluaion scenarios are always unpredicable. Tha is, he acousic environmens (i.e., room size, ype) for creaing he raining daa normally mismach wih he evaluaion scenarios. To cope wih his problem, several arificially reverberan corpora were synhesized on he close alk se of French by convolving various RIRs and adding a lile noise. The rooms o creae he RIRs are differen wih he ones for creaing he French corpus. When generaing he simulaed corpora, hree elemens were aken ino accoun: posiions variaion of he speakers w.r.. he microphones, he weighs of he reverberaion signal and he weighs of he noise signal. The firs column of Table V shows he four scenarios of simulaed speech. The second o sixh columns represen he WER and WERR for each simulaed corpus wihou mapping, mapping o he absolue arges, and mapping o he differenial arges, respecively. As observed from he able, he BLSTM dereverberan ASR sysems prevail over he sysems wihou dereverberaion, which overall leads o a reducion of WER wih 3.3% relaively by he usage of absolue arges and 6.6% relaively by he usage of differenial arges. TABLE V ASR EVALUATION ON THE ARTIFICIAL DISTANT TALK SET USING THE BLSTM DEREVERBERANTION MODELS TRAINED ON THE NATURAL DISTANT TALK SET. POS: POSITION OF SPEAKERS W.R.T. MICROPHONES; R/N: REVERBERANT/NOISY SIGNAL WEIGHTS (DB); W/O: WITHOUT MAPPING; ABS./DIFF.: ABSOLUTE/DIFFERENTIAL TARGETS. [%] w/o abs. diff. WER WER WERR WER WERR Pos-1,R:-100,N: Pos-1,R:-30,N: Pos-2,R:-100,N: Pos-2,R:-30,N: average In addiion, he experimens were repeaed on a realisic English corpus, of which he raining and es ses are recorded in oally differen rooms (cf. Secion IV-A). The baselines of he disan alk of English corpus are WERs of 18.30% and

8 532 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus % for he raining and es ses, boh of which almos double he baseline of close alk (WERs of 9.27% and 9.48% for he raining and es ses). As expeced, a high gain is obained for he raining se when applying channel mapping. Neverheless, such high gain is no observed for he es se. Only when using he differenial arges o rain neural neworks, a gain can be obained by 5.5% of WERR on he mismached es se, and can be enlarged o 7.7% WERR when he uerance level CMN is implemened [7]. In his experimen, i can also be noiced ha he indirec mapping way (using differenial arges for neworks raining) significanly overcomes he direc mapping way (using absolue arges for neworks raining). TABLE VI ASR EVALUATION ON THE TRAINING AND TEST SETS OF THE ENGLISH CORPUS BY USING THE BLSTM (ONE HIDDEN LAYER WITH 128 NEURONS) FEATURE DEREVERBERATION MODEL TRAINED ON THE TRAINING SET. ABS./DIFF.: ABSOLUTE/DIFFERENTIAL TARGETS. CMN (UTT.): UTTERANCE LEVEL CEPSTRAL MEAN NORMALIZATION. [%] ar- raining se es se ges WER WERR WER WERR w/o mapping (close alk) w/o mapping (disan alk) BLSTM abs BLSTM diff BLSTM+CMN(u.) abs BLSTM+CMN(u.) diff From he above wo experimens, he resuls imply ha he iner-room scenario is more challenging when compared o he inra-room scenario shown in Secion IV-C1 o IV-C3. On he one hand, he performance improvemen on boh raining and es ses indicaes ha differen rooms share some common reverberaion informaion. These shared informaion can be learned by he BLSTM neworks. On he oher hand, he differen gains obained by he raining and es ses sugges ha he neworks probably learn oo much informaion from a specific acousic environmen. V. CONCLUSIONS In his sudy, a feaure-based dereverberaion mehod was proposed and invesigaed for realisic hands-free voice conrolled devices. The basic idea is o use bidirecional long shor-erm memory (BLSTM) neural neworks for channel mapping from disan alk cepsal feaure space o is close alk counerpar. In such applicaion scenario, he speech signal a each frame ime will impac he subsequen frames in a long-erm. This consequenially requires a learning algorihm which could no only access long-erm conex informaion bu also make use of he fuure informaion. The bidirecional srucure (pas and fuure) of LSTM neural neworks is capable of dealing wih hese problems. The experimenal resuls on a French corpus show a word error rae reducion (WERR) of more han 16% for ASR, which significanly ouperform he convenional neworks Mulilayer Percepron (MLP) (one-side z-es, p<0.001) and bidirecional recurren neural neworks (BRNNs) (one-side z-es, p<0.05). Such effeciveness of our feaure mapping mehod is furher confirmed by inegraing widely used adapaion echniques of maximum linear likelihood regression (MLLR) or/and consrained MLLR (CMLLR), which yields he bes performance of abou 20% of WERR. And i is also confirmed in he scenario of iner-room evaluaion, as he mismached evaluaion ses in acousic environmen also obain a gain via channel mapping when using BLSTM. This sudy also presens anoher indirec way for channel mapping he differenial feaure vecors (beween he disan alk speech and he close alk speech) as nework arges, hen adding he esimaed differenial feaure vecors o he counerpar of original disan alk. The resuls based on a rich number of experimens show ha his indirec mapping sraegy can compee wih he previously used direc mapping sraegy, paricularly in some cases like using a simple nework srucure and evaluaing mismached daa ses. All hese cases are quie welcome for real-life applicaions. Due o a gain gap beween mached and mismached evaluaion cases, fuure work will focus on he furher exploiaion of join acousic informaion across differen rooms wih he goal of blind dereverberaion applicaion. On a way o achieve his is o rain he neworks by a vas amoun of reverberan speech colleced in a variey of rooms. Furher, one can apply o he objecive funcions some generalizaion erms such as weigh decay. In addiion, i seems also beneficial o develop a way of selecing predefined mapping models for differen room caegories, in order o ulimaely explore he advanages of he roomspecific models. REFERENCES [1] K. Fujia, H. Kuwano, T. Tsuzuki, Y. Ono, and T. Ishihara, A new digial TV inerface employing speech recogniion, IEEE Trans. Consum. Elecron., vol. 49, no. 3, pp , [2] T. Giannakopoulos, N.-A. Ta1las, T. Ganchev, and I. Poamiis, A pracical, real-ime speech-driven home auomaion fron-end, IEEE Trans. Consum. Elecron., vol. 51, no. 2, pp , [3] P. Ding, L. He, X. Yan, R. Zhao, and J. Hao, Robus mandarin speech recogniion in car environmens for embedded navigaion sysem, IEEE Trans. Consum. Elecron., vol. 54, no. 2, pp , [4] T. Viranen, R. Singh, and B. Raj, Techniques for Noise Robusness in Auomaic Speech Recogniion. New York, NY: John Wiley & Sons, [5] P. A. Naylor and N. D. Gaubich, Speech Dereverberaion. Berlin: Springer, [6] M. Miyoshi and Y. Kaneda, Inverse filering of room acousics, IEEE Trans. Acousics, Speech and Signal Process., vol. 36, no. 2, pp , [7] S. Furui, Cepsral analysis echnique for auomaic speaker verificaion, IEEE Trans. Acousics, Speech and Signal Process., vol. 29, no. 2, pp , [8] H. Hermansky and N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp , [9] J.-L. Gauvain and C.-H. Lee, Maximum a poseriori esimaion for mulivariae gaussian mixure observaions of markov chains, IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp , [10] C. J. Leggeer and P. Woodland, Maximum likelihood linear regression for speaker adapaion of coninuous densiy hidden Markov models, Compu. Speech Lang., vol. 9, no. 2, pp , 1995.

9 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 533 [11] M. J. Gales, Maximum likelihood linear ransformaions for HMMbased speech recogniion, Compu. Speech Lang., vol. 12, no. 2, pp , [12] G. Hinon, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaily, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainah, Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups, IEEE Signal Process Mag., vol. 29, no. 6, pp , [13] T. Yoshioka, X. Chen, and M. J. Gales, Impac of single-microphone dereverberaion on dnn-based meeing ranscripion sysems, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Florence, Ialy, 2014, pp [14] M. Selzer, D. Yu, and Y. Wang, An invesigaion of deep neural neworks for noise robus speech recogniion, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [15] W. Li, J. Dines, and M. Magimai.-Doss, Robus overlapping speech recogniion based on neural neworks, Marigny, Swizerland, Tech. Rep. IDIAP-RR , [16] A. L. Maas, T. M. O Neil, A. Y. Hannun, and A. Y. Ng, Recurren neural nework feaure enhancemen: The 2nd CHiME challenge, in proc. CHiME Workshop, Vancouver, Canada, 2013, pp [17] E. Vincen, J. Barker, S. Waanabe, J. Le Roux, F. Nesa, and M. Maassoni, The second CHiME speech separaion and recogniion challenge: Daases, asks, baselines, in Proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [18] S. Hochreier and J. Schmidhuber, Long shor-erm memory, Neural Compu., vol. 9, no. 8, pp , [19] A. Graves, M. Liwicki, S. Fernandez, H. Berolami, R. andbunke, and J. Schmidhuber, A novel connecionis sysem for unconsrained handwriing recogniion, IEEE Trans. Paern Anal. Mach. Inell., vol. 31, no. 5, pp , [20] C. Plahl, M. Kozielski, R. Schlüer, and H. Ney, Feaure combinaion and sacking of recurren and non-recurren neural neworks for LVCSR, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [21] M. Wöllmer, C. Blaschke, T. Schindl, B. Schuller, B. Farber, S. Mayer, and B. Trefflich, Online driver disracion deecion using long shor-erm memory, IEEE Trans. Inell Transp. Sys., vol. 12, no. 2, pp , [22] S. Hochreier, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradien flow in recurren nes: The difficuly of learning long-erm dependencies, in A Field Guide o Dynamical Recurren Neural Nework, S. C. Kremer and J. F. Kolen, Eds. New York, NY: IEEE Press, 2001, pp [23] H. Sak, A. Senior, and F. Beaufays, Long shor-erm memory based recurren neural nework archiecures for large vocabulary speech recogniion, arxiv preprin arxiv: , [24] M. Wöllmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, Feaure enhancemen by bidirecional LSTM neworks for conversaional speech recogniion in highly non-saionary noise, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [25] F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, The Munich feaure enhancemen approach o he 2nd CHiME challenge using BLSTM recurren neural neworks, in proc. CHiME Workshop, Vancouver, Canada, 2013, pp [26] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshia, R. Maas, T. Nakaani, and W. Kellermann, Making machines undersand us in reverberan rooms: Robusness agains reverberaion for auomaic speech recogniion, IEEE Signal Process Mag., vol. 29, no. 6, pp , [27] O. Vinyals, S. V. Ravuri, and D. Povey, Revisiing recurren neural neworks for robus ASR, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Kyoo, Japan, 2012, pp [28] M. Schuser and K. K. Paliwal, Bidirecional recurren neural neworks, IEEE Trans. Signal Process., vol. 45, no. 11, pp , BIOGRAPHIES Zixing Zhang received his maser degree (2010) in elecommunicaions from Beijing Universiy of Poss and Telecommunicaions, Beijing, China. He is currenly pursuing his Ph.D. degree as a Researcher in he Machine Inelligence & Signal Processing (MISP) Group a he Insiue for MMK a Technische Universiä München (TUM) in Munich, Germany. His curren research focuses on efficien machine learning algorihms for robusness of auomaic speech recogniion and compuaional paralinguisics. Joel Pino is a Research Manager a Nuance Communicaions in Aachen, Germany. He holds a PhD from École Polyechnique Fédérale de Lausanne, Swizerland (2010) and a Maser in Engineering degree from he Indian Insiue of Science, India (2003), boh in Elecrical Engineering. During his docoral sudies, he was wih Idiap Research Insiue, Swizerland working on neural nework based acousic modeling for auomaic speech recogniion. Beween , he was wih Hewle Packard Labs India working in he area of speech and language echnology. Chrisian Plahl received he diploma degree in compuer science from he Universiy of Bielefeld, Bielefeld, Germany, in 2005 and he Ph.D. degree in Compuer Science from he RWTH Aachen Universiy, Aachen, Germany, in Since 2013 he is working as a Research Scienis a Nuance Communicaions, Inc., Aachen, Germany. His research ineress cover speech recogniion, signal analysis, deep learning and arificial neural neworks. Björn Schuller (M 05) received his diploma in 1999, and his docoral degree in 2006 and his habiliaion in 2012, all in elecrical engineering and informaion echnology from TUM. He is enured head of he MISP Group a TUM since 2006, senior lecurer a he Imperial College London s Deparmen of Compuing in he UK, CEO of audeering UG (limied), Visiing Professor of HIT in Harbin/China, Associae of CISA in Geneva/Swizerland and Joanneum Research in Graz Ausria since Bes known are his works advancing Machine Inelligence for Speech Analysis. He is he presiden of he Associaion for he Advancemen of Affecive Compuing (AAAC), and member of he IEEE and is Speech and Language Processing Technical Commiee (SLTC), ACM, and ISCA and (co-)auhored five books and more han 400 publicaions in he field leading o more han 6000 ciaions his curren h-index equals 39. Daniel Wille received his diploma of Compuer Science from he Technical Universiy in Darmsad in 1994 and Ph.D. in Elecrical Engineering from Duisburg Universiy in 2000, boh in Germany. He was Posdoc a NTT Communicaion Science Laboraories in Kyoo, Japan, from 2000 o In 2002, he joined Harman-Becker in Ulm, Germany, where he worked on in-car speech recogniion unil 2006, when he joined Nuance Communicaions in Aachen, Germany, where he has since been working on large vocabulary auomaic speech recogniion and nowadays runs a research eam ha focuses on cloud-based speech recogniion. Wih nearly 20 years in he speech recogniion area, he conribued by numerous publicaions o he field as well as innovaions o real-world speech recogniion sysems as widely deployed oday.

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák