Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Size: px
Start display at page:

Download "Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices"

Transcription

1 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices Zixing Zhang, Joel Pino, Chrisian Plahl, Björn Schuller, Member, IEEE, Daniel Wille Absrac In his aricle, he reverberaion problem for hands-free voice conrolled devices is addressed by employing Bidirecional Long Shor-Term Memory (BLSTM) recurren neural neworks. Such neworks use memory blocks in he hidden unis, enabling hem o exploi a self-learn amoun of emporal conex. The main objecive of his echnique is o minimize he mismach beween he disan alk (reverberan/disored) speech and he close alk (clean) speech. To achieve his, he nework is rained by mapping he cepsral feaure space from he disan alk channel o is counerpar from he close alk channel frame-wisely in erms of regression. The mehod has been successfully evaluaed on a realisically recorded reverberan French corpus by a large scale of experimens of comparing a variey of nework archiecures, invesigaing differen nework raining arges (differenial or absolue), and combining wih common adapaion echniques. In addiion, he robusness of his echnique is also accessed by cross-room evaluaion on boh, a simulaed French corpus and a realisic English corpus. Experimenal resuls show ha he proposed novel BLSTM dereverberaion models rained by he differenial arges reduce he word error rae (WER) by 16% relaively on he French corpus (inra room scenario) as well as 8% relaively on he English corpus (iner room scenario) 1. Index Terms Hand-Free Voiced Conrolled Devices, Bidirecional Long Shor-Term Memory, Indirec Feaure Enhancemen, Dereverberaion. I. INTRODUCTION Human compuer ineracion via voice is increasingly being used and acceped in consumer elecronics because of he advanages of hands-free operaion: simpliciy, mobiliy, 1 The research leading o hese resuls is sponsored by Nuance Communicaions, Inc., where Zixing Zhang pursued his inernship from Augus 2013 o December Z. Zhang is wih he Machine Inelligence & Signal Processing Group, Insiue for Human-Machine Communicaion, Technische Universiä München, München, 80333, Germany ( zixing.zhang@um.de). J. Pino is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( joel.pino@nuance.com). C. Plahl is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( chrisian.plahl@nuance.com). B. Schuller wih he Machine Inelligence & Signal Processing Group, Insiue for Human-Machine Communicaion, Technische Universiä München, München, 80333, Germany ( schuller@um.de). He is also wih he Deparmen of Compuing, Imperial College London, London, SW7 2AZ, Unied Kingdom ( bjoern.schuller@imperial.ac.uk). D. Wille is wih he Nuance Communicaions, Inc., Aachen, 52072, Germany ( daniel.wille@nuance.com). Conribued Paper Manuscrip received 07/09/14 Curren version published 09/23/14 Elecronic version published 09/23/ /14/$ IEEE cusomizabiliy, ec. For some personal compuing devices such as noebooks and smar phones, he user is close o he microphones due o he inheren naure of hese devices and applicaion (e.g., personal assisan). In many oher applicaions such as digial elevision [1], se-op boxes, home auomaion [2], car navigaion sysem [3], and human robo ineracion, he ulimae user experience is he abiliy o communicae hands free from a disance ypically a few meers. In his case, however, he disan conrolled speech can underlie significan disorion due o room reverberaion, echo from loud speaker, and addiive noise sources, which leads o high word error rae of speech recogniion, and consequenly resuls in poor user experience. Reverberaion is an undesired acousic phenomenon in he conex of speech recogniion where he speech signal from he user reaches he microphone wih differen ime delays and ampliude aenuaions, caused by he reflecion of various surfaces in he acousic enclosure such as a living room. The speech signal acquired by a microphone is a sum of hree componens: (a) he direc pah signal whose power is inversely proporional o he square of he disance from he speaker [4]; (b) he early reflecions from he walls, floor, ceiling, ec., and hese depend on he posiion of he speaker; (c) he lae reverberaion which depends mainly on he size of he room and reflecive properies of he room surface. This is considered o be less dependen on he posiion of he speaker [4], [5]. In he pas decades, exensive research has been carried ou o handle such harmful effecs. Based on wha is addressed, hey can broadly be sored ino hree caegories: signal, feaure, and model-based approaches. The signal-based approaches are o enhance he reverberan signal from emporal or specral informaion. Typical mehods include blind deconvoluion by inverse filering [6], beamforming (e.g., delay-and-sum mehod) which is based on mulimicrophones [5], ec. The feaure-based approaches aemp o remove he influence of reverberaion direcly from he corruped feaure vecors. Well-known echniques involve feaure normalizaion like cepsral mean normalizaion (CMN), which is effecive for miigaing early reverberaion [7], exacing exper crafed feaures like RASTA-PLP [8], and so on. Boh signal- and feaure based approaches are locaed in he fron-end of ASR sysem according o ETSI sandard ES The model-based approaches are applied in he

2 526 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 back-end of an ASR sysem, and adjusing he parameers of he acousic model o he saisical properies of reverberan feaure vecors or ailoring he decoder o he reverberan feaure vecors. One or more adapaion echniques are applied, for example, maximum a poseriori (MAP) [9], maximum likelihood linear ransformaion (MLLR) [10], and feaure-space MLLR (fmllr or CMLLR) [11], o reduce he mismach of Hidden Markov Models (HMMs) rained on clean speech and reverberan speech. In he recen pas, a prominen echnique is o rain deep neural neworks (DNNs) [12] using a wide variey of reverberaed daa sources. The key objecive is o derive he original speech feaures o a high level represenaion. Is poenial capabiliy for noise robus auomaic speech recogniion (ASR) has been demonsraed previously [13], [14]. Anoher approach which has laely received increasing aenion is o use neural neworks for feaure enhancemen, which aims o remove he reverberaion characerisic informaion from he disan alk speech on he means of learning a mapping rule from he disan alk feaure space o is close alk counerpar. The main advanage of his approach is ha i leaves he feaure exracion and he back-end unouched, as he mapping is performed afer feaure exracion and prior o decoding. Therefore, he echnique can be easily inegraed wih any exising ASR sysems. This work was firsly realized by employing a muli-layer percepron (MLP) via mapping muliple channel array speech o clean speech. Then, i was exended by using recurren neural neworks (RNNs) [16] for he 2nd CHiME challenge [17], where reducion in word error raes were observed. Long shor-erm memory recurren neural neworks (LSTM-RNNs) [18], a more sophisicaed form of RNNs, nowadays have been successfully applied o a variey of paern recogniion asks, especially o sequenial paern asks, i.e., handwriing recogniion [19], coninuous speech recogniion [20], and driver disracion deecion [21]. Compared wih classic RNNs, LSTM neural neworks adop memory blocks o replace he individual arificial neurons. Therefore, hese neworks can learn an opimized range of conexual informaion, aiming a overcoming he vanishing gradien problem of convenional RNNs [18], [22]. The superioriy of LSTM neural neworks (especially he bidirecional ype of BLSTM) when compared o DNNs and convenional RNNs have been empirically confirmed in several recen comparaive sudies [20], [23]. Moreover, in 2013 he effeciveness of LSTM neworks o handle nonsaionary noisy speech was firs demonsraed [24] and laer exended o enhance reverberaed noisy speech [25]. In his paper, he BLSTM-RNNs are explored o learn he nonlinear feaure mapping rule. In comparison wih he work done previously [25], his work conribues o (1) evaluaing he BLSTM dereverberaion approach by execuing exensive experimens on realisic and synhesized reverberaed speech, and comparing he approach wih oher radiional nework srucures like MLP and (B)RNN in order o exploi he poenial value of memory neworks; (2) proposing he differenial feaure vecors beween he disan alk (reverberan/disored) speech and close alk (clean) speech as raining arges, which differs wih he previous work [25] where only he absolue feaure vecors of close alk speech are adoped as raining arges; (3) comparing and inegraing our feaure enhancemen mehods wih he widely used adapaion algorihms like MLLR and CMLLR; and (4) accessing he robusness of he echniques in he scenarios of mismached recording environmens beween raining and evaluaion ses. The remainder of his paper is organized as follows. Secion II describes a framework of a feaure dereverberaion sysem by neural neworks, which are rained by eiher absolue or differenial arges as given successively. Then, he deails of BLSTM srucure are presened in Secion III. Secion IV mainly focuses on invesigaing he effeciveness of our mehods by conducing a large-scale experimens in various scenarios, afer a shor descripion of our daabases and experimenal seups. Finally, conclusions are drawn and possible fuure direcions are poined ou in Secion V. II. FEATURE DEREVERBERATION BY NEURAL NETWORK A. Sysem Overview The framework of BLSTM models for dereverberaion in disan alk ASR is illusraed in Fig. II-A. The clean alk signal s() is corruped by convoluional noise r () and addiive noise n () when ransmiing hrough space channel. So, he observed disan alk signal sˆ( ) a he microphone can be wrien as: sˆ () s() r( ) n( ). (1) For he sake of simplificaion, addiive noise is ignored in his aricle. Thus, equaion (1) becomes sˆ( ) s() r(). (2) The oal lengh of RIR can be denoed as T60 which represens he ime aken for he energy in he impulse response o decay by 60dB compared o he direc sound. The RIR r () can be divided ino wo porions: The early reflecion re () ha includes several srong reflecions, and he lae reverberaion rl () ha consiss of a series of numerous indisinguishable reverberaion. This is, r () re() rl(), (3) where () 0 ( ) 0 r () r T e rl() r T T (4) 0 oherwise, 0 oherwise, and T is he lengh of he specral analysis window (20-30 ms). Thus, equaion (2) can be changed ino sˆ () s() re() s( T) rl( ). (5) When he lengh of RIR T60 is much shorer han he analysis window size T, r () is equal o re (), which only affecs he speech affecs he speech signals wihin a frame (analysis window). This linear disorion in he specral

3 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 527 Fig. 1. Framework of BLSTM models for dereverberaion in disan alk ASR. domain can be effecively miigaed by convenional echniques like CMN [7]. For mos applicaions (e.g., occurring in ypical office and home environmen), however, he reverberaion ime T60 ranges from 200 o 1000 ms [26] ha is much longer han he analysis window size, resuling in an undesirable influence on he following speech frames. For example, if he duraion of a RIR is 1 s (T60) and a feaure frame is exraced every 10 ms, one RIR would smear across he following 100 frames. Therefore, his disored speech, afer applying shor-ime discree Fourier ransform (STDFT), can be formulaed by: D1 Sˆ (, f) S(, f) Re(, f) S( d, f) Rl( d, f), (6) d 1 where Rd (, f ) denoes he par of R( f ) (i.e., STDFT of RIR r ()) corresponding o frame delay d. In his case, he channel disorion is no more of muliplicaive naure in a linear specral domain raher i is convoluional. Assuming he phases of differen frames are non-correlaed for simplificaion, he power specrum of (6) can be approximaed as Sˆ (, f) S(, f) Re (, f) D1 (7) 2 2 S( d, f) R ( d, f). d 1 l To exrac he sandardized feaure vecors in cepsral domain for ASR, logarihms and discree cosine ransform (DCT) are execued over he above specral signals. So, D( ln Sˆ(, f) ) D( ln S(, f) ) D( ln R e (, f ) ) (8) 2 D( ln M (, f) ), where D denoes he discree cosine ransformaion marix, and D S( d, f) R (, ) d 1 l d f M (, f) S(, f) Re (, f) (9) 2 Sˆ (, f). 2 2 S(, f) R (, f) e If he speech signal ransmission channel is invariable wihin 2 he senence period, he second erm of D ( ln Re (, f ) ) in (8) can be reaed as a consan, and can be heoreically removed jus by subracing he cepsral mean over each uerance [7]. Therefore, he objecive of our sraegy is o ge 2 rid of he hird erm of D ( ln M (, f) ) which is he proporion of he power specrum of he whole observed disored speech and he disored speech only convolued by early reverberaion (cf. (9)). The specific way o realize such a sraegy in his aricle is o apply neural neworks o map he feaure vecors x d ha are exraced from he disan alk speech signals sˆ( ) o he arge ones frame by frame. Finally, he enhanced feaure vecors xˆ will be fed ino he ASR decoder. B. Differenial vs. Absolue Targes for Training Neural Neworks From (8) and (9), one can observe ha he erm of Mf (, ) is no only relaive o he early reflecion, bu also convolued o he lae reverberaion of previous speech signals. Such highly nonlinear and nonsaionary characerisic makes dereverberaion an exremely challenging ask [5], [26]. To his end, using a nonlinear sysem o predic his erm migh be a poenially promising approach. On he oher hand, he close relaionship of Mf (, ) wih he numerous previous speech frames also implies he possibiliy of compensaing for he lae reverberaion by leveraging he long-erm acousic conex. Tha is, o exploi he sequence of reverberan feaure vecors preceding he curren ones migh be also beneficial for miigaing he lae reverberaion. The radiional way o capure such conexual informaion is o use riphone HMMs, which is empirically proved no sufficien for his ask [17]. Moivaed by hese analyses, an approach is explored based on a nonlinear and more efficien conex-learning-abiliy neural nework [18] BLSTM-RNN o remove such convolued lae reverberaion in he cepsral domain. More specifically, wo ways could be applied according o (8) via d ransforming he disored feaure vecors x from he disan speech signal sˆ( ) ino: 1) he corresponding absolue (clean) ones x c from close alk speech signals s() by minimizing he following objecive funcion of he mean squared error (MSE): N c c 2 J( ) ( x ˆ x ), (10) n1 where x ˆc is he prediced close alk feaure, and N is he dimensionaliy of he feaure vecor. This direc channel mapping sraegy has already been invesigaed previously [24], [25]. 2) he corresponding differenial (dela) ones x which are obained from laer reverberaion of Mf (, ) (cf. (9)). Before raining he neural nework, he differenial vecors are d calculaed by subracing he feaure vecors of disan alk x c from hose of he corresponding close alk x. When raining he neural neworks, he parameers are opimized by minimizing: N 2 J( ) ( x ˆ x ), (11) n1

4 528 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 where xˆ is he prediced differenial feaure. Afer ha, hese mapped differenial vecors are added o he original disan alk feaure vecors x d frame by frame, so as o compensae he disorion by reverberaion. This indirec channel mapping sraegy is firsly proposed and invesigaed in his work. III. BIDIRECTIONAL LONG SHORT-TERM MEMORY NEURAL NETWORK As discussed in Secion II, a nonlinear sysem wih he capabiliy of learning long-erm conexual informaion is preferred o ackle wih he nonlinear, nonsaionary, and highly convolued lae reverberaion. The convenional MLP propagaes he inpu signals unidirecionally layer-by-layer wih sigmoid acivaions wihou any recurren connecion, and needs o sack several successive feaure vecors as inpu. Neverheless, he capabiliy of capuring conex informaion is sill limied by he chosen conex [27]. Anoher mehod o address his problem is o employ RNNs, where he oupu of a previous ime sep is looped back and used as addiional inpu. However, research shows ha sandard RNNs can no access long-range conex since he backpropagaed error eiher blows up or decays over ime (he vanishing gradien problem) [22]. Fig. 2. LSTM memory block. The symbols f g, f i, and f o denoe logisic sigmoid, anh, and anh acivaion funcions, respecively; i, o, and f are he acivaions of he inpu, oupu, and forge gaes a ime, respecively; x, h, and c represen inpu, oupu, and cell values of he memory block a ime, respecively; b is a bias. To overcome his limiaion, [18] inroduced LSTM neworks, which are able o sore informaion in memory cells over a long period of ime. LSTM neworks can be inerpreed as RNNs in which he radiional neurons are replaced by scaled memory blocks (shown in Fig. 2). Similar o he cyclic connecions in RNNs, hese memory blocks are recurrenly conneced. Every memory block consiss of self-conneced linear memory cells and hree muliplicaive gae unis: inpu, oupu, and forge gae. The inpu and oupu gaes scale he inpu and oupu of he cell while he forge gae scales he inernal sae. In oher words, he hree gaes are responsible for wriing, reading, and reseing he memory cell values, respecively. For example, if he forge gae is open and he inpu gae is closed (i.e., he inpu gae acivaion is close o zero), he acivaion of he cell will no be overwrien by new inpus, and herefore he informaion from previous ime can be accessed a he following arbirary ime seps by opening he oupu gae. (Please refer o [18] and [19] for more deails.) In paricular, for a memory block, he acivaion of he inpu gae i is composed of four componens: i f g( Wxix Whih 1Wcic 1bi), (12) where fg denoes he logisic sigmoid funcion of he inpu uni, W is a weigh marix of he connecions from all inpu gaes, oupu gaes, or forge gaes in he same hidden layer o he inpu uni, x is he inpu vecor, h is he hidden vecor, and bi is he uni bias. The acivaion of he forge gae f follows he same principle, and can be wrien as f f g( Wxf x Whfh 1Wcfc 1 bf ). (13) The memory cell value c is he sum of he inpus a ime sep and is previous ime sep acivaions ha are muliplied by forge gae acivaion, and updaed by: c i fi( Wxcx Whch 1bc) f c 1, (14) where fi is he anh acivaion funcion. Finally, he oupu of he memory cell is conrolled by he oupu gae acivaions of o fg( Wxox Wh oh 1 Wcoc bo), (15) and delivered by h o fo( c), (16) where fo is also a anh acivaion funcion. Noe ha each memory block can be regarded as a separae, independen uni. Therefore, if each memory block includes one memory cell, he acivaion vecors i, o, f, and c are all of same size as h, i.e., he number of memory blocks in he hidden layer. And from he formulas given above, i can be seen ha he values of all memory cells and block oupus in he previous ime sep -1 will cerainly affec he acivaions of all inpu gaes, oupu gaes, forge gaes, even he inpu unis in he curren ime sep in he same layer, excep he case beween memory cell and oupu gae i is he curren sae of memory cell c raher han he sae from previous ime sep ha conribues o forge gae acivaion. Overall, he LSTM memory cell can sore and access informaion over long emporal range and hus avoid he vanishing gradien problem [18]. Therefore, LSTM could also be regarded as a naural exension of DNNs for emporal sequence daa, where he deepness comes from layers hrough ime. Sandard RNNs have access o pas bu no o fuure conex. To exploi boh, pas and fuure conex, RNNs can be exended o bidirecional RNNs, where wo separae recurren hidden layers scan he inpu sequences in opposie direcions [28]. The f nework calculaes is forward hidden layer acivaions h from he beginning o he end of he sequence, and is backward b hidden layer acivaions h from he end o he beginning of he sequence, hen updaes he oupu layer by f b y Wfyh Wbyh b y, (17)

5 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 529 where W fy and W by are he forward and backward weigh marices, and b y is he hidden bias vecor. The forward and backward direced layers are conneced o he same oupu layer, which herefore can access he whole conex. IV. EXPERIMENTS AND RESULTS A. Daabases To demonsrae he effeciveness of he proposed mehods, wo daabases a French and a English corpus were recorded beforehand in a realisic acousic space environmen. Boh daabases are colleced for speech conrolled TV applicaion. This applicaion is designed o enable he user o change he TV conrols (volume, brighness, ec.) or browse he programs using her voice. Table I shows he saisics of he wo daabases. The French corpus is recorded in a living room wih furniure, where one microphone near he mouh records he close alk, and anoher microphone array consising of 16 channels records he disan alk. 22 naive French speakers (11 females) were asked o speak naurally so as o conrol he TV as heir wish, i.e., je veux un film avec Cameron Diaz (I wan a movie wih Cameron Diaz). Finally, 8.3 h recordings are obained, including abou 7 k senences and 45 k words in oal. The disan alk daa obained from a 16-channel microphone array is grouped ino four disjoin ses (1-4, 5-8, 9-12, and 13-16). The four channel speech in each of he ses is beamformed and noise reduced o ge a single speech signal. As a resul, he amoun of disan alk raining/es daa is four imes is close alk counerpar. The whole daabase was hen divided ino raining and es se speaker-independenly and equally. TABLE I DISTRIBUTION OF SPEAKERS, SENTENCES, WORDS, AND RECORDING TIME OF CLOSE TALK PER PARTITION OF FRENCH AND ENGLISH CORPORA. French English rain es rain es # speakers (f/m) 11 (5/6) 11 (6/5) 9 (5/4) 11 (5/6) # senences # words ime (hours) Likewise, 6.3 h of recordings were capured for he English corpus which comprises 20 speakers (10 females), and approximae 3 k senences, 18 k words in oal. For French, he raining and es daa ses were recorded in he same room, bu for English, hese daa ses were recorded in differen rooms. The deails of he French and English corpus are shown in Table I. In he ongoing, he proposed echniques is mainly evaluaed on he French corpus. The English daabase is used o sudy he impac of mismach in acousic (room) environmens beween raining and esing condiions. B. Experimenal Seup The sereo raining (close alk and disan alk) feaure vecors are ime aligned such ha he Pearson produc-momen correlaion coefficien (PCC) is maximized beween he MFCC-0 ime series. The raining uerances wih maximum PCC coefficien lower han 0.9 were dropped o avoid uerances wih severe channel disorions. The mapping echniques were evaluaed on he sandard MFCCs. The 12 dimensional saic MFCCs were appended o heir firs, second, and hird order regression coefficiens, resuling in a feaure vecor of size 48. The feaure vecors of c d x and x are exraced from he close and disan alk signals, respecively, every 10 ms using a window size of 25 ms. Then, he differenial feaure vecors of x are acquired c d by x x. Furhermore, before raining he neural neworks, he global means and variances are calculaed over he close alk, disan alk, and heir differenial feaure vecors of he whole neural nework s raining ses. Then, mean and variance normalizaion are performed over he nework inpus and arges (i.e., he absolue or he differenial feaure vecors) using he means and variances from he corresponding ses, respecively. For he neural neworks, boh inpu and oupu node numbers are equal o he dimension of he feaure vecor (48 in our case) excep ha sacked frames are used as inpu. And one hidden layer wih 200 neurons is chosen. Paricularly, for he LSTM memory block, inpu and oupu gaes adop hyperbolic angen (anh) acivaion funcions, and he forge gaes ake logisic acivaion funcions. During nework raining, gradien descen is implemened wih a learning rae of 10-5 and a momenum of 0.9. Zero mean Gaussian noise wih sandard deviaion 0.1 is added o he inpu acivaions in he raining phase in order o improve generalizaion. All weighs are randomly iniialized in he range from -0.1 o 0.1. Finally, he early sopping sraegy is used as no improvemen of he MSE on he evaluaion se has been observed during 20 epochs. C. Speech Recogniion Evaluaion The effeciveness of differen mapping sraegies and neural nework configuraions was evaluaed on a research ASR sysem available off-he-shelf. The acousic models were rained on mobile daa colleced on hand held devices. The performance of he ASR is measured and compared in erms of word error rae (WER) and is relaive reducion (WERR) merics, and he baselines for he close alk and disan alk of he French corpus are 11.8% and 19.41% WERs, respecively. 1) Neural Nework Archiecures: A performance comparison is shown in Table II beween BLSTM neworks and oher neworks such as MLP, and recurren neworks wihou memory (RNNs) or wih memory (i.e., LSTM) Noe ha, according o he empirical experience, he bes performance for raining MLP and (B)RNN was achieved by a learning rae of 10-6, as opposed o 10-5 for he (B)LSTM neworks. From Table II, i can be seen ha, when no conex is used a he inpu of he MLP, here is an increase in WER compared o he baseline. Whereas, he recurren neural neworks (sandard RNN and more sophisicaed LSTM) show lower WERs. This is because of heir abiliy o capure he conexual informaion implicily. When he emporal

6 530 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus 2014 conex is increased a he inpu of he MLP, here is a seady decrease in WERs and for 600 hidden nodes and a conex of 7 frames, a WERR of 7% is delivered over he baseline sysem. TABLE II PERFORMANCE OF THE BASELINE RECOGNIZER AND DEREVERBERANT SYSTEMS BY ADOPTING VARIOUS NEURAL NETWORK ARCHITECTURES LIKE MLP, RNN, BRNN, LSTM, AND BLSTM WITH DIFFERENT NUMBER OF HIDDEN NEURONS AND STACKED FEATURE FRAMES. FR: FRAMES; WGT: WEIGHTS. nework # neurons #fr # wg WER[%] WERR[%] w/o mapping (close alk) w/o mapping (disan alk) k k MLP k k k RNN k BRNN k LSTM k BLSTM k BLSTM k BLSTM M RNN and LSTM models capure only he pas informaion. However, for dereverberaion, i is imporan o learn he emporal smearing in he fuure frames because he disan alk signal is delayed (fuure) and aenuaed version of he close alk signal (cf. Secion II-A). The bidirecional RNN and LSTM yield significan (one-side z-es, p<0.001) reducion in WERs compared o he corresponding unidirecional models capuring pas informaion. I can also be seen ha boh uni- and bi- direcional LSTM models give lower WERs compared o he simple RNN models. This can be aribued o he sophisicaed archiecure of he individual neurons compared o he simple neuron. Previous acousic informaion can be sored in he memory cell unil he inpu gaes and he forge gaes allow o (parly) change i (cf. Secion III). Moreover, as seven successive frames are simulaneously fed ino BLSTM neworks, no improvemen is observed from his side (see Table II). Hence, he BLSTM seems o learn conex beer if feaure frames are presened one by one and he increased size of he inpu layer raher harms recogniion performance. In addiion, when increasing one hidden layer wih 200 neurons o hree hidden layers wih neurons, he performance improvemen is no obvious. In he following experimens, one hidden layer wih 200 neurons is kep as he BLSTM nework s archiecure on he French corpus. To visualize he mapping learned by he BLSTM model, he rajecories of MFCC-0 for wo randomly seleced uerances are ploed in Fig. 3. The figure shows hree rajecories close alk (red), disan alk (green), and mapped (or esimaed) close alk (blue). I can be seen ha he MFCC-0 curves of mapped close alk speech (by BLSTM neworks) are closer o he original one han he disan alk speech during he Fig. 3. The scaled MFCC-0 (0-255) of a close alk uerance (red), a disan alk one (green), and a mapped close alk one (blue) for wo examples. speaking period, and are smooher during he silence period. This indicaes ha he reverberan signals and channel noise are successfully suppressed. Such a feaure enhancemen phenomenon can be furher confirmed over he enire raining se and he whole feaure vecors. Fig. 4 presens he PCCs of he 48 MFCCs beween disan alk uerances (hollow circle and doed line)/mapped uerances (solid circle and line) and close alk uerances over he whole raining se. Obviously, he PCCs are boosed afer reverberaed feaures are enhanced, which could demonsrae he performance improvemen of ASR by using a BLSTM dereverberaion model. Fig. 4. Pearson produc-momen Correlaion Coefficien (PCC) of 48 MFCCs beween disan alk uerances (hollow circle and doed line)/mapped close alk uerances (solid circle and line) and close alk uerances over he whole raining se. 2) Training on Differenial Targes: As discussed in Subsecion II-B, here are wo ways o obain he enhanced feaures from disan alk, eiher by direc way (raining neworks wih absolue arges) or by indirec way (raining neworks wih differenial arges). Table III compares he performance of he wo mapping ways in ASR sysem. By checking hree ypes of BLSTM nework srucure, he BLSTM dereverberaion models rained on differenial arges perform beer han he models rained on he absolue arges when he nework srucure is simpler. I can be seen ha a gain of abou 3% relaive WERR (a he 0.05 significance level in a one-side z-es) is achieved when only 144 neurons are used in only one hidden layer, compared o using absolue arges.

7 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 531 TABLE III PERFORMANCE COMPARISON BY USING ABSOLUTE TARGETS AND DIFFERENTIAL TARGETS. arges # neurons WER [%] WERR [%] abs diff abs diff abs diff To find ou he raionale behind his phenomenon, he disribuion of globally normalized log energies (MFCC-0) on he absolue arges (a) and he differenial arges (b) over he whole French corpus is ploed in Fig. 5. Obviously, he differenial arges have a symmerical unimodal disribuion which is cenered around zero. In conras, he absolue arge has a bimodal disribuion which could be harder o learn. Therefore, he simpler he neural neworks are, he higher a gain would be obained via raining on he differenial arges. Such superioriy of differenial arges-based learning can furher be verified in Subsecions IV-C3 and IV-C4. Fig. 5. Disribuion of normalized log energy (MFCC-0) of absolue arges (a) and differenial arges (b). 3) Incorporaing CMLLR and MLLR: As he disan alk is passing hrough he BLSTM dereverberaion models, is feaure vecors are ransformed (almos) o he clean arge, which mos preexising acousic models are rained on. Thus, his echnique could also be considered as a sor of feaure adapaion. I is ineresing o see wheher incorporaing backend adapaion echniques like CMLLR and MLLR can furher enhance he ASR performance. As expeced, wihou our mapping echnique he WERs for disan alk decrease from 19.41%, over 19.01%, o 17.19% wih no adapaion, CMLLR, and CMLL + MLLR, respecively (as shown in Table IV). The WERs drop furher o 16.43%, 16.34%, and 15.68% when inegraing wih our suggesed mapping echnique, which resuls in 15.4%, 13.8% and 7.8% relaive WERR, respecively (All improvemens are a he significance level in a one-side z-es). Overall, he bes resul is achieved by combining boh mapping and adapaion (CMLLR + MLLR) echniques, wih 8.8% and 19.2% performance improvemen in WERR in comparison wih adapaion echniques only and he baseline (w/o adapaion and mapping), respecively. Addiionally, Table IV also shows ha if he close alk was falsely deeced as disan alk and fed ino he mapping and adapaion sysems, he WER will increase abou 10% relaively. TABLE IV ASR EVALUATION ON DISTANT TALK AND CLOSE TALK SETS BY COMBINING BLSTM DEEVERBERATION AND ADAPTATION (CMLLR AND MLLR) TECHNIQUES. ABS./DIFF.: ABSOLUTE/DFFERENTIAL TRGETS. [%] ar- disan alk close alk adapaion ges WER WERR WER WERR w/o adapaion w/o mapping w/ mapping abs w/ mapping diff w/ CMLLR w/o mapping w/ mapping abs W /mapping diff w/ CMLLR+MLLR w/o mapping w/ mapping abs w/ mapping diff ) Iner Room Evaluaion: In he above experimens, he daa se used for raining he dereverberaion model is recorded in he same room wih he evaluaion se. In he reallife applicaion, however, he evaluaion scenarios are always unpredicable. Tha is, he acousic environmens (i.e., room size, ype) for creaing he raining daa normally mismach wih he evaluaion scenarios. To cope wih his problem, several arificially reverberan corpora were synhesized on he close alk se of French by convolving various RIRs and adding a lile noise. The rooms o creae he RIRs are differen wih he ones for creaing he French corpus. When generaing he simulaed corpora, hree elemens were aken ino accoun: posiions variaion of he speakers w.r.. he microphones, he weighs of he reverberaion signal and he weighs of he noise signal. The firs column of Table V shows he four scenarios of simulaed speech. The second o sixh columns represen he WER and WERR for each simulaed corpus wihou mapping, mapping o he absolue arges, and mapping o he differenial arges, respecively. As observed from he able, he BLSTM dereverberan ASR sysems prevail over he sysems wihou dereverberaion, which overall leads o a reducion of WER wih 3.3% relaively by he usage of absolue arges and 6.6% relaively by he usage of differenial arges. TABLE V ASR EVALUATION ON THE ARTIFICIAL DISTANT TALK SET USING THE BLSTM DEREVERBERANTION MODELS TRAINED ON THE NATURAL DISTANT TALK SET. POS: POSITION OF SPEAKERS W.R.T. MICROPHONES; R/N: REVERBERANT/NOISY SIGNAL WEIGHTS (DB); W/O: WITHOUT MAPPING; ABS./DIFF.: ABSOLUTE/DIFFERENTIAL TARGETS. [%] w/o abs. diff. WER WER WERR WER WERR Pos-1,R:-100,N: Pos-1,R:-30,N: Pos-2,R:-100,N: Pos-2,R:-30,N: average In addiion, he experimens were repeaed on a realisic English corpus, of which he raining and es ses are recorded in oally differen rooms (cf. Secion IV-A). The baselines of he disan alk of English corpus are WERs of 18.30% and

8 532 IEEE Transacions on Consumer Elecronics, Vol. 60, No. 3, Augus % for he raining and es ses, boh of which almos double he baseline of close alk (WERs of 9.27% and 9.48% for he raining and es ses). As expeced, a high gain is obained for he raining se when applying channel mapping. Neverheless, such high gain is no observed for he es se. Only when using he differenial arges o rain neural neworks, a gain can be obained by 5.5% of WERR on he mismached es se, and can be enlarged o 7.7% WERR when he uerance level CMN is implemened [7]. In his experimen, i can also be noiced ha he indirec mapping way (using differenial arges for neworks raining) significanly overcomes he direc mapping way (using absolue arges for neworks raining). TABLE VI ASR EVALUATION ON THE TRAINING AND TEST SETS OF THE ENGLISH CORPUS BY USING THE BLSTM (ONE HIDDEN LAYER WITH 128 NEURONS) FEATURE DEREVERBERATION MODEL TRAINED ON THE TRAINING SET. ABS./DIFF.: ABSOLUTE/DIFFERENTIAL TARGETS. CMN (UTT.): UTTERANCE LEVEL CEPSTRAL MEAN NORMALIZATION. [%] ar- raining se es se ges WER WERR WER WERR w/o mapping (close alk) w/o mapping (disan alk) BLSTM abs BLSTM diff BLSTM+CMN(u.) abs BLSTM+CMN(u.) diff From he above wo experimens, he resuls imply ha he iner-room scenario is more challenging when compared o he inra-room scenario shown in Secion IV-C1 o IV-C3. On he one hand, he performance improvemen on boh raining and es ses indicaes ha differen rooms share some common reverberaion informaion. These shared informaion can be learned by he BLSTM neworks. On he oher hand, he differen gains obained by he raining and es ses sugges ha he neworks probably learn oo much informaion from a specific acousic environmen. V. CONCLUSIONS In his sudy, a feaure-based dereverberaion mehod was proposed and invesigaed for realisic hands-free voice conrolled devices. The basic idea is o use bidirecional long shor-erm memory (BLSTM) neural neworks for channel mapping from disan alk cepsal feaure space o is close alk counerpar. In such applicaion scenario, he speech signal a each frame ime will impac he subsequen frames in a long-erm. This consequenially requires a learning algorihm which could no only access long-erm conex informaion bu also make use of he fuure informaion. The bidirecional srucure (pas and fuure) of LSTM neural neworks is capable of dealing wih hese problems. The experimenal resuls on a French corpus show a word error rae reducion (WERR) of more han 16% for ASR, which significanly ouperform he convenional neworks Mulilayer Percepron (MLP) (one-side z-es, p<0.001) and bidirecional recurren neural neworks (BRNNs) (one-side z-es, p<0.05). Such effeciveness of our feaure mapping mehod is furher confirmed by inegraing widely used adapaion echniques of maximum linear likelihood regression (MLLR) or/and consrained MLLR (CMLLR), which yields he bes performance of abou 20% of WERR. And i is also confirmed in he scenario of iner-room evaluaion, as he mismached evaluaion ses in acousic environmen also obain a gain via channel mapping when using BLSTM. This sudy also presens anoher indirec way for channel mapping he differenial feaure vecors (beween he disan alk speech and he close alk speech) as nework arges, hen adding he esimaed differenial feaure vecors o he counerpar of original disan alk. The resuls based on a rich number of experimens show ha his indirec mapping sraegy can compee wih he previously used direc mapping sraegy, paricularly in some cases like using a simple nework srucure and evaluaing mismached daa ses. All hese cases are quie welcome for real-life applicaions. Due o a gain gap beween mached and mismached evaluaion cases, fuure work will focus on he furher exploiaion of join acousic informaion across differen rooms wih he goal of blind dereverberaion applicaion. On a way o achieve his is o rain he neworks by a vas amoun of reverberan speech colleced in a variey of rooms. Furher, one can apply o he objecive funcions some generalizaion erms such as weigh decay. In addiion, i seems also beneficial o develop a way of selecing predefined mapping models for differen room caegories, in order o ulimaely explore he advanages of he roomspecific models. REFERENCES [1] K. Fujia, H. Kuwano, T. Tsuzuki, Y. Ono, and T. Ishihara, A new digial TV inerface employing speech recogniion, IEEE Trans. Consum. Elecron., vol. 49, no. 3, pp , [2] T. Giannakopoulos, N.-A. Ta1las, T. Ganchev, and I. Poamiis, A pracical, real-ime speech-driven home auomaion fron-end, IEEE Trans. Consum. Elecron., vol. 51, no. 2, pp , [3] P. Ding, L. He, X. Yan, R. Zhao, and J. Hao, Robus mandarin speech recogniion in car environmens for embedded navigaion sysem, IEEE Trans. Consum. Elecron., vol. 54, no. 2, pp , [4] T. Viranen, R. Singh, and B. Raj, Techniques for Noise Robusness in Auomaic Speech Recogniion. New York, NY: John Wiley & Sons, [5] P. A. Naylor and N. D. Gaubich, Speech Dereverberaion. Berlin: Springer, [6] M. Miyoshi and Y. Kaneda, Inverse filering of room acousics, IEEE Trans. Acousics, Speech and Signal Process., vol. 36, no. 2, pp , [7] S. Furui, Cepsral analysis echnique for auomaic speaker verificaion, IEEE Trans. Acousics, Speech and Signal Process., vol. 29, no. 2, pp , [8] H. Hermansky and N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process., vol. 2, no. 4, pp , [9] J.-L. Gauvain and C.-H. Lee, Maximum a poseriori esimaion for mulivariae gaussian mixure observaions of markov chains, IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp , [10] C. J. Leggeer and P. Woodland, Maximum likelihood linear regression for speaker adapaion of coninuous densiy hidden Markov models, Compu. Speech Lang., vol. 9, no. 2, pp , 1995.

9 Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 533 [11] M. J. Gales, Maximum likelihood linear ransformaions for HMMbased speech recogniion, Compu. Speech Lang., vol. 12, no. 2, pp , [12] G. Hinon, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaily, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainah, Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups, IEEE Signal Process Mag., vol. 29, no. 6, pp , [13] T. Yoshioka, X. Chen, and M. J. Gales, Impac of single-microphone dereverberaion on dnn-based meeing ranscripion sysems, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Florence, Ialy, 2014, pp [14] M. Selzer, D. Yu, and Y. Wang, An invesigaion of deep neural neworks for noise robus speech recogniion, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [15] W. Li, J. Dines, and M. Magimai.-Doss, Robus overlapping speech recogniion based on neural neworks, Marigny, Swizerland, Tech. Rep. IDIAP-RR , [16] A. L. Maas, T. M. O Neil, A. Y. Hannun, and A. Y. Ng, Recurren neural nework feaure enhancemen: The 2nd CHiME challenge, in proc. CHiME Workshop, Vancouver, Canada, 2013, pp [17] E. Vincen, J. Barker, S. Waanabe, J. Le Roux, F. Nesa, and M. Maassoni, The second CHiME speech separaion and recogniion challenge: Daases, asks, baselines, in Proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [18] S. Hochreier and J. Schmidhuber, Long shor-erm memory, Neural Compu., vol. 9, no. 8, pp , [19] A. Graves, M. Liwicki, S. Fernandez, H. Berolami, R. andbunke, and J. Schmidhuber, A novel connecionis sysem for unconsrained handwriing recogniion, IEEE Trans. Paern Anal. Mach. Inell., vol. 31, no. 5, pp , [20] C. Plahl, M. Kozielski, R. Schlüer, and H. Ney, Feaure combinaion and sacking of recurren and non-recurren neural neworks for LVCSR, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [21] M. Wöllmer, C. Blaschke, T. Schindl, B. Schuller, B. Farber, S. Mayer, and B. Trefflich, Online driver disracion deecion using long shor-erm memory, IEEE Trans. Inell Transp. Sys., vol. 12, no. 2, pp , [22] S. Hochreier, Y. Bengio, P. Frasconi, and J. Schmidhuber, Gradien flow in recurren nes: The difficuly of learning long-erm dependencies, in A Field Guide o Dynamical Recurren Neural Nework, S. C. Kremer and J. F. Kolen, Eds. New York, NY: IEEE Press, 2001, pp [23] H. Sak, A. Senior, and F. Beaufays, Long shor-erm memory based recurren neural nework archiecures for large vocabulary speech recogniion, arxiv preprin arxiv: , [24] M. Wöllmer, Z. Zhang, F. Weninger, B. Schuller, and G. Rigoll, Feaure enhancemen by bidirecional LSTM neworks for conversaional speech recogniion in highly non-saionary noise, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Vancouver, Canada, 2013, pp [25] F. Weninger, J. Geiger, M. Wöllmer, B. Schuller, and G. Rigoll, The Munich feaure enhancemen approach o he 2nd CHiME challenge using BLSTM recurren neural neworks, in proc. CHiME Workshop, Vancouver, Canada, 2013, pp [26] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshia, R. Maas, T. Nakaani, and W. Kellermann, Making machines undersand us in reverberan rooms: Robusness agains reverberaion for auomaic speech recogniion, IEEE Signal Process Mag., vol. 29, no. 6, pp , [27] O. Vinyals, S. V. Ravuri, and D. Povey, Revisiing recurren neural neworks for robus ASR, in proc. Inernaional Conference on Acousics, Speech and Signal Processing, Kyoo, Japan, 2012, pp [28] M. Schuser and K. K. Paliwal, Bidirecional recurren neural neworks, IEEE Trans. Signal Process., vol. 45, no. 11, pp , BIOGRAPHIES Zixing Zhang received his maser degree (2010) in elecommunicaions from Beijing Universiy of Poss and Telecommunicaions, Beijing, China. He is currenly pursuing his Ph.D. degree as a Researcher in he Machine Inelligence & Signal Processing (MISP) Group a he Insiue for MMK a Technische Universiä München (TUM) in Munich, Germany. His curren research focuses on efficien machine learning algorihms for robusness of auomaic speech recogniion and compuaional paralinguisics. Joel Pino is a Research Manager a Nuance Communicaions in Aachen, Germany. He holds a PhD from École Polyechnique Fédérale de Lausanne, Swizerland (2010) and a Maser in Engineering degree from he Indian Insiue of Science, India (2003), boh in Elecrical Engineering. During his docoral sudies, he was wih Idiap Research Insiue, Swizerland working on neural nework based acousic modeling for auomaic speech recogniion. Beween , he was wih Hewle Packard Labs India working in he area of speech and language echnology. Chrisian Plahl received he diploma degree in compuer science from he Universiy of Bielefeld, Bielefeld, Germany, in 2005 and he Ph.D. degree in Compuer Science from he RWTH Aachen Universiy, Aachen, Germany, in Since 2013 he is working as a Research Scienis a Nuance Communicaions, Inc., Aachen, Germany. His research ineress cover speech recogniion, signal analysis, deep learning and arificial neural neworks. Björn Schuller (M 05) received his diploma in 1999, and his docoral degree in 2006 and his habiliaion in 2012, all in elecrical engineering and informaion echnology from TUM. He is enured head of he MISP Group a TUM since 2006, senior lecurer a he Imperial College London s Deparmen of Compuing in he UK, CEO of audeering UG (limied), Visiing Professor of HIT in Harbin/China, Associae of CISA in Geneva/Swizerland and Joanneum Research in Graz Ausria since Bes known are his works advancing Machine Inelligence for Speech Analysis. He is he presiden of he Associaion for he Advancemen of Affecive Compuing (AAAC), and member of he IEEE and is Speech and Language Processing Technical Commiee (SLTC), ACM, and ISCA and (co-)auhored five books and more han 400 publicaions in he field leading o more han 6000 ciaions his curren h-index equals 39. Daniel Wille received his diploma of Compuer Science from he Technical Universiy in Darmsad in 1994 and Ph.D. in Elecrical Engineering from Duisburg Universiy in 2000, boh in Germany. He was Posdoc a NTT Communicaion Science Laboraories in Kyoo, Japan, from 2000 o In 2002, he joined Harman-Becker in Ulm, Germany, where he worked on in-car speech recogniion unil 2006, when he joined Nuance Communicaions in Aachen, Germany, where he has since been working on large vocabulary auomaic speech recogniion and nowadays runs a research eam ha focuses on cloud-based speech recogniion. Wih nearly 20 years in he speech recogniion area, he conribued by numerous publicaions o he field as well as innovaions o real-world speech recogniion sysems as widely deployed oday.

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák

More information

Fast Multi-task Learning for Query Spelling Correction

Fast Multi-task Learning for Query Spelling Correction Fas Muli-ask Learning for Query Spelling Correcion Xu Sun Dep. of Saisical Science Cornell Universiy Ihaca, NY 14853 xusun@cornell.edu Anshumali Shrivasava Dep. of Compuer Science Cornell Universiy Ihaca,

More information

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments Inernaional Journal of Elecrical and Compuer Engineering (IJECE) Vol. 6, No. 5, Ocober 2016, pp. 2415~2424 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i5.10639 2415 An Effiecien Approach for Resource Auo-Scaling

More information

More Accurate Question Answering on Freebase

More Accurate Question Answering on Freebase More Accurae Quesion Answering on Freebase Hannah Bas, Elmar Haussmann Deparmen of Compuer Science Universiy of Freiburg 79110 Freiburg, Germany {bas, haussmann}@informaik.uni-freiburg.de ABSTRACT Real-world

More information

MyLab & Mastering Business

MyLab & Mastering Business MyLab & Masering Business Efficacy Repor 2013 MyLab & Masering: Business Efficacy Repor 2013 Edied by Michelle D. Speckler 2013 Pearson MyAccouningLab, MyEconLab, MyFinanceLab, MyMarkeingLab, and MyOMLab

More information

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports Downloaded from ascelibrary.org by Basil Sephanis on 07/13/16. Copyrigh ASCE. For personal use only; all righs reserved. Informaion Propagaion for informing Special Populaion Subgroups abou New Ground

More information

1 Language universals

1 Language universals AS LX 500 Topics: Language Uniersals Fall 2010, Sepember 21 4a. Anisymmery 1 Language uniersals Subjec-erb agreemen and order Bach (1971) discusses wh-quesions across SO and SO languages, hypohesizing:...

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 7 Apr 2015 Transferring Knowledge from a RNN to a DNN William Chan 1, Nan Rosemary Ke 1, Ian Lane 1,2 Carnegie Mellon University 1 Electrical and Computer Engineering, 2 Language Technologies Institute Equal contribution

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Distributed Learning of Multilingual DNN Feature Extractors using GPUs Distributed Learning of Multilingual DNN Feature Extractors using GPUs Yajie Miao, Hao Zhang, Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University Pittsburgh,

More information

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS Heiga Zen, Haşim Sak Google fheigazen,hasimg@google.com ABSTRACT Long short-term

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Alex Graves and Jürgen Schmidhuber IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland TU Munich, Boltzmannstr.

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE Shaofei Xue 1

More information

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Dropout improves Recurrent Neural Networks for Handwriting Recognition 2014 14th International Conference on Frontiers in Handwriting Recognition Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham,Théodore Bluche, Christopher Kermorvant, and Jérôme

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.cl] 27 Apr 2016 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS Jonas Gehring 1 Quoc Bao Nguyen 1 Florian Metze 2 Alex Waibel 1,2 1 Interactive Systems Lab, Karlsruhe Institute of Technology;

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION JOURNAL OF MEDICAL INFORMATICS & TECHNOLOGIES Vol. 11/2007, ISSN 1642-6037 Marek WIŚNIEWSKI *, Wiesława KUNISZYK-JÓŹKOWIAK *, Elżbieta SMOŁKA *, Waldemar SUSZYŃSKI * HMM, recognition, speech, disorders

More information

INPE São José dos Campos

INPE São José dos Campos INPE-5479 PRE/1778 MONLINEAR ASPECTS OF DATA INTEGRATION FOR LAND COVER CLASSIFICATION IN A NEDRAL NETWORK ENVIRONNENT Maria Suelena S. Barros Valter Rodrigues INPE São José dos Campos 1993 SECRETARIA

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Data Fusion Models in WSNs: Comparison and Analysis

Data Fusion Models in WSNs: Comparison and Analysis Proceedings of 2014 Zone 1 Conference of the American Society for Engineering Education (ASEE Zone 1) Data Fusion s in WSNs: Comparison and Analysis Marwah M Almasri, and Khaled M Elleithy, Senior Member,

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words, A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994

More information

Assignment 1: Predicting Amazon Review Ratings

Assignment 1: Predicting Amazon Review Ratings Assignment 1: Predicting Amazon Review Ratings 1 Dataset Analysis Richard Park r2park@acsmail.ucsd.edu February 23, 2015 The dataset selected for this assignment comes from the set of Amazon reviews for

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,

More information

Deploying Agile Practices in Organizations: A Case Study

Deploying Agile Practices in Organizations: A Case Study Copyright: EuroSPI 2005, Will be presented at 9-11 November, Budapest, Hungary Deploying Agile Practices in Organizations: A Case Study Minna Pikkarainen 1, Outi Salo 1, and Jari Still 2 1 VTT Technical

More information

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS Pranay Dighe Afsaneh Asaei Hervé Bourlard Idiap Research Institute, Martigny, Switzerland École Polytechnique Fédérale de Lausanne (EPFL),

More information

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Transfer Learning Action Models by Measuring the Similarity of Different Domains Transfer Learning Action Models by Measuring the Similarity of Different Domains Hankui Zhuo 1, Qiang Yang 2, and Lei Li 1 1 Software Research Institute, Sun Yat-sen University, Guangzhou, China. zhuohank@gmail.com,lnslilei@mail.sysu.edu.cn

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

First Grade Standards

First Grade Standards These are the standards for what is taught throughout the year in First Grade. It is the expectation that these skills will be reinforced after they have been taught. Mathematical Practice Standards Taught

More information

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen TRANSFER LEARNING OF WEAKLY LABELLED AUDIO Aleksandr Diment, Tuomas Virtanen Tampere University of Technology Laboratory of Signal Processing Korkeakoulunkatu 1, 33720, Tampere, Finland firstname.lastname@tut.fi

More information

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Affective Classification of Generic Audio Clips using Regression Models

Affective Classification of Generic Audio Clips using Regression Models Affective Classification of Generic Audio Clips using Regression Models Nikolaos Malandrakis 1, Shiva Sundaram, Alexandros Potamianos 3 1 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los

More information

ACTIVITY: Comparing Combination Locks

ACTIVITY: Comparing Combination Locks 5.4 Compound Events outcomes of one or more events? ow can you find the number of possible ACIVIY: Comparing Combination Locks Work with a partner. You are buying a combination lock. You have three choices.

More information

The Evolution of Random Phenomena

The Evolution of Random Phenomena The Evolution of Random Phenomena A Look at Markov Chains Glen Wang glenw@uchicago.edu Splash! Chicago: Winter Cascade 2012 Lecture 1: What is Randomness? What is randomness? Can you think of some examples

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean

More information

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 The NICT/ATR speech synthesis system for the Blizzard Challenge 2008 Ranniery Maia 1,2, Jinfu Ni 1,2, Shinsuke Sakai 1,2, Tomoki Toda 1,3, Keiichi Tokuda 1,4 Tohru Shimizu 1,2, Satoshi Nakamura 1,2 1 National

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Neural Network GUI Tested on Text-To-Phoneme Mapping A Neural Network GUI Tested on Text-To-Phoneme Mapping MAARTEN TROMPPER Universiteit Utrecht m.f.a.trompper@students.uu.nl Abstract Text-to-phoneme (T2P) mapping is a necessary step in any speech synthesis

More information

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language Z.HACHKAR 1,3, A. FARCHI 2, B.MOUNIR 1, J. EL ABBADI 3 1 Ecole Supérieure de Technologie, Safi, Morocco. zhachkar2000@yahoo.fr.

More information

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках Тарасов Д. С. (dtarasov3@gmail.com) Интернет-портал reviewdot.ru, Казань,

More information

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria FUZZY EXPERT SYSTEMS 16-18 18 February 2002 University of Damascus-Syria Dr. Kasim M. Al-Aubidy Computer Eng. Dept. Philadelphia University What is Expert Systems? ES are computer programs that emulate

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription Wilny Wilson.P M.Tech Computer Science Student Thejus Engineering College Thrissur, India. Sindhu.S Computer

More information

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Dr. Amardeep Kaur Professor, Babe Ke College of Education, Mudki, Ferozepur, Punjab Abstract The present

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

On-Line Data Analytics

On-Line Data Analytics International Journal of Computer Applications in Engineering Sciences [VOL I, ISSUE III, SEPTEMBER 2011] [ISSN: 2231-4946] On-Line Data Analytics Yugandhar Vemulapalli #, Devarapalli Raghu *, Raja Jacob

More information