Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Size: px
Start display at page:

Download "Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training"

Transcription

1 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Single-Channel Muli-alker Speech Recogniion wih Permuaion Invarian Training Yanmin Qian, Member, IEEE, Xuankai Chang, Suden Member, IEEE, and Dong Yu, Senior Member, IEEE arxiv: v1 [cs.sd] 19 Jul 2017 Absrac Alhough grea progresses have been made in auomaic speech recogniion (ASR), significan performance degradaion is sill observed when recognizing muli-alker mixed speech. In his paper, we propose and evaluae several archiecures o address his problem under he assumpion ha only a single channel of mixed signal is available. Our echnique exends permuaion invarian raining (PIT) by inroducing he fronend feaure separaion module wih he minimum mean square error (MSE) crierion and he back-end recogniion module wih he minimum cross enropy (CE) crierion. More specifically, during raining we compue he average MSE or CE over he whole uerance for each possible uerance-level oupu-arge assignmen, pick he one wih he minimum MSE or CE, and opimize for ha assignmen. This sraegy eleganly solves he label permuaion problem observed in he deep learning based muli-alker mixed speech separaion and recogniion sysems. The proposed archiecures are evaluaed and compared on an arificially mixed AMI daase wih boh wo- and hreealker mixed speech. The experimenal resuls indicae ha our proposed archiecures can cu he word error rae (WER) by 45.0% and 25.0% relaively agains he sae-of-he-ar singlealker speech recogniion sysem across all speakers when heir energies are comparable, for wo- and hree-alker mixed speech, respecively. To our knowledge, his is he firs work on he muli-alker mixed speech recogniion on he challenging speakerindependen sponaneous large vocabulary coninuous speech ask. Keywords permuaion invarian raining, muli-alker mixed speech recogniion, feaure separaion, join-opimizaion I. INTRODUCTION Thanks o he significan progresses made in he recen years [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], he ASR sysems now surpassed he hreshold for adopion in many real-world scenarios and enabled services such as Microsof Corana, Apple s Siri and Google Now, where close-alk microphones are commonly used. However, he curren ASR sysems sill perform poorly when far-field microphones are used. This is because many difficulies hidden by close-alk microphones now surface under disan recogniion scenarios. For example, he signal o noise raio (SNR) beween he arge speaker and he inerfering noises is much lower han ha when close-alk microphones are used. As a resul, he inerfering signals, such Yanmin Qian and Xuankai Chang are wih Compuer Science and Engineering Deparmen, Shanghai Jiao Tong Universiy, Shanghai, P. R. China ({yanminqian,xuank}@sju.edu.cn). Dong Yu is wih Tencen AI Lab, Seale, USA (dyu@encen.com). as background noise, reverberaion, and speech from oher alkers, become so disinc ha hey can no longer be ignored. In his paper, we aims a solving he speech recogniion problem when muliple alkers speak a he same ime and only a single channel of mixed speech is available. Many aemps have been made o aack his problem. Before he deep learning era, he mos famous and effecive model is he facorial GMM-HMM [21], which ouperformed human in he 2006 monaural speech separaion and recogniion challenge [22]. The facorial GMM-HMM, however, requires he es speakers o be seen during raining so ha he ineracions beween hem can be properly modeled. Recenly, several deep learning based echniques have been proposed o solve his problem [19], [20], [23], [24], [25], [26]. The core issue ha hese echniques ry o address is he label ambiguiy or permuaion problem (refer o Secion III for deails). In Weng e al. [23] a deep learning model was developed o recognize he mixed speech direcly. To solve he label ambiguiy problem, Weng e al. assigned he senone labels of he alker wih higher insananeous energy o oupu one and he oher o oupu wo. This, alhough addresses he label ambiguiy problem, causes frequen speaker swich across frames. To deal wih he speaker swich problem, a wo-speaker join-decoder wih a speaker swiching penaly was used o race speakers. This approach has wo limiaions. Firs, energy, which is manually picked, may no be he bes informaion o assign labels under all condiions. Second, he frame swiching problem inroduces burden o he decoder. In Hershey e al. [24], [25] he muli-alker mixed speech is firs separaed ino muliple sreams. An ASR engine is hen applied o hese sreams independenly o recognize speech. To separae he speech sreams, hey proposed a echnique called deep clusering (DPCL). They assume ha each imefrequency bin belongs o only one speaker and can be mapped ino a shared embedding space. The model is opimized so ha in he embedding space he ime-frequency bins belong o he same speaker are closer and hose of differen speakers are farher away. During evaluaion, a clusering algorihm is used upon embeddings o generae a pariion of he ime-frequency bins firs, separaed audio sreams are hen reconsruced based on he pariion. In his approach, he speech separaion and recogniion are usually wo separae componens. Chen e al. [26] proposed a similar echnique called deep aracor nework (DANe). Following DPCL, heir approach also learns a high-dimensional embedding of he acousic signals. Differen from DPCL, however, i creaes cluser ceners, called aracor poins, in he embedding space o pull ogeher he ime-frequency bins corresponding o he same source. The main limiaion of DANe is he requiremen o

2 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2 esimae aracor poins during evaluaion ime and o form frequency-bin clusers based on hese poins. In Yu e al. [19] and Kolbak e al.[20], a simpler ye equally effecive echnique named permuaion invarian raining (PIT) 1 was proposed o aack he speaker independen muli-alker speech separaion problem. In PIT, he source arges are reaed as a se (i.e., order is irrelevan). During raining, PIT firs deermines he oupu-arge assignmen wih he minimum error a he uerance level based on he forwardpass resul. I hen minimizes he error given he assignmen. This sraegy eleganly solved he label permuaion problem. However, in hese original works PIT was used o separae speech sreams from mixed speech. For his reason, a frequency-bin mask was firs esimaed and hen used o reconsruc each sream. The minimum mean square error (MMSE) beween he rue and reconsruced speech sreams was used as he crierion o opimize model parameers. Moreover, mos of previous works on muli-alker speech sill focus on speech separaion [19], [20], [24], [25], [26]. In conras, he muli-alker speech recogniion is much harder and he relaed work is less. There has been some aemps, bu he relaed asks are relaively simple. For example, he 2006 monaural speech separaion and recogniion challenge [21], [22], [23], [27], [28] was defined on a speaker-dependen, small vocabulary, consrained language model seup, while in [25] a small vocabulary reading syle corpus was used. We are no aware of any exensive research work on he more real, speaker-independen, sponaneous large vocabulary coninuous speech recogniion (LVCSR) on muli-alker mixed speech before our work. In his paper, we aack he muli-alker mixed speech recogniion problem wih a focus on he speaker-independen seup given jus a single-channel of he mixed speech. Differen from [19], [20], here we exend and redefine PIT over log filer bank feaures and/or senone poseriors. In some archiecures PIT is defined upon he minimum mean square error (MSE) beween he rue and esimaed individual speaker feaures o separae speech a he feaure level (called PIT-MSE from now on). In some oher archiecures, PIT is defined upon he cross enropy (CE) beween he rue and esimaed senone poserior probabiliies o recognize muliple sreams of speech direcly (called PIT-CE from now on). Moreover, he PIT-MSE based fron-end feaure separaion can be combined wih he PIT-CE based back-end recogniion in a join opimizaion archiecure. We evaluae our archiecures on he arificially generaed AMI daa wih boh wo- and hree-alker mixed speech. The experimenal resuls demonsrae ha our proposed archiecures are very promising. The res of he paper is organized as follows. In Secion II we describe he speaker independen muli-alker mixed speech recogniion problem. In Secion III we propose several PITbased archiecures o recognize muli-sreams of speech. We repor experimenal resuls in Secion IV and conclude he paper in Secion V. 1 In [24], a similar permuaion free echnique, which is equivalen o PIT when here are exacly wo-speakers, was evaluaed wih negaive resuls and conclusion. II. SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION In his paper, we assume ha a linearly mixed singlemicrophone signal y[n] = S s=1 x s[n] is known, where x s [n], s = 1,, S are S sreams of speech sources from differen speakers. Our goal is o separae hese sreams and recognize every single one of hem. In oher words, he model needs o generae S oupu sreams, one for each source, a every ime sep. However, given only he mixed speech y[n], he problem of recognizing all sreams is under-deermined because here are an infinie number of possible x s [n] (and hus recogniion resuls) combinaions ha lead o he same y[n]. Forunaely, speech is no random signal. I has paerns ha we may learn from a raining se of pairs y and l s, s = 1,, S, where l s is he senone label sequence for sream s. In he single speaker case, i.e., S = 1, he learning problem is significanly simplified because here is only one possible recogniion resul, hus i can be cased as a simple supervised opimizaion problem. Given he inpu o he model, which is some feaure represenaion of y, he oupu is simply he senone poserior probabiliy condiioned on he inpu. As in mos classificaion problems, he model can be opimized by minimizing he cross enropy beween he senone label and he esimaed poserior probabiliy. When S is greaer han 1, however, i is no longer as simple and direc as in he single-alker case and he label ambiguiy or permuaion becomes a problem in raining. In he case of wo speakers, because speech sources are symmeric given he mixure (i.e., x 1 + x 2 equals o x 2 + x 1 and boh x 1 and x 2 have he same characerisics), here is no predeermined way o assign he correc arge o he corresponding oupu layer. Ineresed readers can find addiional informaion in [19], [20] on how raining progresses o nowhere when he convenional supervised approach is used for he muli-alker speech separaion. III. PERMUTATION INVARIANT TRAINING FOR MULTI-TALKER SPEECH RECOGNITION To address he label ambiguiy problem, we propose several archiecures based on he permuaion invarian raining (PIT) [19], [20] for muli-alker mixed speech recogniion. For simpliciy and wihou losing he generaliy, we always assume here are wo-alkers in he mixed speech when describing our archiecures in his secion. Noe ha, DPCL [24], [25] and DANe [26] are alernaive soluions o he label ambiguiy problem when he goal is speech source separaion. However, hese wo echniques canno be easily applied o direc recogniion (i.e., wihou firs separaing speech) of muliple sreams of speech because of he clusering sep required during separaion, and he assumpion ha each ime-frequency bin belongs o only one speaker (which is false when he CE crierion is used). A. Feaure Separaion wih Direc Supervision To recognize he muli-alker mixed speech, one sraighforward approach is o esimae he feaures of each speech

3 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3 (a) Arch#1: Feaure separaion wih he fixed reference assignmen (b) Arch#2: Feaure separaion wih permuaion invarian raining Fig. 1: Feaure separaion archiecures for muli-alker mixed speech recogniion source given he mixed speech feaure and recognize hem one by one using a normal single-alker LVCSR sysem. This idea is depiced in Figure 1 where we learn a model o recover he filer bank (FBANK) feaures from he mixed FBANK feaures and hen feed each sream of he recovered FBANK feaures o a convenional LVCSR sysem for recogniion. In he simples archiecure, which is denoed as Arch#1 and illusraed in Figure 1(a), feaure separaion can be considered as a muli-class regression problem, similar o many previous works [29], [30], [31], [32], [33], [34]. In his archiecure, Y, he feaure of mixed speech, are used as he inpu o some deep learning models, such as deep neural neworks (DNNs), convoluional neural neworks (CNNs), and long shor-erm memory (LSTM) recurren neural neworks (RNNs), o esimae feaure represenaion of each individual alker. If we use he bidirecional LSTM-RNN model, he model will compue H 0 = Y (1) H f i = RNNf i (H i 1), i = 1,, N (2) H b i = RNN b i (H i 1 ), i = 1,, N (3) H i = Sack(H f i, Hb i ), i = 1,, N (4) ˆX s = Linear(H N ), s = 1,, S (5) where H 0 is he inpu, N is he number of hidden layers, H i is he i-h hidden layer, RNN f i and RNN b i are he forward and backward RNNs a hidden layer i, respecively, ˆX s, s = 1,, S is he esimaed separaed feaures from he oupu layers for each speech sream s. During raining, we need o provide he correc reference (or arge) feaures X s, s = 1,, S for all speakers in he mixed speech o he corresponding oupu layers for supervision. The model parameers can be opimized o minimize he mean square error (MSE) beween he esimaed separaed feaure ˆX s and he original reference feaure X s, J = 1 S S min X s ˆX s 2 (6) s=1 where S is he number of mixed speakers. In his archiecure, i is assumed ha he reference feaures are organized in a given order and assigned o he oupu layer segmens accordingly. Once rained, his feaure separaion module can be used as he fron-end o process he mixed speech. The separaed feaure sreams are hen fed ino a normal singlespeaker LVCSR sysem for decoding. B. Feaure Separaion wih Permuaion Invarian Training The archiecure depiced in Figure 1(a) is easy o implemen bu wih obvious drawbacks. Since he model has muliple oupu layer segmens (one for each sream), and hey depend on he same inpu mixure, assigning reference is acually difficul. The fixed reference order used in his archiecure is no quie righ since he source speech sreams are symmeric and here is no clear clue on how o order hem in advance. This is referred o as he label ambiguiy (or label permuaion) problem in [19], [23], [24]. As a resul, his archiecure may work well on he speaker-dependen seup where he arge speaker is known (and hus can be assigned o a specific oupu segmen) during raining, bu canno generalize well o he speaker-independen case. The label ambiguiy problem in he muli-alker mixed speech recogniion was addressed wih limied success in [23] where Weng e al. assigned reference feaures depending on

4 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 4 he energy level of each speech source. In he archiecure illusraed in Figure 1(b), named as Arch#2, permuaion invarian raining (PIT) [19], [20] is uilized o esimae individual feaure sreams. In his archiecure, The reference feaure sources are given as a se insead of an ordered lis. The oupureference assignmen is deermined dynamically based on he curren model. More specifically, i firs compues he MSE for each possible assignmen beween he reference X s and he esimaed source ˆX s, and picks he one wih minimum MSE. In oher words, he raining crierion is J = 1 S min s permu(s) s X s ˆX s 2, s = 1,, S (7) where permu(s) is a permuaion of 1,, S. We noe wo imporan ingrediens in his objecive funcion. Firs, i auomaically finds he appropriae assignmen no maer how he labels are ordered. Second, he MSE is compued over he whole sequence for each assignmen. This forces all he frames of he same speaker o be aligned wih he same oupu segmen, which can be regarded as performing he feaure-level racing implicily. Wih his new objecive funcion, We can simulaneously perform label assignmen and error evaluaion on he feaure level. I is expeced ha he feaure sreams separaed wih PIT (Figure 1(b)) has higher qualiy han ha separaed wih fixed reference order (Figure 1(a)). As a resul, he recogniion errors on hese feaure sreams should also be lower. Noe ha he compuaional cos associaed wih permuaion is negligible compared o he nework forward compuaion during raining, and no permuaion (and hus no cos) is needed during evaluaion. C. Direc Muli-Talker Mixed Speech Recogniion wih PIT In he previous wo archiecures mixed speech feaures are firs separaed explicily and hen recognized independenly wih a convenional single-alker LVCSR sysem. Since he feaure separaion is no perfec, here is mismach beween he separaed feaures and he normal feaures used o rain he convenional LVCSR sysem. In addiion, he objecive funcion of minimizing he MSE beween he esimaed and reference feaures is no direcly relaed o he recogniion performance. In his secion, we propose an end-o-end archiecure ha direcly recognizes mixed speech of muliple speakers. In his archiecure, denoed as Arch#3, we apply PIT o he CE beween he reference and esimaed senone poserior probabiliy disribuions as shown in Figure 2(a). Given some feaure represenaion Y of he mixed speech y, his model will compue H 0 = Y (8) H f i = RNNf i (H i 1), i = 1,, N (9) H b i = RNN b i (H i 1 ), i = 1,, N (10) H i = Sack(H f i, Hb i), i = 1,, N (11) H s o = Linear(H N ), s = 1,, S (12) O s = Sofmax(H s o), s = 1,, S (13) using a deep bidirecional RNN, where Equaions (8) (11) are similar o Equaions (1) (4). H s o, s = 1,, S is he exciaion a oupu layer for each speech sream s, and O s, s = 1,, S is he oupu segmen for sream s. Differen from archiecures discussed in previous secions, in his archiecure each oupu segmen represens he esimaed senone poserior probabiliy for a speech sream. No addiional feaure separaion, clusering or speaker racing is needed. Alhough various neural nework srucures can be used, in his sudy we focus on bidirecional LSTM-RNNs. In his direc muli-alker mixed speech recogniion archiecure, we minimize he objecive funcion J = 1 S min s permu(s) s CE(l s, O s ), s = 1,, S (14) In oher words, we minimize he minimum average CE of every possible oupu-label assignmen. All he frames of he same speaker are forced o be aligned wih he same oupu segmen by compuing he CE over he whole sequence for each assignmen. This sraegy allows for he direc mulialker mixed speech recogniion wihou explici separaion. I is a simpler and more compac archiecure for muli-alker speech recogniion. D. Join Opimizaion of PIT-based Feaure Separaion and Recogniion As menioned above, he main drawback of he feaure separaion archiecures is he mismach beween he disored separaion resul and he feaures used o rain he singlealker LVCSR sysem. The direc muli-alker mixed speech recogniion wih PIT, which bypassed he feaure separaion sep, is one soluion o his problem. Here we propose anoher archiecure named join opimizaion of PIT-based feaure separaion and recogniion, and i is denoed as Arch#4 and shown in Figure 2(b). This archiecure conains wo PIT-componens, he fronend feaure separaion module wih PIT-MSE and he back-end recogniion module wih PIT-CE. Differen from he archiecure in Figure 1(b), in his archiecure a new LVCSR sysem is rained upon he oupu of he feaure separaion module wih PIT-CE. The whole model is rained progressively: he fronend feaure separaion module is firsly opimized wih PIT- MSE; Then he parameers in he back-end recogniion module are opimized wih PIT-CE while keeping he parameers in he feaure separaion module fixed. Finally parameers in boh modules are joinly refined wih PIT-CE using a small learning rae. Noe ha he reference assignmen in he recogniion (PIT-CE) sep is he same as ha in he separaion (PIT-MSE) sep. J 1 = 1 S J 2 = 1 S min s permu(s) min s permu(s) s s X s ˆX s 2, s = 1,, S (15) CE(l s, O s ), s = 1,, S (16)

5 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 5 (a) Arch#3: Direc muli-alker mixed speech recogniion wih PIT (b) Arch#4: Join opimizaion of PIT-based feaure separaion and recogniion Fig. 2: Advanced archiecures for muli-alker mixed speech recogniion During decoding, he mixed speech feaures are fed ino his archiecure, and he final poserior sreams are used for decoding as normal. IV. EXPERIMENTAL RESULTS To evaluae he performance of he proposed archiecures, we conduced a series of experimens on an arificially generaed wo- and hree-alker mixed speech daase based on he AMI corpus [35]. There are four reasons for us o use AMI: 1) AMI is a speaker-independen sponaneous LVCSR corpora. Compared o small vocabulary, speaker-dependen, read English daases used in mos of he previous sudies [22], [23], [27], [28], observaions made and conclusions drawn from AMI are more likely generalized o oher real-world scenarios; 2) AMI is a really hard ask wih differen kinds of noises, ruly sponaneous meeing syle speech, and srong accens. I reflecs he rue abiliy of LVCSR when he raining se size is around 100hr. The sae-of-he-ar word error rae (WER) on AMI is around 25.0% for he close-alk condiion [36] and more han 45.0% for he far-field condiion wih single-microphone [36], [37]. These WERs are much higher han ha on oher corpora, such as Swichboard [38] on which he WER is now below 10.0% [18], [36], [39], [40]; 3) Alhough he close-alk daa (AMI IHM) was used o generae mixed speech in his work, he exisence of parallel far-field daa (AMI SDM/MDM) allows us o evaluae our archiecures based on he far-field daa in he fuure; 4) AMI is a public corpora, using AMI allows ineresed readers o reproduce our resuls more easily. The AMI IHM (close-alk) daase conains abou 80hr and 8hr speech in raining and evaluaion ses, respecively [35], [41]. Using AMI IHM, we generaed a wo-alker (IHM-2mix) and a hree-alker (IHM-3mix) mixed speech daase. To arificially synhesize IHM-2mix, we randomly selec wo speakers and hen randomly selec an uerance for each speaker o form a mixed-speech uerance. For easier explanaion, he high energy (High E) speaker in he mixed speech is always chosen as he arge speaker and he low energy (Low E) speaker is considered as inerference speaker. We synhesized mixed speech for five differen SNR condiions (i.e. 0dB, 5dB, 10dB, 15dB, 20dB) based on he energy raio of he wo-alkers. To eliminae easy cases we force he lenghs of he seleced source uerances comparable so ha a leas half of he mixed speech conains overlapping speech. When he wo source uerances have differen lenghs, he shorer one is padded wih small noise a he fron and end. The same procedure is used for preparing boh he raining and esing daa. We generaed in oal 400hr wo-alker mixed speech, 80hr per SNR condiion, as he raining se. A subse of 80hr speech from his 400hr raining se was used for fas model raining and evaluaion. For evaluaion, oal 40hr wo-alker mixed speech, 8hr per SNR condiion, is generaed and used. The IHM-3mix daase was generaed similarly. The relaive energy of he hree speakers in each mixed uerance varies randomly in he raining se. Differen from he raining se, all he speakers in he same mixed uerance have equal energy in he esing se. We generaed in oal 400hr and 8hr hree-alker mixed speech as he raining and esing se, respecively. Figure 3 compares he specrogram of a single-alker clean uerance and he corresponding 0db wo-alker mixed uerance in he IHM-2mix daase. Obviously i is really hard o separae he specrogram and reconsruc he source uerances by visually examining i.

6 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 6 for DNN conains a window of 11 frames. The BLSTM-RNN has 3 bidirecional LSTM layers which are followed by he sofmax layer. Each BLSTM layer has 512 memory cells. The inpu o he BLSTM-RNN is a single acousic frame. All he models explored here are opimized wih cross-enropy crierion. The DNN is opimized using SGD mehod wih 256 minibach size, and he BLSTM-RNN is rained using SGD wih 4 full-lengh uerances in each minibach. For decoding, we used a 50K-word dicionary and a rigram language model inerpolaed from he ones creaed using he AMI ranscrips and he Fisher English corpus. The performance of hese wo baselines on he original single-speaker AMI corpus are presened in Table I. These resuls are comparable wih ha repored by ohers [41] even hough we did no use adaped fmllr feaure. I is noed ha adding more BLSTM layers did no show meaningful WER reducion in he baseline. TABLE I: WER (%) of he baseline sysems on original AMI IHM single-alker corpus Fig. 3: Specrogram comparison beween he original singlealker clean speech and he 0db wo-alker mixed-speech in he IHM-2mix daase A. Single-speaker Recogniion Baseline In his work, all he neural neworks were buil using he laes Microsof Cogniive Toolki (CNTK) [42] and he decoding sysems were buil based on Kaldi [43]. We firs followed he officially released kaldi recipe o build an LDA-MLLT-SAT GMM-HMM model. This model uses 39-dim MFCC feaure and has roughly 4K ied-saes and 80K Gaussians. We hen used his acousic model o generae he senone alignmen for neural nework raining. We rained he DNN and BLSTMRNN baseline sysems wih he original AMI IHM daa. 80dimensional log filer bank (LFBK) feaures wih CMVN were used o rain he baselines. The DNN has 6 hidden layers each of which conains 2048 Sigmoid neurons. The inpu feaure Model WER DNN BLSTM To es he normal single-speaker model on he wo-alker mixed speech, he above baseline BLSTM-RNN model is uilized o decode he mixed speech direcly. During scoring we compare he decoding oupu (only one oupu) wih he reference of each source uerance o obain he WER for he corresponding source uerance. Table II summarizes he recogniion resuls. I is clear, from he able, ha he singlespeaker model performs very poorly on he muli-alker mixed speech as indicaed by he huge WER degradaion of he highenergy speaker when SNR decreases. Furher more, in all he condiions, he WERs for he low energy speaker are all above 100.0%. These resuls demonsrae he grea challenge in he muli-alker mixed speech recogniion. TABLE II: WER (%) of he baseline BLSTM-RNN singlespeaker sysem on he IHM-2mix daase SNR Condiion High E Spk Low E Spk 0db 5db 10db 15db 20db B. Evaluaion of Two-alker Speech Recogniion Archiecures The proposed four archiecures for wo-aker speech recogniion are evaluaed here. For he firs wo approaches (Arch#1 and Arch#2) ha conain an explici feaure separaion sage (wih and wihou PIT-MSE), a 3-layer BLSTM is used in he feaure separaion module. The separaed feaure sreams are fed ino a normal 3-layer BLSTM LVCSR sysem, rained wih

7 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 7 single-alker speech, for decoding. The whole sysem conains in oal six BLSTM layers. For he oher wo approaches (Arch#3 and Arch#4), in which PIT-CE is used, 6-layer BLSTM models are used so ha he number of parameers is comparable o he oher wo archiecures. In all hese archiecures he inpu is he 40-dimensional LFBK feaure and each layer conains 768 memory cells. To rain he laer wo archiecures ha exploi PIT-CE we need o prepare he alignmens for he mixed speech. The senone alignmens for he wo-alkers in each mixed speech uerance are from he single-speaker baseline alignmen. The alignmen of he shorer uerance wihin he mixed speech is padded wih he silence sae a he fron and he end. All he models were rained wih a minibach of 8 uerances. The gradien was clipped o o guaranee he raining sabiliy. To obain he resuls repored in his secion we used he 80hr mixed speech raining subse. The recogniion resuls on boh speakers are evaluaed. For scoring, we evaluaed he wo hypoheses, obained from wo oupu secions, agains he wo references and pick he assignmen wih beer WER o compue he final WER. The resuls on he 0db SNR condiion are shown in Table III. Compared o he 0dB condiion in Table II, all he proposed muli-alker speech recogniion archiecures obain obvious improvemen on boh speakers. Wihin he wo archiecures wih he explici feaure separaion sage, he archiecure wih PIT-MSE is significanly beer han he baseline feaure separaion archiecure. These resuls confirmed ha he label permuaion problem can be well alleviaed by he PIT-MSE a he feaure level. We can also observe ha applying PIT- CE on he recogniion module (Arch#3 & Arch#4) can furher reduce WER by 10.0% absolue. This is because hese wo archiecures can significanly reduce he mismach beween he separaed feaure and he feaure used o rain he LVCSR model. I is also because cross-enropy is more direcly relaed o he recogniion accuracy. Comparing Arch#3 and Arch#4, we can see ha he archiecure wih join opimizaion on PITbased feaure separaion and recogniion slighly ouperforms he direc PIT-CE based model. Since Arch#3 and Arch#4 achieve comparable resuls, and he model archiecure and raining process of Arch#3 is much simpler han ha of Arch#4, our furher evaluaions repored in he following secions are based on Arch#3. For clariy, Arch#3 is named direc PIT-CE-ASR from now on. TABLE III: WER (%) of he proposed muli-alker mixed speech recogniion archiecures on he IHM-2mix daase under 0db SNR condiion (using 80hr raining subse). Arch#1- #4 indicae he proposed archiecures described in Secion III.A-D, respecively Arch Fron-end Back-end High E WER Low E WER #1 Fea-Sep-baseline Single-Spk-ASR #2 Fea-Sep-PIT-MSE Single-Spk-ASR #3 PIT-CE #4 Fea-Sep-PIT-MSE PIT-CE C. Evaluaion of he Direc PIT-CE-ASR Model on Large Daase We evaluaed he direc PIT-CE-ASR archiecure on he full IHM-2mix corpus. All he 400hr mixed daa under differen SNR condiions are pooled ogeher for raining. The direc PIT-CE-ASR model is sill composed of 6 BLSTM layers wih 768 memory cells in each layer. All oher configuraions are also he same as he experimens conduced on he subse. The resuls under differen SNR condiions are shown in Table IV. The direc PIT-CE-ASR model achieved significan improvemens on boh alkers compared o baseline resuls in Table II for all SNR condiions. Comparing o he resuls in Table III, achieved wih 80hr raining subse, we observe ha addiional absolue 10.0% WER improvemen on boh speakers can be obained using he large raining se. We also observe ha he WER increases slowly when he SNR becomes smaller for he high energy speaker, and he WER improvemen is very significan for he low energy speaker across all condiions. In he 0dB SNR scenario, he WERs on wo speakers are very close and are 45.0% less han ha achieved wih he single-alker ASR sysem for boh high and low energy speakers. A 20dB SNR, he WER of he high energy speaker is sill significanly beer han he baseline, and approaches he single-alker recogniion resul repored in Table I. TABLE IV: WER (%) of he proposed direc PIT-CE-ASR model on he IHM-2mix daase wih full raining se SNR Condiion High E WER Low E WER 0db db db db db D. Permuaion Invarian Training wih Alernaive Deep Learning Models We invesigaed he direc PIT-CE-ASR model wih alernaive deep learning models. The firs model we evaluaed is a 6-layer feed-forward DNN in which each layer conains 2048 Sigmoid unis. The inpu o he DNN is a window of 11 frames each wih a 40-dimensional LFBK feaure. The resuls of DNN-based PIT-CE-ASR model is repored a he op of Table V. Alhough i sill ges obvious improvemen over he baseline single-speaker model, he gain is much smaller wih near 20.0% WER difference in every condiion han ha from BLSTM-based PIT-CE-ASR model. The difference beween DNN and BLSTM models parially aribue o he sronger modeling power of BLSTM models and parially aribue o he beer racing abiliy of RNNs. We also compared he BLSTM models wih 4, 6, and 8 layers as shown in Table V. I is observed ha deeper BLSTM models perform beer. This is differen from he single speaker ASR model whose performance peaks a 4 BLSTM layers [37]. This is because he direc PIT-CE-ASR archiecure needs

8 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 8 Fig. 4: Decoding resuls of baseline single speaker BLSTM-RNN sysem on 0db wo-alker mixed speech sample Fig. 5: Decoding resuls of he proposed direc PIT-CE-ASR model on 0db wo-alker mixed speech sample o conduc wo asks - separaion and recogniion, and hus requires addiional modeling power. TABLE V: WER (%) of he direc PIT-CE-ASR model using differen deep learning models on he IHM-2mix daase Models SNR Condiion High E WER Low E WER 6L-DNN 4L-BLSTM 6L-BLSTM 8L-BLSTM 0db db db db db db db db db db db db db db db db db db db db E. Analysis on Muli-Talker Speech Recogniion Resuls To beer undersand he resuls on muli-alker speech recogniion, we compued he WER separaely for he speech mixed wih same and opposie genders. The resuls are shown in Table VI. I is observed ha he same-gender mixed speech is much more difficul o recognize han he opposie-gender mixed speech, and he gap is even larger when he energy raio of he wo speakers is closer o 1. I is also observed ha he mixed speech of wo male speakers is hard o recognize han ha of wo female speakers. These resuls sugges ha effecive exploiaion of gender informaion may help o furher improve he muli-alker speech recogniion sysem. We will explore his in our fuure work. TABLE VI: WER (%) comparison of he 6-layer-BLSTM direc PIT-CE-ASR model on he mixed speech generaed from wo male speakers (M + M), wo female speakers (F + F) and a male and a female speaker (M + F) Genders SNR Condiion High E WER Low E WER M + M F + F M + F 0db db db db db db db db db To furher undersand our model, we examined he recogniion resuls wih and wihou using he direc PIT-CE-ASR. An example of hese resuls on a 0db wo-alker mixed speech uerance is shown in Figure 4 (using he single-speaker baseline sysem) and 5 (wih direc PIT-CE-ASR). I is clearly seen ha he resuls are erroneous when he single-speaker baseline sysem is used o recognize he wo-alker mixed speech. In conras, much more words are recognized correcly wih he proposed direc PIT-CE-ASR model. F. Three-Talker Speech Recogniion wih Direc PIT-CE-ASR In his subsecion, we furher exend and evaluae he proposed direc PIT-CE-ASR model on he hree-alker mixed speech using he IHM-3mix daase. The hree-alker direc PIT-CE-ASR model is also a 6-layer BLSTM model. The raining and esing configuraions are he same as hose for wo-alker speech recogniion. The direc

9 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 9 Cross Enropy L-BLSTM-2SPK-Train 6L-BLSTM-3SPK-Train 6L-BLSTM-2SPK-Val 6L-BLSTM-3SPK-Val TABLE VII: WER (%) comparison of he baseline singlespeaker BLSTM-RNN sysem and he proposed direc PIT- CE-ASR model on he IHM-3mix daase. Diff indicaes he mixed speech is from differen genders, and Same indicaes he mixed speech is from same gender Genders Model Speaker1 Speaker2 Speaker3 All BLSTM-RNN All Differen direc PIT-CE-ASR Same Epochs Fig. 6: CE values over epochs on boh he IHM-2mix and IHM-3mix raining and validaion ses wih he proposed direc PIT-CE-ASR model Table IV). This demonsraes he good generalizaion abiliy of our proposed direc PIT-CE-ASR model over variable number of mixed speakers. This suggess ha a single PIT model may be able o recognize mixed speech of differen number of speakers wihou knowing or esimaing he number of speakers. PIT-CE-ASR raining processes as measured by CE on boh wo- and hree-alker mixed speech raining and validaion ses are illusraed in Figure 6. I is observed ha he direc PIT-CE-ASR model wih his specific configuraion converges slowly, and he CE improvemen progress on he raining and validaion ses is almos he same. The raining progress on hree-alker mixed speech is similar o ha on wo-alker mixed speech, bu wih an obviously higher CE value. This indicaes he huge challenge when recognizing speech mixed wih more han wo alkers. Noe ha, in his se of experimens we used he same model configuraion as ha used in wo-alker mixed speech recogniion. Since hree-alker mixed speech recogniion is much harder, using deeper and wider models may help o improve performance. Due o resource limiaion, we did no search for he bes configuraion for he ask. The hree-alker mixed speech recogniion WERs are repored in Table VII. The WERs on differen gender combinaions are also provided. The WERs achieved wih he single-speaker model are lised a he firs line in Table VII. Compared o he resuls on IHM-2mix, he resuls on IHM- 3mix are significanly worse using he convenional single speaker model. Under his exremely hard seup, he proposed direc PIT-CE-ASR archiecure sill demonsraed is powerful abiliy on separaing/racing/recognizing he mixed speech, and achieved 25.0% relaive WER reducion across all hree speakers. Alhough he performance gap from woalker o hree-alker is obvious, i is sill very promising under his speaker-independen hree-alker LVCSR ask. No surprisingly, he mixed speech of differen genders is relaively easier o recognize han ha of same gender. Moreover, we conduced anoher ineresing experimen. We used he hree-alker PIT-CE-ASR model o recognize he woalker mixed speech. The resuls are shown in Table VIII. Surprisingly, he resuls are almos idenical o ha obained using he 6-layer BLSTM based wo-alker model (shown in TABLE VIII: WER (%) of using hree-alker direc PIT-CE- ASR model o recognize wo-alker mixed IHM-2mix speech Model SNR Condiion High E WER Low E WER Three-Talker PIT-CE-ASR 0db db db db db V. CONCLUSION In his paper, we proposed several archiecures for recognizing muli-alker mixed speech given only a single channel of he mixed signal. Our echnique is based on permuaion invarian raining, which was originally developed for separaion of muliple speech sreams. PIT can be performed on he fron-end feaure separaion module o obain beer separaed feaure sreams or be exended on he back-end recogniion module o predic he separaed senone poserior probabiliies direcly. Moreover, PIT can be implemened on boh fronend and back-end wih a join-opimizaion archiecure. When using PIT o opimize a model, he crierion is compued over all frames in he whole uerance for each possible oupuarge assignmen, and he one wih he minimum loss is picked for parameer opimizaion. Thus PIT can address he label permuaion problem well, and conduc he speaker separaion and racing in one sho. Paricularly for he proposed archiecure wih he direc PIT-CE based recogniion model, muli-alker mixed speech recogniion can be direcly conduced wihou an explici separaion sage. The proposed archiecures were evaluaed and compared on an arificially mixed AMI daase wih boh wo- and hreealker mixed speech. The experimenal resuls indicae ha he proposed archiecures are very promising. Our models can obain relaive 45.0% and 25.0% WER reducion agains he sae-of-he-ar single-alker speech recogniion sysem across

10 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 10 all speakers when heir energies are comparable, for wo- and hree-alker mixed speech, respecively. Anoher ineresing observaion is ha here is even no degradaion when using proposed hree-alker model o recognize he wo-alker mixed speech direcly. This suggess ha we can consruc one model o recognize speech mixed wih variable number of speakers wihou knowing or esimaing he number of speakers in he mixed speech. To our knowledge, his is he firs work on he muli-alker mixed speech recogniion on he challenging speaker-independen sponaneous LVCSR ask. ACKNOWLEDGMENT This work was suppored by he Shanghai Sailing Program No. 16YF , he China NSFC projecs (No and No ), he Inerdisciplinary Program (14JCZ03) of Shanghai Jiao Tong Universiy in China, and he Tencen- Shanghai Jiao Tong Universiy join projec. Experimens have been carried ou on he PI supercompuer a Shanghai Jiao Tong Universiy. REFERENCES [1] D. Yu and L. Deng, Auomaic Speech Recogniion: A Deep Learning Approach, ser. Signals and Communicaion Technology. Springer London, [Online]. Available: hps://books.google.com/books?id=rubtbqaaqbaj [2] D. Yu, L. Deng, and G. E. Dahl, Roles of pre-raining and fine-uning in conex-dependen DBN-HMMs for real-world speech recogniion, in NIPS Workshop on Deep Learning and Unsupervised Feaure Learning, [3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Conex-dependen prerained deep neural neworks for large-vocabulary speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 20, pp , [4] F. Seide, G. Li, and D. Yu, Conversaional speech ranscripion using conex-dependen deep neural neworks. in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2011, pp [5] G. Hinon, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaily, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainah e al., Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups, IEEE Signal Processing Magazine (SPM), vol. 29, pp , [6] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, Applying convoluional neural neworks conceps o hybrid NN-HMM model for speech recogniion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2012, pp [7] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, Convoluional neural neworks for speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 22, pp , [8] T. N. Sainah, O. Vinyals, A. Senior, and H. Sak, Convoluional, long shor-erm memory, fully conneced deep neural neworks, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2015, pp [9] M. Bi, Y. Qian, and K. Yu, Very deep convoluional neural neworks for LVCSR, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2015, pp [10] Y. Qian, M. Bi, T. Tan, and K. Yu, Very deep convoluional neural neworks for noise robus speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 24, no. 12, pp , [11] Y. Qian and P. C. Woodland, Very deep convoluional neural neworks for robus speech recogniion, in IEEE Spoken Language Technology Workshop (SLT), 2016, pp [12] V. Mira and H. Franco, Time-frequency convoluional neworks for robus speech recogniion, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), 2015, pp [13] V. Peddini, D. Povey, and S. Khudanpur, A ime delay neural nework archiecure for efficien modeling of long emporal conexs, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2015, pp [14] T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, Very deep mulilingual convoluional neural neworks for LVCSR, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2016, pp [15] D. Amodei, R. Anubhai, E. Baenberg, C. Case, J. Casper, B. Caanzaro, J. Chen, M. Chrzanowski, A. Coaes, G. Diamos e al., Deep speech 2: End-o-end speech recogniion in English and Mandarin, in Inernaional Conference on Machine Learning (ICML), [16] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. Dai, Compac feedforward sequenial memory neworks for large vocabulary coninuous speech recogniion, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp [17] D. Yu, W. Xiong, J. Droppo, A. Solcke, G. Ye, J. Li, and G. Zweig, Deep convoluional neural neworks wih layer-wise conex expansion and aenion. in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp [18] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Selzer, A. Solcke, D. Yu, and G. Zweig, The Microsof 2016 conversaional speech recogniion sysem, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp [19] D. Yu, M. Kolbk, Z.-H. Tan, and J. Jensen, Permuaion invarian raining of deep models for speaker-independen muli-alker speech separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp [20] M. Kolbk, D. Yu, Z.-H. Tan, and J. Jensen, Muli-alker speech separaion wih uerance-level permuaion invarian raining of deep recurren neural neworks, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), acceped, [21] Z. Ghahramani and M. I. Jordan, Facorial hidden Markov models, Machine learning (MLJ), vol. 29, no. 2-3, pp , [22] M. Cooke, J. R. Hershey, and S. J. Rennie, Monaural speech separaion and recogniion challenge, Compuer Speech and Language (CSL), vol. 24, pp. 1 15, [23] C. Weng, D. Yu, M. L. Selzer, and J. Droppo, Deep neural neworks for single-channel muli-alker speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 23, no. 10, pp , [24] J. R. Hershey, Z. Chen, J. L. Roux, and S. Waanabe, Deep clusering: Discriminaive embeddings for segmenaion and separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2016, pp [25] Y. Isik, J. L. Roux, Z. Chen, S. Waanabe, and J. R. Hershey, Singlechannel muli-speaker separaion using deep clusering, in Annual Conference of Inernaional Speech Communicaion Associaion (IN- TERSPEECH), 2016, pp [26] Z. Chen, Y. Luo, and N. Mesgarani, Deep aracor nework for singlemicrophone speaker separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp [27] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Krisjansson, Super-human muli-alker speech recogniion: A graphical modeling approach, Compuer Speech and Language (CSL), vol. 24, pp , [28] S. J. Rennie, J. R. Hershey, and P. A. Olsen, Single-channel mulialker speech recogniion, IEEE Signal Processing Magazine (SPM), vol. 27, pp , 2010.

11 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 11 [29] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Deep learning for monaural speech separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2014, pp [30] F. Weninger, H. Erdogan, S. Waanabe, E. Vincen, J. Roux, J. R. Hershey, and B. Schuller, Speech enhancemen wih LSTM recurren neural neworks and is applicaion o noise-robus ASR, in Inernaional Conference on Laen Variable Analysis and Signal Separaion (LVA/ICA). Springer-Verlag New York, Inc., 2015, pp [31] Y. Wang, A. Narayanan, and D. Wang, On raining arges for supervised speech separaion, IEEE/ACM Transacions on Audio, Speech and Language Processing (TASLP), vol. 22, pp , [32] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimenal sudy on speech enhancemen based on deep neural neworks, IEEE Signal Processing Leers (SPL), vol. 21, pp , [33] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Join opimizaion of masks and deep recurren neural neworks for monaural source separaion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 23, pp , Dec [34] J. Du, Y. Tu, L. R. Dai, and C. H. Lee, A regression approach o singlechannel speech separaion via high-resoluion deep neural neworks, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 24, pp , Aug [35] T. Hain, L. Burge, J. Dines, P. N. Garner, F. Grézl, A. E. Hannani, M. Huijbregs, M. Karafia, M. Lincoln, and V. Wan, Transcribing meeings wih he AMIDA sysems, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 20, no. 2, pp , [36] D. Povey, V. Peddini, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, Purely sequence-rained neural neworks for ASR based on laice-free MMI, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp [37] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, Highway long shor-erm memory RNNs for disan speech recogniion, IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), pp , [38] J. J. Godfrey and E. Holliman, Swichboard-1 release 2, Linguisic Daa Consorium, Philadelphia, [39] T. Sercu and V. Goel, Dense predicion on sequences wih ime-dilaed convoluions for speech recogniion, arxiv preprin arxiv: , [40] G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, The IBM 2016 english conversaional elephone speech recogniion sysem, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp [41] P. Swieojanski, A. Ghoshal, and S. Renals, Hybrid acousic models for disan and mulichannel large vocabulary speech recogniion, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), 2013, pp [42] D. Yu, A. Eversole, M. Selzer, K. Yao, Z. Huang, B. Guener, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang e al., An inroducion o compuaional neworks and he compuaional nework oolki, Microsof Technical Repor MSR-TR , [43] D. Povey, A. Ghoshal, G. Boulianne, L. Burge, O. Glembek, N. Goel, M. Hannemann, P. Molicek, Y. Qian, P. Schwarz e al., The kaldi speech recogniion oolki, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), no. EPFL-CONF , 2011.

Neural Network Model of the Backpropagation Algorithm

Neural Network Model of the Backpropagation Algorithm Neural Nework Model of he Backpropagaion Algorihm Rudolf Jakša Deparmen of Cyberneics and Arificial Inelligence Technical Universiy of Košice Lená 9, 4 Košice Slovakia jaksa@neuron.uke.sk Miroslav Karák

More information

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices Z. Zhang e al.: Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion in Hands-Free Voice Conrolled Devices 525 Channel Mapping using Bidirecional Long Shor-Term Memory for Dereverberaion

More information

Fast Multi-task Learning for Query Spelling Correction

Fast Multi-task Learning for Query Spelling Correction Fas Muli-ask Learning for Query Spelling Correcion Xu Sun Dep. of Saisical Science Cornell Universiy Ihaca, NY 14853 xusun@cornell.edu Anshumali Shrivasava Dep. of Compuer Science Cornell Universiy Ihaca,

More information

More Accurate Question Answering on Freebase

More Accurate Question Answering on Freebase More Accurae Quesion Answering on Freebase Hannah Bas, Elmar Haussmann Deparmen of Compuer Science Universiy of Freiburg 79110 Freiburg, Germany {bas, haussmann}@informaik.uni-freiburg.de ABSTRACT Real-world

More information

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments Inernaional Journal of Elecrical and Compuer Engineering (IJECE) Vol. 6, No. 5, Ocober 2016, pp. 2415~2424 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i5.10639 2415 An Effiecien Approach for Resource Auo-Scaling

More information

1 Language universals

1 Language universals AS LX 500 Topics: Language Uniersals Fall 2010, Sepember 21 4a. Anisymmery 1 Language uniersals Subjec-erb agreemen and order Bach (1971) discusses wh-quesions across SO and SO languages, hypohesizing:...

More information

MyLab & Mastering Business

MyLab & Mastering Business MyLab & Masering Business Efficacy Repor 2013 MyLab & Masering: Business Efficacy Repor 2013 Edied by Michelle D. Speckler 2013 Pearson MyAccouningLab, MyEconLab, MyFinanceLab, MyMarkeingLab, and MyOMLab

More information

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration INTERSPEECH 2013 Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration Yan Huang, Dong Yu, Yifan Gong, and Chaojun Liu Microsoft Corporation, One

More information

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models Navdeep Jaitly 1, Vincent Vanhoucke 2, Geoffrey Hinton 1,2 1 University of Toronto 2 Google Inc. ndjaitly@cs.toronto.edu,

More information

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports Downloaded from ascelibrary.org by Basil Sephanis on 07/13/16. Copyrigh ASCE. For personal use only; all righs reserved. Informaion Propagaion for informing Special Populaion Subgroups abou New Ground

More information

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 7 Apr 2015 Transferring Knowledge from a RNN to a DNN William Chan 1, Nan Rosemary Ke 1, Ian Lane 1,2 Carnegie Mellon University 1 Electrical and Computer Engineering, 2 Language Technologies Institute Equal contribution

More information

arxiv: v1 [cs.cl] 27 Apr 2016

arxiv: v1 [cs.cl] 27 Apr 2016 The IBM 2016 English Conversational Telephone Speech Recognition System George Saon, Tom Sercu, Steven Rennie and Hong-Kwang J. Kuo IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Distributed Learning of Multilingual DNN Feature Extractors using GPUs Distributed Learning of Multilingual DNN Feature Extractors using GPUs Yajie Miao, Hao Zhang, Florian Metze Language Technologies Institute, School of Computer Science, Carnegie Mellon University Pittsburgh,

More information

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation SLSP-2016 October 11-12 Natalia Tomashenko 1,2,3 natalia.tomashenko@univ-lemans.fr Yuri Khokhlov 3 khokhlov@speechpro.com Yannick

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems Modeling function word errors in DNN-HMM based LVCSR systems Melvin Jose Johnson Premkumar, Ankur Bapna and Sree Avinash Parchuri Department of Computer Science Department of Electrical Engineering Stanford

More information

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT Takuya Yoshioka,, Anton Ragni, Mark J. F. Gales Cambridge University Engineering Department, Cambridge, UK NTT Communication

More information

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING Sheng Li 1, Xugang Lu 2, Shinsuke Sakai 1, Masato Mimura 1 and Tatsuya Kawahara 1 1 School of Informatics, Kyoto University, Sakyo-ku, Kyoto 606-8501,

More information

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition Yanzhang He, Eric Fosler-Lussier Department of Computer Science and Engineering The hio

More information

Improvements to the Pruning Behavior of DNN Acoustic Models

Improvements to the Pruning Behavior of DNN Acoustic Models Improvements to the Pruning Behavior of DNN Acoustic Models Matthias Paulik Apple Inc., Infinite Loop, Cupertino, CA 954 mpaulik@apple.com Abstract This paper examines two strategies that positively influence

More information

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction INTERSPEECH 2015 Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction Akihiro Abe, Kazumasa Yamamoto, Seiichi Nakagawa Department of Computer

More information

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES Po-Sen Huang, Kshitiz Kumar, Chaojun Liu, Yifan Gong, Li Deng Department of Electrical and Computer Engineering,

More information

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING Gábor Gosztolya 1, Tamás Grósz 1, László Tóth 1, David Imseng 2 1 MTA-SZTE Research Group on Artificial

More information

Learning Methods in Multilingual Speech Recognition

Learning Methods in Multilingual Speech Recognition Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex

More information

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS Pranay Dighe Afsaneh Asaei Hervé Bourlard Idiap Research Institute, Martigny, Switzerland École Polytechnique Fédérale de Lausanne (EPFL),

More information

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX, 2017 1 Small-footprint Highway Deep Neural Networks for Speech Recognition Liang Lu Member, IEEE, Steve Renals Fellow,

More information

Speech Emotion Recognition Using Support Vector Machine

Speech Emotion Recognition Using Support Vector Machine Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,

More information

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE Shaofei Xue 1

More information

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION Mitchell McLaren 1, Yun Lei 1, Luciana Ferrer 2 1 Speech Technology and Research Laboratory, SRI International, California, USA 2 Departamento

More information

WHEN THERE IS A mismatch between the acoustic

WHEN THERE IS A mismatch between the acoustic 808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,

More information

A study of speaker adaptation for DNN-based speech synthesis

A study of speaker adaptation for DNN-based speech synthesis A study of speaker adaptation for DNN-based speech synthesis Zhizheng Wu, Pawel Swietojanski, Christophe Veaux, Steve Renals, Simon King The Centre for Speech Technology Research (CSTR) University of Edinburgh,

More information

Deep Neural Network Language Models

Deep Neural Network Language Models Deep Neural Network Language Models Ebru Arısoy, Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhadran IBM T.J. Watson Research Center Yorktown Heights, NY, 10598, USA {earisoy, tsainath, bedk, bhuvana}@us.ibm.com

More information

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS Jonas Gehring 1 Quoc Bao Nguyen 1 Florian Metze 2 Alex Waibel 1,2 1 Interactive Systems Lab, Karlsruhe Institute of Technology;

More information

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren Speech Technology and Research Laboratory, SRI International,

More information

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition Hua Zhang, Yun Tang, Wenju Liu and Bo Xu National Laboratory of Pattern Recognition Institute of Automation, Chinese

More information

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3 SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3 Ahmed Ali 1,2, Stephan Vogel 1, Steve Renals 2 1 Qatar Computing Research Institute, HBKU, Doha, Qatar 2 Centre for Speech Technology Research, University

More information

Calibration of Confidence Measures in Speech Recognition

Calibration of Confidence Measures in Speech Recognition Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE

More information

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS Heiga Zen, Haşim Sak Google fheigazen,hasimg@google.com ABSTRACT Long short-term

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-

More information

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training INTERSPEECH 2015 Vowel mispronunciation detection using DNN acoustic models with cross-lingual training Shrikant Joshi, Nachiket Deo, Preeti Rao Department of Electrical Engineering, Indian Institute of

More information

Speech Recognition at ICSI: Broadcast News and beyond

Speech Recognition at ICSI: Broadcast News and beyond Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI

More information

Human Emotion Recognition From Speech

Human Emotion Recognition From Speech RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati

More information

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model Xinying Song, Xiaodong He, Jianfeng Gao, Li Deng Microsoft Research, One Microsoft Way, Redmond, WA 98052, U.S.A.

More information

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important

More information

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen TRANSFER LEARNING OF WEAKLY LABELLED AUDIO Aleksandr Diment, Tuomas Virtanen Tampere University of Technology Laboratory of Signal Processing Korkeakoulunkatu 1, 33720, Tampere, Finland firstname.lastname@tut.fi

More information

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method Sanket S. Kalamkar and Adrish Banerjee Department of Electrical Engineering

More information

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012 Text-independent Mono and Cross-lingual Speaker Identification with the Constraint of Limited Data Nagaraja B G and H S Jayanna Department of Information Science and Engineering Siddaganga Institute of

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

A Review: Speech Recognition with Deep Learning Methods

A Review: Speech Recognition with Deep Learning Methods Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.1017

More information

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION Han Shu, I. Lee Hetherington, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge,

More information

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Dropout improves Recurrent Neural Networks for Handwriting Recognition 2014 14th International Conference on Frontiers in Handwriting Recognition Dropout improves Recurrent Neural Networks for Handwriting Recognition Vu Pham,Théodore Bluche, Christopher Kermorvant, and Jérôme

More information

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH 2009 423 Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition George

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

Word Segmentation of Off-line Handwritten Documents

Word Segmentation of Off-line Handwritten Documents Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department

More information

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology ISCA Archive SUBJECTIVE EVALUATION FOR HMM-BASED SPEECH-TO-LIP MOVEMENT SYNTHESIS Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano Graduate School of Information Science, Nara Institute of Science & Technology

More information

Investigation on Mandarin Broadcast News Speech Recognition

Investigation on Mandarin Broadcast News Speech Recognition Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang 1, Xin Lei 1, Wen Wang 2, Takahiro Shinozaki 1 1 Univ. of Washington, Dept. of Electrical Engineering, Seattle, WA 98195 USA 2

More information

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition Seltzer, M.L.; Raj, B.; Stern, R.M. TR2004-088 December 2004 Abstract

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma International Journal of Computer Applications (975 8887) The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma Gilbert M.

More information

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions 26 24th European Signal Processing Conference (EUSIPCO) Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions Emma Jokinen Department

More information

Model Ensemble for Click Prediction in Bing Search Ads

Model Ensemble for Click Prediction in Bing Search Ads Model Ensemble for Click Prediction in Bing Search Ads Xiaoliang Ling Microsoft Bing xiaoling@microsoft.com Hucheng Zhou Microsoft Research huzho@microsoft.com Weiwei Deng Microsoft Bing dedeng@microsoft.com

More information

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS Annamaria Mesaros 1, Toni Heittola 1, Antti Eronen 2, Tuomas Virtanen 1 1 Department of Signal Processing Tampere University of Technology Korkeakoulunkatu

More information

Soft Computing based Learning for Cognitive Radio

Soft Computing based Learning for Cognitive Radio Int. J. on Recent Trends in Engineering and Technology, Vol. 10, No. 1, Jan 2014 Soft Computing based Learning for Cognitive Radio Ms.Mithra Venkatesan 1, Dr.A.V.Kulkarni 2 1 Research Scholar, JSPM s RSCOE,Pune,India

More information

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project Phonetic- and Speaker-Discriminant Features for Speaker Recognition by Lara Stoll Research Project Submitted to the Department of Electrical Engineering and Computer Sciences, University of California

More information

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer

More information

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics Machine Learning from Garden Path Sentences: The Application of Computational Linguistics http://dx.doi.org/10.3991/ijet.v9i6.4109 J.L. Du 1, P.F. Yu 1 and M.L. Li 2 1 Guangdong University of Foreign Studies,

More information

Cultivating DNN Diversity for Large Scale Video Labelling

Cultivating DNN Diversity for Large Scale Video Labelling Cultivating DNN Diversity for Large Scale Video Labelling Mikel Bober-Irizar mikel@mxbi.net Sameed Husain sameed.husain@surrey.ac.uk Miroslaw Bober m.bober@surrey.ac.uk Eng-Jon Ong e.ong@surrey.ac.uk Abstract

More information

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v1 [cs.lg] 15 Jun 2015 Dual Memory Architectures for Fast Deep Learning of Stream Data via an Online-Incremental-Transfer Strategy arxiv:1506.04477v1 [cs.lg] 15 Jun 2015 Sang-Woo Lee Min-Oh Heo School of Computer Science and

More information

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention Damien Teney 1, Peter Anderson 2*, David Golub 4*, Po-Sen Huang 3, Lei Zhang 3, Xiaodong He 3, Anton van den Hengel 1 1

More information

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS Elliot Singer and Douglas Reynolds Massachusetts Institute of Technology Lincoln Laboratory {es,dar}@ll.mit.edu ABSTRACT

More information

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United

More information

Affective Classification of Generic Audio Clips using Regression Models

Affective Classification of Generic Audio Clips using Regression Models Affective Classification of Generic Audio Clips using Regression Models Nikolaos Malandrakis 1, Shiva Sundaram, Alexandros Potamianos 3 1 Signal Analysis and Interpretation Laboratory (SAIL), USC, Los

More information

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH Don McAllaster, Larry Gillick, Francesco Scattone, Mike Newman Dragon Systems, Inc. 320 Nevada Street Newton, MA 02160

More information

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian Kevin Kilgour, Michael Heck, Markus Müller, Matthias Sperber, Sebastian Stüker and Alex Waibel Institute for Anthropomatics Karlsruhe

More information

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,

More information

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Using EEG to Improve Massive Open Online Courses Feedback Interaction Using EEG to Improve Massive Open Online Courses Feedback Interaction Haohan Wang, Yiwei Li, Xiaobo Hu, Yucong Yang, Zhu Meng, Kai-min Chang Language Technologies Institute School of Computer Science Carnegie

More information

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Speech Translation for Triage of Emergency Phonecalls in Minority Languages Speech Translation for Triage of Emergency Phonecalls in Minority Languages Udhyakumar Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking Language Technologies Institute Carnegie Mellon University

More information

Reducing Features to Improve Bug Prediction

Reducing Features to Improve Bug Prediction Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,

More information

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &

More information

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience Xinyu Tang Parasol Laboratory Department of Computer Science Texas A&M University, TAMU 3112 College Station, TX 77843-3112 phone:(979)847-8835 fax: (979)458-0425 email: xinyut@tamu.edu url: http://parasol.tamu.edu/people/xinyut

More information

An Online Handwriting Recognition System For Turkish

An Online Handwriting Recognition System For Turkish An Online Handwriting Recognition System For Turkish Esra Vural, Hakan Erdogan, Kemal Oflazer, Berrin Yanikoglu Sabanci University, Tuzla, Istanbul, Turkey 34956 ABSTRACT Despite recent developments in

More information

arxiv: v4 [cs.cl] 28 Mar 2016

arxiv: v4 [cs.cl] 28 Mar 2016 LSTM-BASED DEEP LEARNING MODELS FOR NON- FACTOID ANSWER SELECTION Ming Tan, Cicero dos Santos, Bing Xiang & Bowen Zhou IBM Watson Core Technologies Yorktown Heights, NY, USA {mingtan,cicerons,bingxia,zhou}@us.ibm.com

More information

Constructing Parallel Corpus from Movie Subtitles

Constructing Parallel Corpus from Movie Subtitles Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES Judith Gaspers and Philipp Cimiano Semantic Computing Group, CITEC, Bielefeld University {jgaspers cimiano}@cit-ec.uni-bielefeld.de ABSTRACT Semantic parsers

More information

Application of Multimedia Technology in Vocabulary Learning for Engineering Students

Application of Multimedia Technology in Vocabulary Learning for Engineering Students Application of Multimedia Technology in Vocabulary Learning for Engineering Students https://doi.org/10.3991/ijet.v12i01.6153 Xue Shi Luoyang Institute of Science and Technology, Luoyang, China xuewonder@aliyun.com

More information

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17. Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link

More information

Cross Language Information Retrieval

Cross Language Information Retrieval Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................

More information

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools Dr. Amardeep Kaur Professor, Babe Ke College of Education, Mudki, Ferozepur, Punjab Abstract The present

More information

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria FUZZY EXPERT SYSTEMS 16-18 18 February 2002 University of Damascus-Syria Dr. Kasim M. Al-Aubidy Computer Eng. Dept. Philadelphia University What is Expert Systems? ES are computer programs that emulate

More information

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines Amit Juneja and Carol Espy-Wilson Department of Electrical and Computer Engineering University of Maryland,

More information

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures Alex Graves and Jürgen Schmidhuber IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland TU Munich, Boltzmannstr.

More information

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX

More information

Speaker Identification by Comparison of Smart Methods. Abstract

Speaker Identification by Comparison of Smart Methods. Abstract Journal of mathematics and computer science 10 (2014), 61-71 Speaker Identification by Comparison of Smart Methods Ali Mahdavi Meimand Amin Asadi Majid Mohamadi Department of Electrical Department of Computer

More information

Segregation of Unvoiced Speech from Nonspeech Interference

Segregation of Unvoiced Speech from Nonspeech Interference Technical Report OSU-CISRC-8/7-TR63 Department of Computer Science and Engineering The Ohio State University Columbus, OH 4321-1277 FTP site: ftp.cse.ohio-state.edu Login: anonymous Directory: pub/tech-report/27

More information