Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Similar documents
Neural Network Model of the Backpropagation Algorithm

Channel Mapping using Bidirectional Long Short-Term Memory for Dereverberation in Hands-Free Voice Controlled Devices

Fast Multi-task Learning for Query Spelling Correction

More Accurate Question Answering on Freebase

An Effiecient Approach for Resource Auto-Scaling in Cloud Environments

1 Language universals

MyLab & Mastering Business

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Information Propagation for informing Special Population Subgroups about New Ground Transportation Services at Airports

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.cl] 27 Apr 2016

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Improvements to the Pruning Behavior of DNN Acoustic Models

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Learning Methods in Multilingual Speech Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Speech Emotion Recognition Using Support Vector Machine

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

Deep Neural Network Language Models

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

SPEECH RECOGNITION CHALLENGE IN THE WILD: ARABIC MGB-3

Calibration of Confidence Measures in Speech Recognition

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

On the Formation of Phoneme Categories in DNN Acoustic Models

Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

Speech Recognition at ICSI: Broadcast News and beyond

Human Emotion Recognition From Speech

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Lecture 1: Machine Learning Basics

A Review: Speech Recognition with Deep Learning Methods

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Dropout improves Recurrent Neural Networks for Handwriting Recognition

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Word Segmentation of Off-line Handwritten Documents

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Investigation on Mandarin Broadcast News Speech Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Generative models and adversarial training

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Model Ensemble for Click Prediction in Bing Search Ads

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Soft Computing based Learning for Cognitive Radio

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Cultivating DNN Diversity for Large Scale Video Labelling

arxiv: v1 [cs.lg] 15 Jun 2015

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Rule Learning With Negation: Issues Regarding Effectiveness

Affective Classification of Generic Audio Clips using Regression Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The 2014 KIT IWSLT Speech-to-Text Systems for English, German and Italian

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Speech Translation for Triage of Emergency Phonecalls in Minority Languages

Reducing Features to Improve Bug Prediction

Learning Methods for Fuzzy Systems

Artificial Neural Networks written examination

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Xinyu Tang. Education. Research Interests. Honors and Awards. Professional Experience

An Online Handwriting Recognition System For Turkish

arxiv: v4 [cs.cl] 28 Mar 2016

Constructing Parallel Corpus from Movie Subtitles

On the Combined Behavior of Autonomous Resource Management Agents

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

Application of Multimedia Technology in Vocabulary Learning for Engineering Students

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Cross Language Information Retrieval

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Rule Learning with Negation: Issues Regarding Effectiveness

Speaker Identification by Comparison of Smart Methods. Abstract

Segregation of Unvoiced Speech from Nonspeech Interference

Transcription:

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Single-Channel Muli-alker Speech Recogniion wih Permuaion Invarian Training Yanmin Qian, Member, IEEE, Xuankai Chang, Suden Member, IEEE, and Dong Yu, Senior Member, IEEE arxiv:1707.06527v1 [cs.sd] 19 Jul 2017 Absrac Alhough grea progresses have been made in auomaic speech recogniion (ASR), significan performance degradaion is sill observed when recognizing muli-alker mixed speech. In his paper, we propose and evaluae several archiecures o address his problem under he assumpion ha only a single channel of mixed signal is available. Our echnique exends permuaion invarian raining (PIT) by inroducing he fronend feaure separaion module wih he minimum mean square error (MSE) crierion and he back-end recogniion module wih he minimum cross enropy (CE) crierion. More specifically, during raining we compue he average MSE or CE over he whole uerance for each possible uerance-level oupu-arge assignmen, pick he one wih he minimum MSE or CE, and opimize for ha assignmen. This sraegy eleganly solves he label permuaion problem observed in he deep learning based muli-alker mixed speech separaion and recogniion sysems. The proposed archiecures are evaluaed and compared on an arificially mixed AMI daase wih boh wo- and hreealker mixed speech. The experimenal resuls indicae ha our proposed archiecures can cu he word error rae (WER) by 45.0% and 25.0% relaively agains he sae-of-he-ar singlealker speech recogniion sysem across all speakers when heir energies are comparable, for wo- and hree-alker mixed speech, respecively. To our knowledge, his is he firs work on he muli-alker mixed speech recogniion on he challenging speakerindependen sponaneous large vocabulary coninuous speech ask. Keywords permuaion invarian raining, muli-alker mixed speech recogniion, feaure separaion, join-opimizaion I. INTRODUCTION Thanks o he significan progresses made in he recen years [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], he ASR sysems now surpassed he hreshold for adopion in many real-world scenarios and enabled services such as Microsof Corana, Apple s Siri and Google Now, where close-alk microphones are commonly used. However, he curren ASR sysems sill perform poorly when far-field microphones are used. This is because many difficulies hidden by close-alk microphones now surface under disan recogniion scenarios. For example, he signal o noise raio (SNR) beween he arge speaker and he inerfering noises is much lower han ha when close-alk microphones are used. As a resul, he inerfering signals, such Yanmin Qian and Xuankai Chang are wih Compuer Science and Engineering Deparmen, Shanghai Jiao Tong Universiy, Shanghai, 200240 P. R. China ({yanminqian,xuank}@sju.edu.cn). Dong Yu is wih Tencen AI Lab, Seale, USA (dyu@encen.com). as background noise, reverberaion, and speech from oher alkers, become so disinc ha hey can no longer be ignored. In his paper, we aims a solving he speech recogniion problem when muliple alkers speak a he same ime and only a single channel of mixed speech is available. Many aemps have been made o aack his problem. Before he deep learning era, he mos famous and effecive model is he facorial GMM-HMM [21], which ouperformed human in he 2006 monaural speech separaion and recogniion challenge [22]. The facorial GMM-HMM, however, requires he es speakers o be seen during raining so ha he ineracions beween hem can be properly modeled. Recenly, several deep learning based echniques have been proposed o solve his problem [19], [20], [23], [24], [25], [26]. The core issue ha hese echniques ry o address is he label ambiguiy or permuaion problem (refer o Secion III for deails). In Weng e al. [23] a deep learning model was developed o recognize he mixed speech direcly. To solve he label ambiguiy problem, Weng e al. assigned he senone labels of he alker wih higher insananeous energy o oupu one and he oher o oupu wo. This, alhough addresses he label ambiguiy problem, causes frequen speaker swich across frames. To deal wih he speaker swich problem, a wo-speaker join-decoder wih a speaker swiching penaly was used o race speakers. This approach has wo limiaions. Firs, energy, which is manually picked, may no be he bes informaion o assign labels under all condiions. Second, he frame swiching problem inroduces burden o he decoder. In Hershey e al. [24], [25] he muli-alker mixed speech is firs separaed ino muliple sreams. An ASR engine is hen applied o hese sreams independenly o recognize speech. To separae he speech sreams, hey proposed a echnique called deep clusering (DPCL). They assume ha each imefrequency bin belongs o only one speaker and can be mapped ino a shared embedding space. The model is opimized so ha in he embedding space he ime-frequency bins belong o he same speaker are closer and hose of differen speakers are farher away. During evaluaion, a clusering algorihm is used upon embeddings o generae a pariion of he ime-frequency bins firs, separaed audio sreams are hen reconsruced based on he pariion. In his approach, he speech separaion and recogniion are usually wo separae componens. Chen e al. [26] proposed a similar echnique called deep aracor nework (DANe). Following DPCL, heir approach also learns a high-dimensional embedding of he acousic signals. Differen from DPCL, however, i creaes cluser ceners, called aracor poins, in he embedding space o pull ogeher he ime-frequency bins corresponding o he same source. The main limiaion of DANe is he requiremen o

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2 esimae aracor poins during evaluaion ime and o form frequency-bin clusers based on hese poins. In Yu e al. [19] and Kolbak e al.[20], a simpler ye equally effecive echnique named permuaion invarian raining (PIT) 1 was proposed o aack he speaker independen muli-alker speech separaion problem. In PIT, he source arges are reaed as a se (i.e., order is irrelevan). During raining, PIT firs deermines he oupu-arge assignmen wih he minimum error a he uerance level based on he forwardpass resul. I hen minimizes he error given he assignmen. This sraegy eleganly solved he label permuaion problem. However, in hese original works PIT was used o separae speech sreams from mixed speech. For his reason, a frequency-bin mask was firs esimaed and hen used o reconsruc each sream. The minimum mean square error (MMSE) beween he rue and reconsruced speech sreams was used as he crierion o opimize model parameers. Moreover, mos of previous works on muli-alker speech sill focus on speech separaion [19], [20], [24], [25], [26]. In conras, he muli-alker speech recogniion is much harder and he relaed work is less. There has been some aemps, bu he relaed asks are relaively simple. For example, he 2006 monaural speech separaion and recogniion challenge [21], [22], [23], [27], [28] was defined on a speaker-dependen, small vocabulary, consrained language model seup, while in [25] a small vocabulary reading syle corpus was used. We are no aware of any exensive research work on he more real, speaker-independen, sponaneous large vocabulary coninuous speech recogniion (LVCSR) on muli-alker mixed speech before our work. In his paper, we aack he muli-alker mixed speech recogniion problem wih a focus on he speaker-independen seup given jus a single-channel of he mixed speech. Differen from [19], [20], here we exend and redefine PIT over log filer bank feaures and/or senone poseriors. In some archiecures PIT is defined upon he minimum mean square error (MSE) beween he rue and esimaed individual speaker feaures o separae speech a he feaure level (called PIT-MSE from now on). In some oher archiecures, PIT is defined upon he cross enropy (CE) beween he rue and esimaed senone poserior probabiliies o recognize muliple sreams of speech direcly (called PIT-CE from now on). Moreover, he PIT-MSE based fron-end feaure separaion can be combined wih he PIT-CE based back-end recogniion in a join opimizaion archiecure. We evaluae our archiecures on he arificially generaed AMI daa wih boh wo- and hree-alker mixed speech. The experimenal resuls demonsrae ha our proposed archiecures are very promising. The res of he paper is organized as follows. In Secion II we describe he speaker independen muli-alker mixed speech recogniion problem. In Secion III we propose several PITbased archiecures o recognize muli-sreams of speech. We repor experimenal resuls in Secion IV and conclude he paper in Secion V. 1 In [24], a similar permuaion free echnique, which is equivalen o PIT when here are exacly wo-speakers, was evaluaed wih negaive resuls and conclusion. II. SINGLE-CHANNEL MULTI-TALKER SPEECH RECOGNITION In his paper, we assume ha a linearly mixed singlemicrophone signal y[n] = S s=1 x s[n] is known, where x s [n], s = 1,, S are S sreams of speech sources from differen speakers. Our goal is o separae hese sreams and recognize every single one of hem. In oher words, he model needs o generae S oupu sreams, one for each source, a every ime sep. However, given only he mixed speech y[n], he problem of recognizing all sreams is under-deermined because here are an infinie number of possible x s [n] (and hus recogniion resuls) combinaions ha lead o he same y[n]. Forunaely, speech is no random signal. I has paerns ha we may learn from a raining se of pairs y and l s, s = 1,, S, where l s is he senone label sequence for sream s. In he single speaker case, i.e., S = 1, he learning problem is significanly simplified because here is only one possible recogniion resul, hus i can be cased as a simple supervised opimizaion problem. Given he inpu o he model, which is some feaure represenaion of y, he oupu is simply he senone poserior probabiliy condiioned on he inpu. As in mos classificaion problems, he model can be opimized by minimizing he cross enropy beween he senone label and he esimaed poserior probabiliy. When S is greaer han 1, however, i is no longer as simple and direc as in he single-alker case and he label ambiguiy or permuaion becomes a problem in raining. In he case of wo speakers, because speech sources are symmeric given he mixure (i.e., x 1 + x 2 equals o x 2 + x 1 and boh x 1 and x 2 have he same characerisics), here is no predeermined way o assign he correc arge o he corresponding oupu layer. Ineresed readers can find addiional informaion in [19], [20] on how raining progresses o nowhere when he convenional supervised approach is used for he muli-alker speech separaion. III. PERMUTATION INVARIANT TRAINING FOR MULTI-TALKER SPEECH RECOGNITION To address he label ambiguiy problem, we propose several archiecures based on he permuaion invarian raining (PIT) [19], [20] for muli-alker mixed speech recogniion. For simpliciy and wihou losing he generaliy, we always assume here are wo-alkers in he mixed speech when describing our archiecures in his secion. Noe ha, DPCL [24], [25] and DANe [26] are alernaive soluions o he label ambiguiy problem when he goal is speech source separaion. However, hese wo echniques canno be easily applied o direc recogniion (i.e., wihou firs separaing speech) of muliple sreams of speech because of he clusering sep required during separaion, and he assumpion ha each ime-frequency bin belongs o only one speaker (which is false when he CE crierion is used). A. Feaure Separaion wih Direc Supervision To recognize he muli-alker mixed speech, one sraighforward approach is o esimae he feaures of each speech

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 3 (a) Arch#1: Feaure separaion wih he fixed reference assignmen (b) Arch#2: Feaure separaion wih permuaion invarian raining Fig. 1: Feaure separaion archiecures for muli-alker mixed speech recogniion source given he mixed speech feaure and recognize hem one by one using a normal single-alker LVCSR sysem. This idea is depiced in Figure 1 where we learn a model o recover he filer bank (FBANK) feaures from he mixed FBANK feaures and hen feed each sream of he recovered FBANK feaures o a convenional LVCSR sysem for recogniion. In he simples archiecure, which is denoed as Arch#1 and illusraed in Figure 1(a), feaure separaion can be considered as a muli-class regression problem, similar o many previous works [29], [30], [31], [32], [33], [34]. In his archiecure, Y, he feaure of mixed speech, are used as he inpu o some deep learning models, such as deep neural neworks (DNNs), convoluional neural neworks (CNNs), and long shor-erm memory (LSTM) recurren neural neworks (RNNs), o esimae feaure represenaion of each individual alker. If we use he bidirecional LSTM-RNN model, he model will compue H 0 = Y (1) H f i = RNNf i (H i 1), i = 1,, N (2) H b i = RNN b i (H i 1 ), i = 1,, N (3) H i = Sack(H f i, Hb i ), i = 1,, N (4) ˆX s = Linear(H N ), s = 1,, S (5) where H 0 is he inpu, N is he number of hidden layers, H i is he i-h hidden layer, RNN f i and RNN b i are he forward and backward RNNs a hidden layer i, respecively, ˆX s, s = 1,, S is he esimaed separaed feaures from he oupu layers for each speech sream s. During raining, we need o provide he correc reference (or arge) feaures X s, s = 1,, S for all speakers in he mixed speech o he corresponding oupu layers for supervision. The model parameers can be opimized o minimize he mean square error (MSE) beween he esimaed separaed feaure ˆX s and he original reference feaure X s, J = 1 S S min X s ˆX s 2 (6) s=1 where S is he number of mixed speakers. In his archiecure, i is assumed ha he reference feaures are organized in a given order and assigned o he oupu layer segmens accordingly. Once rained, his feaure separaion module can be used as he fron-end o process he mixed speech. The separaed feaure sreams are hen fed ino a normal singlespeaker LVCSR sysem for decoding. B. Feaure Separaion wih Permuaion Invarian Training The archiecure depiced in Figure 1(a) is easy o implemen bu wih obvious drawbacks. Since he model has muliple oupu layer segmens (one for each sream), and hey depend on he same inpu mixure, assigning reference is acually difficul. The fixed reference order used in his archiecure is no quie righ since he source speech sreams are symmeric and here is no clear clue on how o order hem in advance. This is referred o as he label ambiguiy (or label permuaion) problem in [19], [23], [24]. As a resul, his archiecure may work well on he speaker-dependen seup where he arge speaker is known (and hus can be assigned o a specific oupu segmen) during raining, bu canno generalize well o he speaker-independen case. The label ambiguiy problem in he muli-alker mixed speech recogniion was addressed wih limied success in [23] where Weng e al. assigned reference feaures depending on

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 4 he energy level of each speech source. In he archiecure illusraed in Figure 1(b), named as Arch#2, permuaion invarian raining (PIT) [19], [20] is uilized o esimae individual feaure sreams. In his archiecure, The reference feaure sources are given as a se insead of an ordered lis. The oupureference assignmen is deermined dynamically based on he curren model. More specifically, i firs compues he MSE for each possible assignmen beween he reference X s and he esimaed source ˆX s, and picks he one wih minimum MSE. In oher words, he raining crierion is J = 1 S min s permu(s) s X s ˆX s 2, s = 1,, S (7) where permu(s) is a permuaion of 1,, S. We noe wo imporan ingrediens in his objecive funcion. Firs, i auomaically finds he appropriae assignmen no maer how he labels are ordered. Second, he MSE is compued over he whole sequence for each assignmen. This forces all he frames of he same speaker o be aligned wih he same oupu segmen, which can be regarded as performing he feaure-level racing implicily. Wih his new objecive funcion, We can simulaneously perform label assignmen and error evaluaion on he feaure level. I is expeced ha he feaure sreams separaed wih PIT (Figure 1(b)) has higher qualiy han ha separaed wih fixed reference order (Figure 1(a)). As a resul, he recogniion errors on hese feaure sreams should also be lower. Noe ha he compuaional cos associaed wih permuaion is negligible compared o he nework forward compuaion during raining, and no permuaion (and hus no cos) is needed during evaluaion. C. Direc Muli-Talker Mixed Speech Recogniion wih PIT In he previous wo archiecures mixed speech feaures are firs separaed explicily and hen recognized independenly wih a convenional single-alker LVCSR sysem. Since he feaure separaion is no perfec, here is mismach beween he separaed feaures and he normal feaures used o rain he convenional LVCSR sysem. In addiion, he objecive funcion of minimizing he MSE beween he esimaed and reference feaures is no direcly relaed o he recogniion performance. In his secion, we propose an end-o-end archiecure ha direcly recognizes mixed speech of muliple speakers. In his archiecure, denoed as Arch#3, we apply PIT o he CE beween he reference and esimaed senone poserior probabiliy disribuions as shown in Figure 2(a). Given some feaure represenaion Y of he mixed speech y, his model will compue H 0 = Y (8) H f i = RNNf i (H i 1), i = 1,, N (9) H b i = RNN b i (H i 1 ), i = 1,, N (10) H i = Sack(H f i, Hb i), i = 1,, N (11) H s o = Linear(H N ), s = 1,, S (12) O s = Sofmax(H s o), s = 1,, S (13) using a deep bidirecional RNN, where Equaions (8) (11) are similar o Equaions (1) (4). H s o, s = 1,, S is he exciaion a oupu layer for each speech sream s, and O s, s = 1,, S is he oupu segmen for sream s. Differen from archiecures discussed in previous secions, in his archiecure each oupu segmen represens he esimaed senone poserior probabiliy for a speech sream. No addiional feaure separaion, clusering or speaker racing is needed. Alhough various neural nework srucures can be used, in his sudy we focus on bidirecional LSTM-RNNs. In his direc muli-alker mixed speech recogniion archiecure, we minimize he objecive funcion J = 1 S min s permu(s) s CE(l s, O s ), s = 1,, S (14) In oher words, we minimize he minimum average CE of every possible oupu-label assignmen. All he frames of he same speaker are forced o be aligned wih he same oupu segmen by compuing he CE over he whole sequence for each assignmen. This sraegy allows for he direc mulialker mixed speech recogniion wihou explici separaion. I is a simpler and more compac archiecure for muli-alker speech recogniion. D. Join Opimizaion of PIT-based Feaure Separaion and Recogniion As menioned above, he main drawback of he feaure separaion archiecures is he mismach beween he disored separaion resul and he feaures used o rain he singlealker LVCSR sysem. The direc muli-alker mixed speech recogniion wih PIT, which bypassed he feaure separaion sep, is one soluion o his problem. Here we propose anoher archiecure named join opimizaion of PIT-based feaure separaion and recogniion, and i is denoed as Arch#4 and shown in Figure 2(b). This archiecure conains wo PIT-componens, he fronend feaure separaion module wih PIT-MSE and he back-end recogniion module wih PIT-CE. Differen from he archiecure in Figure 1(b), in his archiecure a new LVCSR sysem is rained upon he oupu of he feaure separaion module wih PIT-CE. The whole model is rained progressively: he fronend feaure separaion module is firsly opimized wih PIT- MSE; Then he parameers in he back-end recogniion module are opimized wih PIT-CE while keeping he parameers in he feaure separaion module fixed. Finally parameers in boh modules are joinly refined wih PIT-CE using a small learning rae. Noe ha he reference assignmen in he recogniion (PIT-CE) sep is he same as ha in he separaion (PIT-MSE) sep. J 1 = 1 S J 2 = 1 S min s permu(s) min s permu(s) s s X s ˆX s 2, s = 1,, S (15) CE(l s, O s ), s = 1,, S (16)

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 5 (a) Arch#3: Direc muli-alker mixed speech recogniion wih PIT (b) Arch#4: Join opimizaion of PIT-based feaure separaion and recogniion Fig. 2: Advanced archiecures for muli-alker mixed speech recogniion During decoding, he mixed speech feaures are fed ino his archiecure, and he final poserior sreams are used for decoding as normal. IV. EXPERIMENTAL RESULTS To evaluae he performance of he proposed archiecures, we conduced a series of experimens on an arificially generaed wo- and hree-alker mixed speech daase based on he AMI corpus [35]. There are four reasons for us o use AMI: 1) AMI is a speaker-independen sponaneous LVCSR corpora. Compared o small vocabulary, speaker-dependen, read English daases used in mos of he previous sudies [22], [23], [27], [28], observaions made and conclusions drawn from AMI are more likely generalized o oher real-world scenarios; 2) AMI is a really hard ask wih differen kinds of noises, ruly sponaneous meeing syle speech, and srong accens. I reflecs he rue abiliy of LVCSR when he raining se size is around 100hr. The sae-of-he-ar word error rae (WER) on AMI is around 25.0% for he close-alk condiion [36] and more han 45.0% for he far-field condiion wih single-microphone [36], [37]. These WERs are much higher han ha on oher corpora, such as Swichboard [38] on which he WER is now below 10.0% [18], [36], [39], [40]; 3) Alhough he close-alk daa (AMI IHM) was used o generae mixed speech in his work, he exisence of parallel far-field daa (AMI SDM/MDM) allows us o evaluae our archiecures based on he far-field daa in he fuure; 4) AMI is a public corpora, using AMI allows ineresed readers o reproduce our resuls more easily. The AMI IHM (close-alk) daase conains abou 80hr and 8hr speech in raining and evaluaion ses, respecively [35], [41]. Using AMI IHM, we generaed a wo-alker (IHM-2mix) and a hree-alker (IHM-3mix) mixed speech daase. To arificially synhesize IHM-2mix, we randomly selec wo speakers and hen randomly selec an uerance for each speaker o form a mixed-speech uerance. For easier explanaion, he high energy (High E) speaker in he mixed speech is always chosen as he arge speaker and he low energy (Low E) speaker is considered as inerference speaker. We synhesized mixed speech for five differen SNR condiions (i.e. 0dB, 5dB, 10dB, 15dB, 20dB) based on he energy raio of he wo-alkers. To eliminae easy cases we force he lenghs of he seleced source uerances comparable so ha a leas half of he mixed speech conains overlapping speech. When he wo source uerances have differen lenghs, he shorer one is padded wih small noise a he fron and end. The same procedure is used for preparing boh he raining and esing daa. We generaed in oal 400hr wo-alker mixed speech, 80hr per SNR condiion, as he raining se. A subse of 80hr speech from his 400hr raining se was used for fas model raining and evaluaion. For evaluaion, oal 40hr wo-alker mixed speech, 8hr per SNR condiion, is generaed and used. The IHM-3mix daase was generaed similarly. The relaive energy of he hree speakers in each mixed uerance varies randomly in he raining se. Differen from he raining se, all he speakers in he same mixed uerance have equal energy in he esing se. We generaed in oal 400hr and 8hr hree-alker mixed speech as he raining and esing se, respecively. Figure 3 compares he specrogram of a single-alker clean uerance and he corresponding 0db wo-alker mixed uerance in he IHM-2mix daase. Obviously i is really hard o separae he specrogram and reconsruc he source uerances by visually examining i.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 6 for DNN conains a window of 11 frames. The BLSTM-RNN has 3 bidirecional LSTM layers which are followed by he sofmax layer. Each BLSTM layer has 512 memory cells. The inpu o he BLSTM-RNN is a single acousic frame. All he models explored here are opimized wih cross-enropy crierion. The DNN is opimized using SGD mehod wih 256 minibach size, and he BLSTM-RNN is rained using SGD wih 4 full-lengh uerances in each minibach. For decoding, we used a 50K-word dicionary and a rigram language model inerpolaed from he ones creaed using he AMI ranscrips and he Fisher English corpus. The performance of hese wo baselines on he original single-speaker AMI corpus are presened in Table I. These resuls are comparable wih ha repored by ohers [41] even hough we did no use adaped fmllr feaure. I is noed ha adding more BLSTM layers did no show meaningful WER reducion in he baseline. TABLE I: WER (%) of he baseline sysems on original AMI IHM single-alker corpus Fig. 3: Specrogram comparison beween he original singlealker clean speech and he 0db wo-alker mixed-speech in he IHM-2mix daase A. Single-speaker Recogniion Baseline In his work, all he neural neworks were buil using he laes Microsof Cogniive Toolki (CNTK) [42] and he decoding sysems were buil based on Kaldi [43]. We firs followed he officially released kaldi recipe o build an LDA-MLLT-SAT GMM-HMM model. This model uses 39-dim MFCC feaure and has roughly 4K ied-saes and 80K Gaussians. We hen used his acousic model o generae he senone alignmen for neural nework raining. We rained he DNN and BLSTMRNN baseline sysems wih he original AMI IHM daa. 80dimensional log filer bank (LFBK) feaures wih CMVN were used o rain he baselines. The DNN has 6 hidden layers each of which conains 2048 Sigmoid neurons. The inpu feaure Model WER DNN BLSTM 28.0 26.6 To es he normal single-speaker model on he wo-alker mixed speech, he above baseline BLSTM-RNN model is uilized o decode he mixed speech direcly. During scoring we compare he decoding oupu (only one oupu) wih he reference of each source uerance o obain he WER for he corresponding source uerance. Table II summarizes he recogniion resuls. I is clear, from he able, ha he singlespeaker model performs very poorly on he muli-alker mixed speech as indicaed by he huge WER degradaion of he highenergy speaker when SNR decreases. Furher more, in all he condiions, he WERs for he low energy speaker are all above 100.0%. These resuls demonsrae he grea challenge in he muli-alker mixed speech recogniion. TABLE II: WER (%) of he baseline BLSTM-RNN singlespeaker sysem on he IHM-2mix daase SNR Condiion High E Spk Low E Spk 0db 5db 10db 15db 20db 85.0 68.8 51.9 39.3 32.1 100.5 110.2 114.9 117.6 118.7 B. Evaluaion of Two-alker Speech Recogniion Archiecures The proposed four archiecures for wo-aker speech recogniion are evaluaed here. For he firs wo approaches (Arch#1 and Arch#2) ha conain an explici feaure separaion sage (wih and wihou PIT-MSE), a 3-layer BLSTM is used in he feaure separaion module. The separaed feaure sreams are fed ino a normal 3-layer BLSTM LVCSR sysem, rained wih

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 7 single-alker speech, for decoding. The whole sysem conains in oal six BLSTM layers. For he oher wo approaches (Arch#3 and Arch#4), in which PIT-CE is used, 6-layer BLSTM models are used so ha he number of parameers is comparable o he oher wo archiecures. In all hese archiecures he inpu is he 40-dimensional LFBK feaure and each layer conains 768 memory cells. To rain he laer wo archiecures ha exploi PIT-CE we need o prepare he alignmens for he mixed speech. The senone alignmens for he wo-alkers in each mixed speech uerance are from he single-speaker baseline alignmen. The alignmen of he shorer uerance wihin he mixed speech is padded wih he silence sae a he fron and he end. All he models were rained wih a minibach of 8 uerances. The gradien was clipped o 0.0003 o guaranee he raining sabiliy. To obain he resuls repored in his secion we used he 80hr mixed speech raining subse. The recogniion resuls on boh speakers are evaluaed. For scoring, we evaluaed he wo hypoheses, obained from wo oupu secions, agains he wo references and pick he assignmen wih beer WER o compue he final WER. The resuls on he 0db SNR condiion are shown in Table III. Compared o he 0dB condiion in Table II, all he proposed muli-alker speech recogniion archiecures obain obvious improvemen on boh speakers. Wihin he wo archiecures wih he explici feaure separaion sage, he archiecure wih PIT-MSE is significanly beer han he baseline feaure separaion archiecure. These resuls confirmed ha he label permuaion problem can be well alleviaed by he PIT-MSE a he feaure level. We can also observe ha applying PIT- CE on he recogniion module (Arch#3 & Arch#4) can furher reduce WER by 10.0% absolue. This is because hese wo archiecures can significanly reduce he mismach beween he separaed feaure and he feaure used o rain he LVCSR model. I is also because cross-enropy is more direcly relaed o he recogniion accuracy. Comparing Arch#3 and Arch#4, we can see ha he archiecure wih join opimizaion on PITbased feaure separaion and recogniion slighly ouperforms he direc PIT-CE based model. Since Arch#3 and Arch#4 achieve comparable resuls, and he model archiecure and raining process of Arch#3 is much simpler han ha of Arch#4, our furher evaluaions repored in he following secions are based on Arch#3. For clariy, Arch#3 is named direc PIT-CE-ASR from now on. TABLE III: WER (%) of he proposed muli-alker mixed speech recogniion archiecures on he IHM-2mix daase under 0db SNR condiion (using 80hr raining subse). Arch#1- #4 indicae he proposed archiecures described in Secion III.A-D, respecively Arch Fron-end Back-end High E WER Low E WER #1 Fea-Sep-baseline Single-Spk-ASR 72.58 79.61 #2 Fea-Sep-PIT-MSE Single-Spk-ASR 68.88 75.62 #3 PIT-CE 59.72 66.96 #4 Fea-Sep-PIT-MSE PIT-CE 58.68 66.25 C. Evaluaion of he Direc PIT-CE-ASR Model on Large Daase We evaluaed he direc PIT-CE-ASR archiecure on he full IHM-2mix corpus. All he 400hr mixed daa under differen SNR condiions are pooled ogeher for raining. The direc PIT-CE-ASR model is sill composed of 6 BLSTM layers wih 768 memory cells in each layer. All oher configuraions are also he same as he experimens conduced on he subse. The resuls under differen SNR condiions are shown in Table IV. The direc PIT-CE-ASR model achieved significan improvemens on boh alkers compared o baseline resuls in Table II for all SNR condiions. Comparing o he resuls in Table III, achieved wih 80hr raining subse, we observe ha addiional absolue 10.0% WER improvemen on boh speakers can be obained using he large raining se. We also observe ha he WER increases slowly when he SNR becomes smaller for he high energy speaker, and he WER improvemen is very significan for he low energy speaker across all condiions. In he 0dB SNR scenario, he WERs on wo speakers are very close and are 45.0% less han ha achieved wih he single-alker ASR sysem for boh high and low energy speakers. A 20dB SNR, he WER of he high energy speaker is sill significanly beer han he baseline, and approaches he single-alker recogniion resul repored in Table I. TABLE IV: WER (%) of he proposed direc PIT-CE-ASR model on he IHM-2mix daase wih full raining se SNR Condiion High E WER Low E WER 0db 47.77 54.89 5db 39.25 59.24 10db 33.83 64.14 15db 30.54 71.75 20db 28.75 79.88 D. Permuaion Invarian Training wih Alernaive Deep Learning Models We invesigaed he direc PIT-CE-ASR model wih alernaive deep learning models. The firs model we evaluaed is a 6-layer feed-forward DNN in which each layer conains 2048 Sigmoid unis. The inpu o he DNN is a window of 11 frames each wih a 40-dimensional LFBK feaure. The resuls of DNN-based PIT-CE-ASR model is repored a he op of Table V. Alhough i sill ges obvious improvemen over he baseline single-speaker model, he gain is much smaller wih near 20.0% WER difference in every condiion han ha from BLSTM-based PIT-CE-ASR model. The difference beween DNN and BLSTM models parially aribue o he sronger modeling power of BLSTM models and parially aribue o he beer racing abiliy of RNNs. We also compared he BLSTM models wih 4, 6, and 8 layers as shown in Table V. I is observed ha deeper BLSTM models perform beer. This is differen from he single speaker ASR model whose performance peaks a 4 BLSTM layers [37]. This is because he direc PIT-CE-ASR archiecure needs

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 8 Fig. 4: Decoding resuls of baseline single speaker BLSTM-RNN sysem on 0db wo-alker mixed speech sample Fig. 5: Decoding resuls of he proposed direc PIT-CE-ASR model on 0db wo-alker mixed speech sample o conduc wo asks - separaion and recogniion, and hus requires addiional modeling power. TABLE V: WER (%) of he direc PIT-CE-ASR model using differen deep learning models on he IHM-2mix daase Models SNR Condiion High E WER Low E WER 6L-DNN 4L-BLSTM 6L-BLSTM 8L-BLSTM 0db 72.95 80.29 5db 65.42 84.44 10db 55.27 86.55 15db 47.12 89.21 20db 40.31 92.45 0db 49.74 56.88 5db 40.31 60.31 10db 34.38 65.52 15db 31.24 73.04 20db 29.68 80.83 0db 47.77 54.89 5db 39.25 59.24 10db 33.83 64.14 15db 30.54 71.75 20db 28.75 79.88 0db 46.91 53.89 5db 39.14 59.00 10db 33.47 63.91 15db 30.09 71.14 20db 28.61 79.34 E. Analysis on Muli-Talker Speech Recogniion Resuls To beer undersand he resuls on muli-alker speech recogniion, we compued he WER separaely for he speech mixed wih same and opposie genders. The resuls are shown in Table VI. I is observed ha he same-gender mixed speech is much more difficul o recognize han he opposie-gender mixed speech, and he gap is even larger when he energy raio of he wo speakers is closer o 1. I is also observed ha he mixed speech of wo male speakers is hard o recognize han ha of wo female speakers. These resuls sugges ha effecive exploiaion of gender informaion may help o furher improve he muli-alker speech recogniion sysem. We will explore his in our fuure work. TABLE VI: WER (%) comparison of he 6-layer-BLSTM direc PIT-CE-ASR model on he mixed speech generaed from wo male speakers (M + M), wo female speakers (F + F) and a male and a female speaker (M + F) Genders SNR Condiion High E WER Low E WER M + M F + F M + F 0db 52.18 59.32 5db 42.64 61.77 10db 36.10 63.94 0db 49.90 57.59 5db 40.02 60.92 10db 32.47 65.15 0db 44.89 51.72 5db 37.34 57.43 10db 33.22 63.86 To furher undersand our model, we examined he recogniion resuls wih and wihou using he direc PIT-CE-ASR. An example of hese resuls on a 0db wo-alker mixed speech uerance is shown in Figure 4 (using he single-speaker baseline sysem) and 5 (wih direc PIT-CE-ASR). I is clearly seen ha he resuls are erroneous when he single-speaker baseline sysem is used o recognize he wo-alker mixed speech. In conras, much more words are recognized correcly wih he proposed direc PIT-CE-ASR model. F. Three-Talker Speech Recogniion wih Direc PIT-CE-ASR In his subsecion, we furher exend and evaluae he proposed direc PIT-CE-ASR model on he hree-alker mixed speech using he IHM-3mix daase. The hree-alker direc PIT-CE-ASR model is also a 6-layer BLSTM model. The raining and esing configuraions are he same as hose for wo-alker speech recogniion. The direc

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 9 Cross Enropy 9 8 7 6 5 4 3 2 1 6L-BLSTM-2SPK-Train 6L-BLSTM-3SPK-Train 6L-BLSTM-2SPK-Val 6L-BLSTM-3SPK-Val TABLE VII: WER (%) comparison of he baseline singlespeaker BLSTM-RNN sysem and he proposed direc PIT- CE-ASR model on he IHM-3mix daase. Diff indicaes he mixed speech is from differen genders, and Same indicaes he mixed speech is from same gender Genders Model Speaker1 Speaker2 Speaker3 All BLSTM-RNN 91.0 90.5 90.8 All 69.54 67.35 66.01 Differen direc PIT-CE-ASR 69.36 65.84 64.80 Same 72.21 70.11 69.78 0 0 10 20 30 40 50 60 70 80 90 100 Epochs Fig. 6: CE values over epochs on boh he IHM-2mix and IHM-3mix raining and validaion ses wih he proposed direc PIT-CE-ASR model Table IV). This demonsraes he good generalizaion abiliy of our proposed direc PIT-CE-ASR model over variable number of mixed speakers. This suggess ha a single PIT model may be able o recognize mixed speech of differen number of speakers wihou knowing or esimaing he number of speakers. PIT-CE-ASR raining processes as measured by CE on boh wo- and hree-alker mixed speech raining and validaion ses are illusraed in Figure 6. I is observed ha he direc PIT-CE-ASR model wih his specific configuraion converges slowly, and he CE improvemen progress on he raining and validaion ses is almos he same. The raining progress on hree-alker mixed speech is similar o ha on wo-alker mixed speech, bu wih an obviously higher CE value. This indicaes he huge challenge when recognizing speech mixed wih more han wo alkers. Noe ha, in his se of experimens we used he same model configuraion as ha used in wo-alker mixed speech recogniion. Since hree-alker mixed speech recogniion is much harder, using deeper and wider models may help o improve performance. Due o resource limiaion, we did no search for he bes configuraion for he ask. The hree-alker mixed speech recogniion WERs are repored in Table VII. The WERs on differen gender combinaions are also provided. The WERs achieved wih he single-speaker model are lised a he firs line in Table VII. Compared o he resuls on IHM-2mix, he resuls on IHM- 3mix are significanly worse using he convenional single speaker model. Under his exremely hard seup, he proposed direc PIT-CE-ASR archiecure sill demonsraed is powerful abiliy on separaing/racing/recognizing he mixed speech, and achieved 25.0% relaive WER reducion across all hree speakers. Alhough he performance gap from woalker o hree-alker is obvious, i is sill very promising under his speaker-independen hree-alker LVCSR ask. No surprisingly, he mixed speech of differen genders is relaively easier o recognize han ha of same gender. Moreover, we conduced anoher ineresing experimen. We used he hree-alker PIT-CE-ASR model o recognize he woalker mixed speech. The resuls are shown in Table VIII. Surprisingly, he resuls are almos idenical o ha obained using he 6-layer BLSTM based wo-alker model (shown in TABLE VIII: WER (%) of using hree-alker direc PIT-CE- ASR model o recognize wo-alker mixed IHM-2mix speech Model SNR Condiion High E WER Low E WER Three-Talker PIT-CE-ASR 0db 46.63 54.59 5db 39.47 59.78 10db 34.50 64.55 15db 32.03 72.88 20db 30.66 81.63 V. CONCLUSION In his paper, we proposed several archiecures for recognizing muli-alker mixed speech given only a single channel of he mixed signal. Our echnique is based on permuaion invarian raining, which was originally developed for separaion of muliple speech sreams. PIT can be performed on he fron-end feaure separaion module o obain beer separaed feaure sreams or be exended on he back-end recogniion module o predic he separaed senone poserior probabiliies direcly. Moreover, PIT can be implemened on boh fronend and back-end wih a join-opimizaion archiecure. When using PIT o opimize a model, he crierion is compued over all frames in he whole uerance for each possible oupuarge assignmen, and he one wih he minimum loss is picked for parameer opimizaion. Thus PIT can address he label permuaion problem well, and conduc he speaker separaion and racing in one sho. Paricularly for he proposed archiecure wih he direc PIT-CE based recogniion model, muli-alker mixed speech recogniion can be direcly conduced wihou an explici separaion sage. The proposed archiecures were evaluaed and compared on an arificially mixed AMI daase wih boh wo- and hreealker mixed speech. The experimenal resuls indicae ha he proposed archiecures are very promising. Our models can obain relaive 45.0% and 25.0% WER reducion agains he sae-of-he-ar single-alker speech recogniion sysem across

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 10 all speakers when heir energies are comparable, for wo- and hree-alker mixed speech, respecively. Anoher ineresing observaion is ha here is even no degradaion when using proposed hree-alker model o recognize he wo-alker mixed speech direcly. This suggess ha we can consruc one model o recognize speech mixed wih variable number of speakers wihou knowing or esimaing he number of speakers in he mixed speech. To our knowledge, his is he firs work on he muli-alker mixed speech recogniion on he challenging speaker-independen sponaneous LVCSR ask. ACKNOWLEDGMENT This work was suppored by he Shanghai Sailing Program No. 16YF1405300, he China NSFC projecs (No. 61573241 and No. 61603252), he Inerdisciplinary Program (14JCZ03) of Shanghai Jiao Tong Universiy in China, and he Tencen- Shanghai Jiao Tong Universiy join projec. Experimens have been carried ou on he PI supercompuer a Shanghai Jiao Tong Universiy. REFERENCES [1] D. Yu and L. Deng, Auomaic Speech Recogniion: A Deep Learning Approach, ser. Signals and Communicaion Technology. Springer London, 2014. [Online]. Available: hps://books.google.com/books?id=rubtbqaaqbaj [2] D. Yu, L. Deng, and G. E. Dahl, Roles of pre-raining and fine-uning in conex-dependen DBN-HMMs for real-world speech recogniion, in NIPS Workshop on Deep Learning and Unsupervised Feaure Learning, 2010. [3] G. E. Dahl, D. Yu, L. Deng, and A. Acero, Conex-dependen prerained deep neural neworks for large-vocabulary speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 20, pp. 30 42, 2012. [4] F. Seide, G. Li, and D. Yu, Conversaional speech ranscripion using conex-dependen deep neural neworks. in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2011, pp. 437 440. [5] G. Hinon, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaily, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainah e al., Deep neural neworks for acousic modeling in speech recogniion: The shared views of four research groups, IEEE Signal Processing Magazine (SPM), vol. 29, pp. 82 97, 2012. [6] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, Applying convoluional neural neworks conceps o hybrid NN-HMM model for speech recogniion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2012, pp. 4277 4280. [7] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, Convoluional neural neworks for speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 22, pp. 1533 1545, 2014. [8] T. N. Sainah, O. Vinyals, A. Senior, and H. Sak, Convoluional, long shor-erm memory, fully conneced deep neural neworks, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2015, pp. 4580 4584. [9] M. Bi, Y. Qian, and K. Yu, Very deep convoluional neural neworks for LVCSR, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2015, pp. 3259 3263. [10] Y. Qian, M. Bi, T. Tan, and K. Yu, Very deep convoluional neural neworks for noise robus speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 24, no. 12, pp. 2263 2276, 2016. [11] Y. Qian and P. C. Woodland, Very deep convoluional neural neworks for robus speech recogniion, in IEEE Spoken Language Technology Workshop (SLT), 2016, pp. 481 488. [12] V. Mira and H. Franco, Time-frequency convoluional neworks for robus speech recogniion, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), 2015, pp. 317 323. [13] V. Peddini, D. Povey, and S. Khudanpur, A ime delay neural nework archiecure for efficien modeling of long emporal conexs, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2015, pp. 3214 3218. [14] T. Sercu, C. Puhrsch, B. Kingsbury, and Y. LeCun, Very deep mulilingual convoluional neural neworks for LVCSR, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2016, pp. 4955 4959. [15] D. Amodei, R. Anubhai, E. Baenberg, C. Case, J. Casper, B. Caanzaro, J. Chen, M. Chrzanowski, A. Coaes, G. Diamos e al., Deep speech 2: End-o-end speech recogniion in English and Mandarin, in Inernaional Conference on Machine Learning (ICML), 2016. [16] S. Zhang, H. Jiang, S. Xiong, S. Wei, and L. Dai, Compac feedforward sequenial memory neworks for large vocabulary coninuous speech recogniion, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp. 3389 3393. [17] D. Yu, W. Xiong, J. Droppo, A. Solcke, G. Ye, J. Li, and G. Zweig, Deep convoluional neural neworks wih layer-wise conex expansion and aenion. in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp. 17 21. [18] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Selzer, A. Solcke, D. Yu, and G. Zweig, The Microsof 2016 conversaional speech recogniion sysem, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp. 5255 5259. [19] D. Yu, M. Kolbk, Z.-H. Tan, and J. Jensen, Permuaion invarian raining of deep models for speaker-independen muli-alker speech separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp. 241 245. [20] M. Kolbk, D. Yu, Z.-H. Tan, and J. Jensen, Muli-alker speech separaion wih uerance-level permuaion invarian raining of deep recurren neural neworks, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), acceped, 2017. [21] Z. Ghahramani and M. I. Jordan, Facorial hidden Markov models, Machine learning (MLJ), vol. 29, no. 2-3, pp. 245 273, 1997. [22] M. Cooke, J. R. Hershey, and S. J. Rennie, Monaural speech separaion and recogniion challenge, Compuer Speech and Language (CSL), vol. 24, pp. 1 15, 2010. [23] C. Weng, D. Yu, M. L. Selzer, and J. Droppo, Deep neural neworks for single-channel muli-alker speech recogniion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 23, no. 10, pp. 1670 1679, 2015. [24] J. R. Hershey, Z. Chen, J. L. Roux, and S. Waanabe, Deep clusering: Discriminaive embeddings for segmenaion and separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2016, pp. 31 35. [25] Y. Isik, J. L. Roux, Z. Chen, S. Waanabe, and J. R. Hershey, Singlechannel muli-speaker separaion using deep clusering, in Annual Conference of Inernaional Speech Communicaion Associaion (IN- TERSPEECH), 2016, pp. 545 549. [26] Z. Chen, Y. Luo, and N. Mesgarani, Deep aracor nework for singlemicrophone speaker separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2017, pp. 246 250. [27] J. R. Hershey, S. J. Rennie, P. A. Olsen, and T. T. Krisjansson, Super-human muli-alker speech recogniion: A graphical modeling approach, Compuer Speech and Language (CSL), vol. 24, pp. 45 66, 2010. [28] S. J. Rennie, J. R. Hershey, and P. A. Olsen, Single-channel mulialker speech recogniion, IEEE Signal Processing Magazine (SPM), vol. 27, pp. 66 80, 2010.

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 11 [29] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Deep learning for monaural speech separaion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), 2014, pp. 1562 1566. [30] F. Weninger, H. Erdogan, S. Waanabe, E. Vincen, J. Roux, J. R. Hershey, and B. Schuller, Speech enhancemen wih LSTM recurren neural neworks and is applicaion o noise-robus ASR, in Inernaional Conference on Laen Variable Analysis and Signal Separaion (LVA/ICA). Springer-Verlag New York, Inc., 2015, pp. 91 99. [31] Y. Wang, A. Narayanan, and D. Wang, On raining arges for supervised speech separaion, IEEE/ACM Transacions on Audio, Speech and Language Processing (TASLP), vol. 22, pp. 1849 1858, 2014. [32] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, An experimenal sudy on speech enhancemen based on deep neural neworks, IEEE Signal Processing Leers (SPL), vol. 21, pp. 65 68, 2014. [33] P. S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, Join opimizaion of masks and deep recurren neural neworks for monaural source separaion, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 23, pp. 2136 2147, Dec 2015. [34] J. Du, Y. Tu, L. R. Dai, and C. H. Lee, A regression approach o singlechannel speech separaion via high-resoluion deep neural neworks, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 24, pp. 1424 1437, Aug 2016. [35] T. Hain, L. Burge, J. Dines, P. N. Garner, F. Grézl, A. E. Hannani, M. Huijbregs, M. Karafia, M. Lincoln, and V. Wan, Transcribing meeings wih he AMIDA sysems, IEEE/ACM Transacions on Audio, Speech, and Language Processing (TASLP), vol. 20, no. 2, pp. 486 498, 2012. [36] D. Povey, V. Peddini, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, Purely sequence-rained neural neworks for ASR based on laice-free MMI, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp. 2751 2755. [37] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass, Highway long shor-erm memory RNNs for disan speech recogniion, IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), pp. 5755 5759, 2016. [38] J. J. Godfrey and E. Holliman, Swichboard-1 release 2, Linguisic Daa Consorium, Philadelphia, 1997. [39] T. Sercu and V. Goel, Dense predicion on sequences wih ime-dilaed convoluions for speech recogniion, arxiv preprin arxiv:1611.09288, 2016. [40] G. Saon, T. Sercu, S. Rennie, and H.-K. J. Kuo, The IBM 2016 english conversaional elephone speech recogniion sysem, in Annual Conference of Inernaional Speech Communicaion Associaion (INTERSPEECH), 2016, pp. 7 11. [41] P. Swieojanski, A. Ghoshal, and S. Renals, Hybrid acousic models for disan and mulichannel large vocabulary speech recogniion, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), 2013, pp. 285 290. [42] D. Yu, A. Eversole, M. Selzer, K. Yao, Z. Huang, B. Guener, O. Kuchaiev, Y. Zhang, F. Seide, H. Wang e al., An inroducion o compuaional neworks and he compuaional nework oolki, Microsof Technical Repor MSR-TR-2014 112, 2014. [43] D. Povey, A. Ghoshal, G. Boulianne, L. Burge, O. Glembek, N. Goel, M. Hannemann, P. Molicek, Y. Qian, P. Schwarz e al., The kaldi speech recogniion oolki, in IEEE Workshop on Auomaic Speech Recogniion and Undersanding (ASRU), no. EPFL-CONF-192584, 2011.