DEEP LEARNING FOR MONAURAL SPEECH SEPARATION

Similar documents
Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A study of speaker adaptation for DNN-based speech synthesis

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Human Emotion Recognition From Speech

Speech Emotion Recognition Using Support Vector Machine

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Python Machine Learning

On the Formation of Phoneme Categories in DNN Acoustic Models

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speaker Identification by Comparison of Smart Methods. Abstract

Calibration of Confidence Measures in Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Recognition at ICSI: Broadcast News and beyond

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Learning Methods in Multilingual Speech Recognition

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

arxiv: v1 [cs.lg] 7 Apr 2015

Comment-based Multi-View Clustering of Web 2.0 Items

Assignment 1: Predicting Amazon Review Ratings

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Word Segmentation of Off-line Handwritten Documents

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Improvements to the Pruning Behavior of DNN Acoustic Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Segregation of Unvoiced Speech from Nonspeech Interference

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

INPE São José dos Campos

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

Speaker recognition using universal background model on YOHO database

Evolutive Neural Net Fuzzy Filtering: Basic Description

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

(Sub)Gradient Descent

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v2 [cs.cv] 30 Mar 2017

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Deep Bag-of-Features Model for Music Auto-Tagging

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

A Review: Speech Recognition with Deep Learning Methods

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Generative models and adversarial training

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Digital Signal Processing: Speaker Recognition Final Report (Complete Version)

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Speech Recognition by Indexing and Sequencing

THE world surrounding us involves multiple modalities

CSL465/603 - Machine Learning

Australian Journal of Basic and Applied Sciences

Affective Classification of Generic Audio Clips using Regression Models

Support Vector Machines for Speaker and Language Recognition

Model Ensemble for Click Prediction in Bing Search Ads

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Time series prediction

Second Exam: Natural Language Parsing with Neural Networks

Learning Methods for Fuzzy Systems

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Proceedings of Meetings on Acoustics

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Knowledge Transfer in Deep Convolutional Neural Nets

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Rule Learning With Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Automatic Pronunciation Checker

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Softprop: Softmax Neural Network Backpropagation Learning

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Rule Learning with Negation: Issues Regarding Effectiveness

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Speaker Recognition. Speaker Diarization and Identification

SARDNET: A Self-Organizing Feature Map for Sequences

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Transcription:

DEEP LEARNING FOR MONAURAL SPEECH SEPARATION Po-Sen Huang, Minje Kim, Mark Hasegawa-Johnson, Paris Smaragdis Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, USA Department of Computer Science, University of Illinois at Urbana-Champaign, USA Adobe Research, USA {huang1, minje, jhasegaw, paris}@illinois.edu ABSTRACT Monaural source separation is useful for many real-world applications though it is a challenging problem. In this paper, we study deep learning for monaural speech separation. We propose the joint optimization of the deep learning models (deep neural networks and recurrent neural networks) with an extra masking layer, which enforces a reconstruction constraint. Moreover, we explore a discriminative training criterion for the neural networks to further enhance the separation performance. We evaluate our approaches using the TIMIT speech corpus for a monaural speech separation task. Our proposed models achieve about 3..9 SIR gain compared to NMF models, while maintaining better SDRs and SARs. Index Terms Monaural Source Separation, Time- Frequency Masking, Deep Learning 1. INTRODUCTION Source separation of audio signals is important for several real-world applications. For example, separating noise from speech signals enhances the accuracy of automatic speech recognition (ASR) [1, 2]. Separating singing voices from music enhances the accuracy of chord recognition [3]. Current separation results are, however, still far behind human capability. Monaural source separation is even more difficult since only one single channel signal is available. Recently, several approaches have been proposed to address the monaural source separation problem [,,, 7]. The widely used non-negative matrix factorization (NMF) [] and probabilistic latent semantic indexing (PLSI) [, ] factorize time-frequency spectral representations by learning the nonnegative reconstruction bases and weights. NMF and PLSI models are linear models with nonnegative constraints. Each can be viewed as one linear neural network with non-negative weights and coefficients. Moreover, NMF and PLSI usually operate directly in the spectral domain. In this paper, in order to enhance the model This research was supported by U.S. ARL and ARO under grant number W911NF-09-1-033. Signal Evaluation STFT/log-mel ISTFT DNN/RNN Time Frequency Masking Fig. 1: Proposed framework Source 1 Source 2 expressibility, we study source separation based on nonlinear models, specifically, deep neural networks (DNNs) and recurrent neural networks (RNNs) [, 9, ]. Instead of using a spectral representation for separation directly, the networks can be viewed as learning optimal hidden representations through several layers of nonlinearity, and the output layer reconstructs the spectral domain signals based on the learnt hidden representations. In this paper, we explore the use of a DNN and the use of an RNN for monaural speech separation in a supervised setting. We propose the joint optimization of the network with a soft masking function. Moreover, a discriminative training objective is also explored. The proposed framework is shown in Figure 1. The organization of this paper is as follows: Section 2 discusses the relation to previous work. Section 3 introduces the proposed methods, including the joint optimization of deep learning models and a soft time-frequency masking function, and a discriminative training objective. Section presents the experimental setting and results using the TIMIT speech corpus. We conclude the paper in Section. 2. RELATION TO PREVIOUS WORK Deep learning approaches have yielded many state of the art results by representing different levels of abstraction with multiple nonlinear layers [, 11, ]. Recently, deep learning techniques have been applied to related tasks such as speech enhancement and ideal binary mask estimation [2, 13, 1]. A 2-stage framework for predicting an ideal binary mask using deep neural networks was proposed by Narayanan and

Wang [13] and by Wang and Wang [1]. The authors first try K neural networks to predict each feature dimension separately, where K is the feature dimension, and then train another classifier (one layer perceptron [13] or an SVM [1]) using neighboring time-frequency predictions in the first stage as the input. The approach of training one DNN per output dimension is not scalable when the output dimension is high. For example, if we want to use spectra as targets, we would have 13 dimensions for a 2-point FFT. Training such large neural networks is often impractical. In addition, there are many redundancies between the neural networks in neighboring frequencies. In our approach, we propose a general framework that can jointly train all feature dimensions at the same time using one neural network, and we also propose a method to jointly train the masking function with the network directly. Maas et al. [2] proposed using an RNN for speech noise reduction in robust automatic speech recognition. Given the noisy signal x, the authors apply an RNN to learn clean speech y. In the source separation scenario, we found that directly modeling one target source in the denoising framework is suboptimal compared to the framework that models all sources. In addition, we can use the information and constraints from different prediction outputs to further perform masking and discriminative training. 3.1. Architecture 3. PROPOSED METHODS We explore using a deep neural network and a recurrent neural network for learning the optimal hidden representations to reconstruct the target spectra. Figure 2 presents an example of the proposed framework using an RNN. At time t, the training input, x t, of the network is the concatenation of features (spectral or log-mel filterbank features) from a mixture within a window. The output predictions, ŷ 1t and ŷ 2t, of the network are the spectra of different sources. In an RNN, the l th hidden layer, l>1, is calculated based on the current input x t and the hidden activation from the previous time step h (l) (x t 1 ), h l (x t )=f W l h l 1 (x t )+b l + U l h l (x t 1 ) (1) where W l and U l are weight matrices, and b l is the bias vector. For a DNN, the temporal weight matrix U l is zero. The first hidden layer is computed as h 1 (x t )=f(w 1 x t + b 1 ). The function f() is a nonlinear function, and we explore using the rectified linear unit f(x) =max(0,x) in this work. The output layer is a linear layer and is computed as: ŷ t = W l h l 1 (x t )+c (2) where c is a bias vector and ŷ t is the concatenation of two predicted sources ŷ 1t and ŷ 2t. Target Output Hidden Layers Input Layer y 1t y 1t h t-1 Source 1 Source 2 h 2 h 1 x t h t+1 Fig. 2: An example of the proposed architecture using a recurrent nerual network 3.2. Time-Frequency Masking Directly training the previously mentioned networks does not have the constraint that the sum of the prediction results is equal to the original mixture. One possible way to enforce the constraint is by time-frequency masking of the original mixture. To enforce the constraint, two commonly used masking functions are explored in this paper: binary (hard) and soft time-frequency masking methods. Given a mixture x t, we obtain the output predictions ŷ 1t and ŷ 2t through the network. The binary time-frequency mask M b is defined as follows: 1 M b (f) = ŷ 1t (f) > ŷ 2t (f) 0 otherwise We can also define the soft time-frequency mask M s as follows: ŷ 1t (f) M s (f) = () ŷ 1t (f) + ŷ 2t (f) Once a time-frequency mask M (M b or M s ) is computed, it is applied to the spectra X t of the mixture x t to obtain the estimated separation spectra ŝ 1t and ŝ 2t, which correspond to sources 1 and 2, as follows: ŝ 1t (f) =M(f)X t (f) ŝ 2t (f) =(1 M(f)) X t (f) Moreover, in addition to taking the outputs from the network and computing the masking results, we can integrate the masking function into the neural network directly. Since the binary mask function is not smooth, we propose the integration of the soft time-frequency masking function directly. We add an extra layer to the original output of the neural network as follows: y 2t y 2t (3) ()

ŷ 1t ỹ 1t = ŷ 1t + ŷ 2t ŷ 2t ỹ 2t = ŷ 1t + ŷ 2t X t X t () where the operator is the element-wise multiplication (Hadamard product). In this way, we can integrate the constraints to the network and optimize the network with the masking function jointly. Note that although this extra layer is a deterministic layer, the network weights are optimized for the error metric between and among ỹ 1t, ỹ 2t and y 1t, y 2t, using back-propagation. To further smooth the predictions, we can apply masking functions to ỹ 1t and ỹ 2t, as in Eqs. (3), (), and (), to get the estimated separation spectra s 1t and s 2t. The time domain signals are reconstructed based on the inverse short time Fourier transform (ISTFT) of the estimated spectra. 3.3. Discriminative Training Given the output predictions ŷ 1t and ŷ 2t (or ỹ 1t and ỹ 2t ) of the original sources y 1t and y 2t, we can optimize the neural network parameters by minimizing the squared error, ŷ 1t y 1t 2 2 + ŷ 2t y 2t 2 2 (7) where 2 is the l 2 norm between the two vectors. Furthermore, minimizing Eq. (7) is equivalent to increasing the similarity between the prediction and the target. For a source separation problem, one of the goals is to have a high signal to interference ratio (SIR); that is, we do not want signals from other sources in the current source prediction. Therefore, we propose a discriminative objective function that takes into account the similarity between the prediction and other sources, and between the prediction and the current target. ŷ 1t y 1t 2 2 ŷ 1t y 2t 2 2+ ŷ 2t y 2t 2 2 ŷ 2t y 1t 2 2 () where is a constant chosen by the performance on the development set..1.1. Features In the experiments, we explore two different input features: spectral and log-mel filterbank features. The spectral representation is extracted using a 2-point short time Fourier transform (STFT) with 0% overlap. In the speech recognition literature [1], the log-mel filterbank is found to provide better results compared to mel-frequency cepstral coefficients (MFCC) and log FFT bins. The 0-dimensional log-mel representation and the first and second order derivative features are also explored in the experiments. Empirically we found that using a 32 ms window with a 1 ms frame shift performs the best. The input frame rate corresponds to the output spectra which are extracted using a -point STFT..1.2. Metric The source separation evaluation is measured by using three quantitative values: Source to Interference Ratio (SIR), Source to Artifacts Ratio (SAR), and Source to Distortion Ratio (SDR), according to the BSS-EVAL metrics [1]. Higher values of SDR, SAR, and SIR represent better separation quality. The suppression of interference is reflected in SIR. The artifacts introduced by the separation process are reflected in SAR. The overall performance is reflected in SDR..2. Experimental Results We use the standard NMF with the generalized KL-divergence metric using -point and 2-point STFT as our baselines. We first train a set of basis vectors, W m, W f from male and female training data, respectively. After solving coefficients, H m and H f, the binary and soft time-frequency masking functions are applied to the predicted magnitude spectrogram. Figure 3 shows the NMF results with respect to different numbers of basis vectors (, 30, 0) and different STFT window sizes using binary and soft masks. The results are averaged across different random initializations. For our proposed neural networks, we optimize our models by back-propagating the gradients with respect to the training objectives. The limited-memory Broyden-Fletcher-Goldfarb- Shanno (L-BFGS) algorithm is used to train the models from.1. Setting. EXPERIMENTS We evaluate the performance of the proposed approaches for monaural speech separation using the TIMIT corpus. Eight TIMIT sentences from a male and a female speaker, respectively, are used for training. With the remaining sentences, one sentence from the male and one from the female are used as the development set and the others are used as the test set. Test sentences are added up to form a mixed signal at 0 SNR. For neural network training, in order to increase the variety of training samples, we circularly shift (in the time domain) the signals of the male speaker and mix them with utterances from the female speaker. 9 7. 9.9..71 9...9 9.2.9 DFT:, basis: DFT:, basis:30 DFT:, basis:0 DFT:2, basis: DFT:2, basis:30 DFT:2, basis:0.7 7.79 7.9.3 7.33.03 NMF Binary Mask Results.22 3.. NMF Soft Mask Results.92.29 3.7.3 9.27.7.9 DFT:, basis: DFT:, basis:30 DFT:, basis:0 DFT:2, basis: DFT:2, basis:30 DFT:2, basis:0.7. 7.03.31 7.1 7.29 Fig. 3: NMF results with the -point and 2-point STFT and basis vector sizes (, 30, 0) using binary and soft time-frequency masking.3 9.7 7.32.72 7.9

1 1. 11.2.1. 11..09.7 11.9. 3.9 9.32 Context Window=3 Binary Mask Results 3.3.07.9 1. DNN+spectra 2. RNN+spectra 3. RNN+spectra+discrim. RNN+logmel. RNN+logmel+discrim. RNN+spectra+joint 7. RNN+spectra+joint+discrim. RNN+logmel+joint 9. RNN+logmel+joint+discrim Context Window=3 Soft Mask Results 11.7 9..19..0.1.07.1.9. 7. 7.1..97.9.2.7.92.3.7.29.1 7.0 7. 7.1 7..7 1. DNN+spectra 2. RNN+spectra 3. RNN+spectra+discrim. RNN+logmel. RNN+logmel+discrim. RNN+spectra+joint 7. RNN+spectra+joint+discrim. RNN+logmel+joint 9. RNN+logmel+joint+discrim Fig. : Neural network results with concatenating neighboring 1 frame as input, where joint indicates the joint training between the network and the soft masking function, and discrim indicates the training with discriminative objectives.7 11..07.73 11..01. 11.2.3.97.3.1.1 11. 1. DNN+spectra 2. RNN+spectra 3. RNN+spectra+discrim. RNN+logmel. RNN+logmel+discrim. RNN+spectra+joint 7. RNN+spectra+joint+discrim. RNN+logmel+joint 9. RNN+logmel+joint+discrim.1.17 7.91.1.1 7.91.11.3 7.27. 7.13.0.72 No Context Window Binary Mask Results..9 No Context Window Soft Mask Results.3.97.77 7.7 1. DNN+spectra 2. RNN+spectra 3. RNN+spectra+discrim. RNN+logmel. RNN+logmel+discrim. RNN+spectra+joint 7. RNN+spectra+joint+discrim. RNN+logmel+joint 9. RNN+logmel+joint+discrim.7 11.1 11.1 9..02.0..03.. 11.77 11..03.99.31..7 7.1 13.1 13.2.9.7.77.9.7 7.3 7.1 Fig. : Neural network results without concatenating neighboring frames as input, where joint indicates the joint training between the network and the soft masking function, and discrim indicates the training with discriminative objectives 13. 13.7 11.71.9.3 7.2 random initialization. We train the models with two hidden layers of 10 hidden units. To further understand the strength of the models, we compare the experimental results in several aspects. To examine the effectiveness of using input with and without neighboring frames, we report the results in Figure and, respectively. The differences between the two cases are not significant. The top and bottom rows of Figure and show the results with binary and soft time-frequency masking, respectively. Similar to the results in NMF, as shown in Figure 3, a binary mask makes hard decisions to enforce the separation and hence results in higher SIRs, but also leads to artifacts with lower SARs. Soft mask, conversely, achieves better SDRs and SARs, but with lower SIRs. In the first two columns, we compare the results between the DNN and the RNN using spectra as features. We found that the differences between the DNN and the RNN are small. The differences in using other features or other training criteria are also insignificant. Due to the space limit, we only report the results of the RNNs here. Between columns 2, 3,, and 7, and columns,,, and 9, we make comparisons using spectra and logmel filterbank as input features. In the cases without joint training, columns 2, 3,, and, spectral features perform better than log-mel filterbank features. On the other hand, in the joint training cases, columns, 7,, and 9, log-mel filterbank features achieve better results. Between columns 2 and 3, columns and, columns and 7, and columns and 9, we compare the effectiveness of using the discriminative training criterion, i.e., >0 in Eq. (). In most cases, SIRs are improved. The results match our expectation when we design the objective function. However, it also leads to some artifacts which result in slightly lower SARs in some cases. Empirically, the value is in the range of 0.0 0.2 in order to achieve SIR improvements and maintain SAR and SDR. Comparing columns 2, 3,, and and columns, 7,, and 9, we can observe that jointly training the network with the masking function achieves large improvements. Since the standard NMF is trained without concatenating neighboring features, finally, we compare the NMF results with the results in Figure. Our best model achieves 3.. and 3.9.9 SIR gain with binary and soft time-frequency masking, respectively, while the model achieves better SDRs and SARs. The sound examples and more details of this work are available online. 1. CONCLUSION In this paper, we propose using deep learning models for monaural speech separation. Specifically, we propose the joint optimization of a soft masking function and deep learning models (DNNs and RNNs). With the proposed discriminative training criterion, we further improve the SIR. Overall, our proposed models achieve 3..9 SIR gain compared to the NMF baseline, while maintaining better SDRs and SARs. For future work, it is important to explore longer temporal information with neural networks. Our proposed models can also be applied to many other applications such as robust ASR. 1 https://sites.google.com/site/deeplearningsourceseparation/

. REFERENCES [1] O. Vinyals, S. V. Ravuri, and D. Povey, Revisiting recurrent neural networks for robust ASR, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 20, pp. 0 0. [2] A. L. Maas, Q. V Le, T. M O Neil, O. Vinyals, P. Nguyen, and A. Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in INTERSPEECH, 20. [3] P.-S. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson, Singing-voice separation from monaural recordings using robust principal component analysis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 20, pp. 7 0. [] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factorization, Nature, vol. 01, no. 7, pp. 7 791, 1999. [] T. Hofmann, Probabilistic latent semantic indexing, in Proceedings of the international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999, pp. 0 7. [] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, Learning deep structured semantic models for web search using clickthrough data, in ACM International Conference on Information and Knowledge Management (CIKM), 2013. [13] A. Narayanan and D. Wang, Ideal ratio mask estimation using deep neural networks for robust speech recognition, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2013. [1] Yuxuan Wang and DeLiang Wang, Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 131 1390, 2013. [1] J. Li, D. Yu, J.-T. Huang, and Y. Gong, Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM, in IEEE Spoken Language Technology Workshop (SLT). IEEE, 20, pp. 131 13. [1] E. Vincent, R. Gribonval, and C. Fevotte, Performance measurement in blind audio source separation, Audio, Speech, and Language Processing, IEEE Transactions on, vol. 1, no., pp. 12 19, July 200. [] P. Smaragdis, B. Raj, and M. Shashanka, A probabilistic latent variable model for acoustic modeling, Advances in models for acoustic processing, NIPS, vol. 1, 200. [7] Ron J Weiss, Underdetermined source separation using speaker subspace models, Ph.D. thesis, Columbia University, 2009. [] G. Hinton and R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science, vol. 313, no. 7, pp. 0 07, 200. [9] S. Parveen and P. Green, Speech enhancement with missing data techniques using recurrent neural networks, in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 200, vol. 1, pp. I 733. [] R. J. Williams and D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural computation, vol. 1, no. 2, pp. 270 20, 199. [11] P.-S. Huang, L. Deng, M. Hasegawa-Johnson, and X. He, Random features for kernel deep convex network, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013.