Speech Enhancement with Convolutional- Recurrent Networks

Similar documents
Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Python Machine Learning

WHEN THERE IS A mismatch between the acoustic

A study of speaker adaptation for DNN-based speech synthesis

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

THE world surrounding us involves multiple modalities

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

CS Machine Learning

Calibration of Confidence Measures in Speech Recognition

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Emotion Recognition Using Support Vector Machine

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Human Emotion Recognition From Speech

Lecture 1: Machine Learning Basics

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Author's personal copy

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Modeling function word errors in DNN-HMM based LVCSR systems

Learning Methods for Fuzzy Systems

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v2 [cs.cv] 30 Mar 2017

Speaker Identification by Comparison of Smart Methods. Abstract

Learning Methods in Multilingual Speech Recognition

Segregation of Unvoiced Speech from Nonspeech Interference

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Model Ensemble for Click Prediction in Bing Search Ads

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

On the Formation of Phoneme Categories in DNN Acoustic Models

Artificial Neural Networks written examination

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Word Segmentation of Off-line Handwritten Documents

Deep Neural Network Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

(Sub)Gradient Descent

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

arxiv: v1 [cs.cl] 27 Apr 2016

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Knowledge Transfer in Deep Convolutional Neural Nets

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Australian Journal of Basic and Applied Sciences

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Assignment 1: Predicting Amazon Review Ratings

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Vimala.C Project Fellow, Department of Computer Science Avinashilingam Institute for Home Science and Higher Education and Women Coimbatore, India

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Time series prediction

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Speaker recognition using universal background model on YOHO database

Rule Learning with Negation: Issues Regarding Effectiveness

Lecture 10: Reinforcement Learning

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

LOW-RANK AND SPARSE SOFT TARGETS TO LEARN BETTER DNN ACOUSTIC MODELS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Deep Bag-of-Features Model for Music Auto-Tagging

THE enormous growth of unstructured data, including

Evolutive Neural Net Fuzzy Filtering: Basic Description

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

An empirical study of learning speed in backpropagation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Mining Association Rules in Student s Assessment Data

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Rhythm-typology revisited.

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Transcription:

Speech Enhancement with Convolutional- Recurrent Networks Han Zhao 1, Shuayb Zarar 2, Ivan Tashev 2 and Chin-Hui Lee 3 Apr. 19 th 1 Machine Learning Department, Carnegie Mellon University 2 Microsoft Research 3 School of Electrical Engineering, Georgia Institute of Technology 1

Speech Enhancement Motivation ASR system - Training phase Clean Speech Black-box ASR Text stream 2

Speech Enhancement Motivation ASR system - Inference phase Noisy Speech Fixed Black-box ASR Text stream 3

Speech Enhancement Motivation Distribution mismatch Noisy Speech? Clean Speech Similar issues with rendering and perception Clean speech is preferred for playback 4

Speech Enhancement Motivation Speech enhancement: from noisy to clean Noisy Speech Clean Speech Speech Enhancement 5

Outline Background Data-driven Approach Convolutional-Recurrent Network for Speech Enhancement Conclusion 6

Background Problem setup: Clean signal Noisy signal (Unknown) noise Typical assumptions on noise: Stationarity: is independent of Noise type: Classic methods: spectral subtraction (Boll 1979), Minimum mean-squared error estimator (Ephraim et al. 1984), Subspace approach (Ephraim et al. 1995) 7

Background Classic methods are based on statistical assumptions of noise: Pros: Simple, and computationally efficient Optimality under proper assumption Interpretable Cons: Limited to stationary noise Restricted to noise with specific characteristics 8

Data-driven Approach What if we can collect large datasets of paired signals? 9

Data-driven Approach What if we can collect large datasets of paired signals? Given: Paired signals Goal: Build function approximator such that In short: regression based approach, usually 10

Data-driven Approach Parametric regression using Neural Networks: Flexible for representation learning Scale linearly in and Natural paradigm for multi-task learning by sharing common representations Figure from Lu el al., Interspeech 2013 11

Data-driven Approach Related work for speech enhancement Recurrent network for noise reduction, Maas et al., ISCA 2012 Deep denoising auto-encoder, Lu et al., Interspeech 2013 Weighted denoising auto-encoder, Xia et al., Interspeech 2013 DNN with symmetric context window, Xu et al., IEEE SPL 2014 Hybrid of DNN suppression rule, Mirsamadi et al., Interspeech 2016 12

Data-driven Approach Speech Enhancement Pipeline: Short-term Fourier Transform (STFT) to obtain time-frequency signal STFT Build neural networks to approximate filter function such that Apply Inverse-STFT (ISTFT) to reconstruct sound wave ISTFT( ) Focus of this talk 13

Convolutional-Recurrent Networks for SE Problem setup: Given time-frequency signal spectrogram pair where For each utterance, usually frames and frequency bins. 14

Convolutional-Recurrent Networks for SE Observations: Existing DNN-based approaches do not fully exploit the structure of speech signals. Frame-based DNN regression approach does not use the temporal locality of spectrogram Fully connected DNN regression approach does not exploit the continuity of consecutive frequency bins in spectrogram 15

Convolutional-Recurrent Networks for SE Observations: Existing DNN-based approaches do not fully exploit the structure of speech signals. Frame-based DNN regression approach does not use the temporal locality of spectrogram Use recurrent neural networks Fully connected DNN regression approach does not exploit the continuity of consecutive frequency bins in spectrogram Use convolutional neural networks 16

Convolutional-Recurrent Networks for SE Proposed: Convolution + bi-lstm + Linear Regression Objective: 17

Convolutional-Recurrent Networks for SE Proposed: Convolution + bi-lstm + Linear Regression At a high level, why will this model work? Continuity of signal in time and frequency domains Convolution kernels as linear filters to match local patterns bi-lstm -> symmetric context window with adaptive window size End-to-end learning without additional assumptions on noise type 18

Convolutional-Recurrent Networks for SE Convolution Zero-padded spectrogram (t, f) = * Convolution kernel with size (b, w) feature map of size (t, f ) 19

Convolutional-Recurrent Networks for SE Concatenation of feature maps k feature maps, each with size (t, f ) One feature map, with size (t, kf ) 20

Convolutional-Recurrent Networks for SE bi-directional LSTM State transition function of LSTM cell: + 21

Convolutional-Recurrent Networks for SE Linear Regression with Projection At each time step t: where is the output state of bi-lstm at time step t. Objective function and Optimization MSE: Optimization algorithm: AdaDelta 22

Experiments Dataset Single channel, Microsoft-internal data Cortana utterances: male, female and children Sampling rate: 16kHz Storage format: 24bits precision Each utterance: 5~9 seconds Noise: subset of MS noise collection, 377 files with 25 types 48 room impulse responses from MS RIR collection Training Validation Test (seen noise) Test (unseen noise) # utterances 7,500 1,500 1,500 1,500 23

Experiments Evaluation Metric Signal-to-Noise Ratio (SNR) db Log-spectral Distance (LSD) Mean-squared Error in time domain (MSE) Word error rate (WER) Perceptual evaluation of speech quality P.862 (PESQ) 24

Experiments Comparison with State-of-the-Art Methods Classic noise suppressor DNN-Symmetric (Xu et al. 2015) Multilayer perceptron, 3 hidden layers (2048x3), 11 context window DNN-Causal (Tashev et al. 2016) Multilayer perceptron, 3 hidden layers (2048x3), 7 causal window Deep-RNN (Maas et al. 2012) Recurrent autoencoders, 3 hidden layers (500x3), 3 context window All models are trained using AdaDelta 25

Experiments Comparison with State-of-the-Art Methods (seen noise) SNR LSD MSE WER PESQ Noisy data 15.18 23.07 0.04399 15.40 2.26 Classic NS 18.82 22.24 0.03985 14.77 2.40 DNN-s 44.51 19.89 0.03436 55.38 2.20 DNN-c 40.70 20.09 0.03485 54.92 2.17 RNN 41.08 17.49 0.03533 44.93 2.19 Ours 49.79 15.17 0.03399 14.64 2.86 Clean data 57.31 1.01 0.0000 2.19 4.48 26

Experiments Comparison with State-of-the-Art Methods (unseen noise) SNR LSD MSE WER PESQ Noisy data 14.78 23.76 0.04786 18.40 2.09 Classic NS 19.73 22.82 0.04201 15.54 2.26 DNN-s 40.47 21.07 0.03741 54.77 2.16 DNN-c 38.70 21.38 0.03718 54.13 2.13 RNN 44.60 18.81 0.03665 52.05 2.06 Ours 39.70 17.06 0.04721 16.71 2.73 Clean data 58.35 1.15 0.0000 1.83 4.48 27

Experiments Case Study Noisy Clean MS-Cortana 28

Experiments Case Study Noisy Clean DNN 29

Experiments Case Study Noisy Clean RNN 30

Experiments Case Study Noisy Clean Ours 31

Conclusion Convolutions help capture local pattern Recurrence helps model sequential structure Our model improves SNR by 35 db and PESQ by 0.6 With fixed ASR system, improves WER by 1% Good generalizations on unseen noise 32

Conclusion Thanks 33