Multistream recognition of speech

Similar documents
Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

WHEN THERE IS A mismatch between the acoustic

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Segregation of Unvoiced Speech from Nonspeech Interference

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition at ICSI: Broadcast News and beyond

On the Formation of Phoneme Categories in DNN Acoustic Models

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Python Machine Learning

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Human Emotion Recognition From Speech

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Modeling function word errors in DNN-HMM based LVCSR systems

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

Learning Methods in Multilingual Speech Recognition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speaker Identification by Comparison of Smart Methods. Abstract

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Proceedings of Meetings on Acoustics

Body-Conducted Speech Recognition and its Application to Speech Support System

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Speaker recognition using universal background model on YOHO database

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A Neural Network GUI Tested on Text-To-Phoneme Mapping

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

SARDNET: A Self-Organizing Feature Map for Sequences

Software Maintenance

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Artificial Neural Networks written examination

Deep Neural Network Language Models

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Using Articulatory Features and Inferred Phonological Segments in Zero Resource Speech Processing

arxiv: v1 [cs.lg] 7 Apr 2015

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Author's personal copy

UTD-CRSS Systems for 2012 NIST Speaker Recognition Evaluation

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

A comparison of spectral smoothing methods for segment concatenation based speech synthesis

Lecture 1: Machine Learning Basics

Rhythm-typology revisited.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Speaker Recognition. Speaker Diarization and Identification

Accelerated Learning Online. Course Outline

Stages of Literacy Ros Lugg

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Exploration. CS : Deep Reinforcement Learning Sergey Levine

arxiv: v1 [cs.cl] 27 Apr 2016

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

Generative models and adversarial training

Lecture 2: Quantifiers and Approximation

Knowledge Transfer in Deep Convolutional Neural Nets

Seminar - Organic Computing

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Evolutive Neural Net Fuzzy Filtering: Basic Description

Circuit Simulators: A Revolutionary E-Learning Platform

Using EEG to Improve Massive Open Online Courses Feedback Interaction

age, Speech and Hearii

MYCIN. The MYCIN Task

INPE São José dos Campos

Investigation on Mandarin Broadcast News Speech Recognition

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

THE world surrounding us involves multiple modalities

THE RECOGNITION OF SPEECH BY MACHINE

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

On Developing Acoustic Models Using HTK. M.A. Spaans BSc.

Individual Differences & Item Effects: How to test them, & how to test them well

A Review: Speech Recognition with Deep Learning Methods

Transcription:

Multistream recognition of speech Hynek Hermansky Center for Language and Speech Processing The Johns Hopkins University, Baltimore, USA and FIT VUT Brno Czech Republic

Maxwell demon HIGH ENTROPY LOW ENTROPY The Demon closes door when a slow air molecule comes and lets the fast air molecules to go through The Demon must KNOW which molecule is fast and which is slow! knowledge comes from - magic - measurements When decreasing entropy, one should use knowledge!

> 50 kb/s C= Wlog 2 (S/N+1), W=5kHz, S/N+1>10 3 machine message who is speaking, emotions, accent, acoustic environment, < 50 b/s < 3bits/phoneme, < 15 phonemes/s linguistic message Information rate (entropy) reduction requires knowing what to leave out and how

KNOWLEDGE - magic - experts, beliefs, previous experience (hardwired) - measurements (data) HARDWIRED reusable permanent knowledge but no need to re-learn known facts experts and beliefs can be wrong DATA no knowledge better than wrong knowledge but data do not lie transcribed data are expensive REUSEABLE AND HARDWAREABLE KNOWLEDGE FROM DATA!

Acoustic Processing in ASR signal features probability estimator (classifier) features (signal processing) what we already know (general knowledge) alleviate unwanted information wanted information, which is left out is gone forever classifier (machine learning) what we yet do not know (task-specific knowledge) typically stochastic (trained on data) unwanted information, which is kept, requires more complex classifiers, trained on more data

Data-driven approaches dominate ASR field Artificial Neural Networks Discriminative nonlinear classifiers introduced to ASR in late eighties of 20 th century Fewer restrictions on form of input features Current hardware advances allow for new revolutionary approaches to ASR BIG DATA deep neural net information

BIG DATA deep neural net informati on Deep Neural Net: Hierarchical convolutional long-shortmemory highway-connected attentionbased bi-directional-gated pyramidal temporal-classifying recurrent DNN New DNN structures and their parameters New opportunities to verify existing knowledge and to learn new things Data-derived knowledge should be hardwired into future designs!

Deep Neural Network Based ASR from Raw Speech Signal Tüske, Golik, Schlüter and Ney 2015 power spectrum speech convolutions with input speech signal remaining fully connected hidden layers of the deep neural networks posterior probabilities of generalized tied triphones

Data-driven two-stage acoustic processing of raw speech signal (spectrum and time-frequency cortical-like filters) Golik, Tüske, Schlüter and Ney 2015 power spectrum time-frequency speech representation speech convolutions with input speech signal convolutions with time trajectories of power spectra remaining fully connected hidden layers of the deep neural networks posterior probabilities of generalized tied triphones

APEX BASE

Frequency [khz] Some examples of mammalian auditory cortical receptive fields Patil et al 2012 time [ms]

Spectral (simultaneous) masking signal high frequencies spectral energies in critical bands low frequencies spectral masking: detection of signal in one critical band is not influenced by signal in another critical band Fletcher 1933

constriction (TONGUE?) lips glottis any change in the tract shape is reflected at ALL FREQUENCIES of speech spectrum! F4 F3 F2 F1 0 2 4 6 8 10 12 14 16 distance of constriction from lips

Articulatory Bands French and Steinberg 1949 250-375-505-654-795-995-1130-1315-1515-1720-1930-2140-2355-2600-2900-3255-3680-4200-4860-5720-7000 Hz 20 frequency bands in speech spectral region each band contributes about equally to human speech recognition any 10 bands sufficient for 70% correct recognition of nonsense syllables, better than 95% correct recognition of meaningful sentences [Fletcher and Steinberg 1929]

2 7-1 = 127 streams MLP 127 different stream combinations in hierarchical MLP structures sub-band 1 sub-band 2 MLP MLP MLP evaluate word error for different stream combinations signal sub-band 3 sub-band 4 sub-band 5 MLP MLP MLP form all nonempty combinations of band-limited streams find reliable streams Hermansky et al 1996 sub-band 6 sub-band 7 MLP MLP MLP MLP MLP

processing 1 processing 2 processing 3 processing 4 processing 5 processing 6 Human Recognition Strategy (and eventually also machines)? Divide et Impera S ( frequency ) colored noise can be seen as close to white noise in individual bands corrupted frequency bands could be left out from further processing

speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 fusing DNN search word string Word error rates of DNN recognizer on Aurora noisy data (relative change in brackets) auditory spectrum spectral streams 126 110 (-128) Sri Harish Mallidi, JHU PhD Thesis, in preparation

Some of the streams may carry garbage Train fusing DNN on inputs, which carry no information During training, randomly set some stream outputs to all-zero random mask target speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 [1] [1] [0] [1] [1] fusing DNN 1 0 0 0 0 Similar to feature dropping but here the whole organized sets of features representing streams are being dropped at any given time

speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 fusing DNN trained with band dropouts search word string Word error rates of DNN recognizer on Aurora noisy data (relative change in brackets) auditory spectral stream spectrum streams dropping 126 110 99 (-128) (-101) Sri Harish Mallidi, JHU PhD Thesis, in preparation

Performance monitoring Knowing when the result in probability estimation is in error would allow for the selection of the best performing stream combination speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 fusing DNN trained with band dropouts search word string stream selection performance monitoring Performance monitoring : requires estimation of performance of a classifier without knowing what the correct result is

good posteriogram derived from speech data similar to its training data bad posteriogram derived from corrupted speech data

How clean is a posteriogram? M(Dt) = å N-Dt i=0 D(p i,p i+dt ) N - Dt Δi time delay D() symmetric Kl divergence Δτ

Quality of speech signal from microphone array from Bernd T Meyer speaker noise source - 30 0 40 0 performance monitoring module

How similar is the estimator performance on its training data and in the test? Mesgarani et al 2011 DNN auto-encoder, trained on output of the estimator when applied to its training data training of probability estimator training data p targets from labels training of performance monitor training data p p train to minimize (p - p ) 2 performance monitor in use test data p test p test evaluate (p test - p test ) 2

speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 fusing DNN trained with band dropouts search word string stream selection performance monitoring Word error rates of DNN recognizer on Aurora noisy data (relative change in brackets) auditory spectral stream performance spectrum streams dropping monitoring 126 110 99 96 (-128) (-101) (-28) Sri Harish Mallidi, JHU PhD Thesis, in preparation

speech auditory speech spectrum 1-3 Bark 4-6 Bark 7-11 Bark 12-15 Bark 16-19 Bark DNN1 DNN2 DNN3 DNN4 DNN5 fusing DNN trained with band dropouts search word string picking up the stream combination which yields the lowers word error rate (cheating) Word error rates of DNN recognizer on Aurora noisy data (relative change in brackets) auditory spectral stream performance oracle band spectrum streams dropping monitoring selection 126 110 99 96 79 (-128) (-101) (-28) (-180) Sri Harish Mallidi, JHU PhD Thesis, in preparation

Multiple parallel noise-specific streams speech clean car crowd ship1 ship2 pick the best stream performance monitor phoneme error rates noisy TIMIT train / test clean car crowd ship1 ship2 multi-style 230 249 394 420 430 matched 207 228 370 381 376 oracle (cheating) 184 205 347 345 318 multi-stream with 209 229 368 366 368 Mallidi et al ASRU 2015

Many ways of seeing the signal APEX BASE number of neurons 100 M speed of firing 10 Hz 100K 1 khz

Concept of multi-stream recognition EXTRACTED INFORMATION stream selection performance monitoring fusion stream forming different streams modalities, frequency bands, spectral and temporal resolutions, levels of prior knowledge SIGNAL

THANKS Sri Harish Mallidi Nima Mesgarani Tetsuji Ogawa Samuel Thomas Feipeng Li Ehsan Variani Vijay Peddinti Bernd T Meyer Phani Nidadavolu

Regarding the database: The training set consists of 14 hours of multi-condition data, sampled at 16 khz Total 7137 utterance from 83 speakers Half of the utterances were recorded by the primary Sennheiser microphone and the other half were recorded using one of a number of different secondary microphones Both halves include a combination of clean speech and speech corrupted by one of six different noises (street traffic, train station, car, babble, restaurant, airport) at 10-20 db signal-to-noise ratio The test set consist of 14 conditions, with 330 utterances for each condition The conditions include clean set recorder with primary Sennheiser microphone, clean set with secondary microphone, 6 additive noise conditions which include airport, babble, car, restaurant, street and train noise at 5-15 db signalto-noise ratio (SNR) and 6 conditions with the combination of additive and channel noise Regarding the features: From signal extract 63 Mel filterbank energies At a given frame, take 11 frame context (-5, +5) In each subband project the 11 frame context onto 6 dct basis