Artificial Neural Nets for Deriving Speech Features

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Segregation of Unvoiced Speech from Nonspeech Interference

Modeling function word errors in DNN-HMM based LVCSR systems

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Modeling function word errors in DNN-HMM based LVCSR systems

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

WHEN THERE IS A mismatch between the acoustic

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Human Emotion Recognition From Speech

SARDNET: A Self-Organizing Feature Map for Sequences

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Python Machine Learning

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Accelerated Learning Online. Course Outline

Speaker Identification by Comparison of Smart Methods. Abstract

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Learning Methods for Fuzzy Systems

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Speech Emotion Recognition Using Support Vector Machine

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A study of speaker adaptation for DNN-based speech synthesis

Learning Methods in Multilingual Speech Recognition

DNN ACOUSTIC MODELING WITH MODULAR MULTI-LINGUAL FEATURE EXTRACTION NETWORKS

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Mandarin Lexical Tone Recognition: The Gating Paradigm

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Lecture Notes in Artificial Intelligence 4343

A Case Study: News Classification Based on Term Frequency

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Multi-Lingual Text Leveling

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Proceedings of Meetings on Acoustics

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

INPE São José dos Campos

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

Axiom 2013 Team Description Paper

On the Formation of Phoneme Categories in DNN Acoustic Models

Calibration of Confidence Measures in Speech Recognition

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

CEFR Overall Illustrative English Proficiency Scales

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Course Law Enforcement II. Unit I Careers in Law Enforcement

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Probabilistic Latent Semantic Analysis

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Accelerated Learning Course Outline

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Teachers: Use this checklist periodically to keep track of the progress indicators that your learners have displayed.

Edinburgh Research Explorer

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

SLINGERLAND: A Multisensory Structured Language Instructional Approach

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

How People Learn Physics

Switchboard Language Model Improvement with Conversational Data from Gigaword

Managerial Decision Making

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Evolution of Symbolisation in Chimpanzees and Neural Nets

Artificial Neural Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Deep Neural Network Language Models

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Statewide Framework Document for:

English Language and Applied Linguistics. Module Descriptions 2017/18

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Human Factors Engineering Design and Evaluation Checklist

Softprop: Softmax Neural Network Backpropagation Learning

The Common European Framework of Reference for Languages p. 58 to p. 82

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Computerized Adaptive Psychological Testing A Personalisation Perspective

Non intrusive multi-biometrics on a mobile device: a comparison of fusion techniques

Word Segmentation of Off-line Handwritten Documents

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

DIBELS Next BENCHMARK ASSESSMENTS

Using dialogue context to improve parsing performance in dialogue systems

Transcription:

Autoregressive model of Hilbert envelope of the signal signal AM component (temporal envelope) FM component (carrier) channel vocoder based on AM or FM components Artificial Neural Nets for Deriving Speech Features 112 56

conventional artificial neural net up to 100 ms of stacked frames of short-term features all available frequency components multilayer perceptron () (transformed) posterior probabilities of speech sounds time DEEP: Some Hierarchical Nets serial hierarchy (with Joel Pinto) serio-parallel hierarchy (with Fabio Valente) 1000 ms PLP 1 90 ms PLP high modulation MRASTA components posteriors better posteriors 2 230 ms low modulation MRASTA components 1 posteriors 2 better posteriors 57

06/04/14 Artificial Neural Nets signal pre processing neural network estimating posteriors of speech sounds posteriogram Frequency band auditory-like spectrogram TANDEM signal pre processing posterior of phoneme /ae/ histograms of one typical element of feature vectors correlation matrices of whole feature vectors neural network estimating posteriors of speech sounds pre-softmax outputs pre-softmax output of posterior /ae/ principal component projection to HMM 4th principal component /ae/ 58

Unknown Unknowns 117 red man knowledge white man knowledge The problem is not what you do not know, the problem is what you do not know that you do not know 59

Create models of the world Machine Learning 1. from labeled (annotated) training data 2. from prior knowledge of what is possible and how likely it is Find the model that best accounts for the observed data Assumption: the future is the same as the past both the training and the test data are independently and identically obtained samples from the same probability distribution Unexpected events are hard to deal with because 1. not seen in the training 2. low (zero) prior probability Successfully surviving natural systems attend well to the unexpected Power of Priors (Language Model) although some sort of the computer can either way hopefully cin-cin o-bi computer connected with 60

Unexpected Noise slower complex events inter-spike interval 100 ms number of spiking neurons 10,000,000 10 ms 1,000,000 faster simpler events 1 ms 100,000 Deep: many layers Long: cortical event every 100 ms or so Wide: many possible descriptions of an event in auditory cortex 61

DEEP: Information in the signal should be extracted in stages, from description of signal features to description of phonetic events. LONG: Information about underlying speech sounds is spread in time for more than 200 ms WIDE: There are many ways to form parallel processing streams using different signal projections and different prior assumptions. Not all processing streams get always corrupted and we need to find ways to find the uncorrupted processing streams. Information in speech is coded hierarchically (deep) in temporal dynamics (long) and in many redundant dimensions (wide) Deep, Long, and Wide Neural Nets frequency up to 1000 ms time get info 1 get info i get info N many processing layers smart fusion (transformed) posterior probabilities of speech sounds in the center of the window 62

Longer is Better 400 ms Phonetic classifier accuracy as a function of a time span of an analysis Fanty, Cole, Roginski NIPS 1992 accuracy length of analysis interval LONG: Classifying TempoRAl Patterns of Spectral Energies with Sangita Sharma, Pratibha Jain, Honza Cernocky, Pavel Matejka, Petr Schwartz. ~ 100 ms conventional ~ 1000 ms TRAP frequency frequency processing classifier processing classifier processing classifier merging classifier classifier time time Each temporal pattern contains most of coarticulation span of speech sound in its center. about 7 ms about 200 ms > 200 ms classifier time 63

Fusion of streams of different carrier frequencies Wide: Multi-stream Processing Information in speech is coded in many redundant dimensions. Not all dimensions get corrupted at the same time. signal information fusion decision Parallel information-providing streams, each carrying different redundant dimensions of a given target. A strategy for comparing the streams. A strategy for selecting reliable streams. Stream formation Comparing the streams? Different perceptual modalities Different processing channels within each modality Bottom-up and top-down dominated channels various correlation (distance) measures Selecting reliable streams????? 64

Early Attempts for Multi-Stream Recognition with Sangita Sharma and Misha Pavel sub-band 1 sub-band 2 signal sub-band 3 sub-band 4 sub-band 5 sub-band 6 sub-band 7 form all nonempty combinations of band-limited streams find reliable streams SNR in sub-bands classifier confidence majority vote supervised adaptation Monitoring Performance Fletcher et al Boothroyd ann Nittrouer Allen P(ε) = P(ε i ) i P 1 P 2 P miss = (1-P 1 )(1-P 2 ) observer - false positives and negatives are possible P miss_observed (1-P 1 )(1-P 2 ) Do listeners know when they know? How to make machine know when it knows? performance on training data modify performance in test compare 65

Finding Reliable Streams Streams which yield the best performance on the test data Classifier can never work better that it does on the data on which it was trained performance on training data choose the best stream combinations performance in test compare Evaluating Performance How often sound classes occur and how often do they get confused? AC = 1/ N N i=1 (p i ) r (p T i ) r p i vector of sound posteriors at i-th time instant N time interval of the evaluation r th power element-by-element (currently r=0.1) How much sounds classes differ and how fast do they change? M(Δi) = N Δi i=0 D(p i,p i+δi ) N Δi Δi time delay D(.) symmetric Kl divergence clean data noisy data 66

Multi-stream speech recognition speech signal Filterank... Subband 1 Subband 2... ANN ANN... ANN Fusion form 31 processing streams Performance Monitor...... selecting N best streams Average Viterbi decoder phone sequence Subband 5 ANN Phoneme recognition error rates environment conventional proposed best by hand clean (matched training and test) TIMIT with car noise at 0 db SNR (training on clean) RATS data (Channel E matched training and test) 31 % 28 % 25 % 54 % 38 % 35 % 70 % 57 % 49 % Towards Increasing Error Rates 134 67

Signal processing, information theory, machine learning, signal processing pattern classification decoder message Why to rock the boat? We have good thing going. Why to rock the boat? We have good thing going. error rates 68

real world many (possibly hostile) speakers casual conversations in realistic environments unexpected words together with other sounds single motivated speaker well articulated native speech quiet environment closed set small vocabulary only speech expected difficulty (error rate) Repetition, fillers, hesitations, interruptions, unfinished and non-gramatical sentences, new words, dialects, emotions, Current DARPA and IARPA programs, research agenda of the JHU CoE HLT, industrial efforts (Google, Microsoft, IBM, Amazon, ) Signal processing, information theory, machine learning, & neural information processing, psychophysics, physiology, cognitive science, phonetics and linguistics,... Engineering and Life Sciences together! 69

How to Get There? Fred Jelinek Speech recognition a problem of maximum likelihood decoding information and communication theory, machine learning, large data,. Roman Jakobson We speak, in order to be heard, in order to be understood human communication, speech production, perception, neuroscience, cognitive science,.. Gordon Moore The complexity for minimum component costs has increased at a rate of roughly a factor of two per year tools John Pierce..devise a clear, simple, definitive experiments. So a science of speech can grow, certain step by certain step. However, also John Pierce: (Speech recognition is so far (1969) field of) mad inventors or untrustworthy engineers (because machine needs) intelligence and knowledge of language comparable to those of a native speaker.... should people continue work towards speech recognition by machine? Perhaps it is for people in the field to decide. 70

Why Am I Working in Machine Recognition of Speech? Why did I climbed Mt. Everest? Because it is there! -Sir Edmund Hilary Spoken language is one of the most amazing accomplishments of human race. Implement. intelligence and knowledge of language comparable to those of a native speaker! Don t Follow Leaders, Watch the Parking Meters 71