Learning words from sights and sounds: a computational model. Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang.

Similar documents
Speech Recognition at ICSI: Broadcast News and beyond

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Mandarin Lexical Tone Recognition: The Gating Paradigm

Speech Emotion Recognition Using Support Vector Machine

Modeling function word errors in DNN-HMM based LVCSR systems

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Human Emotion Recognition From Speech

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

WHEN THERE IS A mismatch between the acoustic

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Speaker recognition using universal background model on YOHO database

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Effect of Word Complexity on L2 Vocabulary Learning

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Learning Methods in Multilingual Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

Stages of Literacy Ros Lugg

Segregation of Unvoiced Speech from Nonspeech Interference

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Lecture 9: Speech Recognition

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Natural Language Processing. George Konidaris

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Voice conversion through vector quantization

On the Formation of Phoneme Categories in DNN Acoustic Models

Automatic Pronunciation Checker

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

Corrective Feedback and Persistent Learning for Information Extraction

THE world surrounding us involves multiple modalities

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Degeneracy results in canalisation of language structure: A computational model of word learning

Probability and Statistics Curriculum Pacing Guide

Word Segmentation of Off-line Handwritten Documents

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Large vocabulary off-line handwriting recognition: A survey

English Language and Applied Linguistics. Module Descriptions 2017/18

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Lecture 1: Machine Learning Basics

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Speaker Identification by Comparison of Smart Methods. Abstract

Proceedings of Meetings on Acoustics

CS 598 Natural Language Processing

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

Rhythm-typology revisited.

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Edinburgh Research Explorer

Using computational modeling in language acquisition research

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Speech Synthesis in Noisy Environment by Enhancing Strength of Excitation and Formant Prominence

A Cross-language Corpus for Studying the Phonetics and Phonology of Prominence

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

SARDNET: A Self-Organizing Feature Map for Sequences

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Speaker Recognition. Speaker Diarization and Identification

Georgetown University at TREC 2017 Dynamic Domain Track

Noise-Adaptive Perceptual Weighting in the AMR-WB Encoder for Increased Speech Loudness in Adverse Far-End Noise Conditions

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

arxiv: v1 [cs.cl] 2 Apr 2017

Evolution of Symbolisation in Chimpanzees and Neural Nets

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Investigation on Mandarin Broadcast News Speech Recognition

Learning Methods for Fuzzy Systems

Assignment 1: Predicting Amazon Review Ratings

Quarterly Progress and Status Report. VCV-sequencies in a preliminary text-to-speech system for female speech

Phonological and Phonetic Representations: The Case of Neutralization

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

First Grade Curriculum Highlights: In alignment with the Common Core Standards

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

A Reinforcement Learning Variant for Control Scheduling

The Bruins I.C.E. School

Transcription:

Learning words from sights and sounds: a computational model Deb K. Roy, and Alex P. Pentland Presented by Xiaoxu Wang Introduction Infants understand their surroundings by using a combination of evolved innate structures and powerful learning abilities. They developed a computational model called Cross-channel Early Lexical Learning (CELL). It acquires words from multimodal sensory input and learns by statistically modeling the structure. 1

Infant-directed speech Experiments Participants were asked to engage in play centered around toy objects The infants could not produce single words. The caregivers reported varying levels of limited comprehension of words. 2

Problems of early lexical acquisition Three questions of early lexical acquisition Discover speech segments which correspond to the words of their language. How to learn perceptually grounded semantic categories? How to learn to associate linguistic units with appropriate semantic categories? Speech Segmentation Let us do an experiment I am going to say three sentences in Chinese. Could you tell me how many words in the first sentence? What is the word corresponding to this object? 3

Speech Segmentation Let us do an experiment I am going to say three sentences in Chinese. Could you tell me how many words in the first sentence? What is the word corresponding to this object? I am holding a pencil. This is my pencil. Pencils are useful. Background Existing speech segmentation models may be divided into two classes Based on local sound sequence patterns or statistics. The model was trained by giving it a lexicon of valid words of the language. To segment utterances, the model detect all trigrams which did not occur word internally during training. 37% word boundry detection Minimum description length (MDL) 4

Spoken utterance Spoken utterance are represented as array of phoneme probabilities. The acoustic input is put though a filter called Relative Spectral-Perceptual Linear Prediction (RASTA-PLP). The filter is designed to attenuate nonspeech components of an acoustic signal. It does so by suppressing spectral components that change either faster or slower than the speech. Spoken utterance Filtered signal is expanded using an exponential transformation and each power band is scaled to simulate laws of loudness perception in humans. A 12-parameter representation of the smoothed spectrum is estimated from a 20 ms window of input. The window is moved in time by 10 ms increments resulting in a set of 12 RASTA- PLP coefficients estimated at a rate of 100 Hz. 5

Recurrent Neural Network A recurrent neural network analyses RASTA- PLP coefficients to estimate phoneme and speech/silence probabilities. The RNN has 12 input units, 176 hidden units, and 40 output units. The 176 hidden units are connected through a time delay and concatenated with the RASTA-PLP input coefficients. The time delay units give the network the capacity to remember aspects of old input and combine those representations with fresh data. Recurrent Neural Network 6

Sample output from the recurrent neural network for the utterance "Oh, you can make it bounce too!" The performance of the RNN 7

Speech Segmentation The RNN outputs are treated as state emission probabilities in a Hidden Markov Model (HMM) framework. The Viterbi dynamic programming search, is used to obtain the most likely phoneme sequence for a given phoneme probability array. The system obtains The most likely sequence of phonemes which were concatenated to form the utterance The location of each phoneme boundary for the sequence. Speech Segmentation Any subsequence within an utterance terminated at phoneme boundaries is used to form word hypotheses. Additionally, any word candidate is required to contain at least one vowel. This constraint prevents the model from hypothesizing consonant clusters as word candidates. We refer to a segment containing at least one vowel as a legal segment. 8

Comparing words It is possible to treat the phoneme sequence of each speech segment as a string and use string comparison techniques. A limitation of this method is that it relies on only the single most likely phoneme sequence. A sequence of RNN output contains additional information which specifies the probability of all phonemes at each time instance. To make use of this additional information, they developed the following distance metric. Comparing words Two segments αi and α j can be decoded as phoneme sequences Qi and Q j. Q and Q i j can generate HMMs λ and. We wish to test if the i λ j hypothesis can generated. λ i α j Empirically, the result metric was found to return small values for words which humans would judge as phonetically similar. 9

Visual Input Similar to speech input, the ability to represent and compare shapes is also built into CELL. Three-dimensional objects are represented using a view-based approach in which twodimensional images of an object captured from multiple viewpoints collectively form a visual model of the object. Visual Input Object shapes are represented in terms of histograms of features derived from the location of object edges. 10

Comparing Visual Input Using multidimensional histograms to represent object shapes allows for direct comparison of object models using information theoretic or statistical divergence functions. In practice, an effective metric for shape classification is the 2. χ -divergence: The Structure of CELL 11

Word Learning Objctive: Each utterance may consist of one or more words. Similarly, each context may be an instance of many possible shape categories. Given a pool of utterance-context pairs, the learner must infer speechto-shape mappings (lexical items) which best fit the data. Short term memory (STM), pass the pairs (prototypes) with high local recurrency to long term memory (LTM). For example, dog----- the----- ball----- Word Learning LTM create lexical items by consolidating AV-prototypes based on a mutual information criterion. This consolidation process identifies clusters of AV-prototypes which may be merged together to model consistent intermodal patterns across multiple observations. 12

Mutual Information A=1 iff distance_a(x, y) <=r_a, V is similar. The probabilities are estimated using relative frequencies of all n prototypes in LTM. Mapping The prototype yeah - dog found little support from other AVprototypes in LTM which is indicated by the low flat mutual information surface. In contrast, in the example on the right, the word "dog" was correctly paired with a dog shape. 13

Evaluation measures Lexical items obtained from speaker data sets are evaluated by Segmentation accuracy Word discovery dog Accepted /dg/, /g/ and /ðdg/ (the dog) Rejected /dgiz/ (dog is). Semantic accuracy The best choice of the meaning of a prototype is whatever context co-occurred with it. Result 14