Similar documents
On the Formation of Phoneme Categories in DNN Acoustic Models

Lecture 9: Speech Recognition

Characterizing and Processing Robot-Directed Speech

Speech Recognition at ICSI: Broadcast News and beyond

Learning Methods in Multilingual Speech Recognition

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Emotion Recognition Using Support Vector Machine

Word Segmentation of Off-line Handwritten Documents

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

BODY LANGUAGE ANIMATION SYNTHESIS FROM PROSODY AN HONORS THESIS SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE OF STANFORD UNIVERSITY

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A Comparison of DHMM and DTW for Isolated Digits Recognition System of Arabic Language

Modeling function word errors in DNN-HMM based LVCSR systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Human Emotion Recognition From Speech

LEGO MINDSTORMS Education EV3 Coding Activities

Speaker recognition using universal background model on YOHO database

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Modeling function word errors in DNN-HMM based LVCSR systems

Analysis of Speech Recognition Models for Real Time Captioning and Post Lecture Transcription

Effect of Word Complexity on L2 Vocabulary Learning

International Journal of Advanced Networking Applications (IJANA) ISSN No. :

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Multi-modal Sensing and Analysis of Poster Conversations toward Smart Posterboard

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Natural Language Processing. George Konidaris

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

NUMBERS AND OPERATIONS

ACOUSTIC EVENT DETECTION IN REAL LIFE RECORDINGS

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

arxiv: v1 [cs.cv] 10 May 2017

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

A study of speaker adaptation for DNN-based speech synthesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Large vocabulary off-line handwriting recognition: A survey

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano. Graduate School of Information Science, Nara Institute of Science & Technology

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Evolution of Symbolisation in Chimpanzees and Neural Nets

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Automatic Pronunciation Checker

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speaker Identification by Comparison of Smart Methods. Abstract

A Reinforcement Learning Variant for Control Scheduling

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

International Journal of Computational Intelligence and Informatics, Vol. 1 : No. 4, January - March 2012

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

A Domain Ontology Development Environment Using a MRD and Text Corpus

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Case study Norway case 1

Modeling Dialogue Building Highly Responsive Conversational Agents

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Lecture 10: Reinforcement Learning

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Segregation of Unvoiced Speech from Nonspeech Interference

In how many ways can one junior and one senior be selected from a group of 8 juniors and 6 seniors?

Calibration of Confidence Measures in Speech Recognition

LEARNING A SEMANTIC PARSER FROM SPOKEN UTTERANCES. Judith Gaspers and Philipp Cimiano

About Advisory Committee

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

12- A whirlwind tour of statistics

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Linking object names and object categories: Words (but not tones) facilitate object categorization in 6- and 12-month-olds

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

PRD Online

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

INPE São José dos Campos

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Student Perceptions of Reflective Learning Activities

Saliency in Human-Computer Interaction *

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Corrective Feedback and Persistent Learning for Information Extraction

Mandarin Lexical Tone Recognition: The Gating Paradigm

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Eliciting Language in the Classroom. Presented by: Dionne Ramey, SBCUSD SLP Amanda Drake, SBCUSD Special Ed. Program Specialist

Degeneracy results in canalisation of language structure: A computational model of word learning

Multimedia Application Effective Support of Education

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

A Computer Vision Integration Model for a Multi-modal Cognitive System

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Surprise-Based Learning for Autonomous Systems

Airplane Rescue: Social Studies. LEGO, the LEGO logo, and WEDO are trademarks of the LEGO Group The LEGO Group.

Ling/Span/Fren/Ger/Educ 466: SECOND LANGUAGE ACQUISITION. Spring 2011 (Tuesdays 4-6:30; Psychology 251)

Self-Supervised Acquisition of Vowels in American English

Transcription:

Sensors Utterance-Context Pair utterance linguistic unit 1 linguistic unit 2 linguistic unit M semantic catregory 1 semantic category N context semantic category 2

utterance linguistic unit prototype linguistic unit prototype linguistic unit prototype context semantic category prototype semantic category prototype semantic category prototype

linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype Lexicon semantic category linguistic unit semantic category linguistic unit linguistic unit prototype linguistic unit prototype

Lexical Items Long Term Memory (LTM) Mutual Information filter Lexical Candidates Mid Term Memory (MTM) Recurrence filter Short Term Memory (STM) Linguistic-Semantic Events (LS-events) Co-occurence filter Linguistic Events (L-events) Semantic Events (S-events) Event Detection Linguistic Channels Contextual Channels Input Sensor Signals Feature Extraction time

feature analyzers sensors input signals Linguistic channels time Contextual channels time

linguistic channels linguistic event detector L-events contextual channels semantic event detector S-events time time

segment 1 segment 2 segment n channel 1 channel 2 channel 3 An L-event or S-event is composed of multiple channels Event Segmenter The event is divided into an array along channel and time segment boundaries. channel 1 channel 2 channel 3

An event divided along time segments and channels Some potential subevents

linguistic events (L-events) time semantic events (S-events) co-occuring L-events and S-events are paired to form LS-events L-event S-event L-event L-event L-event L-event S-event S-event S-event S-event old LS-events forgotten short term memory (STM) contains recent LS-events

/* Consider all pairs of LS-events in short term memory */ for each pair of LS-events in STM, LS i and LS j { } /* Compare each pair of L-subevents in LS i and LS j */ for each L-subevent in LS i, L i { for each L-subevent in LS j, L j { if d L (L i, L j ) < t L then set L match = TRUE } } /* Compare each pair of S-subevents in LS i and LS j */ for each S-subevent in LS i, S i { for each S-subevent in LS j, S j { if d S (S i, S j ) < t S then set S match = TRUE } } /* check for matches of L-subevents and co-occuring S-subevents */ if L match = TRUE and S match = TRUE then recurrent match found }

LS-event Short Term Memory (STM) LS-event LS-event Filled regions indicate recurrent L-subevents and S-subevents linguistic unit prototype ( ) semantic prototype ( ) Mid Term Memory (MTM) Lexical Candidate

L-unit = {, } S-category = {, } L-radius ( ) S-radius ( ) linguistic feature space L-prototype ( ) S-prototype ( ) contextual feature space

medium large c l o a d j h g e m p b k n u t s r z y x w v i f q c o a d g e p b k n u y w v c o a d g e p b k n u y w v c o a d g e m p b k n u s y x w v c o a d g e m p b k n u s y w v x c o a d j h g e m p b k n u t s r z y x w v l s r l i j h f m t q z x l i j h f m t s r q z x l i j h f t r q z i l j h f t r q z i f q small I(S;L) = 0.013 bits I(S;L) = 0.29 bits I(S;L) = 0.0 bits

Mid Term Memory (MTM) Mutual Information Filter Lexical Item

LTM Sensors Feature analysis L-event detection Event segmentation L-event L-subevent matches L-unit in lexical item S-category of recognized L-unit LTM Sensors Feature analysis S-event detection Event segmentation S-event S-subevent matches S-category in lexical item L-unit of recognized S-category

Sensors Feature analysis Event detection Event segmentation STM LS-prototype hypotheses Lexical search Recurrence filter MTM Mutual information filter LTM matched hypotheses explained away

Lexical item i Lexical item j L-unit S-category L-units and S-categories overlap Matching lexical items are clustered to form a conglomerate lexical item

Lexical item i Lexical item j L-unit S-category S-prototype i matches S-category j Lexical item i Lexical item j S-category L-unit L-prototype j matches L-unit i

Linguistic Units Semantic Categories

Goals MTM mutual information filter LTM Action selection Environment threshold adjustment lexical item confidence adjustment Feedback

Long Term Memory Lexical items: {spoken word model, color/shape category} Mutual information filter Lexical candidates: {spoken word prototype, color/shape prototype} Mid Term Memory Recurrence filter Short Term Memory LS-events: {spoken utterance, object view-set} Co-occurence filter L-subevents: speech segments L-events: spoken utterances Linguistic channel: phoneme probabilities L-event unpacking Spoken utterance detection S-event unpacking Object detection S-subevents: shape / color view-sets S-events: object view-sets Contextual channels: object shape & color Phoneme analysis Object shape analysis Object color analysis Microphone Camera

CCD camera color image foreground segmentation Context Channel 1: Shape mask-edge spatial derivative analysis Context Channel 2: Color foreground bitmap connected regions analysis object mask masked color image

Original RGB Image Color Histogram Object mask Shape histogram normalized green relative angle normalized red normalized distance

DOF 3: Neck rotation Color CCD Camera DOF 2: Base elevation DOF 4: Neck elevation DOF 1: Base rotation DOF 5: Object turntable rotation

12 units aa ae ah aw ay b RASTA-PLP spectral analysis 176 units 40 units ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq 176 units r s sh sil t th uh uw vw time delay y z Recurrent Neural Network Linguistic channel: phoneme probabilities

aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p q r s sh sil t th uh uw v w y z

state = 1; count_2 = 0; count_3 = 0; count_4 = 0 UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms for each RNN output vector, l(t) { state 1: SILENCE if SIL!= 1 { utterancestartindex = t state=2 } else { state = 1 } state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 } state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 } } state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL!= 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceendindex = t - count_4-1 ProcessUtterance(utteranceStartIndex, utteranceendindex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }

a utterance start null b null utterance end silence

aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z RNN output phoneme probabilities Viterbi algorithm / b aa l / Most likely phoneme sequence b aa l Hidden Markov Model

"yeah" "dog" Mutual Information Mutual Information L-radius S-radius L-radius S-radius

Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets

Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets

9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 45

40 35 30 28% 25 20 15 10 5 0 CELL 7% Acoustic Recurrency 100 90 80 70 60 50 40 30 20 10 0 CELL 72% 31% Acoustic Recurrency

80 70 60 57% 50 40 30 20 10 0 CELL 13% Acoustic Recurrency

Spoken commands CELL User-dependent acoustic & semantic model Task semantics

Scene 1: User points to three colors in the rainbow and names them (lexical acquisition) Scene 2: User selects a part from the "Tree of Life" by pointing to the part Scene 3: Part is colored by speech using one of the three lexical items learned in Scene 1 Scene 4: User must select position for new body part using gesture, confirm with speech Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5 the mate is complete and Toco looks on in new-found love