- PDF Free Download

Sensors Utterance-Context Pair utterance linguistic unit 1 linguistic unit 2 linguistic unit M semantic catregory 1 semantic category N context semantic category 2

utterance linguistic unit prototype linguistic unit prototype linguistic unit prototype context semantic category prototype semantic category prototype semantic category prototype

linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype Lexicon semantic category linguistic unit semantic category linguistic unit linguistic unit prototype linguistic unit prototype

Lexical Items Long Term Memory (LTM) Mutual Information filter Lexical Candidates Mid Term Memory (MTM) Recurrence filter Short Term Memory (STM) Linguistic-Semantic Events (LS-events) Co-occurence filter Linguistic Events (L-events) Semantic Events (S-events) Event Detection Linguistic Channels Contextual Channels Input Sensor Signals Feature Extraction time

feature analyzers sensors input signals Linguistic channels time Contextual channels time

linguistic channels linguistic event detector L-events contextual channels semantic event detector S-events time time

segment 1 segment 2 segment n channel 1 channel 2 channel 3 An L-event or S-event is composed of multiple channels Event Segmenter The event is divided into an array along channel and time segment boundaries. channel 1 channel 2 channel 3

An event divided along time segments and channels Some potential subevents

linguistic events (L-events) time semantic events (S-events) co-occuring L-events and S-events are paired to form LS-events L-event S-event L-event L-event L-event L-event S-event S-event S-event S-event old LS-events forgotten short term memory (STM) contains recent LS-events

/* Consider all pairs of LS-events in short term memory */ for each pair of LS-events in STM, LS i and LS j { } /* Compare each pair of L-subevents in LS i and LS j */ for each L-subevent in LS i, L i { for each L-subevent in LS j, L j { if d L (L i, L j ) < t L then set L match = TRUE } } /* Compare each pair of S-subevents in LS i and LS j */ for each S-subevent in LS i, S i { for each S-subevent in LS j, S j { if d S (S i, S j ) < t S then set S match = TRUE } } /* check for matches of L-subevents and co-occuring S-subevents */ if L match = TRUE and S match = TRUE then recurrent match found }

LS-event Short Term Memory (STM) LS-event LS-event Filled regions indicate recurrent L-subevents and S-subevents linguistic unit prototype ( ) semantic prototype ( ) Mid Term Memory (MTM) Lexical Candidate

L-unit = {, } S-category = {, } L-radius ( ) S-radius ( ) linguistic feature space L-prototype ( ) S-prototype ( ) contextual feature space

medium large c l o a d j h g e m p b k n u t s r z y x w v i f q c o a d g e p b k n u y w v c o a d g e p b k n u y w v c o a d g e m p b k n u s y x w v c o a d g e m p b k n u s y w v x c o a d j h g e m p b k n u t s r z y x w v l s r l i j h f m t q z x l i j h f m t s r q z x l i j h f t r q z i l j h f t r q z i f q small I(S;L) = 0.013 bits I(S;L) = 0.29 bits I(S;L) = 0.0 bits

Mid Term Memory (MTM) Mutual Information Filter Lexical Item

LTM Sensors Feature analysis L-event detection Event segmentation L-event L-subevent matches L-unit in lexical item S-category of recognized L-unit LTM Sensors Feature analysis S-event detection Event segmentation S-event S-subevent matches S-category in lexical item L-unit of recognized S-category

Sensors Feature analysis Event detection Event segmentation STM LS-prototype hypotheses Lexical search Recurrence filter MTM Mutual information filter LTM matched hypotheses explained away

Lexical item i Lexical item j L-unit S-category L-units and S-categories overlap Matching lexical items are clustered to form a conglomerate lexical item

Lexical item i Lexical item j L-unit S-category S-prototype i matches S-category j Lexical item i Lexical item j S-category L-unit L-prototype j matches L-unit i

Linguistic Units Semantic Categories

Goals MTM mutual information filter LTM Action selection Environment threshold adjustment lexical item confidence adjustment Feedback

Long Term Memory Lexical items: {spoken word model, color/shape category} Mutual information filter Lexical candidates: {spoken word prototype, color/shape prototype} Mid Term Memory Recurrence filter Short Term Memory LS-events: {spoken utterance, object view-set} Co-occurence filter L-subevents: speech segments L-events: spoken utterances Linguistic channel: phoneme probabilities L-event unpacking Spoken utterance detection S-event unpacking Object detection S-subevents: shape / color view-sets S-events: object view-sets Contextual channels: object shape & color Phoneme analysis Object shape analysis Object color analysis Microphone Camera

CCD camera color image foreground segmentation Context Channel 1: Shape mask-edge spatial derivative analysis Context Channel 2: Color foreground bitmap connected regions analysis object mask masked color image

Original RGB Image Color Histogram Object mask Shape histogram normalized green relative angle normalized red normalized distance

DOF 3: Neck rotation Color CCD Camera DOF 2: Base elevation DOF 4: Neck elevation DOF 1: Base rotation DOF 5: Object turntable rotation

12 units aa ae ah aw ay b RASTA-PLP spectral analysis 176 units 40 units ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq 176 units r s sh sil t th uh uw vw time delay y z Recurrent Neural Network Linguistic channel: phoneme probabilities

aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p q r s sh sil t th uh uw v w y z

state = 1; count_2 = 0; count_3 = 0; count_4 = 0 UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms for each RNN output vector, l(t) { state 1: SILENCE if SIL!= 1 { utterancestartindex = t state=2 } else { state = 1 } state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 } state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 } } state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL!= 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceendindex = t - count_4-1 ProcessUtterance(utteranceStartIndex, utteranceendindex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }

a utterance start null b null utterance end silence

aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z RNN output phoneme probabilities Viterbi algorithm / b aa l / Most likely phoneme sequence b aa l Hidden Markov Model

"yeah" "dog" Mutual Information Mutual Information L-radius S-radius L-radius S-radius

Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets

Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets

9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 45

40 35 30 28% 25 20 15 10 5 0 CELL 7% Acoustic Recurrency 100 90 80 70 60 50 40 30 20 10 0 CELL 72% 31% Acoustic Recurrency

80 70 60 57% 50 40 30 20 10 0 CELL 13% Acoustic Recurrency

Spoken commands CELL User-dependent acoustic & semantic model Task semantics

Scene 1: User points to three colors in the rainbow and names them (lexical acquisition) Scene 2: User selects a part from the "Tree of Life" by pointing to the part Scene 3: Part is colored by speech using one of the three lexical items learned in Scene 1 Scene 4: User must select position for new body part using gesture, confirm with speech Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5 the mate is complete and Toco looks on in new-found love