Sensors Utterance-Context Pair utterance linguistic unit 1 linguistic unit 2 linguistic unit M semantic catregory 1 semantic category N context semantic category 2
utterance linguistic unit prototype linguistic unit prototype linguistic unit prototype context semantic category prototype semantic category prototype semantic category prototype
linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype Lexicon semantic category linguistic unit semantic category linguistic unit linguistic unit prototype linguistic unit prototype
Lexical Items Long Term Memory (LTM) Mutual Information filter Lexical Candidates Mid Term Memory (MTM) Recurrence filter Short Term Memory (STM) Linguistic-Semantic Events (LS-events) Co-occurence filter Linguistic Events (L-events) Semantic Events (S-events) Event Detection Linguistic Channels Contextual Channels Input Sensor Signals Feature Extraction time
feature analyzers sensors input signals Linguistic channels time Contextual channels time
linguistic channels linguistic event detector L-events contextual channels semantic event detector S-events time time
segment 1 segment 2 segment n channel 1 channel 2 channel 3 An L-event or S-event is composed of multiple channels Event Segmenter The event is divided into an array along channel and time segment boundaries. channel 1 channel 2 channel 3
An event divided along time segments and channels Some potential subevents
linguistic events (L-events) time semantic events (S-events) co-occuring L-events and S-events are paired to form LS-events L-event S-event L-event L-event L-event L-event S-event S-event S-event S-event old LS-events forgotten short term memory (STM) contains recent LS-events
/* Consider all pairs of LS-events in short term memory */ for each pair of LS-events in STM, LS i and LS j { } /* Compare each pair of L-subevents in LS i and LS j */ for each L-subevent in LS i, L i { for each L-subevent in LS j, L j { if d L (L i, L j ) < t L then set L match = TRUE } } /* Compare each pair of S-subevents in LS i and LS j */ for each S-subevent in LS i, S i { for each S-subevent in LS j, S j { if d S (S i, S j ) < t S then set S match = TRUE } } /* check for matches of L-subevents and co-occuring S-subevents */ if L match = TRUE and S match = TRUE then recurrent match found }
LS-event Short Term Memory (STM) LS-event LS-event Filled regions indicate recurrent L-subevents and S-subevents linguistic unit prototype ( ) semantic prototype ( ) Mid Term Memory (MTM) Lexical Candidate
L-unit = {, } S-category = {, } L-radius ( ) S-radius ( ) linguistic feature space L-prototype ( ) S-prototype ( ) contextual feature space
medium large c l o a d j h g e m p b k n u t s r z y x w v i f q c o a d g e p b k n u y w v c o a d g e p b k n u y w v c o a d g e m p b k n u s y x w v c o a d g e m p b k n u s y w v x c o a d j h g e m p b k n u t s r z y x w v l s r l i j h f m t q z x l i j h f m t s r q z x l i j h f t r q z i l j h f t r q z i f q small I(S;L) = 0.013 bits I(S;L) = 0.29 bits I(S;L) = 0.0 bits
Mid Term Memory (MTM) Mutual Information Filter Lexical Item
LTM Sensors Feature analysis L-event detection Event segmentation L-event L-subevent matches L-unit in lexical item S-category of recognized L-unit LTM Sensors Feature analysis S-event detection Event segmentation S-event S-subevent matches S-category in lexical item L-unit of recognized S-category
Sensors Feature analysis Event detection Event segmentation STM LS-prototype hypotheses Lexical search Recurrence filter MTM Mutual information filter LTM matched hypotheses explained away
Lexical item i Lexical item j L-unit S-category L-units and S-categories overlap Matching lexical items are clustered to form a conglomerate lexical item
Lexical item i Lexical item j L-unit S-category S-prototype i matches S-category j Lexical item i Lexical item j S-category L-unit L-prototype j matches L-unit i
Linguistic Units Semantic Categories
Goals MTM mutual information filter LTM Action selection Environment threshold adjustment lexical item confidence adjustment Feedback
Long Term Memory Lexical items: {spoken word model, color/shape category} Mutual information filter Lexical candidates: {spoken word prototype, color/shape prototype} Mid Term Memory Recurrence filter Short Term Memory LS-events: {spoken utterance, object view-set} Co-occurence filter L-subevents: speech segments L-events: spoken utterances Linguistic channel: phoneme probabilities L-event unpacking Spoken utterance detection S-event unpacking Object detection S-subevents: shape / color view-sets S-events: object view-sets Contextual channels: object shape & color Phoneme analysis Object shape analysis Object color analysis Microphone Camera
CCD camera color image foreground segmentation Context Channel 1: Shape mask-edge spatial derivative analysis Context Channel 2: Color foreground bitmap connected regions analysis object mask masked color image
Original RGB Image Color Histogram Object mask Shape histogram normalized green relative angle normalized red normalized distance
DOF 3: Neck rotation Color CCD Camera DOF 2: Base elevation DOF 4: Neck elevation DOF 1: Base rotation DOF 5: Object turntable rotation
12 units aa ae ah aw ay b RASTA-PLP spectral analysis 176 units 40 units ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq 176 units r s sh sil t th uh uw vw time delay y z Recurrent Neural Network Linguistic channel: phoneme probabilities
aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p q r s sh sil t th uh uw v w y z
state = 1; count_2 = 0; count_3 = 0; count_4 = 0 UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms for each RNN output vector, l(t) { state 1: SILENCE if SIL!= 1 { utterancestartindex = t state=2 } else { state = 1 } state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 } state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 } } state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL!= 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceendindex = t - count_4-1 ProcessUtterance(utteranceStartIndex, utteranceendindex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }
a utterance start null b null utterance end silence
aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z RNN output phoneme probabilities Viterbi algorithm / b aa l / Most likely phoneme sequence b aa l Hidden Markov Model
"yeah" "dog" Mutual Information Mutual Information L-radius S-radius L-radius S-radius
Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets
Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets Histogram bin occupancy (normalized) 0 5 10 15 20 25 30 35 40 Distance between view-sets
9000 8000 7000 6000 5000 4000 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 45
40 35 30 28% 25 20 15 10 5 0 CELL 7% Acoustic Recurrency 100 90 80 70 60 50 40 30 20 10 0 CELL 72% 31% Acoustic Recurrency
80 70 60 57% 50 40 30 20 10 0 CELL 13% Acoustic Recurrency
Spoken commands CELL User-dependent acoustic & semantic model Task semantics
Scene 1: User points to three colors in the rainbow and names them (lexical acquisition) Scene 2: User selects a part from the "Tree of Life" by pointing to the part Scene 3: Part is colored by speech using one of the three lexical items learned in Scene 1 Scene 4: User must select position for new body part using gesture, confirm with speech Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5 the mate is complete and Toco looks on in new-found love