- PDF Free Download

Size: px

Start display at page:

Download ""

Douglas Manning
5 years ago
Views:

21 Sensors Utterance-Context Pair utterance linguistic unit 1 linguistic unit 2 linguistic unit M semantic catregory 1 semantic category N context semantic category 2

22 utterance linguistic unit prototype linguistic unit prototype linguistic unit prototype context semantic category prototype semantic category prototype semantic category prototype

23 linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype semantic category prototype linguistic unit prototype linguistic unit prototype linguistic unit prototype semantic category prototype semantic category prototype semantic category prototype Lexicon semantic category linguistic unit semantic category linguistic unit linguistic unit prototype linguistic unit prototype

50 Lexical Items Long Term Memory (LTM) Mutual Information filter Lexical Candidates Mid Term Memory (MTM) Recurrence filter Short Term Memory (STM) Linguistic-Semantic Events (LS-events) Co-occurence filter Linguistic Events (L-events) Semantic Events (S-events) Event Detection Linguistic Channels Contextual Channels Input Sensor Signals Feature Extraction time

55 feature analyzers sensors input signals Linguistic channels time Contextual channels time

57 linguistic channels linguistic event detector L-events contextual channels semantic event detector S-events time time

58 segment 1 segment 2 segment n channel 1 channel 2 channel 3 An L-event or S-event is composed of multiple channels Event Segmenter The event is divided into an array along channel and time segment boundaries. channel 1 channel 2 channel 3

59 An event divided along time segments and channels Some potential subevents

60 linguistic events (L-events) time semantic events (S-events) co-occuring L-events and S-events are paired to form LS-events L-event S-event L-event L-event L-event L-event S-event S-event S-event S-event old LS-events forgotten short term memory (STM) contains recent LS-events

63 /* Consider all pairs of LS-events in short term memory */ for each pair of LS-events in STM, LS i and LS j { } /* Compare each pair of L-subevents in LS i and LS j */ for each L-subevent in LS i, L i { for each L-subevent in LS j, L j { if d L (L i, L j ) < t L then set L match = TRUE } } /* Compare each pair of S-subevents in LS i and LS j */ for each S-subevent in LS i, S i { for each S-subevent in LS j, S j { if d S (S i, S j ) < t S then set S match = TRUE } } /* check for matches of L-subevents and co-occuring S-subevents */ if L match = TRUE and S match = TRUE then recurrent match found }

64 LS-event Short Term Memory (STM) LS-event LS-event Filled regions indicate recurrent L-subevents and S-subevents linguistic unit prototype ( ) semantic prototype ( ) Mid Term Memory (MTM) Lexical Candidate

65 L-unit = {, } S-category = {, } L-radius ( ) S-radius ( ) linguistic feature space L-prototype ( ) S-prototype ( ) contextual feature space

70 medium large c l o a d j h g e m p b k n u t s r z y x w v i f q c o a d g e p b k n u y w v c o a d g e p b k n u y w v c o a d g e m p b k n u s y x w v c o a d g e m p b k n u s y w v x c o a d j h g e m p b k n u t s r z y x w v l s r l i j h f m t q z x l i j h f m t s r q z x l i j h f t r q z i l j h f t r q z i f q small I(S;L) = bits I(S;L) = 0.29 bits I(S;L) = 0.0 bits

71 Mid Term Memory (MTM) Mutual Information Filter Lexical Item

74 LTM Sensors Feature analysis L-event detection Event segmentation L-event L-subevent matches L-unit in lexical item S-category of recognized L-unit LTM Sensors Feature analysis S-event detection Event segmentation S-event S-subevent matches S-category in lexical item L-unit of recognized S-category

75 Sensors Feature analysis Event detection Event segmentation STM LS-prototype hypotheses Lexical search Recurrence filter MTM Mutual information filter LTM matched hypotheses explained away

77 Lexical item i Lexical item j L-unit S-category L-units and S-categories overlap Matching lexical items are clustered to form a conglomerate lexical item

78 Lexical item i Lexical item j L-unit S-category S-prototype i matches S-category j Lexical item i Lexical item j S-category L-unit L-prototype j matches L-unit i

79 Linguistic Units Semantic Categories

81 Goals MTM mutual information filter LTM Action selection Environment threshold adjustment lexical item confidence adjustment Feedback

85 Long Term Memory Lexical items: {spoken word model, color/shape category} Mutual information filter Lexical candidates: {spoken word prototype, color/shape prototype} Mid Term Memory Recurrence filter Short Term Memory LS-events: {spoken utterance, object view-set} Co-occurence filter L-subevents: speech segments L-events: spoken utterances Linguistic channel: phoneme probabilities L-event unpacking Spoken utterance detection S-event unpacking Object detection S-subevents: shape / color view-sets S-events: object view-sets Contextual channels: object shape & color Phoneme analysis Object shape analysis Object color analysis Microphone Camera

88 CCD camera color image foreground segmentation Context Channel 1: Shape mask-edge spatial derivative analysis Context Channel 2: Color foreground bitmap connected regions analysis object mask masked color image

92 Original RGB Image Color Histogram Object mask Shape histogram normalized green relative angle normalized red normalized distance

93 DOF 3: Neck rotation Color CCD Camera DOF 2: Base elevation DOF 4: Neck elevation DOF 1: Base rotation DOF 5: Object turntable rotation

96 12 units aa ae ah aw ay b RASTA-PLP spectral analysis 176 units 40 units ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq 176 units r s sh sil t th uh uw vw time delay y z Recurrent Neural Network Linguistic channel: phoneme probabilities

100 aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z aa ae ah aw ay b ch d dh dx eh er ey f g hh ih iy jh k l m n ng ow oy p q r s sh sil t th uh uw v w y z

101

102

103 state = 1; count_2 = 0; count_3 = 0; count_4 = 0 UTTERANCE_START_DELAY = 50ms; UTTERANCE_END_DELAY = 300ms for each RNN output vector, l(t) { state 1: SILENCE if SIL!= 1 { utterancestartindex = t state=2 } else { state = 1 } state 2: POSSIBLE_START_OF_UTTERANCE count_2 = count_2 + 1 if SIL = 1 { count_2 = 0 state = 1 } else if {count_2 > UTTERANCE_START_DELAY) { state = 3 } state 3: UTTERANCE if SIL { state = 4 } else { count_3 = count_3 + 1 state = 3 } } state 4: POSSIBLE_END_OF_UTTERANCE count_4 = count_4 + 1 if SIL!= 1 { count_3 = count_3 + count_4 count_4 = 0 state = 3 } } else if count_4 > UTTERANCE_END_DELAY { utteranceendindex = t - count_4-1 ProcessUtterance(utteranceStartIndex, utteranceendindex) count_2 = 0 count_3 = 0 count_4 = 0 state = 1 } }

104

105 a utterance start null b null utterance end silence

106

107 aa ae ah aw ay b ch d dh dx eh er ey fg hh ih iy jh kl m n ng ow oy pq r s sh sil t th uh uw vw y z RNN output phoneme probabilities Viterbi algorithm / b aa l / Most likely phoneme sequence b aa l Hidden Markov Model

108

109

110

111

112 "yeah" "dog" Mutual Information Mutual Information L-radius S-radius L-radius S-radius

113

114

115

116

117

118

119

120

121

122

123 Histogram bin occupancy (normalized) Distance between view-sets

124 Histogram bin occupancy (normalized) Distance between view-sets Histogram bin occupancy (normalized) Distance between view-sets

125

126

127

128

129

130

131

132

133

134

135 % CELL 7% Acoustic Recurrency CELL 72% 31% Acoustic Recurrency

136 % CELL 13% Acoustic Recurrency

137

138

139

140

141

142

143

144

145

146 Spoken commands CELL User-dependent acoustic & semantic model Task semantics

147

148

using gesture, confirm with speech Scene

149 Scene 1: User points to three colors in the rainbow and names them (lexical acquisition) Scene 2: User selects a part from the "Tree of Life" by pointing to the part Scene 3: Part is colored by speech using one of the three lexical items learned in Scene 1 Scene 4: User must select position for new body part using gesture, confirm with speech Scene 5: A successfully placed part Scene 6: After two more cycles of Scenes 2-5 the mate is complete and Toco looks on in new-found love

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

On the Formation of Phoneme Categories in DNN Acoustic Models

On the Formation of Phoneme Categories in DNN Acoustic Models Tasha Nagamine Department of Electrical Engineering, Columbia University T. Nagamine Motivation Large performance gap between humans and state-