YUN- N U N G ( V I V I A N ) C H E N H T T P : / / V I V I A N C H E N. I D V. T W H A K K A N I - T U R, T U R, G A O, D E N G 1
Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 2
Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 3
Spoken Dialogue System (SDS) Spoken dialogue systems are intelligent agents that are able to help users finish tasks more efficiently via spoken interactions. Spoken dialogue systems are being incorporated into various devices (smart-phones, smart TVs, in-car navigating system, etc). JARVIS Iron Man s Personal Assistant Baymax Personal Healthcare Companion Good intelligent assistants help users to organize and access information conveniently 4
Dialogue System Pipeline Speech Signal ASR Hypothesis are there any action movies to see this weekend Text Input Are there any action movies to see this weekend? Screen Display location? Text response Where are you located? Output Generation System Action request_locaion Language Understanding (LU) User Intent Detection Slot Filling Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend Dialogue Management (DM) Dialogue State Tracking Policy Decision 5
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495 Success Rate End-to-End Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen LU Importance Learning Curve of System Performance Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Simulation Epoch RL Agent w/o LU errors Rule Agent w/o LU errors 6
0 15 30 45 60 75 90 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360 375 390 405 420 435 450 465 480 495 Success Rate End-to-End Memory Networks for Multi-Turn Spoken Language Understanding Yun-Nung (Vivian) Chen LU Importance 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Learning Curve of System Performance Upper Bound DQN - 0.00 DQN - 0.05 Rule - 0.00 Rule - 0.05 Simulation Epoch RL Agent w/o LU errors RL Agent w/ 5% LU errors >5% performance drop Rule Agent w/o LU errors Rule Agent w/ 5% LU errors The system performance is sensitive to LU errors, for both rule-based and reinforcement learning agents. 7
Dialogue System Pipeline Speech Signal ASR Hypothesis are there any action movies to see this weekend Text Input Are there any action movies to see this weekend? Screen Display location? Text response Where are you located? Output Generation System Action request_locaion current bottleneck error propagation Language Understanding (LU) User Intent Detection Slot Filling Semantic Frame (Intents, Slots) request_movie genre=action date=this weekend Dialogue Management (DM) Dialogue State Tracking Policy Decision SLU usually focuses on understanding single-turn utterances The understanding result is usually influenced by 1) local observations 2) global knowledge. 8
Spoken Language Understanding Domain Identification Intent Prediction Slot Filling D I U S communication send_email just sent email to bob about fishing this weekend O O O O O B-contact_name B-subject I-subject I-subject send_email(contact_name= bob, subject= fishing this weekend ) U 1 S 1 U 2 send email to bob B-contact_name send_email(contact_name= bob ) are we going to fish this weekend S B-message I-message I-message I-message 2 I-message I-message I-message send_email(message= are we going to fish this weekend ) 9
Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 10
MODEL ARCHITECTURE 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding Contextual Sentence Encoder RNN mem x 1 x 2 x i history utterances {x i } p i m i c Knowledge Attention Distribution current utterance Memory Representation Sentence Encoder RNN in x 1 x 2 x i u Inner Product Weighted Sum h RNN Tagger W kg Knowledge Encoding Representation slot tagging sequence y V y t-1 h t-1 o V h t W W W U U M w t-1 M y t w t Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding, in Interspeech, 2016. 11
MODEL ARCHITECTURE 1. Sentence Encoding 2. Knowledge Attention 3. Knowledge Encoding Contextual Sentence Encoder RNN mem CNN x 1 x 2 x i history utterances {x i } p i m i c Knowledge Attention Distribution current utterance Memory Representation Sentence Encoder RNN in CNN x 1 x 2 x i u Inner Product Weighted Sum h RNN Tagger W kg Knowledge Encoding Representation slot tagging sequence y V y t-1 h t-1 o V h t W W W U U M w t-1 M y t w t Idea: additionally incorporating contextual knowledge during slot tagging Chen, et al., End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding, in Interspeech, 2016. 12
END-TO-END TRAINING Tagging Objective slot tag sequence RNN Tagger contextual utterances & current utterance o y t-1 y t y t+1 V V V h t-1 h t h t+1 W W W W M U M U M U w t-1 w t w t+1 Automatically figure out the attention distribution without explicit supervision 13
Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 14
EXPERIMENTS Dataset: Cortana communication session data GRU for all RNN adam optimizer embedding dim=150 hidden unit=100 dropout=0.5 Model RNN Tagger Knowledge Sentence Training Set First Turn Other Overall Encoding Encoder single-turn x x 60.6 16.2 25.5 The model trained on single-turn data performs worse for non-first turns due to mismatched training data 15
EXPERIMENTS Dataset: Cortana communication session data GRU for all RNN adam optimizer embedding dim=150 hidden unit=100 dropout=0.5 Model RNN Tagger Training Set Knowledge Sentence First Turn Encoding Encoder Other Overall single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Treating multi-turn data as single-turn for training performs reasonable 16
EXPERIMENTS Dataset: Cortana communication session data GRU for all RNN adam optimizer embedding dim=150 hidden unit=100 dropout=0.5 Model RNN Tagger Encoder- Tagger Training Set Knowledge Sentence First Turn Encoding Encoder Other Overall single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Encoding current and history utterances improves the performance but increases the training time 17
EXPERIMENTS Dataset: Cortana communication session data GRU for all RNN adam optimizer embedding dim=150 hidden unit=100 dropout=0.5 Model Training Set Knowledge Sentence First Turn Encoding Encoder Other Overall RNN Tagger single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 Encoder- multi-turn current utt (c) RNN 57.6 56.0 56.3 Tagger multi-turn history + current (x, c) RNN 69.9 60.8 62.5 Proposed multi-turn history + current (x, c) RNN 73.2 65.7 67.1 Applying memory networks significantly outperforms all approaches with much less training time 18
EXPERIMENTS Dataset: Cortana communication session data GRU for all RNN adam optimizer embedding dim=150 hidden unit=100 dropout=0.5 Model RNN Tagger Encoder- Tagger Proposed NEW! NOT IN THE PAPER! Training Set Knowledge Sentence First Turn Encoding Encoder Other Overall single-turn x x 60.6 16.2 25.5 multi-turn x x 55.9 45.7 47.4 multi-turn current utt (c) RNN 57.6 56.0 56.3 multi-turn history + current (x, c) RNN 69.9 60.8 62.5 multi-turn history + current (x, c) RNN 73.2 65.7 67.1 multi-turn history + current (x, c) CNN 73.8 66.5 68.0 CNN produces comparable results for sentence encoding with shorter training time 19
Outline Introduction Spoken Dialogue System Spoken/Natural Language Understanding (SLU/NLU) Contextual Spoken Language Understanding Model Architecture End-to-End Training Experiments Conclusion & Future Work 20
Conclusion The proposed end-to-end memory networks store contextual knowledge, which can be exploited dynamically based on an attention model for manipulating knowledge carryover for multi-turn understanding The end-to-end model performs the tagging task instead of classification The experiments show the feasibility and robustness of modeling knowledge carryover through memory networks 21
Future Work Leveraging not only local observation but also global knowledge for better language understanding Syntax or semantics can serve as global knowledge to guide the understanding model Knowledge as a Teacher: Knowledge-Guided Structural Attention Networks, arxiv preprint arxiv: 1609.03286 22
Q & A T H A N K S F O R Y O U R AT T E N T I O N! The code will be available at https://github.com/yvchen/contextualslu 23