Reinforcement Learning for NLP

Similar documents
Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

Georgetown University at TREC 2017 Dynamic Domain Track

Reinforcement Learning by Comparing Immediate Reward

Artificial Neural Networks written examination

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Axiom 2013 Team Description Paper

Assignment 1: Predicting Amazon Review Ratings

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Laboratorio di Intelligenza Artificiale e Robotica

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Laboratorio di Intelligenza Artificiale e Robotica

A Reinforcement Learning Variant for Control Scheduling

Dialog-based Language Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Generative models and adversarial training

Learning Methods for Fuzzy Systems

The Strong Minimalist Thesis and Bounded Optimality

AMULTIAGENT system [1] can be defined as a group of

Knowledge Transfer in Deep Convolutional Neural Nets

Radius STEM Readiness TM

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Calibration of Confidence Measures in Speech Recognition

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

An Online Handwriting Recognition System For Turkish

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

TD(λ) and Q-Learning Based Ludo Players

On the Combined Behavior of Autonomous Resource Management Agents

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

FF+FPG: Guiding a Policy-Gradient Planner

arxiv: v1 [cs.lg] 15 Jun 2015

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Evolutive Neural Net Fuzzy Filtering: Basic Description

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Truth Inference in Crowdsourcing: Is the Problem Solved?

High-level Reinforcement Learning in Strategy Games

Summarizing Answers in Non-Factoid Community Question-Answering

Major Milestones, Team Activities, and Individual Deliverables

Residual Stacking of RNNs for Neural Machine Translation

An investigation of imitation learning algorithms for structured prediction

Improving Fairness in Memory Scheduling

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

arxiv: v1 [cs.cv] 10 May 2017

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

BMBF Project ROBUKOM: Robust Communication Networks

Corrective Feedback and Persistent Learning for Information Extraction

CSL465/603 - Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Comparison of Annealing Techniques for Academic Course Scheduling

SARDNET: A Self-Organizing Feature Map for Sequences

Software Maintenance

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Model Ensemble for Click Prediction in Bing Search Ads

AI Agent for Ice Hockey Atari 2600

Learning From the Past with Experiment Databases

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Learning to Schedule Straight-Line Code

Regret-based Reward Elicitation for Markov Decision Processes

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Task Completion Transfer Learning for Reward Inference

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Learning Prospective Robot Behavior

Transfer Learning Action Models by Measuring the Similarity of Different Domains

A study of speaker adaptation for DNN-based speech synthesis

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Word Segmentation of Off-line Handwritten Documents

Speeding Up Reinforcement Learning with Behavior Transfer

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

WHEN THERE IS A mismatch between the acoustic

Development of Multistage Tests based on Teacher Ratings

Intelligent Agents. Chapter 2. Chapter 2 1

Probability and Statistics Curriculum Pacing Guide

CS Machine Learning

Semi-Supervised Face Detection

Grade 6: Correlated to AGS Basic Math Skills

Improving Action Selection in MDP s via Knowledge Transfer

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

arxiv: v4 [cs.cl] 28 Mar 2016

SOFTWARE EVALUATION TOOL

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Speech Recognition at ICSI: Broadcast News and beyond

MARKETING MANAGEMENT II: MARKETING STRATEGY (MKTG 613) Section 007

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v3 [cs.cl] 7 Feb 2017

Attributed Social Network Embedding

Transcription:

Reinforcement Learning for NLP Caiming Xiong Salesforce Research CS224N/Ling284

Outline Introduction to Reinforcement Learning Policy-based Deep RL Value-based Deep RL Examples of RL for NLP

Many Faces of RL By David Silver

What is RL? RL is a general-purpose framework for sequential decision-making Usually describe as agent interacting with unknown environment Goal: select action to maximize a future cumulative reward Action a Environment Reward r, Observation o Agent

Motor Control Observations: images from camera, joint angle Actions: joint torques Rewards: navigate to target location, serve and protect humans

Business Management Observations: current inventory levels and sales history Actions: number of units of each product to purchase Rewards: future profit Similarly, there also are resource allocation and routing problems.

Games

State Experience is a sequence of observations, actions, rewards The state is a summary of experience

RL Agent Major components: Policy: agent s behavior function Value function: how good would be each state and/or action Model: agent s prediction/representation of the environment

Policy A function that maps from state to action: Deterministic policy: Stochastic policy:

Value Function BQ / s, a = Q-value function gives expected future total reward from state and action (s, a) under policy π with discount factor γ (0,1) Show how good current policy Value functions can be defined using Bellman equation Bellman backup operator B / Q s, a = E 6 7,8 7[r + γq/ (s <, a < ) s, a]

Value Function For optimal Q-value function Q s, a = max / Q/ (s, a), then policy function is deterministic, the Bellman equation becomes: B / Q s, a = E 6 7[r + γ max B 7 Q / (s <, a < ) s, a]

What is Deep RL? Use deep neural network to approximate Policy Value function Model Optimized by SGD

Approaches Policy-based Deep RL Value-based Deep RL Model-based Deep RL

Deep Policy Network Represent policy by deep neural network that max C E B~G(B C,H)[r(a) θ, s] Ideas: given a bunch of trajectories, Make the good trajectories/action more probable Push the actions towards good actions

Policy Gradient How to make high-reward actions more likely:

Let's r a say that measures how good the sample is. Moving in the direction of gradient pushes up the probability of the sample, in proportion to how good it is.

Deep Q-Learning Represent value function by Q-network

Deep Q-Learning Optimal Q-values should obey Bellman equation Treat right-hand side as target network, given s, a, r, s <, optimize MSE loss via SGD: Converges to optimal Q using table lookup representation

Deep Q-Learning But diverges using neural networks due to: Correlations between samples Non-stationary targets

Deep Q-Learning Experience Replay: remove correlations, build data-set from agent's own experience Sample experiences from data-set and apply update To deal with non-stationarity, target parameters is fixed

Deep Q-Learning in Atari Network architecture and hyperparameters fixed across all games By David Silver

By David Silver

If you want to know more about RL, suggest to read: Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto Second Edition, in progress MIT Press, Cambridge, MA, 2017

RL in NLP Article summarization Question answering Dialogue generation Dialogue System Knowledge-based QA Machine Translation Text generation

RL in NLP Article summarization Question answering Dialogue generation Dialogue System Knowledge-based QA Machine Translation Text generation

Article Summarization Text summarization is the process of automatically generating natural language summaries from an input document while retaining the important points. extractive summarization abstractive summarization

A Deep Reinforced Model for Abstractive Summarization Given x = {x L, x M,, x O } represents the sequence of input (article) tokens, y = {y L, y M,, y R }, the sequence of output (summary) tokens Coping word Generating word Paulus et. al.

A Deep Reinforced Model for Abstractive Summarization The maximum-likelihood training objective: Training with teacher forcing algorithm. Paulus et. al.

A Deep Reinforced Model for Abstractive Summarization There is discrepancy between training and test performance, because exposure bias potentially valid summaries metric difference Paulus et. al.

A Deep Reinforced Model for Abstractive Summarization Using reinforcement learning framework, learn a policy that maximizes a specific discrete metric. Action: u T copy, generate and word y T H State: hidden states of encoder and previous outputs Reward: ROUGH score Where p y H T y H H L,, y T[L, x = p u T = copy p y H T y H H L,, y T[L, x, u T = copy +p u T = generate p(y H T y H H L,, y T[L, x, u T = generate) Paulus et. al.

A Deep Reinforced Model for Abstractive Summarization Paulus et. al.

A Deep Reinforced Model for Abstractive Summarization Human readability scores on a random subset of the CNN/Daily Mail test dataset Paulus et. al.

RL in NLP Article summarization Question answering Dialogue generation Dialogue System Knowledge-based QA Machine Translation Text generation

Text Question Answering Example from SQuaD dataset

Text Question Answering Loss function layer Cross Entropy Decoder Pointer Attention Layer LSTM + MLP GRU + MLP Self-attention biattention Coattention Encoder Layer P Encoder Layer Q LSTM, GRU

DCN+: MIXED OBJECTIVE AND DEEP RESIDUAL COATTENTION FOR QUESTION ANSWERING Constraints of Cross-Entropy loss: P: Some believe that the Golden State Warriors team of 2017 is one of the greatest teams in NBA history, Q: which team is considered to be one of the greatest teams in NBA history GT: the Golden State Warriors team of 2017 Ans1: Warriors Ans2: history Xiong et. al.

DCN+: MIXED OBJECTIVE AND DEEP RESIDUAL COATTENTION FOR QUESTION ANSWERING To address this, we introduce F1 score as extra objective combining with traditional cross entropy loss: Not necessary for variable length. Xiong et. al.

RL in NLP Article summarization Question answering Dialogue generation Dialogue System Knowledge-based QA Machine Translation Text generation

Deep Reinforcement Learning for Dialogue Generation To generate responses for conversational agents. The LSTM sequence-to-sequence (SEQ2SEQ) model is one type of neural generation model that maximizes the probability of generating a response given the previous dialogue turn. However, One concrete example is that SEQ2SEQ models tend to generate highly generic responses stuck in an infinite loop of repetitive responses Li et. al.

Deep Reinforcement Learning for Dialogue Generation To solve these, the model needs: integrate developer-defined rewards that better mimic the true goal of chatbot development model the long term influence of a generated response in an ongoing dialogue Li et. al.

Deep Reinforcement Learning for Dialogue Generation Definitions: Action: infinite since arbitrary-length sequences can be generated. State: A state is denoted by the previous two dialogue turns [p \, q \ ]. Reward: Ease of answering, Information Flow and Semantic Coherence Li et. al.

Deep Reinforcement Learning for Dialogue Generation Ease of answering: avoid utterance with a dull response. The S is a list of dull responses such as I don t know what you are talking about, I have no idea, etc. Li et. al.

Deep Reinforcement Learning for Dialogue Generation Information Flow: penalize semantic similarity between consecutive turns from the same agent. Where h G_ and h G_`a denote representations obtained from the encoder for two consecutive turns p \ and p \bl Li et. al.

Deep Reinforcement Learning for Dialogue Generation Semantic Coherence: avoid situations in which the generated replies are highly rewarded but are ungrammatical or not coherent The final reward for action a is a weighted sum of the rewards Li et. al.

Deep Reinforcement Learning for Dialogue Generation Simulation of two agents taking turns that explore state-action space and learning a policy Supervised learning for Seq2Seq models Mutual Information for pretraining policy model Dialogue Simulation between Two Agents Li et. al.

Deep Reinforcement Learning for Dialogue Generation Simulation of two agents taking turns that explore state-action space and learning a policy Supervised learning for Seq2Seq models Mutual Information for pretraining policy model Dialogue Simulation between Two Agents Li et. al.

Deep Reinforcement Learning for Dialogue Generation Mutual Information for previous sequence S and response T MMI objective λ controls the penalization for generic response Li et. al.

Deep Reinforcement Learning for Dialogue Generation Consider S as (q i, p i ), T as a, we can have Li et. al.

Deep Reinforcement Learning for Dialogue Generation Simulation Supervised learning for Seq2Seq models Mutual Information for pretraining policy model Dialogue Simulation between Two Agents Li et. al.

Deep Reinforcement Learning for Dialogue Generation Dialogue Simulation between Two Agents Using the simulated turns and reward, maximize the expected future reward. Training trick: Curriculum Learning Li et. al.

Deep Reinforcement Learning for Dialogue Generation Li et. al.

Summary The introduction of Reinforcement Learning Deep Policy Learning Deep Q-Learning Applications on NLP Article summarization Question answering Dialogue generation