Deep Q-learning for Active Recognition of GERMS: Baseline performance on a standardized dataset for active learning

Similar documents
Lecture 10: Reinforcement Learning

Georgetown University at TREC 2017 Dynamic Domain Track

Axiom 2013 Team Description Paper

AI Agent for Ice Hockey Atari 2600

Reinforcement Learning by Comparing Immediate Reward

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 1: Machine Learning Basics

Speeding Up Reinforcement Learning with Behavior Transfer

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 15 Jun 2015

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

AMULTIAGENT system [1] can be defined as a group of

Python Machine Learning

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Dialog-based Language Learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

High-level Reinforcement Learning in Strategy Games

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

TD(λ) and Q-Learning Based Ludo Players

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Rule Learning With Negation: Issues Regarding Effectiveness

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Laboratorio di Intelligenza Artificiale e Robotica

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Laboratorio di Intelligenza Artificiale e Robotica

FF+FPG: Guiding a Policy-Gradient Planner

Improving Action Selection in MDP s via Knowledge Transfer

Rule Learning with Negation: Issues Regarding Effectiveness

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cv] 10 May 2017

Regret-based Reward Elicitation for Markov Decision Processes

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Generative models and adversarial training

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Reinforcement Learning Variant for Control Scheduling

An investigation of imitation learning algorithms for structured prediction

CS Machine Learning

Residual Stacking of RNNs for Neural Machine Translation

arxiv: v2 [cs.ro] 3 Mar 2017

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Task Completion Transfer Learning for Reward Inference

CSL465/603 - Machine Learning

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Beyond the Pipeline: Discrete Optimization in NLP

Indian Institute of Technology, Kanpur

Learning From the Past with Experiment Databases

A Reinforcement Learning Approach for Adaptive Single- and Multi-Document Summarization

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Deep Neural Network Language Models

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Prospective Robot Behavior

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v1 [cs.lg] 7 Apr 2015

Learning to Schedule Straight-Line Code

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.dc] 19 May 2017

arxiv: v1 [cs.cv] 2 Jun 2017

Improving Fairness in Memory Scheduling

Discriminative Learning of Beam-Search Heuristics for Planning

Assignment 1: Predicting Amazon Review Ratings

Attributed Social Network Embedding

Learning Methods for Fuzzy Systems

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A study of speaker adaptation for DNN-based speech synthesis

Task Completion Transfer Learning for Reward Inference

Softprop: Softmax Neural Network Backpropagation Learning

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

On the Formation of Phoneme Categories in DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Evolutive Neural Net Fuzzy Filtering: Basic Description

Semi-Supervised Face Detection

Calibration of Confidence Measures in Speech Recognition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

arxiv: v4 [cs.cl] 28 Mar 2016

Reducing Features to Improve Bug Prediction

A Case Study: News Classification Based on Term Frequency

Speech Recognition at ICSI: Broadcast News and beyond

Australian Journal of Basic and Applied Sciences

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

SARDNET: A Self-Organizing Feature Map for Sequences

Modeling function word errors in DNN-HMM based LVCSR systems

(Sub)Gradient Descent

Truth Inference in Crowdsourcing: Is the Problem Solved?

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

An OO Framework for building Intelligence and Learning properties in Software Agents

arxiv: v1 [cs.lg] 8 Mar 2017

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Lip Reading in Profile

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Transcription:

Deep Q-learning for Active Recognition of GERMS: Baseline performance on a standardized dataset for active learning Mohsen Malmir, Karan Sikka, Deborah Forster, Javier Movellan, and Garrison W. Cottrell Presented by Ruohan Zhang The University of Texas at Austin April 13, 2016 Ruohan Zhang Active object recognition April 13, 2016 1 / 30

Outline 1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 2 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 3 / 30

The Active Object Recognition (AOR) Problem The recognition module: what is this? The control module: where to look? Goal: find a sequence of sensor control commands that maximizes recognition accuracy and speed. Figure : The AOR problem for the RUBI robot [Malmir et al., ]. Ruohan Zhang Active object recognition April 13, 2016 4 / 30

Motivation A benchmark dataset for the AOR research more difficult than previous ones, e.g. [Nayar et al., 1996]. without the need to have access to a physical robot. A baseline method and its performance combines deep learning and reinforcement learning: deep Q-learning. Ruohan Zhang Active object recognition April 13, 2016 5 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 6 / 30

Data Collection The RUBI project at UCSD Machine Perception Lab. Six configurations for each object, two arms and three axes. RUBI brings the object to its center of view, rotate object by 180. Ruohan Zhang Active object recognition April 13, 2016 7 / 30

Data Statistics Data format: [image][capture time][joint angles]. Joint angles: 2-DOF head, 7-DOF arms X 2. 136 objects, 1365 videos, 30fps, 8.9s on average. Bound boxes are annotated manually. Ruohan Zhang Active object recognition April 13, 2016 8 / 30

Examples Figure : Left: the collage of all 136 objects. Right: some ambiguous objects that require rotation to disambiguate. Ruohan Zhang Active object recognition April 13, 2016 9 / 30

Example Videos The videos for the left arm and for the right arm. Ruohan Zhang Active object recognition April 13, 2016 10 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 11 / 30

The Reinforcement Learning Problem The goal: what to do in a state? Figure : The agent-environment interaction and Markov decision process (MDP). Ruohan Zhang Active object recognition April 13, 2016 12 / 30

Markov Decision Process (MDP) Definition A tuple S, A, P, R, γ, where S is a finite set of states. A is a finite set of actions. P is a state transition probability matrix. Pss a = P[s s, a]. R is a reward function, R a s = E[r s, a]. γ is a discount factor, γ [0, 1). Ruohan Zhang Active object recognition April 13, 2016 13 / 30

Policy and Value Function Policy Agent behavior is fully specified by π(s, a) = P[a s], one can directly optimize this by trying to maximize expected reward. Ruohan Zhang Active object recognition April 13, 2016 14 / 30

Policy and Value Function Policy Agent behavior is fully specified by π(s, a) = P[a s], one can directly optimize this by trying to maximize expected reward. Action-value function Q π (s, a) = E π [v t s t = s, a t = a], expected return starting from state s, taking action a, and then following policy π. Ruohan Zhang Active object recognition April 13, 2016 15 / 30

Policy and Value Function Policy Agent behavior is fully specified by π(s, a) = P[a s], one can directly optimize this by trying to maximize expected reward. Action-value function Q π (s, a) = E π [v t s t = s, a t = a], expected return starting from state s, taking action a, and then following policy π. Goal of reinforcement learning Find optimal policy: 1 if a = arg max Q(s, a) π (s, a) = a A 0 otherwise Therefore, if we know Q(s, a), we find the optimal policy. Ruohan Zhang Active object recognition April 13, 2016 16 / 30

Bellman Equations Action-value function recursive decomposition Q π (s, a) = E π [r t+1 + γq π (s t+1, a t+1 ) s t = s, a t = a] Dynamic programming to solve MDP Assumption: environment model P, R is fully known. Ruohan Zhang Active object recognition April 13, 2016 17 / 30

Model-free Reinforcement Learning: Q-learning The Q-learning algorithm [Sutton and Barto, 1998] Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step): Choose a from s Take action a, observe r, s Q(s, a) Q(s, a) + α[r + γ max a Q(s, a ) Q(s, a)] s s until s is terminal Remark r + γ max a Q(s, a ) can be seen as a supervised learning target, but it is changing. Ruohan Zhang Active object recognition April 13, 2016 18 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 19 / 30

Deep Reinforcement Learning? The basic Q-learning Assumptions: discrete states and actions (lookup Q-table); manually defined state space. The deep Q-learning Using a deep neural network to approximate the Q function. Ruohan Zhang Active object recognition April 13, 2016 20 / 30

The Network Architecture Figure : The deep network architecture in [Malmir et al., ]. Ruohan Zhang Active object recognition April 13, 2016 21 / 30

The MDP in this Paper MDP The state B t : the output of softmax layer of the CNN at time t, i.e., the belief vector over object labels. not the input image at time step t, as in [Mnih et al., 2013]. use Naive Bayes to accumulate belief from history. Figure : The state space representation in [Malmir et al., ]. Ruohan Zhang Active object recognition April 13, 2016 22 / 30

The MDP in this Paper MDP a t : ten rotation commands {±π/64, ±π/32, ±π/16, ±π/8, ±π/4}. P : transition matrix unknown (The reason they used Q-learning). R : +10 for correct classification, -10 ow. γ : unknown. Ruohan Zhang Active object recognition April 13, 2016 23 / 30

The Training Algorithm Exactly the Q-learning algorithm. Q(B t, a t ) Q(B t, a t ) + α[r t + γ max Q(B t+1, a) Q(B t, a t )] a For network weights update, use stochastic gradient descent: W W λ[r t + γ max a Q(B t+1, a) Q(B t, a t )] W Q(B t, a t ) mini-batch update. This is a key trick to stabilize deep RL network. Otherwise, the learning target is changing rapidly and it will not converge. Ruohan Zhang Active object recognition April 13, 2016 24 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 25 / 30

Results Figure : The experiment results on classification accuracy [Malmir et al., ]. Ruohan Zhang Active object recognition April 13, 2016 26 / 30

Results Figure : The number of steps required to achieve certain classification accuracy by different algorithms [Malmir et al., ]. Ruohan Zhang Active object recognition April 13, 2016 27 / 30

1 Introduction 2 The GERMS Dataset 3 The Deep Q-learning for Active Object Recognition A very brief introduction to reinforcement learning The Deep Q-learning 4 Results 5 Conclusions 6 Discussions Ruohan Zhang Active object recognition April 13, 2016 28 / 30

Conclusions Conclusions The GERMS dataset. The deep Q-learning for AOR, however, much space left for improvement: performance-wise. very basic version of deep Q-learning. Ruohan Zhang Active object recognition April 13, 2016 29 / 30

Discussions Right arm outperforms left arm. Uncommon objects for robotic tasks. Manual bounding box annotations is labor intensive. State representation (belief vector). The most representative frame? Any other similar datasets? Extension: using RNN to combine the two modules (control and recognition), e.g., Recurrent models of visual attention [Mnih et al., 2014]. Ruohan Zhang Active object recognition April 13, 2016 30 / 30

Malmir, M., Sikka, K., Forster, D., Movellan, J., and Cottrell, G. W. Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning. In Proceedings of the British Machine Vision Conference (BMVC), pages, pages 161 1. Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204 2212. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arxiv preprint arxiv:1312.5602. Nayar, S., Nene, S., and Murase, H. (1996). Columbia object image library (coil 100). Department of Comp. Science, Columbia University, Tech. Rep. CUCS-006-96. Ruohan Zhang Active object recognition April 13, 2016 30 / 30

Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. Ruohan Zhang Active object recognition April 13, 2016 30 / 30