Lecture 6: Applications

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Reinforcement Learning by Comparing Immediate Reward

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Axiom 2013 Team Description Paper

The dilemma of Saussurean communication

Laboratorio di Intelligenza Artificiale e Robotica

While you are waiting... socrative.com, room number SIMLANG2016

IAT 888: Metacreation Machines endowed with creative behavior. Philippe Pasquier Office 565 (floor 14)

Lecture 10: Reinforcement Learning

Getting Started with Deliberate Practice

Software Maintenance

Radius STEM Readiness TM

B. How to write a research paper

Red Flags of Conflict

A Pipelined Approach for Iterative Software Process Model

Seminar - Organic Computing

Laboratorio di Intelligenza Artificiale e Robotica

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Experience Corps. Mentor Toolkit

On the Combined Behavior of Autonomous Resource Management Agents

Lecturing in a Loincloth

Critical Thinking in Everyday Life: 9 Strategies

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

How To Take Control In Your Classroom And Put An End To Constant Fights And Arguments

Two Futures of Software Testing

The Strong Minimalist Thesis and Bounded Optimality

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Hentai High School A Game Guide

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

SIMPLY THE BEST! AND MINDSETS. (Growth or fixed?)

Going to School: Measuring Schooling Behaviors in GloFish

The Flaws, Fallacies and Foolishness of Benchmark Testing

Strategy Study on Primary School English Game Teaching

PREP S SPEAKER LISTENER TECHNIQUE COACHING MANUAL

STUDENT MOODLE ORIENTATION

No Parent Left Behind

LEGO MINDSTORMS Education EV3 Coding Activities

This map-tastic middle-grade story from Andrew Clements gives the phrase uncharted territory a whole new meaning!

Thesis-Proposal Outline/Template

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

An Introduction to Simio for Beginners

A BOOK IN A SLIDESHOW. The Dragonfly Effect JENNIFER AAKER & ANDY SMITH

Why Pay Attention to Race?

The Foundations of Interpersonal Communication

ALL-IN-ONE MEETING GUIDE THE ECONOMICS OF WELL-BEING

P a g e 1. Grade 4. Grant funded by: MS Exemplar Unit English Language Arts Grade 4 Edition 1

Speeding Up Reinforcement Learning with Behavior Transfer

Python Machine Learning

An OO Framework for building Intelligence and Learning properties in Software Agents

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

SOFTWARE EVALUATION TOOL

Artificial Neural Networks written examination

Extending Learning Across Time & Space: The Power of Generalization

A Reinforcement Learning Variant for Control Scheduling

Top Ten Persuasive Strategies Used on the Web - Cathy SooHoo, 5/17/01

Understanding and Changing Habits

Learning Prospective Robot Behavior

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

OFFICE OF ENROLLMENT MANAGEMENT. Annual Report

Adaptations and Survival: The Story of the Peppered Moth

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Case-Based Approach To Imitation Learning in Robotic Agents

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

Formative Assessment in Mathematics. Part 3: The Learner s Role

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Part I. Figuring out how English works

CS 100: Principles of Computing

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Speech Recognition at ICSI: Broadcast News and beyond

TEAM-BUILDING GAMES, ACTIVITIES AND IDEAS

MENTORING. Tips, Techniques, and Best Practices

Reduce the Failure Rate of the Screwing Process with Six Sigma Approach

Major Milestones, Team Activities, and Individual Deliverables

High-level Reinforcement Learning in Strategy Games

Results In. Planning Questions. Tony Frontier Five Levers to Improve Learning 1

Conducting an interview

What s in Your Communication Toolbox? COMMUNICATION TOOLBOX. verse clinical scenarios to bolster clinical outcomes: 1

Speak Up 2012 Grades 9 12

Mastering Team Skills and Interpersonal Communication. Copyright 2012 Pearson Education, Inc. publishing as Prentice Hall.

Learning Lesson Study Course

Architecting Interaction Styles

MAILCOM Las Vegas. October 2-4, Senior Director, Proposal Management BrightKey, Inc.

TD(λ) and Q-Learning Based Ludo Players

Ph.D. in Behavior Analysis Ph.d. i atferdsanalyse

Firms and Markets Saturdays Summer I 2014

Writing the Personal Statement

WORK OF LEADERS GROUP REPORT

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Beyond Classroom Solutions: New Design Perspectives for Online Learning Excellence

What is Teaching? JOHN A. LOTT Professor Emeritus in Pathology College of Medicine

Course Objectives Upon completion of this course, you will: Have a clear grasp of organic gardening techniques and methods

University of Toronto Physics Practicals. University of Toronto Physics Practicals. University of Toronto Physics Practicals

EXECUTIVE SUMMARY. Online courses for credit recovery in high schools: Effectiveness and promising practices. April 2017

Running head: THE INTERACTIVITY EFFECT IN MULTIMEDIA LEARNING 1

Time Management. To receive regular updates kindly send test to : 1

WELCOME! Of Social Competency. Using Social Thinking and. Social Thinking and. the UCLA PEERS Program 5/1/2017. My Background/ Who Am I?

Process improvement, The Agile Way! By Ben Linders Published in Methods and Tools, winter

Transcription:

Lecture 6: Applications Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real-Life Reinforcement Learning What is RL? Branch of machine learning concerned with sequential behavior: tries to remove human activities from the inner loop of the learning process. makes systems that improve a performance metric via interaction with their environment. Much in common with goals of autonomic computing

Reinforcement-Learning Hypothesis Intelligent behavior arises from the actions of an individual seeking to maximize its received reward signals in a complex and changing world. Research program: identify where reward signals come from, develop algorithms that search the space of behaviors to maximize reward signals. Example: Find The Ball Learn: which way to turn to minimize steps to see goal (ball) from camera input given experience.

Localization: The Garden Path To teach the robot which way to turn, easier if the robot knows where it is. Teach robot to recognize where it is facing. Facing east wall, facing NW corner, etc. From an RL perspective, we ve shot ourselves in the foot. Need labels, training data. No longer autonomously learnable. Human input required. Counterintuitive Alternative? Instead, don t tell robot where it is. Give robot two things: Ability to recognize when goal is achieved. Measure of cost en route (time, in this case). Now, robot can define locations implicitly--- how do they relate to the goal? Less direct learning problem. But, no human intervention needed during learning process. Ideal setting for RL.

Formulation is Key RL agents can either be a big win or a nonstarter depending on the problem formulation. I ll describe several attempts I ve been involved with, good and bad. Network Repair: Diagnosis There s a failure in the network. If the computer can identify the problem, it should be easier to repair. Learn mapping from symptoms to diagnosis. Again, need to train with labeled examples. Uses our notion of an ontology of problems.

Network Repair: Full Connectivity repair (Littman, Ravi, Fenson, Howard 04). Recover from corrupted network interface config. Minimize time to repair. Info. gathering actions: PluggedIn, PingIp, PingLhost, PingGateway, DnsLookup, Repair actions: RenewLease, UseCachedIP, FixIP. Additional information helps to make the right choice. Needed extra code for: detecting restored connectivity (doable) keeping time (easy) Learned Policy Recovery from corrupted network interface configuration. Java/Windows XP: Minimize time to repair.

Spam Filtering Machine learning crucial in development of commercial-grade spam filters. Problem: Input: bag of words and other features Output: likelihood the message is spam Learning: lots of data, always changing human already in the loop (don t get feedback on suppressed messages) Adaptive Filtering A version of spam filtering amenable to RL. reward for delivering non-spam message (!10) punishment for delivering spam (+1) learn from (sparse) human feedback Pitfalls: If spam/non-spam distinction easy, encouraged to right behavior by opportunity costs. If distinction is hard, either deliver all or no messages (depending on how common spam is). Must encourage smart exploration early on so the system has a good chance to learn the distinction.

Spam Tagging as RL Messages arrive at a server. Server has a set of filter programs. Message is spam if fail any filter in set. Cost: Computation time to process message. Try to run cheap / likely-to-fail filters first Non-spam fixed cost, can tag spam quickly Output always same! Sorting and SAT, also (Lagoudakis, Littman, Parr) Other Relevant Applications Deadlock detection interval selection How often should we check for deadlock to balance overhead and wasted time? [Earlier talk] Network routing in changing conditions How do we decide when to find new routes? Wireless network rate selection Rate adjustment depends on whether delays are due to congestion or noise.

Sticky One: Network Security Recognize intrusions. Prevent intrusion symptoms. Hard to define rewards here. system needs to see both sides of the tradeoff so it doesn t solve security problems by turning off the network... +1 for legitimate use,!1 for unauthorized use Rewards (not just the policy) seems to require intrusion detection! Algorithms Discussed problems that are better/worse. Let s say we have a problem we re ready to attack, what algorithms are appropriate?

Families of RL Approaches policy search s value-function based model based s a s a " Q T, R More direct use, less direct learning a Search for action that maximizes value v Solve Bellman equations s r More direct learning, less direct use Some Algorithms Model-based Estimate T, R; solve approximate MDP. Prioritized sweeping, Dyna Value-function-based Use observed transitions to modify Q itself. Q-learning, SARSA Policy search Try out different policies to find the best. policy gradient, genetic approaches

Mixed Bag Of the three, model-based approaches appear to be most data efficient. Model-based approaches still have the problem of solving the model. In some cases, useful to cast the modelsolving problem as an RL problem! Backgammon (Tesauro): Model known, valuefunction-based learning used to solve it. Helicopter (Ng et al.): Model acquired via expert experience, policy search used to solve it. Summary Thoughts RL formulation requires computable rewards. time to goal, if goal detectable Future work: How do RL when reward function must be learned autonomically?

Some Robot Videos! Ng Abbeel, Helicopter Navigation #1 Nouri

Navigation #2 Nouri Creative Learning Walsh

Terrain Learning #2 Leffler, Mansley, Edmunds!"#$%&'()$! *(%)+,-.(/()$!0(&-)%)' Multiagent Reinforcement Learning Pinky and The Brain

The RL Way Reward optimization is a black box. If you want to influence the learning process, do it by manipulating the reward function! Examples: shaping rewards (give hints about optimal policy) (Ng, Harada, Russell 99) intrinsic motivation (rewards associated with the learning process itself---like learning new things) (Barto, Singh, Chentanez, 04) exploration bonus (encourage exploration via rewards for uncertainty) (Brafman & Tennenholtz 02) Evolutionary Perspective Chapman Cohen (1868-1954): Human life, in line with animal life in general, has to develop not merely a dislike for such things as threaten life, but also a liking for their opposite. The development of this capacity means that in the long run the actions which promote pleasure, and those which preserve life, roughly coincide.

Multiagent RL What is there to talk about? Nothing: It'll just work itself out (other agents are a complex part of the environment). A bit: Without a boost, learning to work with other agents is just too hard. A lot: Must be treated directly because it is fundamentally different from other learning. Claim: Multiagent problems addressed via specialized shaping rewards. Shaping Rewards We re smart, but evolution doesn t trust us to plan all that far ahead. Evolution programs us to want things likely to bring about what we need: taste/nutrition pleasure/procreation eye contact/care generosity/cooperation

Shaping Rewards in RL Real task: Escape. One definition of reward function: -1 for each step, +100 for escape. Learning is too slow. If survival depends on escape, would not survive. Alternative: Additional +10 for pushing any button. We call these Shaping rewards. Pros and Cons of Shaping Can be really helpful. Not really the main task, but serve to encourage learning of pertinent parts of the model. Example: Babies like standing up. Somewhat risky. Can distract the learner so it spends all its time gathering easy-to-find, but task-irrelevant rewards. Learner can t tell a real reward from a shaping reward.

Why Have Social Rewards? Big advantages for (safe) cooperation. For reciprocal altruism, a species needs: repeated interactions recognize conspecifics; discriminate against defectors incentive towards long-term over short-term gain Necessary, but not sufficient: Must learn how. Drives Linked with Altruism To lead individuals to reap the benefits of reciprocal altruism, it s critical to: want to be around others, feel obligated to return favors, feel obligated to punish a defector. Evidence that the reward centers of our brains urge precisely this behavior.

Does Rejection Hurt? (Eisenberger et al. 03) In snubbing condition, brain centers associated with physical pain become active. Pain evident even when subjects barred from participation by technical difficulties. From Time Magazine

Is Cooperation Pleasurable? fmri during repeated Prisoner s Dilemma Payoffs: $3 (tempt), $2 (coop), $1 (defect), $0 (sucker) (Rilling et al. 02) Mutual cooperation most common (rational). Activation in reward center (area known to respond to desserts, pictures of pretty faces, money, cocaine) brighter for $2 (cooperative) payoff than for $3 (cheating) payoff. Is Revenge Sweet? Getting Even: Ultimatum Game Proposer is given $10. Proposer offers x! X to Responder. Responder can take it or leave it. Take it: Responder gets x, Proposer gets $10-x Leave it: Both get nothing. X = {2,8} or {2,5} or {2,2} or {2,0}

What Should Responder Do? Fraction of time accepting x=2 X! one-shot# repeated# human # {2,8}: 100%# 33%# 70% # {2,5}: 100%# 0%# 55% # {2,2}: 100%# 100%# 80% # {2,0}: 100%# 100%# 90% Repeated game analysis (Littman & Stone 03) Human results (Falk et al. 03) Ultimatum: Discussion Human results not rational (maximize utility). Common elements with maximizing utility assuming a repeated setting. But, not quite. Suggests other motivations/influences: reward for revenge.

Other Reward Functions Evidence that we have internal reward functions for some specific human-nature events appear in the popular press about once a month. Some recent ones: Love at First Sight Cuteness : Images of adorable kids and animals activates reward center. Schadenfreude Eye Contact Love at first sight. A research team led by Knut Kampe of the Institute of Cognitive Neuroscience at University College, London, has determined that eye contact with a pretty face (one judged to be attractive by the viewer [on variables such as radiance, empathy, cheerfulness, motherliness, and conventional beauty]) activates a pleasure center of the brain called the ventral striatum. Kampe's research, published in the journal Nature (2001), found that the brain-imaged pleasure response (which appears in a matter of seconds after viewing the face) only shows when mutual eye-contact is established, and does not show when looking into an attractive face whose eyes are averted or turned away.

Ha ha Tania Singer at University College London and her colleagues, who published a schadenfreude paper in Nature, were not actually searching for schadenfreude when they used functional magnetic resonance imaging to watch the brains of subjects in action. Their primary interest was variation in levels of empathy, which can be detected by the activity in "pain-related areas" like the "fronto-insular and anterior cingulate cortices" of the brain when a person is watching someone else in pain. The empathy circuits lighted up in both men and women when bad things happened to good people. When bad things happened to bad people, the women in the study were still empathic. But not the men. Not only did they show less empathy toward bad people, but the reward center in the left nucleus accumbens lighted up. All that translates as "Serves him right!" Evolutionary RL (Ackley & Littman 90) Evolution valued health positively, predators negatively. Tree senility: Value trees positively (defense against predators), negative long-term effects (no food). Need sophisticated intelligence for rewards (emotions!)

Ackley s Video