Lecture 6: Applications Michael L. Littman Rutgers University Department of Computer Science Rutgers Laboratory for Real-Life Reinforcement Learning What is RL? Branch of machine learning concerned with sequential behavior: tries to remove human activities from the inner loop of the learning process. makes systems that improve a performance metric via interaction with their environment. Much in common with goals of autonomic computing
Reinforcement-Learning Hypothesis Intelligent behavior arises from the actions of an individual seeking to maximize its received reward signals in a complex and changing world. Research program: identify where reward signals come from, develop algorithms that search the space of behaviors to maximize reward signals. Example: Find The Ball Learn: which way to turn to minimize steps to see goal (ball) from camera input given experience.
Localization: The Garden Path To teach the robot which way to turn, easier if the robot knows where it is. Teach robot to recognize where it is facing. Facing east wall, facing NW corner, etc. From an RL perspective, we ve shot ourselves in the foot. Need labels, training data. No longer autonomously learnable. Human input required. Counterintuitive Alternative? Instead, don t tell robot where it is. Give robot two things: Ability to recognize when goal is achieved. Measure of cost en route (time, in this case). Now, robot can define locations implicitly--- how do they relate to the goal? Less direct learning problem. But, no human intervention needed during learning process. Ideal setting for RL.
Formulation is Key RL agents can either be a big win or a nonstarter depending on the problem formulation. I ll describe several attempts I ve been involved with, good and bad. Network Repair: Diagnosis There s a failure in the network. If the computer can identify the problem, it should be easier to repair. Learn mapping from symptoms to diagnosis. Again, need to train with labeled examples. Uses our notion of an ontology of problems.
Network Repair: Full Connectivity repair (Littman, Ravi, Fenson, Howard 04). Recover from corrupted network interface config. Minimize time to repair. Info. gathering actions: PluggedIn, PingIp, PingLhost, PingGateway, DnsLookup, Repair actions: RenewLease, UseCachedIP, FixIP. Additional information helps to make the right choice. Needed extra code for: detecting restored connectivity (doable) keeping time (easy) Learned Policy Recovery from corrupted network interface configuration. Java/Windows XP: Minimize time to repair.
Spam Filtering Machine learning crucial in development of commercial-grade spam filters. Problem: Input: bag of words and other features Output: likelihood the message is spam Learning: lots of data, always changing human already in the loop (don t get feedback on suppressed messages) Adaptive Filtering A version of spam filtering amenable to RL. reward for delivering non-spam message (!10) punishment for delivering spam (+1) learn from (sparse) human feedback Pitfalls: If spam/non-spam distinction easy, encouraged to right behavior by opportunity costs. If distinction is hard, either deliver all or no messages (depending on how common spam is). Must encourage smart exploration early on so the system has a good chance to learn the distinction.
Spam Tagging as RL Messages arrive at a server. Server has a set of filter programs. Message is spam if fail any filter in set. Cost: Computation time to process message. Try to run cheap / likely-to-fail filters first Non-spam fixed cost, can tag spam quickly Output always same! Sorting and SAT, also (Lagoudakis, Littman, Parr) Other Relevant Applications Deadlock detection interval selection How often should we check for deadlock to balance overhead and wasted time? [Earlier talk] Network routing in changing conditions How do we decide when to find new routes? Wireless network rate selection Rate adjustment depends on whether delays are due to congestion or noise.
Sticky One: Network Security Recognize intrusions. Prevent intrusion symptoms. Hard to define rewards here. system needs to see both sides of the tradeoff so it doesn t solve security problems by turning off the network... +1 for legitimate use,!1 for unauthorized use Rewards (not just the policy) seems to require intrusion detection! Algorithms Discussed problems that are better/worse. Let s say we have a problem we re ready to attack, what algorithms are appropriate?
Families of RL Approaches policy search s value-function based model based s a s a " Q T, R More direct use, less direct learning a Search for action that maximizes value v Solve Bellman equations s r More direct learning, less direct use Some Algorithms Model-based Estimate T, R; solve approximate MDP. Prioritized sweeping, Dyna Value-function-based Use observed transitions to modify Q itself. Q-learning, SARSA Policy search Try out different policies to find the best. policy gradient, genetic approaches
Mixed Bag Of the three, model-based approaches appear to be most data efficient. Model-based approaches still have the problem of solving the model. In some cases, useful to cast the modelsolving problem as an RL problem! Backgammon (Tesauro): Model known, valuefunction-based learning used to solve it. Helicopter (Ng et al.): Model acquired via expert experience, policy search used to solve it. Summary Thoughts RL formulation requires computable rewards. time to goal, if goal detectable Future work: How do RL when reward function must be learned autonomically?
Some Robot Videos! Ng Abbeel, Helicopter Navigation #1 Nouri
Navigation #2 Nouri Creative Learning Walsh
Terrain Learning #2 Leffler, Mansley, Edmunds!"#$%&'()$! *(%)+,-.(/()$!0(&-)%)' Multiagent Reinforcement Learning Pinky and The Brain
The RL Way Reward optimization is a black box. If you want to influence the learning process, do it by manipulating the reward function! Examples: shaping rewards (give hints about optimal policy) (Ng, Harada, Russell 99) intrinsic motivation (rewards associated with the learning process itself---like learning new things) (Barto, Singh, Chentanez, 04) exploration bonus (encourage exploration via rewards for uncertainty) (Brafman & Tennenholtz 02) Evolutionary Perspective Chapman Cohen (1868-1954): Human life, in line with animal life in general, has to develop not merely a dislike for such things as threaten life, but also a liking for their opposite. The development of this capacity means that in the long run the actions which promote pleasure, and those which preserve life, roughly coincide.
Multiagent RL What is there to talk about? Nothing: It'll just work itself out (other agents are a complex part of the environment). A bit: Without a boost, learning to work with other agents is just too hard. A lot: Must be treated directly because it is fundamentally different from other learning. Claim: Multiagent problems addressed via specialized shaping rewards. Shaping Rewards We re smart, but evolution doesn t trust us to plan all that far ahead. Evolution programs us to want things likely to bring about what we need: taste/nutrition pleasure/procreation eye contact/care generosity/cooperation
Shaping Rewards in RL Real task: Escape. One definition of reward function: -1 for each step, +100 for escape. Learning is too slow. If survival depends on escape, would not survive. Alternative: Additional +10 for pushing any button. We call these Shaping rewards. Pros and Cons of Shaping Can be really helpful. Not really the main task, but serve to encourage learning of pertinent parts of the model. Example: Babies like standing up. Somewhat risky. Can distract the learner so it spends all its time gathering easy-to-find, but task-irrelevant rewards. Learner can t tell a real reward from a shaping reward.
Why Have Social Rewards? Big advantages for (safe) cooperation. For reciprocal altruism, a species needs: repeated interactions recognize conspecifics; discriminate against defectors incentive towards long-term over short-term gain Necessary, but not sufficient: Must learn how. Drives Linked with Altruism To lead individuals to reap the benefits of reciprocal altruism, it s critical to: want to be around others, feel obligated to return favors, feel obligated to punish a defector. Evidence that the reward centers of our brains urge precisely this behavior.
Does Rejection Hurt? (Eisenberger et al. 03) In snubbing condition, brain centers associated with physical pain become active. Pain evident even when subjects barred from participation by technical difficulties. From Time Magazine
Is Cooperation Pleasurable? fmri during repeated Prisoner s Dilemma Payoffs: $3 (tempt), $2 (coop), $1 (defect), $0 (sucker) (Rilling et al. 02) Mutual cooperation most common (rational). Activation in reward center (area known to respond to desserts, pictures of pretty faces, money, cocaine) brighter for $2 (cooperative) payoff than for $3 (cheating) payoff. Is Revenge Sweet? Getting Even: Ultimatum Game Proposer is given $10. Proposer offers x! X to Responder. Responder can take it or leave it. Take it: Responder gets x, Proposer gets $10-x Leave it: Both get nothing. X = {2,8} or {2,5} or {2,2} or {2,0}
What Should Responder Do? Fraction of time accepting x=2 X! one-shot# repeated# human # {2,8}: 100%# 33%# 70% # {2,5}: 100%# 0%# 55% # {2,2}: 100%# 100%# 80% # {2,0}: 100%# 100%# 90% Repeated game analysis (Littman & Stone 03) Human results (Falk et al. 03) Ultimatum: Discussion Human results not rational (maximize utility). Common elements with maximizing utility assuming a repeated setting. But, not quite. Suggests other motivations/influences: reward for revenge.
Other Reward Functions Evidence that we have internal reward functions for some specific human-nature events appear in the popular press about once a month. Some recent ones: Love at First Sight Cuteness : Images of adorable kids and animals activates reward center. Schadenfreude Eye Contact Love at first sight. A research team led by Knut Kampe of the Institute of Cognitive Neuroscience at University College, London, has determined that eye contact with a pretty face (one judged to be attractive by the viewer [on variables such as radiance, empathy, cheerfulness, motherliness, and conventional beauty]) activates a pleasure center of the brain called the ventral striatum. Kampe's research, published in the journal Nature (2001), found that the brain-imaged pleasure response (which appears in a matter of seconds after viewing the face) only shows when mutual eye-contact is established, and does not show when looking into an attractive face whose eyes are averted or turned away.
Ha ha Tania Singer at University College London and her colleagues, who published a schadenfreude paper in Nature, were not actually searching for schadenfreude when they used functional magnetic resonance imaging to watch the brains of subjects in action. Their primary interest was variation in levels of empathy, which can be detected by the activity in "pain-related areas" like the "fronto-insular and anterior cingulate cortices" of the brain when a person is watching someone else in pain. The empathy circuits lighted up in both men and women when bad things happened to good people. When bad things happened to bad people, the women in the study were still empathic. But not the men. Not only did they show less empathy toward bad people, but the reward center in the left nucleus accumbens lighted up. All that translates as "Serves him right!" Evolutionary RL (Ackley & Littman 90) Evolution valued health positively, predators negatively. Tree senility: Value trees positively (defense against predators), negative long-term effects (no food). Need sophisticated intelligence for rewards (emotions!)
Ackley s Video