Reinforcement Learning of Artificial Intelligence B659. Class meets Tu & Thur 2:30pm

Reinforcement Learning of Artificial Intelligence B659 Class meets Tu & Thur 2:30pm 3:45pm in BH 330 Course webpage on canvas: schedule, slides, assignment submission, info about projects (later) Instructor: Adam White, you can find me in Lindley 201I, website: adamwhite.ca, email: adam@adamwhite.ca contact me via email, not canvas Your AI s are Matt and Su: contact info on canvas

Text & other useful resources Text: Reinforcement Learning An Introduction (1998) The second edition on Sutton and Barto: Minor rearrangement of topics Changes of notation new topics free online: https://www.dropbox.com/s/b3psxv2r0ccmf80/book2015oct.pdf?dl=0 we will use this exclusively! Csaba Szepesvari s book: Algorithms for reinforcement learning (2010) More theory Few additional topics covered Free online: http://www.ualberta.ca/~szepesva/rlbook.html

Grading 50% from 5 assignments, mostly questions from Sutton & Barto, and one or more programming questions code framework will be provided 5% short midterm quiz. Gives you an idea of how well you are tracking the course contents See what type of questions I will use on the final 35% from final project report or final exam. PhD students and <MS students with permission> can do a project rest must do a final

Thought questions 10% of your mark. The idea is to show you have read and thought about the reading material Must ask a question! Must provide at least one possible answer! Answers cannot be found in textbook or lecture slides You are showing me that you have read the text!!

Thought questions/statements Good example: Setting parameters (e.g., learning rate) in ML often involves cross validation and a testing/training split, however, in RL data is produced interactively. This seems much more challenging in RL. What is a fair way compare learning algorithms in RL? <QUESTION> One idea is to discretize the parameter set, for each algorithm, and report the performance with the best parameter setting <ANSWER> Another idea would be to use some metalearning algorithm to automatically tune the parameters of each method. <ANSWER>

Thought questions/statements If you submit thought questions in the correct form, i.e., question with an answer you get half the marks To get more than half marks you have to ask a good question. Bad questions I don t understand Sarsa, can you explain it again? There is a typo on page 7 Chapter 3 unrealistically assume access to the model of the MDP. I think we should skip this chapter How does reinforcement learning avoid overfitting? <No answer>

Assignments Assignments are one of the best ways to learn about RL All assignments will be individual work. You can talk to your friends but only at the ideas level, no details, no writing on the white board etc You cannot share written answers or code You cannot submit code you found online and modified, you must write your own from scratch You cannot use ML packages like RLtoolkit, python ML packages, tensor flow, etc All programming will be done in C

Academic integrity If you are caught cheating, copying, working together, plagiarizing: you will be reported to the university you may get a zero on the assignment/exam/project you may get a fail in the course you may be expelled We have problems with this every single year. Last year people plagiarized and copied assignments. Don t let it happen to you!!

What is artificial intelligence? Get out some paper and write down a definition What is machine learning? How do they differ?

What is artificial intelligence? Intelligence is the most powerful phenomena in the universe Ray Kurzweil The phenomena is that there are systems in the universe that are well thought of as goal seeking systems A science of mind Sutton When people finally come to understand the principles of intelligence what it is and how it works well enough to design and create beings as intelligent as ourselves It is the science and engineering of making intelligent machines, especially intelligent computer programs. Intelligence is the computational part of the ability to achieve goals in the world. John McCarthy

What is machine learning? A branch of computational statistics, with a specific focus on efficiency and scalability me Machine learning is the subfield of computer science that "gives computers the ability to learn without being explicitly programmed" Arthur Samuel Often AI and ML are interchanged One way to keep it simple is AI defines a problem, a research goal and ML defines a set of tools and a computational perspective these tools can be applied to a variety of applications and can be used to help understand and replicate the principles of human intelligence

What is reinforcement learning?

Goals and ambitions for the Course Learn the methods and foundational ideas of RL Outcome: prepared to apply RL to a novel application Outcome: prepared to do research in RL Learn some new ways of thinking about AI research The agent perspective interaction between learning and decision making experience as an unending stream temporally correlated actions and agent lives a life

Things we will not get into Deep reinforcement learning Neural networks Nonlinear representations of state Large applications of RL When we are done, you will be able to learn these topics on your own This course will help demystify modern RL

What is Reinforcement Learning? An approach to Artificial Intelligence it is a problem specification, a class of methods, and a field of study Learning from interaction Goaloriented learning Learning about, from, and while interacting with an external & unknown environment Learning what to do how to map situations to actions so as to maximize a numerical reward signal

Key Features of RL Learner is not told which actions to take no teacher, no labels (not supervised learning) TrialandError search (learn by doing) Possibility of delayed reward Sacrifice shortterm gains for greater longterm gains e.g., games like backgammon, most interesting problems The need to tradeoff: explore and exploit Considers the whole problem of a goaldirected agent interacting with an uncertain environment

Supervised, Unsupervised, & RL In SL the system is told what the correct response should have been: given a set of labelled training examples (teach provides labels) objective is to do well on unlabelled examples (generalize well to new examples) Unsupervised learning is about learning/uncovering the structure of unlabelled examples: e.g., finding a lower dimensional representation of the data this certainly could be useful to an RL agent, but does not address the key problem of maximizing reward

RL is influenced by and is influencing Computer Science Engineering Mathematics Optimal Control Operations Research Machine Learning Reinforcement Learning Bounded Rationality Reward System Classical/Operant Conditioning Neuroscience Psychology Economics David Silver

Agentenvironment interaction stream Interaction produces a temporal stream of data Continual learning, acting, and planning Object is to affect the environment Environment is stochastic and uncertain Environment Agent State, Stimulus, Situation Reward, Gain, Payoff, Cost Environment (world) Action, Response, Control

Example: Hajime Kimura s RL Robots (slide from Sutton) Backward New Robot, Same algorithm Before After

RL + Deep Learing Performance on Atari Games Space Invaders Breakout Enduro

RL + Deep Learning, applied to Classic Atari Games Google Deepmind 2015, Bowling et al. 2012 Learned to play 49 games for the Atari 2600 game console, without labels RESEARCH or LETTER human input, from selfplay and the score alone Convolution Convolution Fully connected Fully connected No input mapping raw screen pixels to predictions of final score for each of 18 joystick actions Figure 1 Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an 84 3 84 3 4 image produced by the preprocessing by a rectifier nonlinearity (that is, maxð0,xþ). Learned to play better than all previous algorithms map w, followed by three convolutional layers (note: snaking blue line and at human level for more than half the games symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed Same learning algorithm applied to all 49 games! w/o human tuning

Classic Examples of Reinforcement Learning Elevator Control Crites & Barto (Probably) world's best downpeak elevator controller Helicopter control Ng & Abbel (https://www.youtube.com/ watch?v=vcdxqn0fcne#t=22) Can perform maneuvers that no human operator can; modelbased RL TDGammon and Jellyfish Tesauro, Dahl World's best backgammon player

More Examples of RL Robot learning to walk Schuitema (https://www.youtube.com/watch? v=sbf5efeiw) 00:45 Uses methods you will learn about in this class Octopus arm simulator Engel (http://videolectures.net/ icml07_engel_demo/) 9:00, 10:55 Bayesian Temporal difference learning Keepaway Soccer Sutton & Stone http://www.cs.utexas.edu/~austinvilla/sim/keepaway/swf/learn360.swf Higher dimensional control using RL and tile coding Go, Hearts, other games

Elements of RL Policy: what to do, mapping from situations to actions Reward: what is good. Immediate Value: what is good because it predicts reward. Longer term Model: what follows what

Rewards Single scalar number Provided by you the designer. Not really that hard Agent seeks to maximize total future reward nextstep reward optimization is usually suboptimal The agent cannot change how rewards are generated, they are outside the agent s control In general rewards are stochastic functions of the state of the environment A good example of reward is physical pain and pleasure (positive and negative reward)

Value functions The value of a state specifies how good the state is in terms of future total reward The value function predicts longterm reward Reward tells us immediate goodness of moving from one state to another Value specify longterm desirability of states taking into account the states that usually follow We might incur short term negative reward to achieve higher value in the long term The environment produces rewards in response to the agent s actions The agent constructs an estimate of the value function to help select actions the yield high longterm reward

Policies A mapping from perceived states to actions Encoding how the agent behaves over time Policies may be stochastic They may be represented as a table, or involve some complex optimization

Models A model mimics the environment A model might predict the next state and reward, given the current state and action a prediction of what would happen in the environment if the agent took an action in some state. A simulation Models are often used for planning deciding a course of action by, for example, simulating different courses of action and choosing among them In this course we will consider an approach to planning that is stronglyconnected with value function learning: very different from the classic approaches considered in GOFAI

RLtoolkit demo

Limitations of course contents The methods we will study assume their exists some state, with certain properties almost all RL theory assumes this the algorithms work very well when we only have limited or incomplete access to state Most methods covered will estimate valuefunctions large chunk of modern RL other non valuefunction methods like evolution methods are possible but outside the scope we will cover policy gradient methods that learn a parameterized policy, and a value function

An Extended Example: TicTacToe Two player, turn based game Win if three in a row Assume draw and loss are equally bad We assume we do not know anything about the opponent s strategy, and he/she is not perfect Instead learn from experience generated from playing many games Can we build an agent to exploit our opponent? Maximize the chance of winning?

Possible solutions Specify or learn a model of our opponent is their strategy stationary? Evolutionary search search the space of policies maintain a population fitness measured via prob of winning Reinforcement learning we are player X

One way to approach this as an RL task Create a value function table of numbers for state of the game prob of winning from each state takes into account what policy does in the future all states with 3 X s in a row have value 1.0 all states with 3 O s in a row have value 0.0 all draw states have value 0.0 initially set rest to 0.5 The policy to select a move, examine next possible states from current one (look ahead) most of the time pick one with largest value greedy, or exploitive occasionally pick randomly: exploratory

............ x x x o o x x o o o o x o o x x x o o An RL Approach to TicTacToe 1. Make a table with one entry per state: State x V(s) estimated probability of winning.5?.5? 1 win...... 0 loss 0 draw Just pick the next state with the highest estimated prob. of winning the largest V(s); a greedy move. 2. Now play lots of games. But 10% of the time pick a move at random; an exploratory move. To pick our moves, look ahead one step: current state * various possible next states

RL Learning Rule for TicTacToe Opponent's Move { Our Move { Opponent's Move { Our Move { Opponent's Move { Our Move { * e' Starting Position a c c * b d e f g g* Exploratory move s s ʹ the state before our greedy move the state after our greedy move We increment each V(s) toward V( s ʹ ) a backup : V(s) V (s) + α[ V( s ʹ ) V (s)] a small positive fraction, e.g., α =.1 the step size parameter

Learning rule If we reduce the stepsize over time, this approach converges, giving the optimal moves for a fixed opponent If we keep the stepsize small but constant this approach tracks and plays well against opponents that change their strategy over time Where is the reward?

Attributes of this simple task Learn while interacting learning affects how we play, which requires learning the values of the new policy, which changes how we play, which Clear goal Delayed consequences of action Sophisticated behavior without a model of opponent or search over action sequences Just a value function + onestep model RL methods can be applied when no model is available In the beginning the agent didn t know anything about it s action consequences how could we inject prior knowledge?

How can we improve this T.T.T. player? Do we need random moves? Why? Do we always need a full 10%? Can we learn from random moves? Can we learn offline? Pretraining from self play? Using learned models of opponent?...

What would happen if we learned from exploratory moves? Not doing so: we learn the probability of winning under optimal play prob of win from current state if we choose some action then then played optimally from then on Learning from exploratory moves: we learn the probability of winning under the policy that includes exploration the estimates take into account that we sometimes explore this will likely result in different moves (sometimes safer moves) If we continue to explore forever, this second approach may end up being better: winning more games

How is TicTacToe Too Easy? Finite, small number of states Backgammon for example has 10 20 Go has more unique configurations than atoms in the universe! Onestep lookahead is always possible in TTT State completely observable...

The Course Part I: The Problem Introduction Evaluative Feedback The Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods Temporal Difference Learning Part III: A Unified View Eligibility Traces Generalization and Function Approximation Planning and Learning Advanced topics (see canvas) RL in psychology and animal learning Case Studies

Next Class Thursday: Read Chapter 2 of Sutton & Barto (2016), you can skip 2.7 I find the history section at the end particularly interesting! 2 thought questions about chapter 1 & 2 are due Monday 16th (the day before class @11:59pm) that way we can discuss them in class Assignment #1 will be released today. We will talk about it next time

Reinforcement Learning of Artificial Intelligence B659. Class meets Tu & Thur 2:30pm - 3:45pm in BH 330