An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst

Size: px
Start display at page:

Download "An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst"

Transcription

1 An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst UPF Lecture 2 Autonomous Learning Laboratory Department of Computer Science

2 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

3 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

4 Lecture 2, Part 1: Dynamic Programming Objectives of this part: Overview of a collection of classical solution methods for MDPs known as Dynamic Programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

5 Policy Evaluation Policy Evaluation: for a given policy π, compute the state-value function V! Recall: State - value function for policy! : V! (s) = E! R t s t = s % { } = E! & $ " k r t +k +1 s t = s ' # k =0 ( ) * Bellman equation for V! : $ $ V! (s) =!(s, a) P a s s " a s " [ R a s s " + # V! ( s ")] a system of S simultaneous linear equations A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

6 Iterative Methods V 0! V 1! L! V k! V k +1! L! V " a sweep A sweep consists of applying a backup operation to each state. A full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

7 A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

8 Iterative Policy Eval for the Small Gridworld! = random (uniform) action choices A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

9 Policy Improvement Suppose we have computed V! for a deterministic policy π. For a given state s, would it be better to do an action a! "(s)? The value of doing a in state s is : Q " (s,a) = E { " r t +1 + #V " (s t +1 ) s t = s,a t = a} = % P a s s $ R a + #V " ( $ s s $ s ) s $ [ ] It is better to switch to action a for state s if and only if Q " (s,a) > V " (s) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

10 Policy Improvement Cont. Do this for all states to get a new policy "! that is greedy with respect to V " : Then V "! # V " "!(s) = argmax Q " (s, a) a = argmax # R a + $ V " ( s!) s! P a a s! s! [ ] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

11 Policy Improvement Cont. What if V # " = V #? i.e., for all s $ S, V # " (s) = max% P a s s " R a + &V # ( " s s " s ) But this is the Bellman Optimality Equation. a s " [ ]? So V "! = V # and both " and "! are optimal policies. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

12 Policy Iteration! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evaluation policy improvement greedification A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

13 Value Iteration Recall the full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ Here is the full value iteration backup: V k +1 (s) " max a % s # P s # s [ a + $V ( s # )] k a R s s # A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

14 Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent s experience can act as a guide. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

15 Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

16 Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

17 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

18 Lecture 2, Part 2: Simple Monte Carlo Methods Simple Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

19 (First-visit) Monte Carlo policy evaluation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

20 Backup diagram for Simple Monte Carlo Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

21 Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available Q π (s,a) - average return starting from state s and action a following π Also converges asymptotically if every state-action pair is visited infinitely often We are really interested in estimates of V* and Q*, i.e., Monte Carlo Control A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

22 Learning about π while following! " A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

23 Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) Estimating values for one policy while behaving according to another policy: importance sampling A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

24 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

25 Lecture 2, Part 2: Temporal Difference Learning Objectives of this part: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

26 TD Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V! Recall: Simple (every - visit) Monte Carlo method : [ ] V (s t ) " V (s t ) + # R t $V (s t ) target: the actual return after time t The simplest TD method, TD(0) : [ ] V (s t ) " V (s t ) + # r t +1 + $ V (s t +1 ) %V(s t ) target: an estimate of the return A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

27 Simple Monte Carlo V (s t ) " V (s t ) + # [ R t $V (s t )] (constant-α MC) where R t is the actual return following state s t. s t T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

28 cf. Dynamic Programming V(s t )! E " r t +1 +# V(s t ) { } s t r t +1 s t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

29 Simplest TD Method V(s t )! V(s t ) +" [ r t +1 + # V (s t+1 ) $ V(s t )] s t s t +1 r t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

30 TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

31 Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions), but which is faster/better? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

32 Random Walk Example equiprobable transitions " = 0.1 Values learned by TD(0) after various numbers of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

33 TD and MC on the Random Walk Data averaged over 100 sequences of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

34 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

35 You are the Predictor V(A)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

36 You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

37 Learning An Action-Value Function Estimate Q! for the current behavior policy!. After every transition from a nonterminal state s t, do this : Q( s t, a t )! Q( s t, a t ) + " r t +1 +# Q( s t +1,a t +1 ) $ Q( s t,a t ) [ ] If s t +1 is terminal, then Q(s t +1, a t +1 ) = 0. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

38 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: s, a r s a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

39 Q-Learning: Off-Policy TD Control One - step Q - learning : Q( s t, a t )! Q( s t, a t ) [ + " r t +1 +# max Q ( s t+1, a ) $ Q ( s, a )] t t a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

40 Cliffwalking ε greedy, ε = 0.1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

41 Actor-Critic Architecture Environment Actions Situations or States Primary Critic Primary Rewards Adaptive Critic Effective Rewards: (involves values) Actor A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

42 Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

43 Actor-Critic Details TD - error is used to evaluate actions :! t = r t +1 + " V (s t +1 ) # V(s t ) If actions are determined by preferences, p(s, a), as follows : { } = ep( s, a) " e! t (s, a) = Pr a t = a s t = s b p(s,b) then you can update the preferences like this : p(s t, a t ) # p(s t,a t ) + $% t, A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

44 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

45 Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

46 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

47 Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

48 n-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

49 Mathematics of n-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt " rt L+ T! t! 1 r T TD: R! (1) t = r + t+ 1 Vt t+ ( s ) 1 Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt Vt t+ ( s ) 2 n-step return: ( n) 2 n! 1 n R t = rt " rt " rt L+ " rt + n + " Vt ( st+ n ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

50 Learning with n-step Backups Backup (on-line or off-line): "V t (s t ) = #[ (n R ) t $V t (s t )] Error reduction property of n-step returns max s E " {R n t s t = s} #V " (s) $ % n maxv (s) #V " (s) s n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

51 Random Walk Examples How does 2-step TD work here? How about 3-step TD? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

52 A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

53 Averaging n-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt R (4) t One backup Called a complex backup Draw each component Label with the weights for that component A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

54 Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

55 λ-return Weighting Function A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

56 Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

57 Forward View of TD(λ) Look forward from each state to determine update from future states and rewards: A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

58 λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

59 Backward View of TD(λ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace: e t (s) " # + On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

60 Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

61 On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

62 Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

63 Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

64 Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % if &!"e t #1 (s,a) s = s and a = a t t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

65 Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

66 Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

67 Conclusions Eligibility traces provide efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

68 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

69 TD-error r δ t = r t + V t V t 1 regular predictors of z over this interval early in learning learning complete V δ V δ r omitted δ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

70 Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

71 Dopamine Modulated Synaptic Plasticity A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

72 Basal Ganglia as Adaptive Critic Architecture Houk, Adams, & Barto, 1995 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

73 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,

Reinforcement Learning by Comparing Immediate Reward

Reinforcement Learning by Comparing Immediate Reward Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate

More information

Lecture 10: Reinforcement Learning

Lecture 10: Reinforcement Learning Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation

More information

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and

More information

Axiom 2013 Team Description Paper

Axiom 2013 Team Description Paper Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association

More information

A Reinforcement Learning Variant for Control Scheduling

A Reinforcement Learning Variant for Control Scheduling A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement

More information

TD(λ) and Q-Learning Based Ludo Players

TD(λ) and Q-Learning Based Ludo Players TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability

More information

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Exploration. CS : Deep Reinforcement Learning Sergey Levine Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?

More information

Speeding Up Reinforcement Learning with Behavior Transfer

Speeding Up Reinforcement Learning with Behavior Transfer Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu

More information

Georgetown University at TREC 2017 Dynamic Domain Track

Georgetown University at TREC 2017 Dynamic Domain Track Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain

More information

Learning Prospective Robot Behavior

Learning Prospective Robot Behavior Learning Prospective Robot Behavior Shichao Ou and Rod Grupen Laboratory for Perceptual Robotics Computer Science Department University of Massachusetts Amherst {chao,grupen}@cs.umass.edu Abstract This

More information

Lecture 1: Machine Learning Basics

Lecture 1: Machine Learning Basics 1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3

More information

Artificial Neural Networks written examination

Artificial Neural Networks written examination 1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14

More information

High-level Reinforcement Learning in Strategy Games

High-level Reinforcement Learning in Strategy Games High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer

More information

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The Good Judgment Project: A large scale test of different methods of combining expert predictions The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania

More information

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI

More information

Regret-based Reward Elicitation for Markov Decision Processes

Regret-based Reward Elicitation for Markov Decision Processes 444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu

More information

AMULTIAGENT system [1] can be defined as a group of

AMULTIAGENT system [1] can be defined as a group of 156 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008 A Comprehensive Survey of Multiagent Reinforcement Learning Lucian Buşoniu, Robert Babuška,

More information

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should

More information

Improving Action Selection in MDP s via Knowledge Transfer

Improving Action Selection in MDP s via Knowledge Transfer In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone

More information

Python Machine Learning

Python Machine Learning Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled

More information

The Strong Minimalist Thesis and Bounded Optimality

The Strong Minimalist Thesis and Bounded Optimality The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this

More information

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com

More information

A Comparison of Annealing Techniques for Academic Course Scheduling

A Comparison of Annealing Techniques for Academic Course Scheduling A Comparison of Annealing Techniques for Academic Course Scheduling M. A. Saleh Elmohamed 1, Paul Coddington 2, and Geoffrey Fox 1 1 Northeast Parallel Architectures Center Syracuse University, Syracuse,

More information

Generative models and adversarial training

Generative models and adversarial training Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?

More information

FF+FPG: Guiding a Policy-Gradient Planner

FF+FPG: Guiding a Policy-Gradient Planner FF+FPG: Guiding a Policy-Gradient Planner Olivier Buffet LAAS-CNRS University of Toulouse Toulouse, France firstname.lastname@laas.fr Douglas Aberdeen National ICT australia & The Australian National University

More information

Planning with External Events

Planning with External Events 94 Planning with External Events Jim Blythe School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 blythe@cs.cmu.edu Abstract I describe a planning methodology for domains with uncertainty

More information

Probability and Game Theory Course Syllabus

Probability and Game Theory Course Syllabus Probability and Game Theory Course Syllabus DATE ACTIVITY CONCEPT Sunday Learn names; introduction to course, introduce the Battle of the Bismarck Sea as a 2-person zero-sum game. Monday Day 1 Pre-test

More information

An OO Framework for building Intelligence and Learning properties in Software Agents

An OO Framework for building Intelligence and Learning properties in Software Agents An OO Framework for building Intelligence and Learning properties in Software Agents José A. R. P. Sardinha, Ruy L. Milidiú, Carlos J. P. Lucena, Patrick Paranhos Abstract Software agents are defined as

More information

Are You Ready? Simplify Fractions

Are You Ready? Simplify Fractions SKILL 10 Simplify Fractions Teaching Skill 10 Objective Write a fraction in simplest form. Review the definition of simplest form with students. Ask: Is 3 written in simplest form? Why 7 or why not? (Yes,

More information

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon

More information

Introduction to Simulation

Introduction to Simulation Introduction to Simulation Spring 2010 Dr. Louis Luangkesorn University of Pittsburgh January 19, 2010 Dr. Louis Luangkesorn ( University of Pittsburgh ) Introduction to Simulation January 19, 2010 1 /

More information

Learning Methods for Fuzzy Systems

Learning Methods for Fuzzy Systems Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8

More information

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017 Instructor Syed Zahid Ali Room No. 247 Economics Wing First Floor Office Hours Email szahid@lums.edu.pk Telephone Ext. 8074 Secretary/TA TA Office Hours Course URL (if any) Suraj.lums.edu.pk FINN 321 Econometrics

More information

AI Agent for Ice Hockey Atari 2600

AI Agent for Ice Hockey Atari 2600 AI Agent for Ice Hockey Atari 2600 Emman Kabaghe (emmank@stanford.edu) Rajarshi Roy (rroy@stanford.edu) 1 Introduction In the reinforcement learning (RL) problem an agent autonomously learns a behavior

More information

While you are waiting... socrative.com, room number SIMLANG2016

While you are waiting... socrative.com, room number SIMLANG2016 While you are waiting... socrative.com, room number SIMLANG2016 Simulating Language Lecture 4: When will optimal signalling evolve? Simon Kirby simon@ling.ed.ac.uk T H E U N I V E R S I T Y O H F R G E

More information

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents

More information

How long did... Who did... Where was... When did... How did... Which did...

How long did... Who did... Where was... When did... How did... Which did... (Past Tense) Who did... Where was... How long did... When did... How did... 1 2 How were... What did... Which did... What time did... Where did... What were... Where were... Why did... Who was... How many

More information

Measurement. When Smaller Is Better. Activity:

Measurement. When Smaller Is Better. Activity: Measurement Activity: TEKS: When Smaller Is Better (6.8) Measurement. The student solves application problems involving estimation and measurement of length, area, time, temperature, volume, weight, and

More information

On the Combined Behavior of Autonomous Resource Management Agents

On the Combined Behavior of Autonomous Resource Management Agents On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science

More information

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za

More information

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators s and environments Percepts Intelligent s? Chapter 2 Actions s include humans, robots, softbots, thermostats, etc. The agent function maps from percept histories to actions: f : P A The agent program runs

More information

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes WHAT STUDENTS DO: Establishing Communication Procedures Following Curiosity on Mars often means roving to places with interesting

More information

Radius STEM Readiness TM

Radius STEM Readiness TM Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and

More information

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The

More information

Julia Smith. Effective Classroom Approaches to.

Julia Smith. Effective Classroom Approaches to. Julia Smith @tessmaths Effective Classroom Approaches to GCSE Maths resits julia.smith@writtle.ac.uk Agenda The context of GCSE resit in a post-16 setting An overview of the new GCSE Key features of a

More information

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors) Intelligent Agents Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Agent types 2 Agents and environments sensors environment percepts

More information

BMBF Project ROBUKOM: Robust Communication Networks

BMBF Project ROBUKOM: Robust Communication Networks BMBF Project ROBUKOM: Robust Communication Networks Arie M.C.A. Koster Christoph Helmberg Andreas Bley Martin Grötschel Thomas Bauschert supported by BMBF grant 03MS616A: ROBUKOM Robust Communication Networks,

More information

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference

More information

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014 UNSW Australia Business School School of Risk and Actuarial Studies ACTL5103 Stochastic Modelling For Actuaries Course Outline Semester 2, 2014 Part A: Course-Specific Information Please consult Part B

More information

Learning to Schedule Straight-Line Code

Learning to Schedule Straight-Line Code Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko Stefanović Dept. of Comp. Sci., Univ. of Mass. Amherst, MA 01003 Carla Brodley, David Scheeff Sch. of Elec.

More information

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling

Adaptive Generation in Dialogue Systems Using Dynamic User Modeling Adaptive Generation in Dialogue Systems Using Dynamic User Modeling Srinivasan Janarthanam Heriot-Watt University Oliver Lemon Heriot-Watt University We address the problem of dynamically modeling and

More information

Stopping rules for sequential trials in high-dimensional data

Stopping rules for sequential trials in high-dimensional data Stopping rules for sequential trials in high-dimensional data Sonja Zehetmayer, Alexandra Graf, and Martin Posch Center for Medical Statistics, Informatics and Intelligent Systems Medical University of

More information

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering

More information

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only. Calculus AB Priority Keys Aligned with Nevada Standards MA I MI L S MA represents a Major content area. Any concept labeled MA is something of central importance to the entire class/curriculum; it is a

More information

Evolutive Neural Net Fuzzy Filtering: Basic Description

Evolutive Neural Net Fuzzy Filtering: Basic Description Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:

More information

(Sub)Gradient Descent

(Sub)Gradient Descent (Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include

More information

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering Lecture Details Instructor Course Objectives Tuesday and Thursday, 4:00 pm to 5:15 pm Information Technology and Engineering

More information

Statewide Framework Document for:

Statewide Framework Document for: Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance

More information

The Evolution of Random Phenomena

The Evolution of Random Phenomena The Evolution of Random Phenomena A Look at Markov Chains Glen Wang glenw@uchicago.edu Splash! Chicago: Winter Cascade 2012 Lecture 1: What is Randomness? What is randomness? Can you think of some examples

More information

A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur?

A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur? A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur? Dario D. Salvucci Drexel University Philadelphia, PA Christopher A. Monk George Mason University

More information

Acquiring Competence from Performance Data

Acquiring Competence from Performance Data Acquiring Competence from Performance Data Online learnability of OT and HG with simulated annealing Tamás Biró ACLC, University of Amsterdam (UvA) Computational Linguistics in the Netherlands, February

More information

Intelligent Agents. Chapter 2. Chapter 2 1

Intelligent Agents. Chapter 2. Chapter 2 1 Intelligent Agents Chapter 2 Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types The structure of agents Chapter 2 2 Agents

More information

Self Study Report Computer Science

Self Study Report Computer Science Computer Science undergraduate students have access to undergraduate teaching, and general computing facilities in three buildings. Two large classrooms are housed in the Davis Centre, which hold about

More information

NCEO Technical Report 27

NCEO Technical Report 27 Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students

More information

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1. University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING Calendar Description Units: 1.5 Hours: 3-2 Neural and cognitive processes underlying human skilled

More information

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Dublin City Schools Mathematics Graded Course of Study GRADE 4 I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported

More information

Modeling user preferences and norms in context-aware systems

Modeling user preferences and norms in context-aware systems Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos

More information

Corrective Feedback and Persistent Learning for Information Extraction

Corrective Feedback and Persistent Learning for Information Extraction Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,

More information

Mathematics subject curriculum

Mathematics subject curriculum Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June

More information

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016 AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory

More information

Erkki Mäkinen State change languages as homomorphic images of Szilard languages

Erkki Mäkinen State change languages as homomorphic images of Szilard languages Erkki Mäkinen State change languages as homomorphic images of Szilard languages UNIVERSITY OF TAMPERE SCHOOL OF INFORMATION SCIENCES REPORTS IN INFORMATION SCIENCES 48 TAMPERE 2016 UNIVERSITY OF TAMPERE

More information

Extending Place Value with Whole Numbers to 1,000,000

Extending Place Value with Whole Numbers to 1,000,000 Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit

More information

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best

More information

Learning and Transferring Relational Instance-Based Policies

Learning and Transferring Relational Instance-Based Policies Learning and Transferring Relational Instance-Based Policies Rocío García-Durán, Fernando Fernández y Daniel Borrajo Universidad Carlos III de Madrid Avda de la Universidad 30, 28911-Leganés (Madrid),

More information

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS Page 1 of 42 Articles in PresS. J Neurophysiol (December 20, 2006). doi:10.1152/jn.00946.2006 BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS Anne C. Smith 1*, Sylvia

More information

Accelerated Learning Online. Course Outline

Accelerated Learning Online. Course Outline Accelerated Learning Online Course Outline Course Description The purpose of this course is to make the advances in the field of brain research more accessible to educators. The techniques and strategies

More information

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic

More information

SARDNET: A Self-Organizing Feature Map for Sequences

SARDNET: A Self-Organizing Feature Map for Sequences SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu

More information

XXII BrainStorming Day

XXII BrainStorming Day UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA PhD course in Electronics, Automation and Control of Complex Systems - XXV Cycle DIPARTIMENTO DI INGEGNERIA ELETTRICA ELETTRONICA E INFORMATICA XXII

More information

arxiv: v1 [math.at] 10 Jan 2016

arxiv: v1 [math.at] 10 Jan 2016 THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the

More information

A Stochastic Model for the Vocabulary Explosion

A Stochastic Model for the Vocabulary Explosion Words Known A Stochastic Model for the Vocabulary Explosion Colleen C. Mitchell (colleen-mitchell@uiowa.edu) Department of Mathematics, 225E MLH Iowa City, IA 52242 USA Bob McMurray (bob-mcmurray@uiowa.edu)

More information

Surprise-Based Learning for Autonomous Systems

Surprise-Based Learning for Autonomous Systems Surprise-Based Learning for Autonomous Systems Nadeesha Ranasinghe and Wei-Min Shen ABSTRACT Dealing with unexpected situations is a key challenge faced by autonomous robots. This paper describes a promising

More information

Visit us at:

Visit us at: White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,

More information

Executive Guide to Simulation for Health

Executive Guide to Simulation for Health Executive Guide to Simulation for Health Simulation is used by Healthcare and Human Service organizations across the World to improve their systems of care and reduce costs. Simulation offers evidence

More information

1. Answer the questions below on the Lesson Planning Response Document.

1. Answer the questions below on the Lesson Planning Response Document. Module for Lateral Entry Teachers Lesson Planning Introductory Information about Understanding by Design (UbD) (Sources: Wiggins, G. & McTighte, J. (2005). Understanding by design. Alexandria, VA: ASCD.;

More information

Laboratorio di Intelligenza Artificiale e Robotica

Laboratorio di Intelligenza Artificiale e Robotica Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning

More information

An investigation of imitation learning algorithms for structured prediction

An investigation of imitation learning algorithms for structured prediction JMLR: Workshop and Conference Proceedings 24:143 153, 2012 10th European Workshop on Reinforcement Learning An investigation of imitation learning algorithms for structured prediction Andreas Vlachos Computer

More information

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing a Moving Target How Do We Test Machine Learning Systems? Peter Varhol, Technology

More information

Algebra 2- Semester 2 Review

Algebra 2- Semester 2 Review Name Block Date Algebra 2- Semester 2 Review Non-Calculator 5.4 1. Consider the function f x 1 x 2. a) Describe the transformation of the graph of y 1 x. b) Identify the asymptotes. c) What is the domain

More information

Discriminative Learning of Beam-Search Heuristics for Planning

Discriminative Learning of Beam-Search Heuristics for Planning Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University

More information

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS

EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS by Robert Smith Submitted in partial fulfillment of the requirements for the degree of Master of

More information

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley Challenges in Deep Reinforcement Learning Sergey Levine UC Berkeley Discuss some recent work in deep reinforcement learning Present a few major challenges Show some of our recent work toward tackling

More information

Accelerated Learning Course Outline

Accelerated Learning Course Outline Accelerated Learning Course Outline Course Description The purpose of this course is to make the advances in the field of brain research more accessible to educators. The techniques and strategies of Accelerated

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview

More information

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and

More information

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham Curriculum Design Project with Virtual Manipulatives Gwenanne Salkind George Mason University EDCI 856 Dr. Patricia Moyer-Packenham Spring 2006 Curriculum Design Project with Virtual Manipulatives Table

More information

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes

Paper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes Centre No. Candidate No. Paper Reference 1 3 8 0 1 F Paper Reference(s) 1380/1F Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier Monday 6 June 2011 Afternoon Time: 1 hour

More information