An Introduction to COMPUTATIONAL REINFORCEMENT LEARING. Andrew G. Barto. Department of Computer Science University of Massachusetts Amherst
|
|
- Rafe Allen
- 5 years ago
- Views:
Transcription
1 An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts Amherst UPF Lecture 2 Autonomous Learning Laboratory Department of Computer Science
2 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
3 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
4 Lecture 2, Part 1: Dynamic Programming Objectives of this part: Overview of a collection of classical solution methods for MDPs known as Dynamic Programming (DP) Show how DP can be used to compute value functions, and hence, optimal policies Discuss efficiency and utility of DP A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
5 Policy Evaluation Policy Evaluation: for a given policy π, compute the state-value function V! Recall: State - value function for policy! : V! (s) = E! R t s t = s % { } = E! & $ " k r t +k +1 s t = s ' # k =0 ( ) * Bellman equation for V! : $ $ V! (s) =!(s, a) P a s s " a s " [ R a s s " + # V! ( s ")] a system of S simultaneous linear equations A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
6 Iterative Methods V 0! V 1! L! V k! V k +1! L! V " a sweep A sweep consists of applying a backup operation to each state. A full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
7 A Small Gridworld An undiscounted episodic task Nonterminal states: 1, 2,..., 14; One terminal state (shown twice as shaded squares) Actions that would take agent off the grid leave state unchanged Reward is 1 until the terminal state is reached A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
8 Iterative Policy Eval for the Small Gridworld! = random (uniform) action choices A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
9 Policy Improvement Suppose we have computed V! for a deterministic policy π. For a given state s, would it be better to do an action a! "(s)? The value of doing a in state s is : Q " (s,a) = E { " r t +1 + #V " (s t +1 ) s t = s,a t = a} = % P a s s $ R a + #V " ( $ s s $ s ) s $ [ ] It is better to switch to action a for state s if and only if Q " (s,a) > V " (s) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
10 Policy Improvement Cont. Do this for all states to get a new policy "! that is greedy with respect to V " : Then V "! # V " "!(s) = argmax Q " (s, a) a = argmax # R a + $ V " ( s!) s! P a a s! s! [ ] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
11 Policy Improvement Cont. What if V # " = V #? i.e., for all s $ S, V # " (s) = max% P a s s " R a + &V # ( " s s " s ) But this is the Bellman Optimality Equation. a s " [ ]? So V "! = V # and both " and "! are optimal policies. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
12 Policy Iteration! 0 " V! 0 "! 1 " V! 1 " L! * " V * "! * policy evaluation policy improvement greedification A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
13 Value Iteration Recall the full policy evaluation backup: V k +1 (s) " #(s,a) P a [ s s $ R a + %V ( $ s s $ s )] k & a & s $ Here is the full value iteration backup: V k +1 (s) " max a % s # P s # s [ a + $V ( s # )] k a R s s # A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
14 Asynchronous DP All the DP methods described so far require exhaustive sweeps of the entire state set. Asynchronous DP does not use sweeps. Instead it works like this: Repeat until convergence criterion is met: Pick a state at random and apply the appropriate backup Still need lots of computation, but does not get locked into hopelessly long sweeps Can you select states to backup intelligently? YES: an agent s experience can act as a guide. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
15 Efficiency of DP To find an optimal policy is polynomial in the number of states BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called the curse of dimensionality ). In practice, classical DP can be applied to problems with a few millions of states. Asynchronous DP can be applied to larger problems, and appropriate for parallel computation. It is surprisingly easy to come up with MDPs for which DP methods are not practical. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
16 Summary Policy evaluation: backups without a max Policy improvement: form a greedy policy, if only locally Policy iteration: alternate the above two processes Value iteration: backups with a max Full backups (to be contrasted later with sample backups) Asynchronous DP: a way to avoid exhaustive sweeps Bootstrapping: updating estimates based on other estimates A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
17 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
18 Lecture 2, Part 2: Simple Monte Carlo Methods Simple Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Simple Monte Carlo methods learn directly from experience On-line: No model necessary Simulated: No need for a full model A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
19 (First-visit) Monte Carlo policy evaluation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
20 Backup diagram for Simple Monte Carlo Entire episode included Only one choice at each state (unlike DP) MC does not bootstrap Time required to estimate one state does not depend on the total number of states A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
21 Monte Carlo Estimation of Action Values (Q) Monte Carlo is most useful when a model is not available Q π (s,a) - average return starting from state s and action a following π Also converges asymptotically if every state-action pair is visited infinitely often We are really interested in estimates of V* and Q*, i.e., Monte Carlo Control A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
22 Learning about π while following! " A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
23 Summary MC has several advantages over DP: Can learn directly from interaction with environment No need for full models No need to learn about ALL states Less harm by Markovian violations (later in book) MC methods provide an alternate policy evaluation process One issue to watch for: maintaining sufficient exploration exploring starts, soft policies No bootstrapping (as opposed to DP) Estimating values for one policy while behaving according to another policy: importance sampling A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
24 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Simple Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
25 Lecture 2, Part 2: Temporal Difference Learning Objectives of this part: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Then extend to control methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
26 TD Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function V! Recall: Simple (every - visit) Monte Carlo method : [ ] V (s t ) " V (s t ) + # R t $V (s t ) target: the actual return after time t The simplest TD method, TD(0) : [ ] V (s t ) " V (s t ) + # r t +1 + $ V (s t +1 ) %V(s t ) target: an estimate of the return A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
27 Simple Monte Carlo V (s t ) " V (s t ) + # [ R t $V (s t )] (constant-α MC) where R t is the actual return following state s t. s t T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
28 cf. Dynamic Programming V(s t )! E " r t +1 +# V(s t ) { } s t r t +1 s t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
29 Simplest TD Method V(s t )! V(s t ) +" [ r t +1 + # V (s t+1 ) $ V(s t )] s t s t +1 r t +1 T T T T T T T T T T A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
30 TD Bootstraps and Samples Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps TD bootstraps Sampling: update does not involve an expected value MC samples DP does not sample TD samples A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
31 Advantages of TD Learning TD methods do not require a model of the environment, only experience TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions), but which is faster/better? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
32 Random Walk Example equiprobable transitions " = 0.1 Values learned by TD(0) after various numbers of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
33 TD and MC on the Random Walk Data averaged over 100 sequences of episodes A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
34 You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(A)? V(B)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
35 You are the Predictor V(A)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
36 You are the Predictor The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD(0) gets A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
37 Learning An Action-Value Function Estimate Q! for the current behavior policy!. After every transition from a nonterminal state s t, do this : Q( s t, a t )! Q( s t, a t ) + " r t +1 +# Q( s t +1,a t +1 ) $ Q( s t,a t ) [ ] If s t +1 is terminal, then Q(s t +1, a t +1 ) = 0. A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
38 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: s, a r s a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
39 Q-Learning: Off-Policy TD Control One - step Q - learning : Q( s t, a t )! Q( s t, a t ) [ + " r t +1 +# max Q ( s t+1, a ) $ Q ( s, a )] t t a A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
40 Cliffwalking ε greedy, ε = 0.1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
41 Actor-Critic Architecture Environment Actions Situations or States Primary Critic Primary Rewards Adaptive Critic Effective Rewards: (involves values) Actor A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
42 Actor-Critic Methods Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
43 Actor-Critic Details TD - error is used to evaluate actions :! t = r t +1 + " V (s t +1 ) # V(s t ) If actions are determined by preferences, p(s, a), as follows : { } = ep( s, a) " e! t (s, a) = Pr a t = a s t = s b p(s,b) then you can update the preferences like this : p(s t, a t ) # p(s t,a t ) + $% t, A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
44 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
45 Summary TD prediction Introduced one-step tabular model-free TD methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
46 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
47 Lecture 2, Part 4: Unified Perspective A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
48 n-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
49 Mathematics of n-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt " rt L+ T! t! 1 r T TD: R! (1) t = r + t+ 1 Vt t+ ( s ) 1 Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt Vt t+ ( s ) 2 n-step return: ( n) 2 n! 1 n R t = rt " rt " rt L+ " rt + n + " Vt ( st+ n ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
50 Learning with n-step Backups Backup (on-line or off-line): "V t (s t ) = #[ (n R ) t $V t (s t )] Error reduction property of n-step returns max s E " {R n t s t = s} #V " (s) $ % n maxv (s) #V " (s) s n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
51 Random Walk Examples How does 2-step TD work here? How about 3-step TD? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
52 A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
53 Averaging n-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt R (4) t One backup Called a complex backup Draw each component Label with the weights for that component A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
54 Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
55 λ-return Weighting Function A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
56 Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
57 Forward View of TD(λ) Look forward from each state to determine update from future states and rewards: A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
58 λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
59 Backward View of TD(λ) The forward view was for theory The backward view is for mechanism New variable called eligibility trace: e t (s) " # + On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
60 Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
61 On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
62 Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
63 Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
64 Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % if &!"e t #1 (s,a) s = s and a = a t t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
65 Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
66 Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
67 Conclusions Eligibility traces provide efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
68 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
69 TD-error r δ t = r t + V t V t 1 regular predictors of z over this interval early in learning learning complete V δ V δ r omitted δ A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
70 Dopamine Neurons and TD Error W. Schultz et al. Universite de Fribourg A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
71 Dopamine Modulated Synaptic Plasticity A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
72 Basal Ganglia as Adaptive Critic Architecture Houk, Adams, & Barto, 1995 A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
73 The Overall Plan Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes Lecture 2: Dynamic Programming Simple Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience Lecture 3: Function approximation Model-based methods Abstraction and hierarchy Intrinsically motivated RL A. G. Barto, Barcelona Lectures, April Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press,
Reinforcement Learning by Comparing Immediate Reward
Reinforcement Learning by Comparing Immediate Reward Punit Pandey DeepshikhaPandey Dr. Shishir Kumar Abstract This paper introduces an approach to Reinforcement Learning Algorithm by comparing their immediate
More informationLecture 10: Reinforcement Learning
Lecture 1: Reinforcement Learning Cognitive Systems II - Machine Learning SS 25 Part III: Learning Programs and Strategies Q Learning, Dynamic Programming Lecture 1: Reinforcement Learning p. Motivation
More informationISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM
Proceedings of 28 ISFA 28 International Symposium on Flexible Automation Atlanta, GA, USA June 23-26, 28 ISFA28U_12 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM Amit Gil, Helman Stern, Yael Edan, and
More informationAxiom 2013 Team Description Paper
Axiom 2013 Team Description Paper Mohammad Ghazanfari, S Omid Shirkhorshidi, Farbod Samsamipour, Hossein Rahmatizadeh Zagheli, Mohammad Mahdavi, Payam Mohajeri, S Abbas Alamolhoda Robotics Scientific Association
More informationA Reinforcement Learning Variant for Control Scheduling
A Reinforcement Learning Variant for Control Scheduling Aloke Guha Honeywell Sensor and System Development Center 3660 Technology Drive Minneapolis MN 55417 Abstract We present an algorithm based on reinforcement
More informationTD(λ) and Q-Learning Based Ludo Players
TD(λ) and Q-Learning Based Ludo Players Majed Alhajry, Faisal Alvi, Member, IEEE and Moataz Ahmed Abstract Reinforcement learning is a popular machine learning technique whose inherent self-learning ability
More informationExploration. CS : Deep Reinforcement Learning Sergey Levine
Exploration CS 294-112: Deep Reinforcement Learning Sergey Levine Class Notes 1. Homework 4 due on Wednesday 2. Project proposal feedback sent Today s Lecture 1. What is exploration? Why is it a problem?
More informationSpeeding Up Reinforcement Learning with Behavior Transfer
Speeding Up Reinforcement Learning with Behavior Transfer Matthew E. Taylor and Peter Stone Department of Computer Sciences The University of Texas at Austin Austin, Texas 78712-1188 {mtaylor, pstone}@cs.utexas.edu
More informationGeorgetown University at TREC 2017 Dynamic Domain Track
Georgetown University at TREC 2017 Dynamic Domain Track Zhiwen Tang Georgetown University zt79@georgetown.edu Grace Hui Yang Georgetown University huiyang@cs.georgetown.edu Abstract TREC Dynamic Domain
More informationLearning Prospective Robot Behavior
Learning Prospective Robot Behavior Shichao Ou and Rod Grupen Laboratory for Perceptual Robotics Computer Science Department University of Massachusetts Amherst {chao,grupen}@cs.umass.edu Abstract This
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationArtificial Neural Networks written examination
1 (8) Institutionen för informationsteknologi Olle Gällmo Universitetsadjunkt Adress: Lägerhyddsvägen 2 Box 337 751 05 Uppsala Artificial Neural Networks written examination Monday, May 15, 2006 9 00-14
More informationHigh-level Reinforcement Learning in Strategy Games
High-level Reinforcement Learning in Strategy Games Christopher Amato Department of Computer Science University of Massachusetts Amherst, MA 01003 USA camato@cs.umass.edu Guy Shani Department of Computer
More informationThe Good Judgment Project: A large scale test of different methods of combining expert predictions
The Good Judgment Project: A large scale test of different methods of combining expert predictions Lyle Ungar, Barb Mellors, Jon Baron, Phil Tetlock, Jaime Ramos, Sam Swift The University of Pennsylvania
More informationContinual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots
Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots Varun Raj Kompella, Marijn Stollenga, Matthew Luciw, Juergen Schmidhuber The Swiss AI Lab IDSIA, USI
More informationRegret-based Reward Elicitation for Markov Decision Processes
444 REGAN & BOUTILIER UAI 2009 Regret-based Reward Elicitation for Markov Decision Processes Kevin Regan Department of Computer Science University of Toronto Toronto, ON, CANADA kmregan@cs.toronto.edu
More informationAMULTIAGENT system [1] can be defined as a group of
156 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 2, MARCH 2008 A Comprehensive Survey of Multiagent Reinforcement Learning Lucian Buşoniu, Robert Babuška,
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationImproving Action Selection in MDP s via Knowledge Transfer
In Proc. 20th National Conference on Artificial Intelligence (AAAI-05), July 9 13, 2005, Pittsburgh, USA. Improving Action Selection in MDP s via Knowledge Transfer Alexander A. Sherstov and Peter Stone
More informationPython Machine Learning
Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled
More informationThe Strong Minimalist Thesis and Bounded Optimality
The Strong Minimalist Thesis and Bounded Optimality DRAFT-IN-PROGRESS; SEND COMMENTS TO RICKL@UMICH.EDU Richard L. Lewis Department of Psychology University of Michigan 27 March 2010 1 Purpose of this
More informationLearning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for
Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for Email Marilyn A. Walker Jeanne C. Fromer Shrikanth Narayanan walker@research.att.com jeannie@ai.mit.edu shri@research.att.com
More informationA Comparison of Annealing Techniques for Academic Course Scheduling
A Comparison of Annealing Techniques for Academic Course Scheduling M. A. Saleh Elmohamed 1, Paul Coddington 2, and Geoffrey Fox 1 1 Northeast Parallel Architectures Center Syracuse University, Syracuse,
More informationGenerative models and adversarial training
Day 4 Lecture 1 Generative models and adversarial training Kevin McGuinness kevin.mcguinness@dcu.ie Research Fellow Insight Centre for Data Analytics Dublin City University What is a generative model?
More informationFF+FPG: Guiding a Policy-Gradient Planner
FF+FPG: Guiding a Policy-Gradient Planner Olivier Buffet LAAS-CNRS University of Toulouse Toulouse, France firstname.lastname@laas.fr Douglas Aberdeen National ICT australia & The Australian National University
More informationPlanning with External Events
94 Planning with External Events Jim Blythe School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 blythe@cs.cmu.edu Abstract I describe a planning methodology for domains with uncertainty
More informationProbability and Game Theory Course Syllabus
Probability and Game Theory Course Syllabus DATE ACTIVITY CONCEPT Sunday Learn names; introduction to course, introduce the Battle of the Bismarck Sea as a 2-person zero-sum game. Monday Day 1 Pre-test
More informationAn OO Framework for building Intelligence and Learning properties in Software Agents
An OO Framework for building Intelligence and Learning properties in Software Agents José A. R. P. Sardinha, Ruy L. Milidiú, Carlos J. P. Lucena, Patrick Paranhos Abstract Software agents are defined as
More informationAre You Ready? Simplify Fractions
SKILL 10 Simplify Fractions Teaching Skill 10 Objective Write a fraction in simplest form. Review the definition of simplest form with students. Ask: Is 3 written in simplest form? Why 7 or why not? (Yes,
More informationReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology
ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon
More informationIntroduction to Simulation
Introduction to Simulation Spring 2010 Dr. Louis Luangkesorn University of Pittsburgh January 19, 2010 Dr. Louis Luangkesorn ( University of Pittsburgh ) Introduction to Simulation January 19, 2010 1 /
More informationLearning Methods for Fuzzy Systems
Learning Methods for Fuzzy Systems Rudolf Kruse and Andreas Nürnberger Department of Computer Science, University of Magdeburg Universitätsplatz, D-396 Magdeburg, Germany Phone : +49.39.67.876, Fax : +49.39.67.8
More informationLahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017
Instructor Syed Zahid Ali Room No. 247 Economics Wing First Floor Office Hours Email szahid@lums.edu.pk Telephone Ext. 8074 Secretary/TA TA Office Hours Course URL (if any) Suraj.lums.edu.pk FINN 321 Econometrics
More informationAI Agent for Ice Hockey Atari 2600
AI Agent for Ice Hockey Atari 2600 Emman Kabaghe (emmank@stanford.edu) Rajarshi Roy (rroy@stanford.edu) 1 Introduction In the reinforcement learning (RL) problem an agent autonomously learns a behavior
More informationWhile you are waiting... socrative.com, room number SIMLANG2016
While you are waiting... socrative.com, room number SIMLANG2016 Simulating Language Lecture 4: When will optimal signalling evolve? Simon Kirby simon@ling.ed.ac.uk T H E U N I V E R S I T Y O H F R G E
More informationQuickStroke: An Incremental On-line Chinese Handwriting Recognition System
QuickStroke: An Incremental On-line Chinese Handwriting Recognition System Nada P. Matić John C. Platt Λ Tony Wang y Synaptics, Inc. 2381 Bering Drive San Jose, CA 95131, USA Abstract This paper presents
More informationHow long did... Who did... Where was... When did... How did... Which did...
(Past Tense) Who did... Where was... How long did... When did... How did... 1 2 How were... What did... Which did... What time did... Where did... What were... Where were... Why did... Who was... How many
More informationMeasurement. When Smaller Is Better. Activity:
Measurement Activity: TEKS: When Smaller Is Better (6.8) Measurement. The student solves application problems involving estimation and measurement of length, area, time, temperature, volume, weight, and
More informationOn the Combined Behavior of Autonomous Resource Management Agents
On the Combined Behavior of Autonomous Resource Management Agents Siri Fagernes 1 and Alva L. Couch 2 1 Faculty of Engineering Oslo University College Oslo, Norway siri.fagernes@iu.hio.no 2 Computer Science
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationAgents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators
s and environments Percepts Intelligent s? Chapter 2 Actions s include humans, robots, softbots, thermostats, etc. The agent function maps from percept histories to actions: f : P A The agent program runs
More informationRover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes
Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes WHAT STUDENTS DO: Establishing Communication Procedures Following Curiosity on Mars often means roving to places with interesting
More informationRadius STEM Readiness TM
Curriculum Guide Radius STEM Readiness TM While today s teens are surrounded by technology, we face a stark and imminent shortage of graduates pursuing careers in Science, Technology, Engineering, and
More informationCOMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR
COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR ROLAND HAUSSER Institut für Deutsche Philologie Ludwig-Maximilians Universität München München, West Germany 1. CHOICE OF A PRIMITIVE OPERATION The
More informationJulia Smith. Effective Classroom Approaches to.
Julia Smith @tessmaths Effective Classroom Approaches to GCSE Maths resits julia.smith@writtle.ac.uk Agenda The context of GCSE resit in a post-16 setting An overview of the new GCSE Key features of a
More informationChapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)
Intelligent Agents Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Agent types 2 Agents and environments sensors environment percepts
More informationBMBF Project ROBUKOM: Robust Communication Networks
BMBF Project ROBUKOM: Robust Communication Networks Arie M.C.A. Koster Christoph Helmberg Andreas Bley Martin Grötschel Thomas Bauschert supported by BMBF grant 03MS616A: ROBUKOM Robust Communication Networks,
More informationInformatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy
Informatics 2A: Language Complexity and the Chomsky Hierarchy September 28, 2010 Starter 1 Is there a finite state machine that recognises all those strings s from the alphabet {a, b} where the difference
More informationACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014
UNSW Australia Business School School of Risk and Actuarial Studies ACTL5103 Stochastic Modelling For Actuaries Course Outline Semester 2, 2014 Part A: Course-Specific Information Please consult Part B
More informationLearning to Schedule Straight-Line Code
Learning to Schedule Straight-Line Code Eliot Moss, Paul Utgoff, John Cavazos Doina Precup, Darko Stefanović Dept. of Comp. Sci., Univ. of Mass. Amherst, MA 01003 Carla Brodley, David Scheeff Sch. of Elec.
More informationOPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS Václav Kocian, Eva Volná, Michal Janošek, Martin Kotyrba University of Ostrava Department of Informatics and Computers Dvořákova 7,
More informationLaboratorio di Intelligenza Artificiale e Robotica
Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning
More informationAdaptive Generation in Dialogue Systems Using Dynamic User Modeling
Adaptive Generation in Dialogue Systems Using Dynamic User Modeling Srinivasan Janarthanam Heriot-Watt University Oliver Lemon Heriot-Watt University We address the problem of dynamically modeling and
More informationStopping rules for sequential trials in high-dimensional data
Stopping rules for sequential trials in high-dimensional data Sonja Zehetmayer, Alexandra Graf, and Martin Posch Center for Medical Statistics, Informatics and Intelligent Systems Medical University of
More informationSystem Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering
More informationAP Calculus AB. Nevada Academic Standards that are assessable at the local level only.
Calculus AB Priority Keys Aligned with Nevada Standards MA I MI L S MA represents a Major content area. Any concept labeled MA is something of central importance to the entire class/curriculum; it is a
More informationEvolutive Neural Net Fuzzy Filtering: Basic Description
Journal of Intelligent Learning Systems and Applications, 2010, 2: 12-18 doi:10.4236/jilsa.2010.21002 Published Online February 2010 (http://www.scirp.org/journal/jilsa) Evolutive Neural Net Fuzzy Filtering:
More information(Sub)Gradient Descent
(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include
More informationENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering
ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering Lecture Details Instructor Course Objectives Tuesday and Thursday, 4:00 pm to 5:15 pm Information Technology and Engineering
More informationStatewide Framework Document for:
Statewide Framework Document for: 270301 Standards may be added to this document prior to submission, but may not be removed from the framework to meet state credit equivalency requirements. Performance
More informationThe Evolution of Random Phenomena
The Evolution of Random Phenomena A Look at Markov Chains Glen Wang glenw@uchicago.edu Splash! Chicago: Winter Cascade 2012 Lecture 1: What is Randomness? What is randomness? Can you think of some examples
More informationA Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur?
A Process-Model Account of Task Interruption and Resumption: When Does Encoding of the Problem State Occur? Dario D. Salvucci Drexel University Philadelphia, PA Christopher A. Monk George Mason University
More informationAcquiring Competence from Performance Data
Acquiring Competence from Performance Data Online learnability of OT and HG with simulated annealing Tamás Biró ACLC, University of Amsterdam (UvA) Computational Linguistics in the Netherlands, February
More informationIntelligent Agents. Chapter 2. Chapter 2 1
Intelligent Agents Chapter 2 Chapter 2 1 Outline Agents and environments Rationality PEAS (Performance measure, Environment, Actuators, Sensors) Environment types The structure of agents Chapter 2 2 Agents
More informationSelf Study Report Computer Science
Computer Science undergraduate students have access to undergraduate teaching, and general computing facilities in three buildings. Two large classrooms are housed in the Davis Centre, which hold about
More informationNCEO Technical Report 27
Home About Publications Special Topics Presentations State Policies Accommodations Bibliography Teleconferences Tools Related Sites Interpreting Trends in the Performance of Special Education Students
More informationUniversity of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.
University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING Calendar Description Units: 1.5 Hours: 3-2 Neural and cognitive processes underlying human skilled
More informationDublin City Schools Mathematics Graded Course of Study GRADE 4
I. Content Standard: Number, Number Sense and Operations Standard Students demonstrate number sense, including an understanding of number systems and reasonable estimates using paper and pencil, technology-supported
More informationModeling user preferences and norms in context-aware systems
Modeling user preferences and norms in context-aware systems Jonas Nilsson, Cecilia Lindmark Jonas Nilsson, Cecilia Lindmark VT 2016 Bachelor's thesis for Computer Science, 15 hp Supervisor: Juan Carlos
More informationCorrective Feedback and Persistent Learning for Information Extraction
Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,
More informationMathematics subject curriculum
Mathematics subject curriculum Dette er ei omsetjing av den fastsette læreplanteksten. Læreplanen er fastsett på Nynorsk Established as a Regulation by the Ministry of Education and Research on 24 June
More informationAGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016
AGENDA Advanced Learning Theories Alejandra J. Magana, Ph.D. admagana@purdue.edu Introduction to Learning Theories Role of Learning Theories and Frameworks Learning Design Research Design Dual Coding Theory
More informationErkki Mäkinen State change languages as homomorphic images of Szilard languages
Erkki Mäkinen State change languages as homomorphic images of Szilard languages UNIVERSITY OF TAMPERE SCHOOL OF INFORMATION SCIENCES REPORTS IN INFORMATION SCIENCES 48 TAMPERE 2016 UNIVERSITY OF TAMPERE
More informationExtending Place Value with Whole Numbers to 1,000,000
Grade 4 Mathematics, Quarter 1, Unit 1.1 Extending Place Value with Whole Numbers to 1,000,000 Overview Number of Instructional Days: 10 (1 day = 45 minutes) Content to Be Learned Recognize that a digit
More informationAlgebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview
Algebra 1, Quarter 3, Unit 3.1 Line of Best Fit Overview Number of instructional days 6 (1 day assessment) (1 day = 45 minutes) Content to be learned Analyze scatter plots and construct the line of best
More informationLearning and Transferring Relational Instance-Based Policies
Learning and Transferring Relational Instance-Based Policies Rocío García-Durán, Fernando Fernández y Daniel Borrajo Universidad Carlos III de Madrid Avda de la Universidad 30, 28911-Leganés (Madrid),
More informationBAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS
Page 1 of 42 Articles in PresS. J Neurophysiol (December 20, 2006). doi:10.1152/jn.00946.2006 BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS Anne C. Smith 1*, Sylvia
More informationAccelerated Learning Online. Course Outline
Accelerated Learning Online Course Outline Course Description The purpose of this course is to make the advances in the field of brain research more accessible to educators. The techniques and strategies
More informationAGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS
AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS 1 CALIFORNIA CONTENT STANDARDS: Chapter 1 ALGEBRA AND WHOLE NUMBERS Algebra and Functions 1.4 Students use algebraic
More informationSARDNET: A Self-Organizing Feature Map for Sequences
SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu
More informationXXII BrainStorming Day
UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA PhD course in Electronics, Automation and Control of Complex Systems - XXV Cycle DIPARTIMENTO DI INGEGNERIA ELETTRICA ELETTRONICA E INFORMATICA XXII
More informationarxiv: v1 [math.at] 10 Jan 2016
THE ALGEBRAIC ATIYAH-HIRZEBRUCH SPECTRAL SEQUENCE OF REAL PROJECTIVE SPECTRA arxiv:1601.02185v1 [math.at] 10 Jan 2016 GUOZHEN WANG AND ZHOULI XU Abstract. In this note, we use Curtis s algorithm and the
More informationA Stochastic Model for the Vocabulary Explosion
Words Known A Stochastic Model for the Vocabulary Explosion Colleen C. Mitchell (colleen-mitchell@uiowa.edu) Department of Mathematics, 225E MLH Iowa City, IA 52242 USA Bob McMurray (bob-mcmurray@uiowa.edu)
More informationSurprise-Based Learning for Autonomous Systems
Surprise-Based Learning for Autonomous Systems Nadeesha Ranasinghe and Wei-Min Shen ABSTRACT Dealing with unexpected situations is a key challenge faced by autonomous robots. This paper describes a promising
More informationVisit us at:
White Paper Integrating Six Sigma and Software Testing Process for Removal of Wastage & Optimizing Resource Utilization 24 October 2013 With resources working for extended hours and in a pressurized environment,
More informationExecutive Guide to Simulation for Health
Executive Guide to Simulation for Health Simulation is used by Healthcare and Human Service organizations across the World to improve their systems of care and reduce costs. Simulation offers evidence
More information1. Answer the questions below on the Lesson Planning Response Document.
Module for Lateral Entry Teachers Lesson Planning Introductory Information about Understanding by Design (UbD) (Sources: Wiggins, G. & McTighte, J. (2005). Understanding by design. Alexandria, VA: ASCD.;
More informationLaboratorio di Intelligenza Artificiale e Robotica
Laboratorio di Intelligenza Artificiale e Robotica A.A. 2008-2009 Outline 2 Machine Learning Unsupervised Learning Supervised Learning Reinforcement Learning Genetic Algorithms Genetics-Based Machine Learning
More informationAn investigation of imitation learning algorithms for structured prediction
JMLR: Workshop and Conference Proceedings 24:143 153, 2012 10th European Workshop on Reinforcement Learning An investigation of imitation learning algorithms for structured prediction Andreas Vlachos Computer
More informationTesting A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA
Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA Testing a Moving Target How Do We Test Machine Learning Systems? Peter Varhol, Technology
More informationAlgebra 2- Semester 2 Review
Name Block Date Algebra 2- Semester 2 Review Non-Calculator 5.4 1. Consider the function f x 1 x 2. a) Describe the transformation of the graph of y 1 x. b) Identify the asymptotes. c) What is the domain
More informationDiscriminative Learning of Beam-Search Heuristics for Planning
Discriminative Learning of Beam-Search Heuristics for Planning Yuehua Xu School of EECS Oregon State University Corvallis,OR 97331 xuyu@eecs.oregonstate.edu Alan Fern School of EECS Oregon State University
More informationEVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS
EVOLVING POLICIES TO SOLVE THE RUBIK S CUBE: EXPERIMENTS WITH IDEAL AND APPROXIMATE PERFORMANCE FUNCTIONS by Robert Smith Submitted in partial fulfillment of the requirements for the degree of Master of
More informationChallenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley
Challenges in Deep Reinforcement Learning Sergey Levine UC Berkeley Discuss some recent work in deep reinforcement learning Present a few major challenges Show some of our recent work toward tackling
More informationAccelerated Learning Course Outline
Accelerated Learning Course Outline Course Description The purpose of this course is to make the advances in the field of brain research more accessible to educators. The techniques and strategies of Accelerated
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationIntroduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007 Indiana University Outline Introduction Bias and
More informationCurriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham
Curriculum Design Project with Virtual Manipulatives Gwenanne Salkind George Mason University EDCI 856 Dr. Patricia Moyer-Packenham Spring 2006 Curriculum Design Project with Virtual Manipulatives Table
More informationPaper Reference. Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier. Monday 6 June 2011 Afternoon Time: 1 hour 30 minutes
Centre No. Candidate No. Paper Reference 1 3 8 0 1 F Paper Reference(s) 1380/1F Edexcel GCSE Mathematics (Linear) 1380 Paper 1 (Non-Calculator) Foundation Tier Monday 6 June 2011 Afternoon Time: 1 hour
More information