CMPUT 609/499: Reinforcement Learning for Artificial Intelligence. Instructor: Rich Sutton Dept of Computing Science richsutton.

Similar documents
TD(λ) and Q-Learning Based Ludo Players

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ECON 484-A1 GAME THEORY AND ECONOMIC APPLICATIONS

Axiom 2013 Team Description Paper

Lecture 10: Reinforcement Learning

Reinforcement Learning by Comparing Immediate Reward

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Georgetown University at TREC 2017 Dynamic Domain Track

Python Machine Learning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 6: Applications

Foothill College Summer 2016

Artificial Neural Networks written examination

DEPARTMENT OF HISTORY AND CLASSICS Academic Year , Classics 104 (Summer Term) Introduction to Ancient Rome

Office Hours: Day Time Location TR 12:00pm - 2:00pm Main Campus Carl DeSantis Building 5136

MTH 141 Calculus 1 Syllabus Spring 2017

CSL465/603 - Machine Learning

COMM370, Social Media Advertising Fall 2017

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

FINN FINANCIAL MANAGEMENT Spring 2014

KLI: Infer KCs from repeated assessment events. Do you know what you know? Ken Koedinger HCI & Psychology CMU Director of LearnLab

Computer Science 1015F ~ 2016 ~ Notes to Students

INTERMEDIATE ALGEBRA Course Syllabus

Speeding Up Reinforcement Learning with Behavior Transfer

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Laboratorio di Intelligenza Artificiale e Robotica

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Please read this entire syllabus, keep it as reference and is subject to change by the instructor.

Math 181, Calculus I

MGMT3274 INTERNATONAL BUSINESS PROCESSES AND PROBLEMS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Intelligent Agents. Chapter 2. Chapter 2 1

Neuroscience I. BIOS/PHIL/PSCH 484 MWF 1:00-1:50 Lecture Center F6. Fall credit hours

AU MATH Calculus I 2017 Spring SYLLABUS

Foothill College Fall 2014 Math My Way Math 230/235 MTWThF 10:00-11:50 (click on Math My Way tab) Math My Way Instructors:

Rule-based Expert Systems

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Department of Anthropology ANTH 1027A/001: Introduction to Linguistics Dr. Olga Kharytonava Course Outline Fall 2017

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

MTH 215: Introduction to Linear Algebra

AI Agent for Ice Hockey Atari 2600

CS 100: Principles of Computing

SOUTHERN MAINE COMMUNITY COLLEGE South Portland, Maine 04106

Biology 1 General Biology, Lecture Sections: 47231, and Fall 2017

MATH 1A: Calculus I Sec 01 Winter 2017 Room E31 MTWThF 8:30-9:20AM

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Natural Language Processing. George Konidaris

Learning Methods for Fuzzy Systems

Evolution of Symbolisation in Chimpanzees and Neural Nets


ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

A Reinforcement Learning Variant for Control Scheduling

Class Mondays & Wednesdays 11:00 am - 12:15 pm Rowe 161. Office Mondays 9:30 am - 10:30 am, Friday 352-B (3 rd floor) or by appointment

Social Media Journalism J336F Unique ID CMA Fall 2012

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

While you are waiting... socrative.com, room number SIMLANG2016

Syllabus - ESET 369 Embedded Systems Software, Fall 2016

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

BIOS 104 Biology for Non-Science Majors Spring 2016 CRN Course Syllabus

An investigation of imitation learning algorithms for structured prediction

Phys4051: Methods of Experimental Physics I

Class Tuesdays & Thursdays 12:30-1:45 pm Friday 107. Office Tuesdays 9:30 am - 10:30 am, Friday 352-B (3 rd floor) or by appointment

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Spring 2015 CRN: Department: English CONTACT INFORMATION: REQUIRED TEXT:

Generative models and adversarial training

HARRISBURG AREA COMMUNITY COLLEGE ONLINE COURSE SYLLABUS

BIOL 2421 Microbiology Course Syllabus:

Spring 2015 Natural Science I: Quarks to Cosmos CORE-UA 209. SYLLABUS and COURSE INFORMATION.

Page 1 of 8 REQUIRED MATERIALS:

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

CALCULUS I Math mclauh/classes/calculusi/ SYLLABUS Fall, 2003

SOFTWARE EVALUATION TOOL

ACC : Accounting Transaction Processing Systems COURSE SYLLABUS Spring 2011, MW 3:30-4:45 p.m. Bryan 202

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Math 96: Intermediate Algebra in Context

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

Improving Action Selection in MDP s via Knowledge Transfer

Managing Sustainable Operations MGMT 410 Bachelor of Business Administration (Sustainable Business Practices) Business Administration Program

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Seminar - Organic Computing

Intensive English Program Southwest College

Syllabus Foundations of Finance Summer 2014 FINC-UB

Business Administration

Probability and Game Theory Course Syllabus

MKT ADVERTISING. Fall 2016

High-level Reinforcement Learning in Strategy Games

INTRODUCTION TO GENERAL PSYCHOLOGY (PSYC 1101) ONLINE SYLLABUS. Instructor: April Babb Crisp, M.S., LPC

COMMUNICATIONS FOR THIS ONLINE COURSE:

Electromagnetic Spectrum Webquest Answer Key

The Strong Minimalist Thesis and Bounded Optimality

Corporate Communication

Instructor: Matthew Wickes Kilgore Office: ES 310

Transcription:

CMPUT 609/499: Reinforcement Learning for Artificial Intelligence Instructor: Rich Sutton Dept of Computing Science richsutton.com 1

What is Reinforcement Learning? Agent-oriented learning learning by interacting with an environment to achieve a goal more realistic and ambitious than other kinds of machine learning Learning by trial and error, with only delayed evaluative feedback (reward) the kind of machine learning most like natural learning learning that can tell for itself when it is right or wrong The beginnings of a science of mind that is neither natural science nor applications technology

Computer Science Engineering Mathematics Optimal Control Operations Research Machine Learning Reinforcement Learning Bounded Rationality Reward System Classical/Operant Conditioning Neuroscience Psychology Economics David Silver 2015

Example: Hajime Kimura s RL Robots Backward New Robot, Same algorithm Before After

The RL Interface State, Stimulus, Situation Agent Reward, Gain, Payoff, Cost Environment (world) Action, Response, Control Environment may be unknown, nonlinear, stochastic and complex Agent learns a policy mapping states to actions Seeking to maximize its cumulative reward in the long run

Signature challenges of RL Evaluative feedback (reward) Sequentiality, delayed consequences Need for trial and error, to explore as well as exploit Non-stationarity The fleeting nature of time and online data

Some RL Successes Learned the world s best player of Backgammon (Tesauro 1995) Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+) Widely used in the placement and selection of advertisements and pages on the web (e.g., A-B tests) Used to make strategic decisions in Jeopardy! (IBM s Watson 2011) Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015) In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction

Example: TD-Gammon Tesauro, 1992-1995 Bbar 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Wbar s w estimated state value ( prob of winning) Action selection by a shallow search Start with a random Network Play millions of games against itself Learn a value function from this sim ulated experience Six weeks later it s the best player of backgammon in the world Originally used expert handcrafted features, later repeated with raw board positions

Some RL Successes Learned the world s best player of Backgammon (Tesauro 1995) Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+) Widely used in the placement and selection of advertisements on the web (e.g. A-B tests) Used to make strategic decisions in Jeopardy! (IBM s Watson 2011) Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015) In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction

RL + Deep Learing Performance on Atari Games Space Invaders Breakout Enduro

RL + Deep Learning, applied to Classic Atari Games Google Deepmind 2015, Bowling et al. 2012 Learned to play 49 games for the Atari 2600 game console, without labels RESEARCH or LETTER human input, from self-play and the score alone Convolution Convolution Fully connected Fully connected No input mapping raw screen pixels to predictions of final score for each of 18 joystick actions Figure 1 Schematic illustration of the convolutional neural network. The details of the architecture are explained in the Methods. The input to the neural network consists of an 84 3 84 3 4 image produced by the preprocessing by a rectifier nonlinearity (that is, maxð0,xþ). Learned to play better than all previous algorithms map w, followed by three convolutional layers (note: snaking blue line and at human level for more than half the games symbolizes sliding of each filter across input image) and two fully connected layers with a single output for each valid action. Each hidden layer is followed Same learning algorithm applied to all 49 games! w/o human tuning

Some RL Successes Learned the world s best player of Backgammon (Tesauro 1995) Learned acrobatic helicopter autopilots (Ng, Abbeel, Coates et al 2006+) Widely used in the placement and selection of advertisements on the web (e.g. A-B tests) Used to make strategic decisions in Jeopardy! (IBM s Watson 2011) Achieved human-level performance on Atari games from pixel-level visual input, in conjunction with deep learning (Google Deepmind 2015) In all these cases, performance was better than could be obtained by any other method, and was obtained without human instruction

Intelligence is the ability to achieve goals Intelligence is the most powerful phenomena in the universe Ray Kurzweil, c 2000 The phenomena is that there are systems in the universe that are well thought of as goalseeking systems What is a goal-seeking system? Constant ends from variable means is the hallmark of mind William James, c 1890 a system that is better understood in terms of outcomes than in terms of mechanisms

The coming of artificial intelligence When people finally come to understand the principles of intelligence what it is and how it works well enough to design and create beings as intelligent as ourselves A fundamental goal for science, engineering, the humanities, for all mankind It will change the way we work and play, our sense of self, life, and death, the goals we set for ourselves and for our societies But it is also of significance beyond our species, beyond history It will lead to new beings and new ways of being, things inevitably much more powerful than our current selves

Milestones in the development of life on Earth year 14Bya 4.5Bya Milestone Big bang formation of the earth and solar system The Age of Replicators The Age of Design 3.7Bya 1.1Bya 1Mya origin of life on earth (formation of first replicators) DNA and RNA sexual reproduction multi-cellular organisms nervous systems Self-replicated things humans most prominent culture language agriculture, metal tools written language industrial revolution 100Kya 10Kya 5Kya 200ya technology 70ya computers nanotechnology? artificial intelligence super-intelligence Designed things most prominent

AI is a great scientific prize cf. the discovery of DNA, the digital code of life, by Watson and Crick (1953) cf. Darwin s discovery of evolution, how people are descendants of earlier forms of life (1860) cf. the splitting of the atom, by Hahn (1938) leading to both atomic power and atomic bombs

Socrative.com, Room 568225 When will we understand the principles of intelligence well enough to create, using technology, artificial minds that rival our own in skill and generality? Which of the following best represents your current views? A. Never B. Not during your lifetime C. During your lifetime, but not before 2045 D. Before 2045 E. Before 2035

Is human-level AI possible? If people are biological machines, then eventually we will reverse engineer them, and understand their workings Then, surely we can make improvements with materials and technology not available to evolution how could there not be something we can improve? design can overcome local minima, make great strides, try things much faster than biology Yes

If AI is possible, then will it eventually, inevitably happen? No. Not if we destroy ourselves first If that doesn t happen, then there will be strong, multiincremental economic incentives pushing inexorably towards human and super-human AI It seems unlikely that they could be resisted or successfully forbidden or controlled there is too much value, too many independent actors Very probably, say 90%

When will human-level AI first be created? No one knows of course; we can make an educated guess about the probability distribution: 25% chance by 2030 50% chance by 2040 10% chance never Certainly a significant chance within all of our expected lifetimes We should take the possibility into account in our career plans

Corporate investment in AI is way up Google s prescient AI buying spree: Boston Dynamics, Nest, Deepmind Technologies, New AI research labs at Facebook (Yann LeCun), Baidu (Andrew Ng), Allen Institute (Oren Etzioni), Vicarious, Maluuba Also enlarged corporate AI labs: Microsoft, Amazon, Adobe Yahoo makes major investment in CMU machine learning department Many new AI startups getting venture capital

The 2nd industrial revolution The 1st industrial revolution was the physical power of machines substituting for that of people The 2nd industrial revolution is the computational power of machines substituting for that of people Computation for perception, motor control, prediction, decision making, optimization, search Until now, people have been our cheapest source of computation But now our machines are starting to provide greater, cheaper computation

The computational revolution computation al power of the human brain by 2025 102016

Advances in AI abilities are coming faster; in the last 5 years: IBM s Watson beats the best human players of Jeopardy! (2011) Deep neural networks greatly improve the state of the art in speech recognition and computer vision (2012 ) Google s self-driving car becomes a plausible reality ( 2013) Deepmind s DQN learns to play Atari games at the human level, from pixels, with no gamespecific knowledge ( 2014, Nature) University of Alberta s Cepheus solves Poker (2015, Science) Google Deepmind s AlphaGo defeats the world Go champion, vastly improving over all previous programs (2016)

Advances in AI abilities are coming faster; in the last 5 years: IBM s Watson beats the best human players of Jeopardy! (2011) Deep neural networks greatly improve the state of the art in speech recognition and computer vision (2012 ) Google s self-driving car becomes a plausible reality ( 2013) Deepmind s DQN learns to play Atari games at the human level, from pixels, with no gamespecific knowledge ( 2014, Nature) University of Alberta s Cepheus solves Poker (2015, Science) Google Deepmind s AlphaGo defeats the world Go champion, vastly improving over all previous programs (2016)

Cheap computation power drives progress in AI Deep learning algorithms are essentially the same as what was used in 80s only now with larger computers (GPUs) and larger data sets enabling today s vastly improved speech recognition Similar impacts of computer power can be seen in recent years, and throughout AI s history, in natural language processing, computer vision, and computer chess, Go, and other games

Algorithmic advances are also essential Algorithmic advances such as backpropagation, MCTS, policy-gradient reinforcement learning, and LSTM were necessary but not sufficient They were invented early, then waited for the computational power needed for them to shine other algorithms are still waiting for more cheaper computation Algorithmic advances are slower, less reliable But they will accelerate with more computation, more focused effort

AI is not like other sciences AI has Moore s law, an enabling technology racing alongside it, making the present special Moore s law is a slow fuse, leading to the greatest scientific and economic prize of all time So slow, so inevitable, yet so uncertain in timing The present is a special time for humanity, as we prepare for, wait for, and strive to create strong AI

Algorithmic advances in Alberta World s best computer games group for decades (see Bowling s talk) including solving Poker Created the Atari games environment that our alumni, at Deepmind, used to show learning of human-level play Trained the AlphaGo team that beat the world Go champion World s leading university in reinforcement learning algorithms, theory, and applications, including TD, MCTS 20 faculty members in AI

Course Overview Main Topics: Learning (by trial and error) Planning (search, reason, thought, cognition) Prediction (evaluation functions, knowledge) Control (action selection, decision making) Recurring issues: Demystifying the illusion of intelligence Purpose (goals, reward) vs Mechanism

Model-based RL: GridWorld Example

CMPUT 609: Provisional Schedule of Classes and Assignments class num date lecture topic Reading assignment (in advance) Assignment due 1 Thu, Sep 1, 2016 The Magic of Artificial Intelligence; reasons for taking the course 2 Tue, Sep 6, 2016 Bandit problems Sutton & Barto Chapters 1 and 2 Read section 1 of the Wikipedia entry for the technological singularity ; see also Vinge2010 (http://www-rohan.sdsu.edu/faculty/vinge/misc/iaai10/) and Moravec1998 (http://www.transhumanist.com/volume1/moravec.htm) 3 Thu, Sep 8, 2016 Bandit problems plus RL examples Sutton & Barto Chapter 2 (including Section 2.7) 4 Tue, Sep 13, 2016 Defining Intelligent Systems Read the definition given for artificial intelligence in Wikipedia and in the Nilsson book on p13; google for and read John McCarthy basic questions, and the intentional stance (dictionary of philosophy of mind) 5 Thu, Sep 15, 2016 Markov decision problems Sutton & Barto Chapter 3 thru Section 3.5 6 Tue, Sep 20, 2016 Returns, value functions Rest of Sutton & Barto Chapter 3 7 Thu, Sep 22, 2016 Bellman Equations Sutton & Barto Summary of Notation, Sutton & Barto Section 4.1 W2 8 Tue, Sep 27, 2016 Dynamic programming (planning) Sutton & Barto Rest of Chapter 4 9 Thu, Sep 29, 2016 Monte Carlo Learning Sutton & Barto Chapter 5 10 Tue, Oct 4, 2016 More Monte Carlo Learning Sutton & Barto Chapter 5 W3 11 Thu, Oct 6, 2016 Temporal-difference learning Sutton & Barto Chapter 6 thru Section 6.3 12 Tue, Oct 11, 2016 Temporal-difference learning Sutton & Barto rest of Chapter 6 13 Thu, Oct 13, 2016 Multi-step bootstrapping Sutton & Barto Chapter 7 W4 14 Tue, Oct 18, 2016 Models and planning Sutton & Barto Chapter 8 thru Section 8.3 15 Thu, Oct 20, 2016 Models and planning Sutton & Barto rest of Chapter 8 W1 16 Tue, Oct 25, 2016 Review Sutton & Barto Chapters 2-8 W5 17 Thu, Oct 27, 2016 Midterm Exam No new reading 18 Tue, Nov 1, 2016 19 Thu, Nov 3, 2016 20 Tue, Nov 15, 2016 Function Approximation; Online linear supervised learning Prediction with linear approximation, Tile coding Control with approximation, Average reward, off-policy problems Nilsson Sec. 2.2.1 and Nilsson Ch. 4; Sutton & Barto Chapter 9 thru 9.4 Sutton & Barto rest of Chapter 9 Sutton & Barto Chapter 10 P1

Help Probability refresher Monday Sept 5, 5pm, NRE 1-001 Homework labs with TAs, subsequent Mondays Office hours

Course Information Course Moodle page some official information discussion list! Course Dropbox (see moodle page for link) schedule, assignments, slides, projects Lab is on Monday, 5-7:50 a good place to do your assignments

Textbooks Readings will be from web sources plus the following two textbooks (both of which are available as online electronically and open-access): Reinforcement Learning: An Introduction, by R Sutton and A Barto, MIT Press. we will use the in-progress, online 2nd edition printed copies available at next class $28 exact The Quest for AI, by N Nilsson, Cambridge, 2010 (pdf) 3

Evaluation 1 assignment per week, due at the beginning of class 5 written assignments (5) 3 programming projects (4) (later in the course) Midterm (4) Project (4) 4

Prerequisites Some comfort or interest in thinking abstractly and with mathematics Elementary statistics, probability theory conditional expectations of random variables there will be a lab session devoted to a tutorial review of basic probability Basic linear algebra: vectors, vector equations, gradients Basic programming skills (Python) If Python is a problem, choose a partner who is already comfortable with Python 10

for next time... Read Chapters 1 & 2 of Sutton & Barto text (online)

Policies on Integrity Do not cheat on assignments: Discuss only general approaches to problem Do not take written notes on other's work Respect the lab environment. Do not: Interfere with operation of computing system Interfere with other's files Change another's password Copy another's program etc. Cheating is reported to university whereupon it is out of our hands Possible consequences: A mark of 0 for assignment A mark of 0 for the course A permanent note on student record Suspension / Expulsion from university 8

Academic Integrity The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behavior (online at www.ualberta.ca/ secretariat/appeals.htm) and avoid any behavior which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University. 7

AI Seminar!!! http://www.cs.ualberta.ca/~ai/cal/ Friday noons, CSC 3-33, FREE PIZZA! Neat topics, great speakers 11