Bandits and Reinforcement Learning

Similar documents
Exploration. CS : Deep Reinforcement Learning Sergey Levine

Navigating the PhD Options in CMS

Lecture 10: Reinforcement Learning

Lecture 1: Machine Learning Basics

CS Machine Learning

Executive Guide to Simulation for Health

DIGITAL GAMING & INTERACTIVE MEDIA BACHELOR S DEGREE. Junior Year. Summer (Bridge Quarter) Fall Winter Spring GAME Credits.

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Reinforcement Learning by Comparing Immediate Reward

Python Machine Learning

Axiom 2013 Team Description Paper

(Sub)Gradient Descent

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

PeopleSoft Class Scheduling. The Mechanics of Schedule Build

Ericsson Wallet Platform (EWP) 3.0 Training Programs. Catalog of Course Descriptions

College Pricing and Income Inequality

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Laboratorio di Intelligenza Artificiale e Robotica

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

BUSINESS OCR LEVEL 2 CAMBRIDGE TECHNICAL. Cambridge TECHNICALS BUSINESS ONLINE CERTIFICATE/DIPLOMA IN R/502/5326 LEVEL 2 UNIT 11

Spiritual Works of Mercy

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Math 1313 Section 2.1 Example 2: Given the following Linear Program, Determine the vertices of the feasible set. Subject to:

M55205-Mastering Microsoft Project 2016

Fundraising 101 Introduction to Autism Speaks. An Orientation for New Hires

Top Ten Persuasive Strategies Used on the Web - Cathy SooHoo, 5/17/01

An Introduction to Simio for Beginners

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Pitching Accounts & Advertising Sales ADV /PR

Visit us at:

Firms and Markets Saturdays Summer I 2014

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Creating Your Term Schedule

ECO 3101: Intermediate Microeconomics

Outreach Connect User Manual

FINN FINANCIAL MANAGEMENT Spring 2014

Functional Maths Skills Check E3/L x

DOCTOR OF PHILOSOPHY HANDBOOK

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Introduction to Forensic Drug Chemistry

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

Data Structures and Algorithms

BMBF Project ROBUKOM: Robust Communication Networks

BOOK INFORMATION SHEET. For all industries including Versions 4 to x 196 x 20 mm 300 x 209 x 20 mm 0.7 kg 1.1kg

Course Content Concepts

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Self Study Report Computer Science

Computers Change the World

MGT/MGP/MGB 261: Investment Analysis

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

MYCIN. The MYCIN Task

PHY2048 Syllabus - Physics with Calculus 1 Fall 2014

Functional Skills Mathematics Level 2 assessment

High-level Reinforcement Learning in Strategy Games

ADDIE: A systematic methodology for instructional design that includes five phases: Analysis, Design, Development, Implementation, and Evaluation.

Strategic Management (MBA 800-AE) Fall 2010

Ministry of Education, Republic of Palau Executive Summary

UoS - College of Business Administration. Master of Business Administration (MBA)

Complete the pre-survey before we get started!

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

EXECUTIVE SUMMARY. Online courses for credit recovery in high schools: Effectiveness and promising practices. April 2017

Georgia Tech College of Management Project Management Leadership Program Eight Day Certificate Program: October 8-11 and November 12-15, 2007

CS 100: Principles of Computing

While you are waiting... socrative.com, room number SIMLANG2016

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

An OO Framework for building Intelligence and Learning properties in Software Agents

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

MARKETING MANAGEMENT II: MARKETING STRATEGY (MKTG 613) Section 007

4 th Grade Number and Operations in Base Ten. Set 3. Daily Practice Items And Answer Keys

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Syllabus Foundations of Finance Summer 2014 FINC-UB

MTH 215: Introduction to Linear Algebra

Week 01. MS&E 273: Technology Venture Formation

Regret-based Reward Elicitation for Markov Decision Processes

B. How to write a research paper

48 contact hours using STANDARD version of Study & Solutions Kit

Introduction and Motivation

UHD Student Support Resources

Speeding Up Reinforcement Learning with Behavior Transfer

Syllabus - ESET 369 Embedded Systems Software, Fall 2016

Undergraduate Program Guide. Bachelor of Science. Computer Science DEPARTMENT OF COMPUTER SCIENCE and ENGINEERING

EECS 571 PRINCIPLES OF REAL-TIME COMPUTING Fall 10. Instructor: Kang G. Shin, 4605 CSE, ;

Practice Examination IREB

CS 3516: Computer Networks

IT Project List. Description

Millersville University Degree Works Training User Guide

Introduction to Yearbook / Newspaper Course Syllabus

Sight Word Assessment

Leader s Guide: Dream Big and Plan for Success

Making Sales Calls. Watertown High School, Watertown, Massachusetts. 1 hour, 4 5 days per week

Radius STEM Readiness TM

UEP 251: Economics for Planning and Policy Analysis Spring 2015

Transcription:

Bandits and Reinforcement Learning COMS E6998.001 Fall 2017 Columbia University Alekh Agarwal Alex Slivkins Microsoft Research NYC

What the course is about? Algorithms for sequential decisions and interactive ML under uncertainty algorithm interacts with environment, learns over time. loop: observe state make a decision observe reward/feedback machine learning, theoretical CS, AI, operations research, economics since 1933, very active in the past decade Focus on bandits (no state) and contextual bandits (state does not depend on past actions) Focus on theory (design & analysis of algorithms) using tools from Probability with lots of examples & discussions for motivations & applications

This lecture course organization intro to the problem space short break review of concentration inequalities (necessary basics)

Prerequisites Algorithm design & mathematical proofs: Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors. Probability & statistics: I will review concentration inequalities later today Review of basic probability will be posted on course webpage deeper familiarity would help, but not required Programming: familiarity with programming is not required; however, your project may involve simulations/experiments if you choose so.

Logistics Instructors: Alekh Agarwal and Alex Slivkins (Microsoft Research NYC). Schedule: Wednesdays 4:10-6:40pm, 404 Intl Affairs Bldg Office hours: after each class, and online TBD Course webpage: http://alekhagarwal.net/bandits_and_rl/index.html Q&A and announcements: we will use Piazza, please sign up! https://piazza.com/columbia/fall2017/comse6998001/home Contact: : bandits-fa17-instr@microsoft.com (but please use Piazza if appropriate) Waitlist: let s see how it goes sign up for Piazza!

Coursework and assessment Project: reading, coding, and/or original research on a course-related topic written report: a short academic-style paper Grading: letter grade based on the project Homeworks: 2-3 problem sets throughout the course, not graded to check/solidify your understanding of the material we ll distribute hints/solutions, and we ll be available to discuss

Projects default: reading several papers, making sense of a given topic simulations and/or research if you feel brave and inspired specific topic suggestions will be posted soon topic proposals due Oct 20 (tentatively) feedback / discussion: we ll aim to be available before and after the proposal output: written report, short presentation in the last class we can only handle 10-12 projects Students will need to bunch up, esp. on reading projects

Related courses at Columbia Daniel Russo @Business School, Fall 17 Dynamic Programming and Reinforcement Learning (B9140-001) Shipra Agrawal @IEOR department, Spring 18 Reinforcement learning Our course focuses more heavily on contextual bandits and off-policy evaluation than either of these, and is complimentary to these other offerings

Intro to the problem space

(Informal & very stylized) running examples News site. When a new user arrives, the site picks a news header to show, observes whether the user clicks. Goal: maximize #clicks. Dynamic pricing. A store is selling a digital good (e.g., an app or a song). When a new customer arrives, the store picks a price. Customer buys (or not) and leaves forever. Goal: maximize total profit. Personalized health advice. A health app gives you health/lifestyle recommendations, and tracks how well you follow. Goal: maximize #adopted recommendations (weighted by their usefulness). Chatbot for task completion. You arrive with a specific task in mind (e.g.: tech support issue, buying a ticket), and a chatbot assists you. Goal: maximize #completed tasks.

Basic model A fixed set of K actions ( arms ) In each round t = 1 T algorithm observes a context/state x t, chooses an arm a t, and observes the reward r t for the chosen arm Bandit feedback : no other rewards are observed! IID rewards: reward for each arm is drawn independently from a fixed distribution that depends on the arm and the context, but not on the round t. No contexts multi-armed bandits contexts do not depend on past actions contextual bandits contexts/state depends on past actions reinforcement learning

Examples Example Context Action Reward News site Dynamic pricing Health advice Chatbot User location Buyer s profile User health profile Stage of conversation an article to display a price p what to recommend what to say 1 if clicked, 0 otherwise makes sense even w/o context p if sale, bandits or contextual bandits 0 otherwise 1 if adopted, 0 otherwise 1 if task completed, 0 otherwise Context is essential contextual bandits Context is essential, depends on the past actions reinforcement learning

Exploration-exploitation tradeoff Bandit feedback => need to try different arms to acquire new info if algorithm always chooses arm 1, how would it know if arm 2 is better? fundamental tradeoff between acquiring info about rewards (exploration) and making optimal decisions based on available info (exploitation) multi-armed bandits is a simple model to study this tradeoff

Rich problem space Bandits vs contextual bandits vs reinforcement learning many other distinctions, even for bandits

Distinction #1: which problem to solve? Explore-exploit problem: we control the choice of actions and want to maximize cumulative reward Offline evaluation: some algorithm collects data, and we use this data to answer counterfactuals: what if we ran this policy (mapping from contexts to actions) instead? Off-policy: we do not have control over data collection On-policy: we design data collection ( exploration policy )

Distinction #2: where rewards come from? IID rewards: the reward for each arm is drawn independently from a fixed distribution that depends on the arm but not on the round t. Adversarial rewards: rewards are chosen by an adversary. Constrained adversary: rewards are chosen by an adversary with known constraints, e.g.: reward of each arm can change by at most ε from one round to another reward of each arm can change at most once Stochastic rewards (beyond IID): reward of each arm evolves over time as a random process e.g. random walk: changes by ±ε in each round

Distinction #3: extra feedback Bandit feedback (most of this course): reward for chosen arm and no other info News site: a click on a news article Partial feedback News site: time spent reading an article? Health advice: how diligently was each recommendation followed? Dynamic pricing: sale @p => sale at any smaller price Still, no full counterfactual answer (what could have happened) Full feedback: rewards for all arms are revealed Dynamic pricing choosing min acceptable price at an auction (reserve price) Given the bids tells you what would have happened at any other reserve price

Other distinctions Bayesian prior? problem instance comes from known distribution ( prior ), optimize in expectation over this distribution Global constraints: e.g.: limited #items to sell Complex decisions A news site picks a slate of articles Health advice consists of multiple specific recommendations. Structured rewards: rewards may have a known structure e.g.: arms are points in 0,1 d and in each round the reward is a linear / concave / Lipschitz function of the chosen arm Policy sets: compare to a restricted set of policies: mappings from contexts to arms. e.g.: linear policies or decision trees of bounded width and depth

Course outline Multi-armed bandits (4 lectures: Alex) Bandits with IID rewards Adversarial rewards, full feedback Adversarial bandits Impossibility results (any algorithm cannot do better than ) Contextual bandits (4 lectures: Alekh) Reinforcement learning (2 lectures: Alekh) Back to bandits, topic TBD (1 lecture: Alex) Final class: project presentations

Some philosophy Reality can be complicated we often study simpler models. a good model captures some essential issues present in multiple applications and allows for clean solutions with good performance and/or clean/strong impossibility results and provides intuition/suggestions for more realistic models even a good model typically does not fully capture any one application very rich problem space => why work on problems with shaky motivation?

More examples Example Action Rewards / costs medical trials drug to give health outcomes internet ads which ad to display bid value if clicked, 0 otherwise content optimization e.g.: font color or page layout #clicks sales optimization which products to sell at which prices $$$ recommender systems suggest a movie, restaurants, etc. #users that followed suggestions computer systems which server(s) to route the job to job completion time crowdsourcing systems which tasks to give to which workers quality of completed work which price to offer? #completed tasks wireless networking which frequency to use? #successful transmissions robot control a strategy for a given state & task #tasks successfully completed game playing an action for a given game state #games won