Learning. Learning from Observations. Learning agents. Outline. Environment. Agent

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

STA 225: Introductory Statistics (CT)

Rule Learning With Negation: Issues Regarding Effectiveness

Lecture 10: Reinforcement Learning

Proof Theory for Syntacticians

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

12- A whirlwind tour of statistics

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Chapter 2 Rule Learning in a Nutshell

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Radius STEM Readiness TM

Probability and Statistics Curriculum Pacing Guide

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Mathematics Success Grade 7

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Physics 270: Experimental Physics

Agents and environments. Intelligent Agents. Reminders. Vacuum-cleaner world. Outline. A vacuum-cleaner agent. Chapter 2 Actuators

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

TU-E2090 Research Assignment in Operations Management and Services

Intelligent Agents. Chapter 2. Chapter 2 1

Grade 6: Correlated to AGS Basic Math Skills

Learning goal-oriented strategies in problem solving

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Version Space Approach to Learning Context-free Grammars

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Knowledge Transfer in Deep Convolutional Neural Nets

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

A Case Study: News Classification Based on Term Frequency

Multimedia Application Effective Support of Education

Reinforcement Learning by Comparing Immediate Reward

MYCIN. The MYCIN Task

Pod Assignment Guide

Scientific Method Investigation of Plant Seed Germination

Stopping rules for sequential trials in high-dimensional data

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Tabular and Textual Methods for Selecting Objects from a Group

(Sub)Gradient Descent

Lab 1 - The Scientific Method

Lecture 1: Basic Concepts of Machine Learning

Axiom 2013 Team Description Paper

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Python Machine Learning

Word learning as Bayesian inference

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Artificial Neural Networks written examination

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Cal s Dinner Card Deals

On the Polynomial Degree of Minterm-Cyclic Functions

Using focal point learning to improve human machine tacit coordination

"f TOPIC =T COMP COMP... OBJ

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Generating Test Cases From Use Cases

BMBF Project ROBUKOM: Robust Communication Networks

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Speech Recognition at ICSI: Broadcast News and beyond

A. What is research? B. Types of research

Millersville University Degree Works Training User Guide

CSL465/603 - Machine Learning

Planning with External Events

Learning Disability Functional Capacity Evaluation. Dear Doctor,

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

The Strong Minimalist Thesis and Bounded Optimality

On the Combined Behavior of Autonomous Resource Management Agents

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Probability and Game Theory Course Syllabus

A Comparison of Standard and Interval Association Rules

Statewide Framework Document for:

THE INFORMATION SYSTEMS ANALYST EXAM AS A PROGRAM ASSESSMENT TOOL: PRE-POST TESTS AND COMPARISON TO THE MAJOR FIELD TEST

Mathematics Success Level E

DegreeWorks Advisor Reference Guide

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Applications of data mining algorithms to analysis of medical data

Multi-Lingual Text Leveling

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Spring 2016 Stony Brook University Instructor: Dr. Paul Fodor

Algebra 2- Semester 2 Review

Research Design & Analysis Made Easy! Brainstorming Worksheet

Mathematics. Mathematics

Eduroam Support Clinics What are they?

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Learning Methods in Multilingual Speech Recognition

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Firms and Markets Saturdays Summer I 2014

Two-Valued Logic is Not Sufficient to Model Human Reasoning, but Three-Valued Logic is: A Formal Analysis

Transcription:

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning from Observations Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent s decision mechanisms to improve performance Chapter 18, Sections 1 3 Chapter 18, Sections 1 3 1 Chapter 18, Sections 1 3 3 Learning agents Outline Performance standard Learning agents Inductive learning Decision tree learning Critic Sensors Measuring learning performance feedback learning goals Learning element changes knowledge Performance element Environment Problem generator experiments Agent Effectors Chapter 18, Sections 1 3 2 Chapter 18, Sections 1 3 4

Learning element Design of learning element is dictated by what type of performance element is used which functional component is to be learned how that functional compoent is represented what kind of feedback is available Example scenarios: Performance element Component Representation eedback Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) Alpha beta search Eval. fn. Weighted linear function Win/loss Logical agent ransition model Successor state axioms Outcome Utility based agent ransition model Dynamic Bayes net Outcome Simple reflex agent Percept action fn Neural net Supervised learning: correct answers for each instance Reinforcement learning: occasional rewards Correct action x Chapter 18, Sections 1 3 5 Chapter 18, Sections 1 3 7 Inductive learning (a.k.a. Science) Simplest form: learn a function from examples (tabula rasa) f is the target function An example is a pair x, f(x), e.g., Problem: find a(n) hypothesis h such that h f given a training set of examples O O X X X, +1 (his is a highly simplified model of real learning: Ignores prior knowledge Assumes a deterministic, observable environment Assumes examples are given Assumes that the agent wants to learn f why?) Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Chapter 18, Sections 1 3 6 Chapter 18, Sections 1 3 8

Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) E.g., curve fitting: f(x) x x Chapter 18, Sections 1 3 9 Chapter 18, Sections 1 3 11 Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) Inductive learning method Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) E.g., curve fitting: f(x) x x Ockham s razor: maximize a combination of consistency and simplicity Chapter 18, Sections 1 3 10 Chapter 18, Sections 1 3 12

Attribute-based representations Examples described by attribute values (Boolean, discrete, continuous, etc.) E.g., situations where I will/won t wait for a table: Example Attributes arget Alt Bar ri Hun P at P rice Rain Res ype Est WillWait X 1 Some $$$ rench 0 10 X 2 ull $ hai 30 60 X 3 Some $ Burger 0 10 X 4 ull $ hai 10 30 X 5 ull $$$ rench >60 X 6 Some $$ Italian 0 10 X 7 None $ Burger 0 10 X 8 Some $$ hai 0 10 X 9 ull $ Burger >60 X 10 ull $$$ Italian 10 30 X 11 None $ hai 0 10 X 12 ull $ Burger 30 60 Expressiveness Decision trees can express any boolean function of the input attributes. E.g., for Boolean attributes, truth table row path to leaf: A B A xor B B A B rivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won t generalize to new examples Prefer to find more compact decision trees Classification of examples is positive () or negative () Chapter 18, Sections 1 3 13 Chapter 18, Sections 1 3 15 Decision trees One possible representation for hypotheses E.g., here is the true tree for deciding whether to wait: Hypothesis spaces Patrons? None Some ull WaitEstimate? >60 30 60 10 30 0 10 Alternate? Hungry? Reservation? ri/sat? Alternate? Bar? Raining? Chapter 18, Sections 1 3 14 Chapter 18, Sections 1 3 16

Hypothesis spaces Hypothesis spaces = number of distinct truth tables with 2 n rows = 2 2n Chapter 18, Sections 1 3 17 Chapter 18, Sections 1 3 19 Hypothesis spaces = number of distinct truth tables with 2 n rows Hypothesis spaces = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees Chapter 18, Sections 1 3 18 Chapter 18, Sections 1 3 20

Hypothesis spaces = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Hypothesis spaces = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set may get worse predictions Chapter 18, Sections 1 3 21 Chapter 18, Sections 1 3 23 Hypothesis spaces = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees How many purely conjunctive hypotheses (e.g., Hungry Rain)?? Each attribute can be in (positive), in (negative), or out 3 n distinct conjunctive hypotheses Decision tree learning Aim: find a small tree consistent with the training examples Idea: (recursively) choose most significant attribute as root of (sub)tree function DL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Mode(examples) else best Choose-Attribute(attributes, examples) tree a new decision tree with root test best for each value v i of best do examples i {elements of examples with best = v i } subtree DL(examples i,attributes best,mode(examples)) add a branch to tree with label v i and subtree subtree return tree Chapter 18, Sections 1 3 22 Chapter 18, Sections 1 3 24

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) all positive or all negative Information heory Consider communicating two messages ( and ) between two parties Bits are used to measure message size Patrons? None Some ull ype? rench Italian hai Burger If P() = 1 and P() = 0, how many bits are needed? If P() =.5 and P() =.5, how many bits are needed? Information: I(P(v 1 ),...P(v n )) = n i=1 P(v i ) log 2 P(v i ) I(1, 0) = 0 bit I(0.5, 0.5) = 0.5 log 2 0.5 0.5 log 2 0.5 = 1 bit Chapter 18, Sections 1 3 25 Chapter 18, Sections 1 3 27 Information heory Consider communicating two messages ( and ) between two parties Bits are used to measure message size If P() = 1 and P() = 0, how many bits are needed? If P() =.5 and P() =.5, how many bits are needed? Information heory Consider communicating two messages ( and ) between two parties Bits are used to measure message size If P() = 1 and P() = 0, how many bits are needed? If P() =.5 and P() =.5, how many bits are needed? Information: I(P(v 1 ),...P(v n )) = n i=1 P(v i ) log 2 P(v i ) I(1, 0) = 0 bit I(0.5, 0.5) = 0.5 log 2 0.5 0.5 log 2 0.5 = 1 bit I measures the information content for communication (or uncertainty in what is already known) he more one knows, the less to be communicated, the smaller is I he less one knows, the more to be communicated, the larger is I Chapter 18, Sections 1 3 26 Chapter 18, Sections 1 3 28

Using Information heory (P(pos),P(neg)): probabilities of positive ( message ) and negative ( message ) Attribute color: black (1,0), white (0,1) Attribute size: large (.5,.5), small (.5,.5) After adding an attribute How much uncertainty/confusion after adding an attribute (e.g., color)? p i = number of positive examples for value i (e.g., black), n i = number of negative ones Estimating probabilities for value i: P i (pos) = Uncertainty from value i: I(P i (pos), P i (neg)) But we have v values for attribute A (e.g., 2 for color) p i p i +n i, P i (neg) = n i p i +n i How do we combine the uncertainty from the different attribute values? Chapter 18, Sections 1 3 29 Chapter 18, Sections 1 3 31 Before adding an attribute How much uncertainty/confusion before adding an attribute (e.g., color)? p = number of positive examples, n = number of negative examples Estimating probabilities: P(pos) = p p+n, P(neg) = n p+n Before() = I(P(pos), P(neg)) After adding an attribute How much uncertainty/confusion after adding an attribute (e.g., color)? p i = number of positive examples for value i (e.g., black), n i = number of negative ones Estimating probabilities for value i: P i (pos) = Uncertainty from value i: I(P i (pos), P i (neg)) But we have v values for attribute A (e.g., 2 for color) p i p i +n i, P i (neg) = n i p i +n i How do we combine the uncertainty from the different attribute values? Remainder(A) = After(A) = v i=1 p i +n i p+n I(P i(pos),p i (neg)) [expected uncertanity] Chapter 18, Sections 1 3 30 Chapter 18, Sections 1 3 32

Choosing an Attribute Information Gain (reduction in uncertainty) Gain(A) = Before() After(A) Why Bef ore() Af ter(a), not Af ter(a) Bef ore()? Example contd. Decision tree learned from the 12 examples: Patrons? None Some ull Hungry? Yes No ype? rench Italian hai Burger ri/sat? Substantially simpler than true tree a more complex hypothesis isn t justified by small amount of data Chapter 18, Sections 1 3 33 Chapter 18, Sections 1 3 35 Choosing an Attribute Information Gain (reduction in uncertainty) How do we know that h f? Performance measurement Gain(A) = Before() After(A) Why Bef ore() Af ter(a), not Af ter(a) Bef ore()? Before() should have more uncertainty Choose attribute A with the largest Gain(A) Chapter 18, Sections 1 3 34 Chapter 18, Sections 1 3 36

How do we know that h f? Performance measurement Performance measurement Learning curve = % correct on test set as a function of training set size How about measuring the accuracy of h on the examples that were used to learn h? % correct on test set 1 0.9 0.8 0.7 0.6 0.5 0.4 0 10 20 30 40 50 60 70 80 90 100 raining set size Chapter 18, Sections 1 3 37 Chapter 18, Sections 1 3 39 Performance measurement How do we know that h f? (Hume s Problem of Induction) 1. Use theorems of computational/statistical learning theory 2. ry h on a new test set of examples use same distribution over example space as training set divide into two disjoint subsets: training and test sets prediction accuracy: accuracy on the (unseen) test set Learning curve realizable (can express target function) vs. non-realizable non-realizability can be due to: missing attributes and/or restricted hypothesis class (e.g., thresholded linear function) redundant expressiveness (e.g., loads of irrelevant attributes) % correct 1 realizable redundant nonrealizable # of examples Chapter 18, Sections 1 3 38 Chapter 18, Sections 1 3 40

Irrelevant Attributes Consider adding the attribute: Date (month and day) How can it affect the learned tree? Significance est Null hypothesis (in statistics): attribute is irrelevant (gain is not significant) Alternative hypothesis : attribute is relevant Calculating the deviation expected ˆp i = p p i+n i p+n expected ˆn i = n p i+n i p+n Deviation (from expected): D = v (p i ˆp i ) 2 + (n i ˆn i ) 2 i=1 ˆp i ˆn i D is χ 2 (chi-squared) distributed with v 1 degrees of freedom χ 2 est in statistics With a confidence level (e.g. 95%), if D > χ 2 value, attribute is relevant (Null hypothesis is rejected) Chapter 18, Sections 1 3 41 Chapter 18, Sections 1 3 43 Overfitting More attributes larger hypothesis space Larger hypothesis space can lead to more hypotheses that represent meaningless regularity/patterns Overfitting: high accuracy on training set, but low accuracy on test set low prediction accuracy Select the attribute with the largest information gain however, is the gain significant? (statistical) significance test Pruning do not include the attribute if information gain is not statistically significant potentially, less than 100% accurate on the training set, why? however, improved prediction accuracy on the test set Additional Issues Missing attribute values. Gain() biases to attributes with more values. Continuous-valued (numeric) attributes have infinite number of values. Chapter 18, Sections 1 3 42 Chapter 18, Sections 1 3 44

Learning as search What is the state space in learning decision trees? State-space formulation Summary Learning needed for unknown environments, lazy designers Learning agent = performance element + learning element Learning method depends on type of performance element, available feedback, type of component to be improved, and its representation or supervised learning, the aim is to find a simple hypothesis that is approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set Chapter 18, Sections 1 3 45 Chapter 18, Sections 1 3 47 Learning as search What is the state space in learning decision trees? State-space formulation State: a decision tree Initial state: an empty decision tree Action: add an attribute to the tree Goal test: all examples in each leaf have the same classification What kind of search is DL? Chapter 18, Sections 1 3 46