V. Lesser CS683 F2004

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

CS Machine Learning

Chapter 2 Rule Learning in a Nutshell

Lecture 1: Basic Concepts of Machine Learning

Python Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Introduction to Simulation

(Sub)Gradient Descent

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Rule Learning With Negation: Issues Regarding Effectiveness

On-Line Data Analytics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Knowledge Transfer in Deep Convolutional Neural Nets

Radius STEM Readiness TM

A Version Space Approach to Learning Context-free Grammars

arxiv: v1 [cs.cl] 2 Apr 2017

LEGO MINDSTORMS Education EV3 Coding Activities

MYCIN. The MYCIN Task

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Word Segmentation of Off-line Handwritten Documents

Rule Learning with Negation: Issues Regarding Effectiveness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Transfer Learning Action Models by Measuring the Similarity of Different Domains

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning From the Past with Experiment Databases

Assignment 1: Predicting Amazon Review Ratings

Using focal point learning to improve human machine tacit coordination

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Lecture 10: Reinforcement Learning

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CSL465/603 - Machine Learning

The Strong Minimalist Thesis and Bounded Optimality

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

An OO Framework for building Intelligence and Learning properties in Software Agents

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

The stages of event extraction

Calibration of Confidence Measures in Speech Recognition

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Using computational modeling in language acquisition research

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Speech Recognition at ICSI: Broadcast News and beyond

How to make successful presentations in English Part 2

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Intelligent Agents. Chapter 2. Chapter 2 1

Probability and Statistics Curriculum Pacing Guide

Algebra 2- Semester 2 Review

Name Class Date. Graphing Proportional Relationships

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Case-Based Approach To Imitation Learning in Robotic Agents

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Using dialogue context to improve parsing performance in dialogue systems

Learning Methods in Multilingual Speech Recognition

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Chapter 2. Intelligent Agents. Outline. Agents and environments. Rationality. PEAS (Performance measure, Environment, Actuators, Sensors)

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Field Experience Management 2011 Training Guides

Word learning as Bayesian inference

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Artificial Neural Networks written examination

Learning goal-oriented strategies in problem solving

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Lecture 2: Quantifiers and Approximation

Introduction to Causal Inference. Problem Set 1. Required Problems

Model Ensemble for Click Prediction in Bing Search Ads

Prediction of Maximal Projection for Semantic Role Labeling

Seminar - Organic Computing

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

STA 225: Introductory Statistics (CT)

Ling/Span/Fren/Ger/Educ 466: SECOND LANGUAGE ACQUISITION. Spring 2011 (Tuesdays 4-6:30; Psychology 251)

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unit 3: Lesson 1 Decimals as Equal Divisions

arxiv: v1 [cs.cv] 10 May 2017

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

Firms and Markets Saturdays Summer I 2014

Using a Native Language Reference Grammar as a Language Learning Tool

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Mathematics process categories

Math Grade 3 Assessment Anchors and Eligible Content

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Transcription:

Today s s Lecture Lecture 17: Learning -1 The structure of a learning agent Basic problems: bias, Ockham s razor, expressiveness Victor Lesser CMPSCI 683 Fall 2004 Decision-tree algorithms 2 Commonsense Definition Why Should Systems Learn? Learning is change within a system that improves its performance A viable alternative to problem solving. Learning can simplify the complexity of problem solving. Replace procedural knowledge, inferencing, search with learned functions and policies This admits a lot of different behaviors, but identifies the basic preconditions of learning: Learning systems must be capable of change Learning increases efficiency, robustness, survivability, and autonomy of system. Key to operating in open environments Learning systems must do something differently as a result of the change A learning program can become better than its teacher. 3 4

Characterizing Learning Systems Available Feedback What changes as a result of learning? How does the system find out change is needed? How does the system localize the problem to find out what changes are necessary? What is the mechanism of change? Supervised learning Is told by a teacher what action is best in a specific situation Reinforcement Learning Gets feedback about the consequences of a specific sequence of actions in a certain situation Can also be thought of as supervised learning with a less informative feedback signal. Unsupervised Learning No feedback about actions Learns to predict future precepts given its previous precepts Can t learn what to do unless it already has a utility function 5 6 A Model of Learning Agents Model of Learning Agent Performance Standard Critic sensors feedback changes Learning Performance element element learning knowledge goals Problem generator effectors Environment Learning element modifies performance element in response to feedback Critic tells learning element how well agent is doing Fixed standard of performance Problem generator suggests actions that will lead to new and informative experiences Related to decision to acquire information 7 8

Design of Learning Element Types of Learned Knowledge Goals: Learn better actions Speed up performance element Which components of the performance element are to be improved. What representation is used for those components. What feedback is available What prior information is available. A direct mapping from conditions on the current state to actions. Weighting of parameters of multiattribute decision process A means to infer relevant properties of the world from the percept sequence. Information about the way the world evolves. Allow prediction of future events 9 10 Applicability of Learned Knowledge cont. Information about the results of possible actions the agent can take Utility information indicating the desirability of world states. Action-value information indicating the desirability of particular actions in particular states. Goals that describe classes of states whose achievement maximizes the agent s utility. 11 Dimensions of Learning The type of training instances the beginning data for the learning task. The language used to represent knowledge. Specific training instances must be translated into this representation language In some programs the training instances are in the same language as the internal knowledge base and this step is unnecessary. A set of operations on representations. Typical operations generalize or specialize existing knowledge, combine units of knowledge, or otherwise modify the program s existing knowledge or the representation of the training instances. 12

Dimensions of Learning cont. Types of Knowledge Representations for Learning The concept space. The operations that define a space of possible knowledge structures that is searched to find the appropriate characterization of the training instances and similar problems. The learning algorithms and heuristics employed to search the concept space. The order of the search and the use of heuristics to guide the search. numerical parameters decision trees formal grammars production rules logical theories graphs and networks frames and schemas computer programs (procedural encoding) 13 14 Learning Functions Some Additional Thoughts All learning can be seen as learning the representation of a function Choice of representation of a function Trade-off between expressiveness and efficiency Is what you want representable? Is what you want learnable (# of examples, cost of search)? Choice of training data Correctly reflects past experiences Correctly predicts future experiences How to judge the goodness of the learned function Importance of Prior Knowledge Prior knowledge can significantly speed up learning process EBL: explanation-based learning Learning as a search process Finding the best function Incremental Process (on-line) vs. off-line 15 16

Inductive (Supervised) Learning Problems Let an example be (x, f(x)) Give a collection of examples of f, return a function h that approximates f. This function h is called a hypothesis: Feedback is relation between f(x) and h(x) (x, f(x)) could only be approximately correct Noise, missing components Many hypotheses h s are approximately consistent with the training set Curve-fitting... A preference for one hypothesis over another beyond consistency is called Bias. 17 18 Ockham s Razor Simple hypotheses that are consistent with data are preferred We want to maximize some metric of consistency and simplicity in the choice of the most appropriate function Learning Classification Decision Trees Restricted representation of logical sentences Boolean functions Takes as input situation described by a set of properties and outputs a yes/no decision Tree of property value tests Terminals are decisions Learn, based on conditions of the situation, whether to wait at a restaurant for a table 19 20

Decision trees Example: Waiting for a table A (classification) decision tree takes as input a situation described by a set of attributes and returns a decision. Can express any boolean function of the input attributes. How to choose between equally consistent trees Alternate Bar Fri/Sat Hungry Patrons (None, Some, Full) Price ($, $$, $$$) Raining Reservation Type (French, Italian, Thai, Burger) WaitEstimate (0-10, 10-30, 30-60, >60) 21 22 Inducing Decision Trees from Examples Constructing the Decision Tree Construct a root node that includes all the examples, then for each node: 1. if there are both positive and negative examples, choose the best attribute to split them. 2. if all the examples are pos (neg) answer yes (no). 3. if there are no examples for a case (no observed examples) then choose a default based on the majority classification at the parent. Case of raining under hungry- yes,alternate - yes 4. if there are no attributes left but we have both pos and neg examples, this means that the selected features are not sufficient for classification or that there is error in the examples. (can use majority vote.) 23 24

Splitting the Examples A perfect attribute divides the examples into sets that are all positive and negative +: X1, X3, X4, X6, X8, X12!: X2, X5, X7, X9, X10, X11 Patrons? Splitting Examples cont. +: X1, X3, X4, X6, X8, X12!: X2, X5, X7, X9, X10, X11 Type? None Some Full French Italian Thai Burger +:!: X7, X11 +: X1, X3, X6, X8!: +: X4, X12!: X2, X5, X9, X10 +: X1!: X5 +: X6!: X10 +: X4, X8!: X2, X11 +: X3, X12!: X7, X9 25 26 Splitting Examples cont. +: X1, X3, X4, X6, X8, X12!: X2, X5, X7, X9, X10, X11 No Yes Patrons? None Some Full No +:!: X7, X11 No +: X1, X3, X6, X8!: Yes +: X4, X12!: X2, X5, X9, X10 Y +: X4, X12!: X2, X10 Hungry? N +:!: X5, X9 27 Yes No No Yes Yes 28

Decision Tree Algorithm Expressions of Decision Tree Basic idea is to build the tree greedily. Decisions once made are not revised No search Choose most significant attribute to be the root. Then split the dataset in two halves, and recurse. Define significance using information theory (based on information gain or entropy ). Finding the smallest decision tree is an intractable problem Any Boolean function can be written as a decision tree $r Patrons(r,Full) # WaitEstimate(r,10-30) # Hungry(r,N) % WillWait(r) Row of truth table path in decision tree 2 n rows given n literals, 2 2 n functions 29 30 Limits on Expressability Cannot use decision tree to represent tests that refer to two or more different objects "r 2 Nearby(r 2,r) # Price(r,p) # Price(r 2,p 2 ) # Cheaper(p 2,p) New Boolean attribute: CheaperRestaurantNearby but intractable to add all such attributes Choosing the Best Attribute Based on Information Theory Expected amount of information provided by an attribute Similar to the concept of value of perfect information Amount of information content in a set of examples V i is the possible answers, p positive, n negative I(P( v 1 ),...,P( v n )) = "! P( i n i=1 v ) log2p( v i ) Some truth tables cannot be compactly represented in decision tree Parity function returns 1 if and only if an even number of inputs are 1 exponentially large decision tree will be needed. Majority function which returns 1 if more than half of its inputs are 1. 31 Example 12 cases, 6 pos, 6 neg; information 1 bit I( p p + n, n p + n ) =! p log p + n 2 p p + n! v number of attributes of Attribute A v p i + n remainder( A) = i " p! I $ i, i=1 p + n # p + n i i n log p + n 2 ni p i + n i % ' & n p + n 32

Choosing the Best Attribute Based on Information Theory cont. Example (Quinlan 83) Gain(A) = ( I p p+ n, n )! remainder( A) p + n " Gain(Patrons) = 1! $ 2 12 I (0,1) + 4 I(1, 0) + 6 # 12 12 I % 2 ', 4 ( * + -. 0.541bits & 6 6), ) Gain(Type) = 1! 2 12 I & 1 2 " 1 # + ( % + 2 * + ' 2$ 12 I & 2 4 " 2 # ( % + 4 ' 4$ 12 I & 2 ( " 2 # %,. = 0bits ' 4 4$ - + HEIGHT SHORT TALL 1 2 2 3 CLASS HEIGHT HAIR!EYES SHORT BLOND BROWN TALL DARK BROWN + TALL BLOND BLUE TALL DARK BLUE SHORT DARK BLUE + TALL RED BLUE TALL BLOND BROWN + SHORT BLOND BLUE + HAIR BLOND DARK RED 2 0 1 2 3 0 Partition on hair gives least Impurity + EYES BROWN BLUE 0 3 3 2 33 34 Example (Quinlan 83) cont. Performance Measurement HAIR BLOND DARK RED CLASS HEIGHT EYES CLASS HEIGHT EYES CLASS HEIGHT EYES - SHORT BROWN - TALL BROWN + TALL BLUE + TALL BLUE - TALL BLUE - TALL BROWN - SHORT BLUE + SHORT BLUE HEIGHT SHORT TALL 1 1 EYES BROWN BLUE 0 2 How do we measure how close our hypothesis is to f()? Try h() on a test set 1 1 2 0 2 4 I(1,1) + 2 4 I(1,1) 2 4 I(0, 2) + 2 I(2, 0) 4 1-0.6931 1-0.0 Learning curve: Measure % correct predictions on the test set as a function of the size of the training set. EYES ARE BETTER ATTRIBUTE 35 36

Assessing the Performance of the Learning Algorithm Full Learned Decision Tree Randomly divide available examples into test and training set A learning curve for the decision tree algorithm on 100 randomly generated examples in the restaurant domain. The graph summarizes 20 trials. 37 How correct is this? Can we even judge this idea? Not all attributes used How does the number of examples seen relate to the likelihood of correctness? 38 Noise and Overfitting Broadening the applicability - Missing Data Finding meaningless regularities in the data. With enough attributes, you re likely to find one which captures some of the noise in your data. One solution is to prune the tree. Collapse subtrees which provide only minor improvements Using information gain as a criteria Handling examples with missing data Add new attribute value - unknown Instantiated example with all possible values of missing attribute but assign weights to each instance based on likelihood of missing value being a particular value given the distribution of examples in the parent node Modify decision tree algorithm to take into account weighting 39 40

Broadening the applicability - Multivalued Attributes Handling multivalued (large) attributes and classification Need another measure of information gain Information gain measure gives inappropriate indication of attributed usefulness because of likelihood of singleton values Gain ratio Gain over intrinsic information content Broadening the Applicability - Continuous-Valued attributes Continuous-valued attributes Discretize Example $,$$, $$$ Preprocess to find out which ranges give the most useful information for classification purposes Incremental construction 41 42 Next Lecture The version space algorithm Neural Networks 43