Decision Trees. Doug Downey EECS 348 Spring with slides from Pedro Domingos, Bryan Pardo

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CSL465/603 - Machine Learning

Lecture 1: Machine Learning Basics

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

Rule Learning With Negation: Issues Regarding Effectiveness

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

CS 446: Machine Learning

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Linking Task: Identifying authors and book titles in verbose queries

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Chapter 2 Rule Learning in a Nutshell

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Python Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Data Stream Processing and Analytics

Corrective Feedback and Persistent Learning for Information Extraction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Rule-based Expert Systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Knowledge Transfer in Deep Convolutional Neural Nets

MYCIN. The MYCIN Task

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Probabilistic Latent Semantic Analysis

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Transfer Learning Action Models by Measuring the Similarity of Different Domains

GACE Computer Science Assessment Test at a Glance

LEGO MINDSTORMS Education EV3 Coding Activities

Learning From the Past with Experiment Databases

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Distributed Linguistic Classes

B. How to write a research paper

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Unit 1: Scientific Investigation-Asking Questions

A Case Study: News Classification Based on Term Frequency

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Case study Norway case 1

A Version Space Approach to Learning Context-free Grammars

Calibration of Confidence Measures in Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Discriminative Learning of Beam-Search Heuristics for Planning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Using focal point learning to improve human machine tacit coordination

Grammar Lesson Plan: Yes/No Questions with No Overt Auxiliary Verbs

Interactive Whiteboard

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

AUTHOR COPY. Techniques for cold-starting context-aware mobile recommender systems for tourism

Axiom 2013 Team Description Paper

School of Innovative Technologies and Engineering

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Softprop: Softmax Neural Network Backpropagation Learning

Grade 6: Correlated to AGS Basic Math Skills

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Speech Recognition at ICSI: Broadcast News and beyond

Using dialogue context to improve parsing performance in dialogue systems

Course Content Concepts

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

SOCIAL STUDIES GRADE 1. Clear Learning Targets Office of Teaching and Learning Curriculum Division FAMILIES NOW AND LONG AGO, NEAR AND FAR

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The stages of event extraction

The Strong Minimalist Thesis and Bounded Optimality

GLOBAL INSTITUTIONAL PROFILES PROJECT Times Higher Education World University Rankings

Speech Emotion Recognition Using Support Vector Machine

Artificial Neural Networks written examination

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cooperative evolutive concept learning: an empirical study

Intelligent Agents. Chapter 2. Chapter 2 1

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Radius STEM Readiness TM

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

A Genetic Irrational Belief System

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Beyond the Pipeline: Discrete Optimization in NLP

Proof Theory for Syntacticians

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Mathematics Assessment Plan

Constraining X-Bar: Theta Theory

Medical Complexity: A Pragmatic Theory

Applications of data mining algorithms to analysis of medical data

Multi-Lingual Text Leveling

Statewide Framework Document for:

Word learning as Bayesian inference

Getting Started with TI-Nspire High School Science

Multi-label classification via multi-target regression on data streams

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

SARDNET: A Self-Organizing Feature Map for Sequences

Transcription:

Decision Trees Doug Downey EECS 348 Spring 2012 with slides from Pedro Domingos, Bryan Pardo

Outline Classical AI Limitations Knowledge Acquisition Bottleneck, Brittleness Modern directions: Situatedness, embodiment Learning from data (machine learning) Probability 2

Recall: example Learn function from x = (x 1,, x d ) to f(x) {0, 1} given labeled examples (x, f(x))? x 2 x 1

Instances E.g. Days, in terms of weather: Sky Temp Humid Wind Water Forecast sunny warm normal strong warm same sunny warm high strong warm same rainy cold high strong warm change sunny warm high strong cool change

Functions Days on which my friend Aldo enjoys his favorite water sport INPUT OUTPUT Sky Temp Humid Wind Water Forecast f(x) sunny warm normal strong warm same 1 sunny warm high strong warm same 1 rainy cold high strong warm change 0 sunny warm high strong cool change 1 5

Machine Learning! Predict the output for a new instance INPUT OUTPUT Sky Temp Humid Wind Water Forecast f(x) sunny warm normal strong warm same 1 sunny warm high strong warm same 1 rainy cold high strong warm change 0 sunny warm high strong cool change 1 rainy warm high strong cool change? 6

General Machine Learning Task DEFINE: Set X of instances (of n-tuples x = <x 1,..., x n >) E.g., days decribed by attributes (or features): Sky, Temp, Humidity, Wind, Water, Forecast Target function f, e.g.: EnjoySport X Y = {0,1} GIVEN: Training examples D examples of the target function: <x, f(x)> FIND: A hypothesis h such that h(x) approximates f(x).

Examples Credit Risk Analysis X: Properties of customer and proposed purchase f(x): Approve (1) or Disapprove (0) Disease Diagnosis X: Properties of patient (symptoms, lab tests) f(x): Disease (if any) Face Recognition X: Bitmap image f(x):name of person Automatic Steering X: Bitmap picture of road surface in front of car f(x): Degrees to turn the steering wheel

Appropriate applications Situations in which: there is no human expert Humans can perform the task but can t describe how The desired function changes frequently Each user needs a customized f

Task: Will I wait for a table? 10

Hypothesis Spaces Hypothesis space H is a subset of all g: X Y e.g.: Linear separators Conjunctions of constraints on attributes (humidity must be low, and outlook!= rain) Etc. In machine learning, we restrict ourselves to H

Decision Trees! Decision Tree 12

Expressiveness of D-Trees 13

A learned decision tree 14

Inductive Bias To learn, we must prefer some functions to others Selection bias use a restricted hypothesis space, e.g.: linear separators 2-level decision trees Preference bias use the whole function space, but state a preference over concepts, e.g.: Lowest-degree polynomial that separates the data shortest decision tree that fits the data 15

Decision Tree Learning (ID3) 16

Recap Machine learning Goal: generate a hypothesis a function from instances described by attributes to an output using training examples. Requires inductive bias a restricted hypothesis space, or preferences over hypotheses. Decision Trees Simple representation of hypotheses, recursive learning algorithm Prefer smaller trees! 17

Choosing an attribute 18

Information 19

H(V) Entropy The entropy H(V) of a Boolean random variable V as the probability of V = 0 varies from 0 to 1 20 P(V=0)

Using Information 21

Measuring Performance 22

What the learning curve tells us 23

Rule #2 of Machine Learning The best hypothesis almost never achieves 100% accuracy on the training data. (Rule #1 was: you can t learn anything without inductive bias)

Overfitting

Overfitting is due to noise Sources of noise: Erroneous training data concept variable incorrect (annotator error) Attributes mis-measured Much more significant: Irrelevant attributes Target function not deterministic in attributes

Irrelevant attributes If many attributes are noisy, information gains can be spurious, e.g.: 20 noisy attributes 10 training examples Expected # of different depth-3 trees that split the training data perfectly using only noisy attributes: 13.4

Non-determinism In general: We can t measure all the variables we need to do perfect prediction. => Target function is not uniquely determined by attribute values

Non-determinism: Example Humidity EnjoySport 0.90 0 0.87 1 0.80 0 0.75 0 0.70 1 0.69 1 0.65 1 0.63 1 Decent hypothesis: Humidity > 0.70 No Otherwise Yes Overfit hypothesis: Humidity > 0.89 No Humidity > 0.80 ^ Humidity <= 0.89 Yes Humidity > 0.70 ^ Humidity <= 0.80 No Humidity <= 0.70 Yes

Avoiding Overfitting Approaches Stop splitting when information gain is low or when split is not statistically significant. Grow full tree and then prune it when done How to pick the best tree? Performance on training data? Performance on validation data? Complexity penalty? Bryan Pardo, EECS 349 Fall 2009 30

Effect of Reduced Error Pruning 32

C4.5 Algorithm Builds a decision tree from labeled training data Also by Ross Quinlan Generalizes ID3 by Allowing continuous value attributes Allows missing attributes in examples Prunes tree after building to improve generality 33

Rule post pruning Used in C4.5 Steps 1. Build the decision tree 2. Convert it to a set of logical rules 3. Prune each rule independently 4. Sort rules into desired sequence for use 34

Decision Tree Boundaries 39

Decision Trees Inductive Bias How to solve 2-bit parity: Two step look-ahead, or Split on pairs of attributes at once For k-bit parity, why not just do k-step look ahead? Or split on k attribute values? =>Parity functions are the victims of the decision tree s inductive bias.

Take away about decision trees Used as classifiers Supervised learning algorithms (ID3, C4.5) (mostly) Batch processing Good for situations where The classification categories are finite The data can be represented as vectors of attributes 42