Machine Learning. June 22, 2006 CS 486/686 University of Waterloo

Similar documents
Lecture 1: Basic Concepts of Machine Learning

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Word Segmentation of Off-line Handwritten Documents

(Sub)Gradient Descent

Chapter 2 Rule Learning in a Nutshell

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Case Study: News Classification Based on Term Frequency

Rule Learning with Negation: Issues Regarding Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CSL465/603 - Machine Learning

A Version Space Approach to Learning Context-free Grammars

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Proof Theory for Syntacticians

Mining Student Evolution Using Associative Classification and Clustering

Learning From the Past with Experiment Databases

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Applications of data mining algorithms to analysis of medical data

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Softprop: Softmax Neural Network Backpropagation Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Generating Test Cases From Use Cases

MYCIN. The MYCIN Task

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning goal-oriented strategies in problem solving

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Statewide Framework Document for:

Artificial Neural Networks written examination

Corrective Feedback and Persistent Learning for Information Extraction

Knowledge Transfer in Deep Convolutional Neural Nets

Scientific Method Investigation of Plant Seed Germination

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Speech Recognition at ICSI: Broadcast News and beyond

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Multi-Lingual Text Leveling

arxiv: v1 [cs.lg] 15 Jun 2015

Grade 6: Correlated to AGS Basic Math Skills

Physics 270: Experimental Physics

Reducing Features to Improve Bug Prediction

Probabilistic Latent Semantic Analysis

Radius STEM Readiness TM

Probability and Statistics Curriculum Pacing Guide

Data Stream Processing and Analytics

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Are You Ready? Simplify Fractions

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Rule-based Expert Systems

WSU Five-Year Program Review Self-Study Cover Page

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CS 446: Machine Learning

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

The Strong Minimalist Thesis and Bounded Optimality

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Mathematics Assessment Plan

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

On the Polynomial Degree of Minterm-Cyclic Functions

Knowledge-Based - Systems

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Designing a Computer to Play Nim: A Mini-Capstone Project in Digital Design I

Using dialogue context to improve parsing performance in dialogue systems

Learning Distributed Linguistic Classes

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Mining Association Rules in Student s Assessment Data

Multimedia Application Effective Support of Education

Mathematics. Mathematics

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

On-Line Data Analytics

"f TOPIC =T COMP COMP... OBJ

Self Study Report Computer Science

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

An OO Framework for building Intelligence and Learning properties in Software Agents

Conversions among Fractions, Decimals, and Percents

OFFICE SUPPORT SPECIALIST Technical Diploma

Switchboard Language Model Improvement with Conversational Data from Gigaword

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Calibration of Confidence Measures in Speech Recognition

GACE Computer Science Assessment Test at a Glance

Using focal point learning to improve human machine tacit coordination

arxiv: v1 [cs.cl] 2 Apr 2017

Word learning as Bayesian inference

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Setting Up Tuition Controls, Criteria, Equations, and Waivers

Transcription:

Machine Learning June 22, 2006 CS 486/686 University of Waterloo

Outline Inductive learning Decision trees Reading: R&N Ch 18.1-18.3 CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 2

What is Machine Learning? Definition: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. [T Mitchell, 1997] CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 3

Examples Backgammon (reinforcement learning): T: playing backgammon P: percent of games won against an opponent E: playing practice games against itself Handwriting recognition (supervised learning): T: recognize handwritten words within images P: percent of words correctly recognized E: database of handwritten words with given classifications Customer profiling (unsupervised learning): T: cluster customers based on transaction patterns P: homogeneity of clusters E: database of customer transactions CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 4

Representation Representation of the learned information is important Determines how the learning algorithm will work Common representations: Linear weighted polynomials Propositional logic First order logic Bayesnets Special case for neural nets Today s lecture CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 5

Inductive learning (aka concept learning) Induction: Given a training set of examples of the form (x,f(x)) x is the input, f(x) is the output Return a function h that approximates f h is called the hypothesis CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 6

Training set: Classification Sky Humidity Wind Water Forecast EnjoySport Sunny Normal Strong Warm Same Yes Sunny High Strong Warm Same Yes Sunny High Strong Warm Change No Sunny High Strong Cool Change Yes x f(x) Possible hypotheses: h 1 : S=sunny ES=yes h 2 : Wa=cool or F=same enjoysport CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 7

Regression Find function h that fits f at instances x CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 8

Regression Find function h that fits f at instances x h 1 h 2 CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 9

Hypothesis Space Hypothesis space H Set of all hypotheses h that the learner may consider Learning is a search through hypothesis space Objective: Find hypothesis that agrees with training examples But what about unseen examples? CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 10

Generalization A good hypothesis will generalize well (i.e. predict unseen examples correctly) Usually Any hypothesis h found to approximate the target function f well over a sufficiently large set of training examples will also approximate the target function well over any unobserved examples CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 11

Inductive learning Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 12

Inductive learning Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 13

Inductive learning Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 14

Inductive learning Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 15

Inductive learning Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham s razor: prefer the simplest hypothesis consistent with data CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 16

Inductive learning Finding a consistent hypothesis depends on the hypothesis space For example, it is not possible to learn exactly f(x)=ax+b+xsin(x) when H=space of polynomials of finite degree A learning problem is realizable if the hypothesis space contains the true function, otherwise it is unrealizable Difficult to determine whether a learning problem is realizable since the true function is not known CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 17

Inductive learning It is possible to use a very large hypothesis space For example, H=class of all Turing machines But there is a tradeoff between expressiveness of a hypothesis class and complexity of finding simple, consistent hypothesis within the space Fitting straight lines is easy, fitting high degree polynomials is hard, fitting Turing machines is very hard! CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 18

Decision trees Decision tree classification Nodes: labeled with attributes Edges: labeled with attribute values Leaves: labeled with classes Classify an instance by starting at the root, testing the attribute specified by the root, then moving down the branch corresponding to the value of the attribute Continue until you reach a leaf Return the class CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 19

Decision tree (playing tennis) Outlook Sunny Overcast Rain Humidity Wind Yes High Normal Strong Weak No Yes No Yes An instance <Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong> Classification: No CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 20

Decision tree representation Decision trees can represent disjunctions of conjunctions of constraints on attribute values Humidity Sunny Outlook Overcast Yes Rain Wind High Normal Strong Weak No Yes No Yes (Outlook=Sunny Humidity=Normal) (Outlook=Overcast) (Outlook=Rain Wind=Weak) CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 21

Decision tree representation Decision trees are fully expressive within the class of propositional languages Any Boolean function can be written as a decision tree Trivially by allowing each row in a truth table correspond to a path in the tree Can often use small trees Some functions require exponentially large trees (majority function, parity function) However, there is no representation that is efficient for all functions CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 22

Inducing a decision tree Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 23

Decision Tree Learning CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 24

Choosing attribute tests The central choice is deciding which attribute to test at each node We want to choose an attribute that is most useful for classifying examples CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 25

Example -- Restaurant CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 26

Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 27

Using information theory To implement Choose-Attribute in the DTL algorithm Measure uncertainty (Entropy): I(P(v 1 ),, P(v n )) = Σ i=1 -P(v i ) log 2 P(v i ) For a training set containing p positive examples and n negative examples: p I(, p + n n ) = p + n p p n log 2 log 2 p + n p + n p + n n p + n CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 28

Information gain A chosen attribute A divides the training set E into subsets E 1,, E v according to their values for A, where A has v distinct values. v p + = i ni pi ni remainder( A) I(, ) p + n p + n p n i= 1 i i i + Information Gain (IG) or reduction in uncertainty from the attribute test: p n IG( A) = I(, ) remainder( A) p + n p + n Choose the attribute with the largest IG i CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 29

Information gain For the training set, p = n = 6, I(6/12, 6/12) = 1 bit Consider the attributes Patrons and Type (and others too): 2 IG( Patrons) = 1 [ I(0,1) 12 2 1 1 IG( Type) = 1 [ I(, ) 12 2 2 4 + 12 2 + I( 12 I(1,0) 1 2 1, ) 2 6 2 + I(, 12 6 4 2 + I(, 12 4 4 )] 6 2 ) + 4 =.0541.541 bits 4 12 2 2 I(, )] = 4 4 0 bits Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 30

Example Decision tree learned from the 12 examples: Substantially simpler than true tree---a more complex hypothesis isn t justified by small amount of data CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 31

Performance of a learning algorithm A learning algorithm is good if it produces a hypothesis that does a good job of predicting classifications of unseen examples Verify performance with a test set 1. Collect a large set of examples 2. Divide into 2 disjoint sets: training set and test set 3. Learn hypothesis h with training set 4. Measure percentage of correctly classified examples by h in the test set 5. Repeat 2-4 for different randomly selected training sets of varying sizes CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 32

Learning curves Training set Overfitting! % correct Test set Tree size CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 33

Overfitting Decision-tree grows until all training examples are perfectly classified But what if Data is noisy Training set is too small to give a representative sample of the target function May lead to Overfitting! Common problem with most learning algo CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 34

Overfitting Definition: Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h H such that h has smaller error than h over the training examples but h has smaller error than h over the entire distribution of instances Overfitting has been found to decrease accuracy of decision trees by 10-25% CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 35

Avoiding overfitting Two popular techniques: 1. Prune statistically irrelevant nodes Measure irrelevance with χ 2 test 2. Stop growing tree when test set performance starts decreasing Use cross-validation % correct Best tree Training set Test set Tree size CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 36

Cross-validation Split data in two parts, one for training, one for testing the accuracy of a hypothesis K-fold cross validation means you run k experiments, each time putting aside 1/k of the data to test on CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 37

Next Class Next Class: Midterm Bring a non-programmable calculator Following class: Statistical Learning Russell and Norvig: Chapter 20 CS486/686 Lecture Slides (c) 2006 K.Larson and P. Poupart 38