Machine Learning B, Fall 2016

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

CS Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Chapter 2 Rule Learning in a Nutshell

Rule Learning With Negation: Issues Regarding Effectiveness

A Version Space Approach to Learning Context-free Grammars

Data Stream Processing and Analytics

Rule Learning with Negation: Issues Regarding Effectiveness

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

(Sub)Gradient Descent

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Proof Theory for Syntacticians

CSL465/603 - Machine Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Python Machine Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Learning Methods in Multilingual Speech Recognition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Softprop: Softmax Neural Network Backpropagation Learning

Learning goal-oriented strategies in problem solving

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using dialogue context to improve parsing performance in dialogue systems

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Scientific Method Investigation of Plant Seed Germination

Linking Task: Identifying authors and book titles in verbose queries

Calibration of Confidence Measures in Speech Recognition

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Learning From the Past with Experiment Databases

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

CS 446: Machine Learning

GACE Computer Science Assessment Test at a Glance

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

12- A whirlwind tour of statistics

arxiv: v1 [cs.cl] 2 Apr 2017

Prediction of Maximal Projection for Semantic Role Labeling

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Cooperative evolutive concept learning: an empirical study

Corrective Feedback and Persistent Learning for Information Extraction

MYCIN. The MYCIN Task

Assignment 1: Predicting Amazon Review Ratings

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Grammars & Parsing, Part 1:

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Lecture 10: Reinforcement Learning

INPE São José dos Campos

Research Design & Analysis Made Easy! Brainstorming Worksheet

Model Ensemble for Click Prediction in Bing Search Ads

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Word learning as Bayesian inference

Learning to Schedule Straight-Line Code

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Discriminative Learning of Beam-Search Heuristics for Planning

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Case Study: News Classification Based on Term Frequency

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Guide to Teaching Computer Science

Second Exam: Natural Language Parsing with Neural Networks

A Reinforcement Learning Variant for Control Scheduling

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Multi-label Classification via Multi-target Regression on Data Streams

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Unit 1: Scientific Investigation-Asking Questions

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Medical Complexity: A Pragmatic Theory

Using focal point learning to improve human machine tacit coordination

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Applications of data mining algorithms to analysis of medical data

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Axiom 2013 Team Description Paper

Learning Computational Grammars

Customized Question Handling in Data Removal Using CPHC

Finding Your Friends and Following Them to Where You Are

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Word Segmentation of Off-line Handwritten Documents

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

The Interface between Phrasal and Functional Constraints

Semi-Supervised Face Detection

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Foothill College Fall 2014 Math My Way Math 230/235 MTWThF 10:00-11:50 (click on Math My Way tab) Math My Way Instructors:

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Online Updating of Word Representations for Part-of-Speech Tagging

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

MYCIN. The embodiment of all the clichés of what expert systems are. (Newell)

Multimedia Application Effective Support of Education

Managing Experience for Process Improvement in Manufacturing

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Transcription:

Machine Learning 10-601 B, Fall 2016 Decision Trees (Summary) Lecture 2, 08/31/ 2016 Maria-Florina (Nina) Balcan

Learning Decision Trees. Supervised Classification. Useful Readings: Mitchell, Chapter 3 Bishop, Chapter 14.4 DT learning: Method for learning discrete-valued target functions in which the function to be learned is represented by a decision tree.

Supervised Classification: Decision Tree Learning Example: learn concept PlayTennis (i.e., decide whether our friend will play tennis or not in a given day) Simple Training Data Set Day Outlook Temperature Humidity Wind Play Tennis example label

Supervised Classification: Decision Tree Learning Each internal node: test one (discrete-valued) attribute X i Each branch from a node: corresponds to one possible values for X i Each leaf node: predict Y Example: A Decision tree for f: <Outlook, Temperature, Humidity, Wind> PlayTennis? Day Outlook Temperature Humidity Wind Play Tennis E.g., x=(outlook=sunny, Temperature-Hot, Humidity=Normal,Wind=High), f(x)=yes.

Supervised Classification: Problem Setting Input: Training labeled examples {(x (i),y (i) )} of unknown target function f Examples described by their values on some set of features or attributes Day Outlook Temperature Humidity Wind Play Tennis E.g. 4 attributes: Humidity, Wind, Outlook, Temp e.g., <Humidity=High, Wind=weak, Outlook=rain, Temp=Mild> Set of possible instances X (a.k.a instance space) Unknown target function f : X Y e.g., Y={0,1} label space e.g., 1 if we play tennis on this day, else 0 Output: Hypothesis h H that (best) approximates target function f Set of function hypotheses H={ h h : X Y } each hypothesis h is a decision tree

Core Aspects in Decision Tree & Supervised Learning How to automatically find a good hypothesis for training data? This is an algorithmic question, the main topic of computer science When do we generalize and do well on unseen data? Learning theory quantifies ability to generalize as a function of the amount of training data and the hypothesis space Occam s razor: use the simplest hypothesis consistent with data! Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data

Core Aspects in Decision Tree & Supervised Learning How to automatically find a good hypothesis for training data? This is an algorithmic question, the main topic of computer science When do we generalize and do well on unseen data? Occam s razor: use the simplest hypothesis consistent with data! Decision trees: if we were able to find a small decision tree that explains data well, then good generalization guarantees. NP-hard [Hyafil-Rivest 76]: unlikely to have a poly time algorithm Very nice practical heuristics; top down algorithms, e.g, ID3

Top-Down Induction of Decision Trees [ID3, C4.5, Quinlan] ID3: Natural greedy approach to growing a decision tree top-down (from the root to the leaves by repeatedly replacing an existing leaf with an internal node.). Algorithm: Pick best attribute to split at the root based on training data. Recurse on children that are impure (e.g, have both Yes and No). Humidity Outlook Temp Wind Day Outlook Temperature Humidity Wind Play Tennis Day Outlook Temperature Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D11 Sunny Mild Normal Strong Yes Weak High Sunny Cool Overcast Mild Normal Strong Rain Hot Day Outlook Temperature Humidity Wind Play Tennis D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D10 Rain Mild Normal Weak Yes D14 Rain Mild High Strong No Humidity Yes Wind High Normal Strong Weak No Yes No Yes

Top-Down Induction of Decision Trees [ID3, C4.5, Quinlan] ID3: Natural greedy approach to growing a decision tree top-down. Algorithm: Day Outlook Temperature Humidity Wind Play Tennis Pick best attribute to split at the root based on training data. Recurse on children that are impure (e.., have both Yes and No). Key question: Which attribute is best? ID3 uses a statistical measure called information gain (how well a given attribute separates the training examples according to the target classification) Information Gain of A is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A

Properties of ID3 ID3 performs heuristic search through space of decision trees It tends to have the right bias (output short decision trees), but it can still overfit. It might be beneficial to prune the tree by using a validation dataset.

Consider a hypothesis h and its Properties of ID3 Overfitting could occur because of noisy data and because ID3 is not guaranteed to output a small hypothesis even if one exists. Error rate over training data: error train (h) True error rate over all data: error true (h) We say h overfits the training data if error true h > error train (h) Amount of overfitting = error true h error train (h)

Task: learning which medical patients have a form of diabetes.

Key Issues in Machine Learning How can we gauge the accuracy of a hypothesis on unseen data? Occam s razor: use the simplest hypothesis consistent with data! This will help us avoid overfitting. Learning theory will help us quantify our ability to generalize as a function of the amount of training data and the hypothesis space How do we find the best hypothesis? This is an algorithmic question, the main topic of computer science How do we choose a hypothesis space? Often we use prior knowledge to guide this choice How to model applications as machine learning problems? (engineering challenge)