Machine Learning, Reading: Mitchell, Chapter 3. Machine Learning Tom M. Mitchell. Carnegie Mellon University.

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

(Sub)Gradient Descent

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Python Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

Semi-Supervised Face Detection

On-Line Data Analytics

Model Ensemble for Click Prediction in Bing Search Ads

Linking Task: Identifying authors and book titles in verbose queries

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Data Stream Processing and Analytics

Word learning as Bayesian inference

Learning Methods in Multilingual Speech Recognition

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Word Segmentation of Off-line Handwritten Documents

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Probabilistic Latent Semantic Analysis

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Version Space Approach to Learning Context-free Grammars

Artificial Neural Networks written examination

Lecture 10: Reinforcement Learning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

MYCIN. The MYCIN Task

Mining Student Evolution Using Associative Classification and Clustering

Chapter 2 Rule Learning in a Nutshell

Scientific Method Investigation of Plant Seed Germination

Learning Methods for Fuzzy Systems

Team Formation for Generalized Tasks in Expertise Social Networks

Knowledge Transfer in Deep Convolutional Neural Nets

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Truth Inference in Crowdsourcing: Is the Problem Solved?

Mining Association Rules in Student s Assessment Data

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Probability and Statistics Curriculum Pacing Guide

Using dialogue context to improve parsing performance in dialogue systems

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Generative models and adversarial training

Using focal point learning to improve human machine tacit coordination

On the Formation of Phoneme Categories in DNN Acoustic Models

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Reinforcement Learning by Comparing Immediate Reward

Indian Institute of Technology, Kanpur

B. How to write a research paper

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Universidade do Minho Escola de Engenharia

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Evolution of Symbolisation in Chimpanzees and Neural Nets

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Seminar - Organic Computing

Switchboard Language Model Improvement with Conversational Data from Gigaword

The stages of event extraction

The taming of the data:

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Corrective Feedback and Persistent Learning for Information Extraction

The Evolution of Random Phenomena

Learning From the Past with Experiment Databases

Some Principles of Automated Natural Language Information Extraction

An empirical study of learning speed in backpropagation

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Proof Theory for Syntacticians

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Probabilistic Mission Defense and Assurance

Rule Learning With Negation: Issues Regarding Effectiveness

Learning goal-oriented strategies in problem solving

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Welcome to. ECML/PKDD 2004 Community meeting

Assignment 1: Predicting Amazon Review Ratings

CWSEI Teaching Practices Inventory

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

CS 100: Principles of Computing

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Probability estimates in a scenario tree

Disambiguation of Thai Personal Name from Online News Articles

Chemistry Senior Seminar - Spring 2016

arxiv:cmp-lg/ v1 22 Aug 1994

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

AQUA: An Ontology-Driven Question Answering System

STAT 220 Midterm Exam, Friday, Feb. 24

Transcription:

Machine Learning, Decision Trees, Overfitting Reading: Mitchell, Chapter 3 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 14, 2008

Machine Learning 10-601 Instructors William Cohen Tom Mitchell TA s Andrew Arnold Mary McGlohon Course assistant Sharon Cavlovich See webpage for Office hours Grading policy Final exam date Late homework policy Syllabus details... webpage: www.cs.cmu.edu/~tom/10601 cmu

Machine Learning: Study of algorithms that improve their performance P at some task T with experience E well-defined learning task: <P,T,E>

Learning to Predict Emergency C-Sections [Sims et al., 2000] 9714 patient records, each with 215 features

Learning to detect objects in images (Prof. H. Schneiderman) Example training images for each orientation

Learning to classify text documents Company home page vs Personal home page vs University home page vs

Reading a noun (vs verb) [Rustandi et al., 2005]

Machine Learning - Practice Speech Recognition Mining Databases Text analysis Control learning Object recognition Supervised learning Bayesian networks Hidden Markov models Unsupervised clustering Reinforcement learning...

Machine Learning - Theory Other theories for PAC Learning Theory (supervised concept learning) # examples (m)) error rate (ε) representational complexity (H) failure probability (δ) Reinforcement skill learning Semi-supervised learning Active student querying also relating: # of mistakes during learning learner s query strategy convergence rate asymptotic performance bias, variance

Growth of Machine Learning Machine learning already the preferred approach to Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control ML apps. This ML niche is growing All software apps. Improved machine learning algorithms Increased data capture, networking Software too complex to write by hand New sensors / IO devices Demand for self-customization to user, environment

Function Approximation and Decision tree learning

Function approximation Setting: Set of possible instances X Unknown target function f: X Y Set of function hypotheses H={ { h h: X Y } Given: Training examples {<x i,y i >} of unknown target function f Determine: Hypothesis h H that t best approximates f

How would you represent AB CD( E)? Each internal node: test one attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or P(Y X leaf))

[ID3, C4.5, ] node = Root

Entropy Entropy H(X) of a random variable X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient i code) Why? Information theory: Most efficient code assigns -log 2 P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: # of possible values for X

Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Mututal information (aka information gain) of X and Y :

Sample Entropy

Subset of S for which A=v Gain(S,A) = mutual information between A and target class variable over sample S

Decision Tree Learning Applet http://www.cs.ualberta.ca/%7eaixplore/l ca/%7eai plore/l earning/decisiontrees/applet/decisiont reeapplet.html

Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis h that fits the data

Why Prefer Short Hypotheses? (Occam s Razor) Argument in favor: Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data Argument opposed: Also fewer hypotheses with prime number of nodes and attributes beginning with Z What s so special about short hypotheses?

Split data into training and validation set Create tree that classifies training set correctly

What you should know: Well posed function approximation problems: Instance space, X Sample of labeled training data { <x i,y i >} Hypothesis space, H = { f: X Y } Learning is a search/optimization problem over H Various objective functions minimize training error (0-1 loss) among hypotheses that minimize training error, select shortest Decision tree learning Greedy top-down learning of decision i trees (ID3, C4.5,...) Overfitting and tree/rule post-pruning Extensions

Questions to think about (1) Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?

Questions to think about (2) ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?

Questions to think about (3) Consider target function f: <x1,x2> y, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once?