Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. January 11, 2011

Similar documents
Lecture 1: Basic Concepts of Machine Learning

Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

(Sub)Gradient Descent

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Probabilistic Latent Semantic Analysis

Model Ensemble for Click Prediction in Bing Search Ads

Data Stream Processing and Analytics

Linking Task: Identifying authors and book titles in verbose queries

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-Supervised Face Detection

Word learning as Bayesian inference

The Evolution of Random Phenomena

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Lecture 10: Reinforcement Learning

Softprop: Softmax Neural Network Backpropagation Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Learning Methods in Multilingual Speech Recognition

On-Line Data Analytics

Using dialogue context to improve parsing performance in dialogue systems

STAT 220 Midterm Exam, Friday, Feb. 24

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Indian Institute of Technology, Kanpur

Evolution of Symbolisation in Chimpanzees and Neural Nets

Generative models and adversarial training

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Truth Inference in Crowdsourcing: Is the Problem Solved?

On the Formation of Phoneme Categories in DNN Acoustic Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Version Space Approach to Learning Context-free Grammars

Artificial Neural Networks written examination

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Using focal point learning to improve human machine tacit coordination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning From the Past with Experiment Databases

Laboratorio di Intelligenza Artificiale e Robotica

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule Learning With Negation: Issues Regarding Effectiveness

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Calibration of Confidence Measures in Speech Recognition

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

Word Segmentation of Off-line Handwritten Documents

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Probability and Statistics Curriculum Pacing Guide

Switchboard Language Model Improvement with Conversational Data from Gigaword

Universidade do Minho Escola de Engenharia

The taming of the data:

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Assignment 1: Predicting Amazon Review Ratings

Content-based Image Retrieval Using Image Regions as Query Examples

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Proof Theory for Syntacticians

Corrective Feedback and Persistent Learning for Information Extraction

Mining Student Evolution Using Associative Classification and Clustering

MYCIN. The MYCIN Task

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Case-Based Approach To Imitation Learning in Robotic Agents

Learning Methods for Fuzzy Systems

Reinforcement Learning by Comparing Immediate Reward

Probability estimates in a scenario tree

B. How to write a research paper

A survey of multi-view machine learning

Chapter 2 Rule Learning in a Nutshell

Discriminative Learning of Beam-Search Heuristics for Planning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

AMULTIAGENT system [1] can be defined as a group of

Rule Learning with Negation: Issues Regarding Effectiveness

Disambiguation of Thai Personal Name from Online News Articles

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Mining Association Rules in Student s Assessment Data

Team Formation for Generalized Tasks in Expertise Social Networks

Seminar - Organic Computing

The stages of event extraction

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Learning to Rank with Selection Bias in Personal Search

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

A Comparison of Two Text Representations for Sentiment Analysis

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

INPE São José dos Campos

Distributed Weather Net: Wireless Sensor Network Supported Inquiry-Based Learning

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Cooperative evolutive concept learning: an empirical study

CSC200: Lecture 4. Allan Borodin

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Transcription:

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 11, 2011 Today: What is machine learning? Decision tree learning Course logistics Readings: The Discipline of ML Mitchell, Chapter 3 Bishop, Chapter 14.4 Machine Learning: Study of algorithms that improve their performance P at some task T with experience E well-defined learning task: <P,T,E> 1

Learning to Predict Emergency C-Sections [Sims et al., 2000] 9714 patient records, each with 215 features Learning to detect objects in images (Prof. H. Schneiderman) Example training images for each orientation 2

Learning to classify text documents Company home page vs Personal home page vs University home page vs Reading a noun (vs verb) [Rustandi et al., 2005] 3

Machine Learning - Practice Speech Recognition Mining Databases Text analysis Control learning Object recognition Supervised learning Bayesian networks Hidden Markov models Unsupervised clustering Reinforcement learning... Machine Learning - Theory Other theories for PAC Learning Theory (supervised concept learning) # examples (m) error rate (ε) representational complexity (H) failure probability (δ) Reinforcement skill learning Semi-supervised learning Active student querying also relating: # of mistakes during learning learner s query strategy convergence rate asymptotic performance bias, variance 4

Economics and Organizational Behavior Evolution Computer science Machine learning Statistics Animal learning (Cognitive science, Psychology, Neuroscience) Adaptive Control Theory Machine Learning in Computer Science Machine learning already the preferred approach to Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control ML apps. This ML niche is growing (why?) All software apps. 5

Machine Learning in Computer Science Machine learning already the preferred approach to Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control ML apps. All software apps. This ML niche is growing Improved machine learning algorithms Increased data capture, networking, new sensors Software too complex to write by hand Demand for self-customization to user, environment Function Approximation and Decision tree learning 6

Function approximation Problem Setting: Set of possible instances X Unknown target function f : X Y Set of function hypotheses H={ h h : X Y } Input: superscript: i th training example Training examples {<x (i),y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f A Decision tree for F: <Outlook, Humidity, Wind, Temp> PlayTennis? Each internal node: test one attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or P(Y X leaf)) 7

Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot> Unknown target function f : X Y Y is discrete valued Set of function hypotheses H={ h h : X Y } each hypothesis h is a decision tree trees sorts x to leaf, which assigns y Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector x = < x 1, x 2 x n > Unknown target function f : X Y Y is discrete valued Set of function hypotheses H={ h h : X Y } each hypothesis h is a decision tree Input: Training examples {<x (i),y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f 8

Decision Trees Suppose X = <X 1, X n > where X i are boolean variables How would you represent Y = X 2 X 5? Y = X 2 X 5 How would you represent X 2 X 5 X 3 X 4 ( X 1 ) 9

[ID3, C4.5, Quinlan] node = Root Entropy Entropy H(X) of a random variable X # of possible values for X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient code assigns -log 2 P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: 10

Sample Entropy Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Mututal information (aka Information Gain) of X and Y : 11

Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A 12

13

Decision Tree Learning Applet http://www.cs.ualberta.ca/%7eaixplore/learning/ DecisionTrees/Applet/DecisionTreeApplet.html Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis that fits the data 14

Why Prefer Short Hypotheses? (Occam s Razor) Arguments in favor: Arguments opposed: Why Prefer Short Hypotheses? (Occam s Razor) Argument in favor: Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data Argument opposed: Also fewer hypotheses with prime number of nodes and attributes beginning with Z What s so special about short hypotheses? 15

16

17

Split data into training and validation set Create tree that classifies training set correctly 18

19

What you should know: Well posed function approximation problems: Instance space, X Sample of labeled training data { <x (i), y (i) >} Hypothesis space, H = { f: X Y } Learning is a search/optimization problem over H Various objective functions minimize training error (0-1 loss) among hypotheses that minimize training error, select smallest (?) Decision tree learning Greedy top-down learning of decision trees (ID3, C4.5,...) Overfitting and tree/rule post-pruning Extensions 20