Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. January 12, 2015

Similar documents
Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Python Machine Learning

(Sub)Gradient Descent

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CSL465/603 - Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Model Ensemble for Click Prediction in Bing Search Ads

Calibration of Confidence Measures in Speech Recognition

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Linking Task: Identifying authors and book titles in verbose queries

Generative models and adversarial training

Learning Methods in Multilingual Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Evolution of Symbolisation in Chimpanzees and Neural Nets

Knowledge Transfer in Deep Convolutional Neural Nets

Semi-Supervised Face Detection

Learning From the Past with Experiment Databases

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

CS 100: Principles of Computing

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Methods for Fuzzy Systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Artificial Neural Networks written examination

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Mining Association Rules in Student s Assessment Data

arxiv: v1 [cs.cl] 2 Apr 2017

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

On-Line Data Analytics

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Switchboard Language Model Improvement with Conversational Data from Gigaword

Data Structures and Algorithms

Office Hours: Mon & Fri 10:00-12:00. Course Description

Human Emotion Recognition From Speech

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning to Schedule Straight-Line Code

Data Stream Processing and Analytics

Physics 270: Experimental Physics

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Rule Learning with Negation: Issues Regarding Effectiveness

A Comparison of Annealing Techniques for Academic Course Scheduling

Cooperative evolutive concept learning: an empirical study

A survey of multi-view machine learning

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Seminar - Organic Computing

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Lecture 10: Reinforcement Learning

Indian Institute of Technology, Kanpur

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

Universidade do Minho Escola de Engenharia

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

CS 446: Machine Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Australian Journal of Basic and Applied Sciences

Speech Recognition at ICSI: Broadcast News and beyond

Word learning as Bayesian inference

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

Class Meeting Time and Place: Section 3: MTWF10:00-10:50 TILT 221

A Case Study: News Classification Based on Term Frequency

Issues in the Mining of Heart Failure Datasets

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Team Formation for Generalized Tasks in Expertise Social Networks

STAT 220 Midterm Exam, Friday, Feb. 24

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Evolutive Neural Net Fuzzy Filtering: Basic Description

Applications of data mining algorithms to analysis of medical data

A Version Space Approach to Learning Context-free Grammars

ABSTRACT. A major goal of human genetics is the discovery and validation of genetic polymorphisms

Truth Inference in Crowdsourcing: Is the Problem Solved?

On the Formation of Phoneme Categories in DNN Acoustic Models

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

While you are waiting... socrative.com, room number SIMLANG2016

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Chapter 2 Rule Learning in a Nutshell


Automatic Discretization of Actions and States in Monte-Carlo Tree Search

INTERMEDIATE ALGEBRA Course Syllabus

Transcription:

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 12, 2015 Today: What is machine learning? Decision tree learning Course logistics Readings: The Discipline of ML Mitchell, Chapter 3 Bishop, Chapter 14.4 Machine Learning: Study of algorithms that improve their performance P at some task T with experience E well-defined learning task: <P,T,E> 1

Learning to Predict Emergency C-Sections [Sims et al., 2000] 9714 patient records, each with 215 features Learning to classify text documents spam vs not spam 2

Learning to detect objects in images (Prof. H. Schneiderman) Example training images for each orientation Learn to classify the word a person is thinking about, based on fmri brain activity 3

Learning prosthetic control from neural implant [R. Kass L. Castellanos A. Schwartz] Machine Learning - Practice Speech Recognition Mining Databases Text analysis Control learning Object recognition Support Vector Machines Bayesian networks Hidden Markov models Deep neural networks Reinforcement learning... 4

Machine Learning - Theory Other theories for PAC Learning Theory (supervised concept learning) # examples (m) error rate (ε) representational complexity (H) failure probability (δ) Reinforcement skill learning Semi-supervised learning Active student querying also relating: # of mistakes during learning learner s query strategy convergence rate asymptotic performance bias, variance Machine Learning in Computer Science Machine learning already the preferred approach to Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control ML apps. This ML niche is growing (why?) All software apps. 5

Machine Learning in Computer Science Machine learning already the preferred approach to Speech recognition, Natural language processing Computer vision Medical outcomes analysis Robot control ML apps. All software apps. This ML niche is growing Improved machine learning algorithms Increased volume of online data Increased demand for self-customizing software Tom s prediction: ML will be fastest-growing part of CS this century Economics and Organizational Behavior Evolution Computer science Machine learning Statistics Animal learning (Cognitive science, Psychology, Neuroscience) Adaptive Control Theory 6

What You ll Learn in This Course The primary Machine Learning algorithms Logistic regression, Bayesian methods, HMM s, SVM s, reinforcement learning, decision tree learning, boosting, unsupervised clustering, How to use them on real data text, image, structured data your own project Underlying statistical and computational theory Enough to read and understand ML research papers Course logistics 7

Machine Learning 10-601 website: www.cs.cmu.edu/~ninamf/courses/601sp15 Faculty Maria Balcan Tom Mitchell TA s Travis Dick Kirsten Early Ahmed Hefny Micol Marchetti-Bowick Willie Neiswanger Abu Saparov See webpage for Office hours Syllabus details Recitation sessions Grading policy Honesty policy Late homework policy Piazza pointers... Course assistant Sharon Cavlovich Highlights of Course Logistics On the wait list? Hang in there for first few weeks Homework 1 Available now, due friday Grading: 30% homeworks (~5-6) 20% course project 25% first midterm (March 2) 25% final midterm (April 29) Academic integrity: Cheating à Fail class, be expelled from CMU Late homework: full credit when due half credit next 48 hrs zero credit after that we ll delete your lowest HW score must turn in at least n-1 of the n homeworks, even if late Being present at exams: You must be there plan now. Two in-class exams, no other final 8

Maria-Florina Balcan: Nina Foundations for Modern Machine Learning E.g., interactive, distributed, life-long learning Theoretical Computer Science, especially connections between learning theory & other fields Approx. Algorithms Control Theory Game Theory Machine Learning Theory Mechanism Design Discrete Optimization Matroid Theory Travis Dick When can we learn many concepts from mostly unlabeled data by exploiting relationships between between concepts. Currently: Geometric relationships 9

Kirstin Early Analyzing and predicting energy consumption Reduce costs/usage and help people make informed decisions Predicting energy costs from features of home and occupant behavior Energy disaggregation: decomposing total electric signal into individual appliances Ahmed Hefny How can we learn to track and predict the state of a dynamical system only from noisy observations? Can we exploit supervised learning methods to devise a flexible, local minima-free approach? observations (oscillating pendulum) Extracted 2D state trajectory 10

Micol Marchetti-Bowick How can we use machine learning for biological and medical research? Using genotype data to build personalized models that can predict clinical outcomes Integrating data from multiple sources to perform cancer subtype analysis Structured sparse regression models for genome-wide association studies sample weight Gene expression data w/ dendrogram (or have one picture per task) x y x y x y genetic relatedness x x x y y y x y x y x y Willie Neiswanger If we want to apply machine learning algorithms to BIG datasets How can we develop parallel, low-communication machine learning algorithms? Such as embarrassingly parallel algorithms, where machines work independently, without communication. 11

Abu Saparov How can knowledge about the world help computers understand natural language? What kinds of machine learning tools are needed to understand sentences? Carolyn ate the cake with a fork. person_eats_food Carolyn ate the cake with vanilla. person_eats_food consumer Carolyn consumer Carolyn food cake food cake instrument fork topping vanilla Tom Mitchell How can we build never-ending learners? Case study: never-ending language learner (NELL) runs 24x7 to learn to read the web mean avg. precision top 1000 see http://rtw.ml.cmu.edu # of beliefs vs. time (5 years) reading accuracy vs. time (5 years) 12

Function Approximation and Decision tree learning Function approximation Problem Setting: Set of possible instances X Unknown target function f : Xà Y Set of function hypotheses H={ h h : Xà Y } Input: superscript: i th training example Training examples {<x (i),y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f 13

Simple Training Data Set Day Outlook Temperature Humidity Wind PlayTennis? A Decision tree for f: <Outlook, Temperature, Humidity, Wind> à PlayTennis? Each internal node: test one discrete-valued attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y (or P(Y X leaf)) 14

Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector e.g., <Humidity=low, Wind=weak, Outlook=rain, Temp=hot> Unknown target function f : Xà Y Y=1 if we play tennis on this day, else 0 Set of function hypotheses H={ h h : Xà Y } each hypothesis h is a decision tree trees sorts x to leaf, which assigns y Decision Tree Learning Problem Setting: Set of possible instances X each instance x in X is a feature vector x = < x 1, x 2 x n > Unknown target function f : Xà Y Y is discrete-valued Set of function hypotheses H={ h h : Xà Y } each hypothesis h is a decision tree Input: Training examples {<x (i),y (i) >} of unknown target function f Output: Hypothesis h H that best approximates target function f 15

Decision Trees Suppose X = <X 1, X n > where X i are boolean-valued variables How would you represent Y = X 2 X 5? Y = X 2 X 5 How would you represent X 2 X 5 X 3 X 4 ( X 1 ) 16

node = Root [ID3, C4.5, Quinlan] Sample Entropy 17

Entropy Entropy H(X) of a random variable X # of possible values for X H(X) is the expected number of bits needed to encode a randomly drawn value of X (under most efficient code) Why? Information theory: Most efficient possible code assigns -log 2 P(X=i) bits to encode the message X=i So, expected number of bits to code one random X is: Entropy Entropy H(X) of a random variable X Specific conditional entropy H(X Y=v) of X given Y=v : Conditional entropy H(X Y) of X given Y : Mutual information (aka Information Gain) of X and Y : 18

Information Gain is the mutual information between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable A Simple Training Data Set Day Outlook Temperature Humidity Wind PlayTennis? 19

20

Final Decision Tree for f: <Outlook, Temperature, Humidity, Wind> à PlayTennis? Each internal node: test one discrete-valued attribute X i Each branch from a node: selects one value for X i Each leaf node: predict Y Which Tree Should We Output? ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree. Why? Occam s razor: prefer the simplest hypothesis that fits the data 21

Why Prefer Short Hypotheses? (Occam s Razor) Arguments in favor: Arguments opposed: Why Prefer Short Hypotheses? (Occam s Razor) Argument in favor: Fewer short hypotheses than long ones à a short hypothesis that fits the data is less likely to be a statistical coincidence à highly probable that a sufficiently complex hypothesis will fit the data Argument opposed: Also fewer hypotheses with prime number of nodes and attributes beginning with Z What s so special about short hypotheses? 22

Overfitting Consider a hypothesis h and its Error rate over training data: True error rate over all data: We say h overfits the training data if Amount of overfitting = 23

24

Split data into training and validation set Create tree that classifies training set correctly 25

26

You should know: Well posed function approximation problems: Instance space, X Sample of labeled training data { <x (i), y (i) >} Hypothesis space, H = { f: Xà Y } Learning is a search/optimization problem over H Various objective functions minimize training error (0-1 loss) among hypotheses that minimize training error, select smallest (?) Decision tree learning Greedy top-down learning of decision trees (ID3, C4.5,...) Overfitting and tree/rule post-pruning Extensions Questions to think about (1) ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search? 27

Questions to think about (2) Consider target function f: <x1,x2> à y, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once? Questions to think about (3) Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice? 28

Questions to think about (4) What is the relationship between learning decision trees, and learning IF-THEN rules 29