Machine Learning 2nd Edition

Similar documents
Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

Lecture 1: Basic Concepts of Machine Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

CSL465/603 - Machine Learning

Probability and Statistics Curriculum Pacing Guide

Chapter 2 Rule Learning in a Nutshell

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Rule Learning With Negation: Issues Regarding Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Human Emotion Recognition From Speech

The Strong Minimalist Thesis and Bounded Optimality

Artificial Neural Networks written examination

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Semi-Supervised Face Detection

Generative models and adversarial training

Statewide Framework Document for:

Rule Learning with Negation: Issues Regarding Effectiveness

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

A study of speaker adaptation for DNN-based speech synthesis

Self Study Report Computer Science

Learning From the Past with Experiment Databases

GDP Falls as MBA Rises?

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Model Ensemble for Click Prediction in Bing Search Ads

Knowledge Transfer in Deep Convolutional Neural Nets

Softprop: Softmax Neural Network Backpropagation Learning

Extending Place Value with Whole Numbers to 1,000,000

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Cooperative evolutive concept learning: an empirical study

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

arxiv: v1 [cs.lg] 15 Jun 2015

Speech Recognition at ICSI: Broadcast News and beyond

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

An empirical study of learning speed in backpropagation

Mathematics. Mathematics

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Discriminative Learning of Beam-Search Heuristics for Planning

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

The stages of event extraction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Switchboard Language Model Improvement with Conversational Data from Gigaword

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

INPE São José dos Campos

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Evolutive Neural Net Fuzzy Filtering: Basic Description

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

learning collegiate assessment]

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Methods for Fuzzy Systems

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Major Milestones, Team Activities, and Individual Deliverables

STA 225: Introductory Statistics (CT)

Probabilistic Latent Semantic Analysis

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Learning Distributed Linguistic Classes

Hierarchical Linear Models I: Introduction ICPSR 2015

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Mathematics Scoring Guide for Sample Test 2005

Laboratorio di Intelligenza Artificiale e Robotica

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Georgetown University at TREC 2017 Dynamic Domain Track

Dynamic Tournament Design: An Application to Prediction Contests

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Calibration of Confidence Measures in Speech Recognition

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Office Hours: Mon & Fri 10:00-12:00. Course Description

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning to Rank with Selection Bias in Personal Search

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Mathematics Assessment Plan

Digital Fabrication and Aunt Sarah: Enabling Quadratic Explorations via Technology. Michael L. Connell University of Houston - Downtown

How do we balance statistical evidence with expert judgement when aligning tests to the CEFR?

Machine Learning and Development Policy

Detailed course syllabus

Word Segmentation of Off-line Handwritten Documents

On-the-Fly Customization of Automated Essay Scoring

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

WHEN THERE IS A mismatch between the acoustic

Transcription:

INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e

CHAPTER 2: Supervised Learning

Outline Last Class: Ch 2 Supervised Learning (Sec 2.1-2.4) Learning a class from Examples VC Dimension PAC learning Noise This class: Learning Multiple Classes Regression Model Selection and Generalization Dimensions of a Supervised Learning Algorithm Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 3

Multiple Classes General case K classes Family, Sport, Luxury cars Classes can overlap Can use different/same hypothesis class Fall into two classes? Sometimes worth to reject Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Multiple Classes, Ci i=1,...,k X { t,r t } N t= 1 = x r t i = 1 if x C Train hypotheses t hi(x), 0 if x C j, j i =1,...,K: Train hypotheses h i (x), i =1,...,K: t i i h i ( t x ) = 1 if 0 if x t x t C i C j, j i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 5

Regression Output is not Boolean (yes/no) or label but numeric value Training Set of examples Interpolation: fit function (polynomial) Extrapolation: predict output for any x Regression : added noise Assumption: hidden variables Approximate output by model: g(x) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Examples Train hypotheses hi(x), i =1,...,K: Interpolation Extrapolation From: http://en.wikipedia.org Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e The MIT Press (V1.0) 7

Regression 8 Empirical error on training set Hypothesis space is linear functions Calculate best parameters to minimize error by taking partial derivatives Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Example Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Example A more complex model Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Higher-order polynomials Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Model Selection & Generalization Consider learning boolean functions If d inputs, examples at most Each example can be labeled 0 or 1 Therefore possible functions of d variables

Model Selection & Generalization Each training example removes half the hypothesis Learning as a way to remove hypothesis inconsistent with data But we need to see examples to

Model Selection & Generalization Learning is an ill-posed problem; data is not sufficient to find a unique solution Each sample remove irrelevant hypothesis The need for inductive bias, assumptions about H E.g. rectangles in our example But each hypothesis can only learn some functions Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Model Selection & Generalization Learning needs an inductive bias Model selection: How to choose the right bias? Each sample remove irrelevant hypothesis Want the model to be able to generalize Predict new data even more than fitting the training dataset Generalization: How well a model performs on new data Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Model Selection & Generalization Best generalization requires mathing the complexity of the hypothesis with the complexity of the function underlying the data Overfitting: H more complex than C or f e.g Fitting two rectangles to data sampled from one rectangle e.g Fitting a sixth-order polynomal to noisy data from a third-order polynomial Underfitting: H less complex than C or f e.g Fit a line to data sample from a third-order polynomial

Triple Trade-Off There is a trade-off between three factors (Dietterich, 2003): 1. Complexity of H, c (H), 2. Training set size, N, 3. Generalization error, E, on new data As N, E As c (H), first E and then E why? Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Cross-Validation To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) To train a model Validation set (25%) To select a model (e.g. degree of polynomials) Test (publication) set (25%) Estimate the error, evaluate performance Resampling when there is few data Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning The MIT Press (V1.1)

Dimensions of a Supervised Learner Let us now recapitulate and generalize. We have a sample The sample is independent and identically distributed (i.i.d) from the same joint distribution İs 0/1 for classification K binary vector for multiclass classification real value in regression Goal: Build a good and useful approximation to using the model

Dimensions of a Supervised Learner We must make three decisions: 1. Model: 1. model input parameters Defines the hypothesis class H and defines h H -E.g. In classification? 2. In regression, Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 20

Dimensions of a Supervised Learner We must make three decisions: 1. Model: 1. model input parameters Defines the hypothesis class H and defines h H -E.g. In classification rectangle is the model and the paramentes are the four coordinates 2. In regression, model is a linear function of the input, slope and intersect are the parameters

Dimensions of a Supervised Learner 2. Loss function: L() Difference between desire outpot and approximation given the parameters Class: learning 0/1 ( ) ( t ( t θ X = L r,g x θ) ) E Regression: numerical value t Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 22

Dimensions of a Supervised Learner 3. Optimization procedure: Find θ* = arg min E θ ( X ) the value of the parameters that minimize the total error. Can be found analytically as in regression or through more complex optimization methods for more complicated models Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning θ 23

Dimensions of a Supervised Learner 3. Optimization procedure: Find the value of the parameters that minimize the total error. Can be found analytically as in regression or through more complex optimization methods for more complicated models Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning 24

Dimensions of a Supervised Learner The following conditions should be satisfied: 1) Hypothesis class g() must be big enough 2) Enough training data to find the best hypothesis 3) Good optimization procedure Different machine learning differ either in model, loss function or optimization procedure 25