Supervised Learning: The Setup. Machine Learning Fall 2017

Similar documents
CS 446: Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Assignment 1: Predicting Amazon Review Ratings

CS Machine Learning

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Functional Skills Mathematics Level 2 assessment

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Generative models and adversarial training

Lecture 1: Basic Concepts of Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

Python Machine Learning

Artificial Neural Networks written examination

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Rule Learning with Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Radius STEM Readiness TM

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Knowledge Transfer in Deep Convolutional Neural Nets

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Case Study: News Classification Based on Term Frequency

Human Emotion Recognition From Speech

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Improving Conceptual Understanding of Physics with Technology

Full text of O L O W Science As Inquiry conference. Science as Inquiry

CSL465/603 - Machine Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Getting Started with Deliberate Practice

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Word Segmentation of Off-line Handwritten Documents

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Vector Space Approach for Aspect-Based Sentiment Analysis

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Lab 1 - The Scientific Method

Unit 3: Lesson 1 Decimals as Equal Divisions

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

GACE Computer Science Assessment Test at a Glance

Speech Emotion Recognition Using Support Vector Machine

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Model Ensemble for Click Prediction in Bing Search Ads

Learning Distributed Linguistic Classes

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Diagnostic Test. Middle School Mathematics

Compositional Semantics

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Dublin City Schools Mathematics Graded Course of Study GRADE 4

12- A whirlwind tour of statistics

APES Summer Work PURPOSE: THE ASSIGNMENT: DUE DATE: TEST:

HUBBARD COMMUNICATIONS OFFICE Saint Hill Manor, East Grinstead, Sussex. HCO BULLETIN OF 11 AUGUST 1978 Issue I RUDIMENTS DEFINITIONS AND PATTER

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Proof Theory for Syntacticians

Axiom 2013 Team Description Paper

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Reducing Features to Improve Bug Prediction

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Switchboard Language Model Improvement with Conversational Data from Gigaword

On-Line Data Analytics

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Beyond the Pipeline: Discrete Optimization in NLP

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

The Boosting Approach to Machine Learning An Overview

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Science Fair Rules and Requirements

Cal s Dinner Card Deals

The stages of event extraction

Detecting English-French Cognates Using Orthographic Edit Distance

Process improvement, The Agile Way! By Ben Linders Published in Methods and Tools, winter

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Teaching a Laboratory Section

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

A Comparison of Two Text Representations for Sentiment Analysis

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Blank Table Of Contents Template Interactive Notebook

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Practice Examination IREB

Students Understanding of Graphical Vector Addition in One and Two Dimensions

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Discriminative Learning of Beam-Search Heuristics for Planning

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Association Between Categorical Variables

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Section 7, Unit 4: Sample Student Book Activities for Teaching Listening

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Evolutive Neural Net Fuzzy Filtering: Basic Description

How People Learn Physics

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Interactive Whiteboard

Probabilistic Latent Semantic Analysis

Context Free Grammars. Many slides from Michael Collins

Education for an Information Age

Transcription:

Supervised Learning: The Setup Machine Learning Fall 2017 1

Last lecture We saw What is learning? Learning as generalization The badges game 2

This lecture More badges Formalizing supervised learning Instance space and features Label space Hypothesis space Some slides based on lectures from Tom Dietterich, Dan Roth 3

The badges game 4

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - (Full data on the class website, you can stare at it longer if you want) 5

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - What is the label for Peyton Manning? What about Eli Manning? (Full data on the class website, you can stare at it longer if you want) 6

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - How were the labels generated? (Full data on the class website, you can stare at it longer if you want) 7

Let s play Name Label Claire Cardie - Peter Bartlett + Eric Baum + Haym Hirsh + Shai Ben-David + Michael I. Jordan - How were the labels generated? If length of first name <= 5, then + else - (Full data on the class website, you can stare at it longer if you want) 8

Questions 1. Are you sure you got the correct function? 2. How did you arrive at it? 3. Learning issues: Is this prediction or just modeling data? How did you know that you should look at the letters? All words have a length. Background knowledge. What learning algorithm did you use? 9

What is supervised learning? 10

Instances and Labels Running example: Automatically tag news articles 11

Instances and Labels Running example: Automatically tag news articles An instance of a news article that needs to be classified 12

Instances and Labels Running example: Automatically tag news articles Sports A label An instance of a news article that needs to be classified 13

Instances and Labels Running example: Automatically tag news articles Sports Mapped by the classifier to Business Politics Entertainment Instance Space: All possible news articles Label Space: All possible labels 14

Instances and Labels X: Instance Space The set of examples that need to be classified Eg: The set of all possible names, documents, sentences, images, emails, etc 15

Instances and Labels X: Instance Space The set of examples that need to be classified Y: Label Space The set of all possible labels Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 16

Instances and Labels X: Instance Space The set of examples that need to be classified Target function y = f(x) Y: Label Space The set of all possible labels Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 17

Instances and Labels X: Instance Space The set of examples that need to be classified Target function y = f(x) The goal of learning: Find this target function Learning is search over functions Y: Label Space The set of all possible labels Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 18

Supervised learning X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action 19

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data 20

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm 21

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm A learned function g: X! Y 22

Supervised learning: Training X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels Learning algorithm only sees examples of the function f in action (x 1, f(x 1 )) (x 2, f(x 2 )) (x 3, f(x 3 ))! (x N, f(x N )) Labeled training data Learning algorithm A learned function g: X! Y Can you think of other training protocols? 23

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels 24

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x 2 X f(x) g(x) Are they different? 25

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x 2 X f(x) g(x) Are they different? Apply the model to many test examples and compare to the target s prediction 26

Supervised learning: Evaluation X: Instance Space The set of examples Target function y = f(x) Learned function y = g(x) Y: Label Space The set of all possible labels Draw test example x 2 X f(x) g(x) Are they different? Apply the model to many test examples and compare to the target s prediction Can you use test examples during training? 27

Supervised learning: General setting Given: Training examples of the form <x, f(x)> The function f is an unknown function Typically the input x is represented in a feature space Example: x 2 {0,1} n or x 2 < n A deterministic mapping from instances in your problem (eg: news articles) to features For a training example x, the value off(x) is called its label Goal: Find a good approximation for f The label determines the kind of problem we have Binary classification: f(x) 2 {-1,1} Multiclass classification: f(x) 2 {1, 2, 3, ", K} Regression: f(x) 2 < Questions? 28

Nature of applications There is no human expert Eg: Identify DNA binding sites Humans can perform a task, but can t describe how they do it Eg: Object detection in images The desired function is hard to obtain in closed form Eg: Stock market 29

Binary classification Where the label space consists of two elements Spam filtering Is an email spam or not? Recommendation systems Given user s movie preferences, will she like a new movie? Malware detection Is an Android app malicious? Time series prediction Will the future value of a stock increase or decrease with respect to its current value? 30

On using supervised learning We should be able to decide: 1. What is our instance space? What are the inputs to the problem? What are the features? 2. What is our label space? What is the prediction task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 31

1. The Instance Space X X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, documents, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 32

1. The Instance Space X Designing an appropriate feature representation of the instance space is crucial X: Instance Space The set of examples Instances x 2 X are defined Y: Label by features/attributes Space Target function Example: Boolean features y = f(x) The set of all Does the email contain possible the word labels free? Eg: The set of all possible names, documents, sentences, images, emails, etc The goal of learning: Example: Find this Real target valued function features What is the height of the person? Learning is What search was over the functions stock price yesterday? Eg: {Spam, Not-Spam}, {+,-}, etc. 33

1. The Instance Space X Let s brainstorm some features for the badges game 34

Instances as feature vectors An input to the problem (Eg: emails, names, images) Feature function A feature vector 35

Instances as feature vectors An input to the problem (Eg: emails, names, images) Feature function A feature vector Feature functions a.k.a feature extractors Deterministic (for the most part) Convert the examples a collection of attributes Very often easy to think of them as vectors Important part of the design of a learning based solution 36

Instances as feature vectors Features functions convert inputs to vectors Fixed mapping The instance space X is a N-dimensional vector space (e.g < N or {0,1} N ) Each dimension is one feature Each x 2 X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space 37

Instances as feature vectors Features functions convert inputs to vectors Fixed mapping The instance space X is a N-dimensional vector space (e.g < N or {0,1} N ) Each dimension is one feature Each x 2 X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space x 2 x 1 38

Instances as feature vectors Features functions convert inputs to vectors Fixed mapping The instance space X is a N-dimensional vector space (e.g < N or {0,1} N ) Each dimension is one feature Each x 2 X is a feature vector Each x = [x 1, x 2, ", x N ] is a point in the vector space x 2 x 1 39

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a! [1 0 0 0 ] b! [0 1 0 0 ] Manning a! [1 0 0 0 ] Scrooge c! [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 40

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a! [1 0 0 0 ] b! [0 1 0 0 ] Manning a! [1 0 0 0 ] Scrooge c! [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 Question: What is the length of this feature vector? 41

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a! [1 0 0 0 ] b! [0 1 0 0 ] Manning a! [1 0 0 0 ] Scrooge c! [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 Question: What is the length of this feature vector? 26 (One dimension per letter) 42

Feature functions produce feature vectors When designing feature functions, think of them as templates Feature: The second letter of the name Naoki Abe a! [1 0 0 0 ] b! [0 1 0 0 ] Manning a! [1 0 0 0 ] Scrooge c! [0 0 1 0 ] Feature: The length of the name Naoki! 5 Abe! 3 Question: What is the length of this feature vector? 26 (One dimension per letter) 43

Good features are essential Good features decide how well a task can be learned Eg: A bad feature function the badges game Is there a day of the week that begins with the last letter of the first name? Much effort goes into designing features Or maybe learning them Will touch upon general principles for designing good features But feature definition largely domain specific Comes with experience 44

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? 2. What is our label space? What is the learning task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 45

2. The Label Space Y X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 46

2. The Label Space Y X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 47

2. The Label Space Y Classification: The outputs are categorical Binary classification: Two possible labels We will see a lot of this Multiclass classification: K possible labels We may see a bit of this Structured classification: Graph valued outputs A different class Classification is the primary focus of this class 48

2. The Label Space Y The output space can be numerical Regression: Y is the set (or a subset) of real numbers Ranking Labels are ordinal That is, there is an ordering over the labels Eg: A Yelp 5-star review is only slightly different from a 4-star review, but very different from a 1-star review 49

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? ü What is our label space? What is the learning task? 3. What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 50

3. The Hypothesis Space X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions Eg: The set of all possible names, sentences, images, emails, etc Eg: {Spam, Not-Spam}, {+,-}, etc. 51

3. The Hypothesis Space X: Instance Space The set of examples Target function y = f(x) Y: Label Space The set of all possible labels The goal of learning: Find this target function Learning is search over functions The hypothesis space is the set of functions we consider for this search 52

Example of search over functions x 1 Unknown x 2 function y = f(x 1, x 2 ) x " x # y 0 0 0 0 1 0 1 0 0 1 1 1 Can you learn this function? What is it? 53

The fundamental problem: Machine learning is ill-posed! x 1 x 2 Unknown x 3 function y = f(x 1, x 2, x 3, x 4 ) x 4 Can you learn this function? What is it? 54

Is learning possible at all? There are 2 16 = 65536 possible Boolean functions over 4 inputs Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 2 16 functions. We have seen only 7 outputs How could we possibly know the rest without seeing every label? Think of an adversary filling in the labels every time you make a guess at the function 55

Is learning possible at all? There are 2 16 = 65536 possible Boolean functions over 4 inputs Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 2 16 functions. We have seen only 7 outputs How could we possibly know the rest without seeing every label? Think of an adversary filling in the labels every time you make a guess at the function 56

Is learning possible at all? There are 2 16 = 65536 possible Boolean functions over 4 inputs Why? There are 16 possible outputs. Each way to fill these 16 slots is a different function, giving 2 16 functions. How could we possibly learn anything? We have seen only 7 outputs How could we possibly know the rest without seeing every label? Think of an adversary filling in the labels every time you make a guess at the function 57

Solution: Restrict the search space A hypothesis space is the set of possible functions we consider We were looking at the space of all Boolean functions Instead choose a hypothesis space that is smaller than the space of all functions Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) Simple disjunctions m-of-n rules: Fix a set of n variables. At least m of them must be true Linear functions! 58

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k 59

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k 60

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k Rule Counter-example Rule Counter-example Always False 1001 x # x ) 0011 x " 1100 x # x * 0011 x # 0100 x ) x * 1001 x ) 0110 x " x # x ) 0011 x * 0101 x " x # x * 0011 x " x # 1100 x " x ) x * 0011 x " x ) 0011 x # x ) x * 0011 x " x * 0011 x " x # x ) x * 0011 61

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k Rule Counter-example Rule Counter-example Always False 1001 x # x ) 0011 x " 1100 x # x * 0011 Exercise: How many simple conjunctions are x # possible when 0100 there are n inputs instead x ) x of * 4? 1001 x ) 0110 x " x # x ) 0011 x * 0101 x " x # x * 0011 x " x # 1100 x " x ) x * 0011 x " x ) 0011 x # x ) x * 0011 x " x * 0011 x " x # x ) x * 0011 62

Example Hypothesis space 1 Simple conjunctions Is there a consistent hypothesis in this space? There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k Rule Counter-example Rule Counter-example Always False 1001 x # x ) 0011 x " 1100 x # x * 0011 x # 0100 x ) x * 1001 x ) 0110 x " x # x ) 0011 x * 0101 x " x # x * 0011 x " x # 1100 x " x ) x * 0011 x " x ) 0011 x # x ) x * 0011 x " x * 0011 x " x # x ) x * 0011 63

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k Rule Counter-example Rule Counter-example Always False 1001 x # x ) 0011 x " 1100 x # x * 0011 x # 0100 x ) x * 1001 x ) 0110 x " x # x ) 0011 x * 0101 x " x # x * 0011 x " x # 1100 x " x ) x * 0011 x " x ) 0011 x # x ) x * 0011 x " x * 0011 x " x # x ) x * 0011 64

Example Hypothesis space 1 Simple conjunctions There are only 16 simple conjunctive rules of the form g(x)=x i Æ x j Æ x k Rule Counter-example Rule Counter-example Always False 1001 x # x ) 0011 x " No simple 1100 conjunction explains the x # data! x * 0011 x # Our hypothesis 0100 space is too small x ) x * 1001 x ) 0110 x " x # x ) 0011 x * 0101 x " x # x * 0011 x " x # 1100 x " x ) x * 0011 x " x ) 0011 x # x ) x * 0011 x " x * 0011 x " x # x ) x * 0011 65

Solution: Restrict the search space A hypothesis space is the set of possible functions we consider We were looking at the space of all Boolean functions Instead choose a hypothesis space that is smaller than the space of all functions Only simple conjunctions (with four variables, there are only 16 conjunctions without negations) m-of-n rules: Pick a set of n variables. At least m of them must be true Linear functions How do we pick a hypothesis space? Using some prior knowledge (or by guessing) What if the hypothesis space is so small that nothing in it agrees with the data? We need a hypothesis space that is flexible enough 66

Example Hypothesis space 2 m-of-n rules Pick a subset with n variables. Y = 1 if at least m of them are 1 Example: If at least 2 of {x 1, x 3, x 4 } are 1, then the output is 1. Otherwise, the output is 0. Is there a consistent hypothesis in this space? Try to check if there is one First, how many m-of-n rules are there for four variables? 67

Restricting the hypothesis space Our guess of the hypothesis space may be incorrect General strategy Pick an expressive hypothesis space expressing concepts Concept = the target classifier that is hidden from us. Sometimes we may even call it the oracle. Example hypothesis spaces: m-of-n functions, decision trees, linear functions, grammars, multi-layer deep networks, etc Develop algorithms that find an element the hypothesis space that fits data well (or well enough) Hope that it generalizes 68

Views of learning Learning is the removal of remaining uncertainty If we knew that the unknown function is a simple conjunction, we could use the training data to figure out which one it is Requires guessing a good, small hypothesis class And we could be wrong We could find a consistent hypothesis and still be incorrect with a new example! 69

On using supervised learning ü What is our instance space? What are the inputs to the problem? What are the features? ü What is our label space? What is the learning task? ü What is our hypothesis space? What functions should the learning algorithm search over? 4. What is our learning algorithm? How do we learn from the labeled data? 5. What is our loss function or evaluation metric? What is success? 70