Machine Learning - Introduction

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Lecture 10: Reinforcement Learning

Laboratorio di Intelligenza Artificiale e Robotica

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning From the Past with Experiment Databases

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CSL465/603 - Machine Learning

An OO Framework for building Intelligence and Learning properties in Software Agents

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Switchboard Language Model Improvement with Conversational Data from Gigaword

(Sub)Gradient Descent

Rule Learning With Negation: Issues Regarding Effectiveness

Axiom 2013 Team Description Paper

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Laboratorio di Intelligenza Artificiale e Robotica

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Assignment 1: Predicting Amazon Review Ratings

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Artificial Neural Networks written examination

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Learning Methods in Multilingual Speech Recognition

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Rule Learning with Negation: Issues Regarding Effectiveness

Human Emotion Recognition From Speech

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Reinforcement Learning by Comparing Immediate Reward

SARDNET: A Self-Organizing Feature Map for Sequences

Generative models and adversarial training

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

First Grade Standards

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Australian Journal of Basic and Applied Sciences

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

The following shows how place value and money are related. ones tenths hundredths thousandths

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Methods for Fuzzy Systems

Semi-Supervised Face Detection

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Algebra 2- Semester 2 Review

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Arizona s College and Career Ready Standards Mathematics

Statewide Framework Document for:

Multiplication of 2 and 3 digit numbers Multiply and SHOW WORK. EXAMPLE. Now try these on your own! Remember to show all work neatly!

Interactive Whiteboard

Softprop: Softmax Neural Network Backpropagation Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Data Fusion Through Statistical Matching

Interpreting ACER Test Results

How do adults reason about their opponent? Typologies of players in a turn-taking game

Speech Emotion Recognition Using Support Vector Machine

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

Intelligent Agents. Chapter 2. Chapter 2 1

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

The Evolution of Random Phenomena

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Why Did My Detector Do That?!

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Calibration of Confidence Measures in Speech Recognition

Mathematics Success Level E

Using focal point learning to improve human machine tacit coordination

A Case Study: News Classification Based on Term Frequency

Multivariate k-nearest Neighbor Regression for Time Series data -

Software Maintenance

Unit 2. A whole-school approach to numeracy across the curriculum

Spinners at the School Carnival (Unequal Sections)

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Word Segmentation of Off-line Handwritten Documents

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

P-4: Differentiate your plans to fit your students

Backwards Numbers: A Study of Place Value. Catherine Perez

School of Innovative Technologies and Engineering

Mathematics process categories

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

Speech Recognition at ICSI: Broadcast News and beyond

WHEN THERE IS A mismatch between the acoustic

Probability and Statistics Curriculum Pacing Guide

Characteristics of Functions

Applying Fuzzy Rule-Based System on FMEA to Assess the Risks on Project-Based Software Engineering Education

Getting Started with TI-Nspire High School Science

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

End-of-Module Assessment Task

Truth Inference in Crowdsourcing: Is the Problem Solved?

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Linking Task: Identifying authors and book titles in verbose queries

learning collegiate assessment]

Math 96: Intermediate Algebra in Context

Why Pay Attention to Race?

This scope and sequence assumes 160 days for instruction, divided among 15 units.

Knowledge Transfer in Deep Convolutional Neural Nets

Transcription:

Machine Learning - Introduction CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

What is Machine Learning Quote by Tom M. Mitchell: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." To define a machine learning problem, we need to specify: The experience (usually known as training data). The task (classification, regression, ) The performance measure (classification accuracy, squared error, ). 2

(source: Wikipedia) Types of Machine Learning Supervised Learning. The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to outputs. Unsupervised Learning. No example outputs are given to the learning algorithm, leaving it on its own to find structure in its input. Reinforcement Learning. A computer program interacts with a dynamic environment and must perform a certain goal (such as driving a car or playing chess). The program is provided feedback (rewards and punishments). 3

Supervised Learning The computer is presented with example inputs and their desired outputs, given by a "teacher". Goal: learn a general function that maps inputs to outputs.

Supervised Learning Example: recognizing the digits of zip codes. The training set consists of images of digits and the names of those digits. example inputs one one two three three four four five six six seven eight nine nine zero desired outputs (class labels)

Supervised Learning Example: face recognition The training set consists of images of faces and the IDs of those faces. example inputs Person 534 Person 789 Person 956 Person 126 Person 120 Person 521 Person 457 Person 917 Person 017 Person 398 desired outputs (class labels)

Regression, Classification, Pattern Recognition When the desired output belongs to one of a finite number of categories, then the supervised learning problem is called a classification problem. When the desired output contains one or more values from a continuous space, then the supervised learning problem is called a regression problem. 7

Unsupervised Learning No example outputs are given to the learning algorithm, leaving it on its own to find structure in its input. Example: figure out how many different types of digits appear in this set: 8

Unsupervised Learning No example outputs are given to the learning algorithm, leaving it on its own to find structure in its input. Example: figure out how many different people appear in this set of face photos: 9

Clustering. Applications of Unsupervised Learning E.g., categorize living organisms into hierarchical groups. Source: https://en.wikipedia.org/wiki/phylogenetic_tree 10

Applications of Unsupervised Learning Anomaly detection. Figure out if someone at an airport is behaving abnormally, which may be a sign of danger. Figure out if an engine is behaving abnormally, which may be a sign of malfunction/damage. This can also be treated as a supervised learning problem, if someone provides training examples that are labeled as "anomalies". If it is treated as an unsupervised learning problem, then an anomaly model must be built without such training examples. 11

Reinforcement Learning Learn what actions to take so as to maximize reward. Correct pairs of input/output are not presented to the system. The system needs to explore different actions at different situations, to see what rewards it gets. However, the system also needs to exploit its knowledge so as to maximize rewards. Problem: what is the optimal balance between exploration and exploitation? 12

Applications of Reinforcement Learning A robot learning how to move a robotic arm, or how to walk on two legs. A car learning how to drive itself. A computer program learning how to play a board game, like chess, tic-tac-toe, etc. 13

Machine Learning and Pattern Recognition Machine learning and pattern recognition are not the same thing. This is a point that confuses many people. You can use machine learning to learn things that are not classifiers. For example: Learn how to walk on two feet. Learn how to grasp a medical tool. You can construct classifiers without machine learning. You can hardcode a bunch of rules that the classifier applies to each pattern in order to estimate its class. However, machine learning and pattern recognition are heavily related. A big part of machine learning research focuses on pattern recognition. Modern pattern recognition systems are usually exclusively based on machine learning. 14

Topics for This Semester Main emphasis: supervised learning. We will study several different approaches: Bayesian classifiers. Neural networks. Kernel methods and support vector machines. Nearest neighbors. Boosting. Decision trees. Graphical models. Towards the end, we will briefly study unsupervised learning and reinforcement learning. 15

A Simple Learning Task This is a toy regression example.. Source. S. Russell and P. Norvig, "Artificial Intelligence: A Modern Approach". Here, the input is a single real number. The output is also a real number. So, our target function F true is a function from the reals to the reals. Usually patterns are much more complex. In this example it is easy to visualize training examples and learned functions. 16

A Simple Learning Task Each training example is denoted as (x n, t n ), where: x n is the example input. t n is the desired output (also called target output). Each example (x n, t n ) is marked with on the figure. x n corresponds to the x-axis. t n corresponds to the y-axis. Based on the figure, what do you think F true looks like? 17

A Simple Learning Task Different people may give different answers as to what F true may look like. That shows the challenge in supervised learning: we can find some plausible functions, but: How do we know which one of them is correct? Given many choices for the function, how can we evaluate each choice? 18

A Simple Learning Task Here is one possible function F. Can anyone guess how it was obtained? 19

A Simple Learning Task Here is one possible function F. Can anyone guess how it was obtained? It was obtained by fitting a line to the training data. 20

A Simple Learning Task Here we see another possible function F, shown in green. It looks like a quadratic function (second degree polynomial). It fits all the data perfectly, except for one. 21

A Simple Learning Task Here we see a third possible function F, shown in blue. It looks like a cubic degree polynomial. It fits all the data perfectly. 22

A Simple Learning Task Here we see a fourth possible function F, shown in orange. It zig-zags a lot. It fits all the data perfectly. 23

The Model Selection Problem Overall, we can come up with an infinite number of possible functions here. The question is, how do we choose which one is best? Or, an easier version, how do we choose a good one. This is called the model selection problem: out of an infinite number of possible models for our data, we must choose one. 24

The Model Selection Problem An easier version of the model selection problem: given a model (i.e., a function modeling our data), how can we measure how good this model is? What are your thoughts on this? 25

A Simple Learning Task One naïve solution is to evaluate functions based on training error. For any function F, its training error can be measured as a sum of squared errors over training patterns x n : (t n F x n ) 2 n What are the pitfalls of choosing the best function based on training error? 26

A Simple Learning Task What are the pitfalls of choosing the best function based on training error? The zig-zagging orange function comes out as perfect : its training error is zero. As a human, would you find more reasonable the orange function or the blue function (cubic polynomial)? They both have zero training error. 27

A Simple Learning Task What are the pitfalls of choosing the best function based on training error? The zig-zagging orange function comes out as perfect : its training error is zero. As a human, would you find more reasonable the orange function or the blue function (cubic polynomial)? They both have zero training error. However, the zig-zagging function looks pretty arbitrary. 28

A Simple Learning Task Ockham s razor: given two equally good explanations, choose the more simple one. This is an old philosophical principle (Ockham lived in the 14 th century). Based on that, we prefer a cubic polynomial over a crazy zig-zagging function, because it is more simple, and they both have zero training error. 29

A Simple Learning Task However, real life is more complicated. What if none of the functions have zero training error? How do we weigh simplicity versus training error? 30

A Simple Learning Task However, real life is more complicated. What if none of the functions have zero training error? How do we weigh simplicity versus training error? There is no standard or straightforward solution to this. There exist many machine learning algorithms. Each corresponds to a different approach for resolving the trade-off between simplicity and training error. 31

Another Example The data here was generated as follows: Given x n : t n = sin (2πx n ) + noise. Noise was randomly sampled from a Gaussian distribution. The green curve shows f x noise. = sin (2πx) without The blue circles show the actual training examples, which are not exactly on the line because of the added noise. 32

Polynomial Fitting Given the training data, if we know that the generating function is sin(2πx), or sin(cx) for some unknown c, the learning task is trivial. However, we typically do not know the underlying function. One common approach, that we also saw in the previous example, is to try to model the function as a polynomial. We estimate the parameters of the polynomial based on the training data. 33

Polynomial Fitting Here are estimated polynomials of degrees 0, 1, 3, 9. 34

Polynomial Fitting Notice the overfitting problem with the 9 th degree polynomial. 35

Overfitting Overfitting is a huge problem in machine learning. Overfitting means that the learned function fits very well (or perfectly) the training data, but works very poorly on test data. Some times, when our models have too many parameters (like a 9 th degree polynomial), those parameters get tuned to match the noise in the data. 36

More Training, Less Overfitting Increasing the amount of training data (from 10 to 15, and then to 100) reduces overfitting. 37

Regularization These are the parameters for estimated polynomials of degrees 1, 3, and 9. Degree w 0 w 1 w 2 w 3 w 4 w 5 W 6 w 7 w 8 w 9 1 0.82-1.27 3 0.31 7.99-25.43 9 17.37 38

Regularization These are the parameters for some estimated polynomials. Degree 1 Degree 3 Degree 9 w 0 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 39

Regularization Overfitting leads to very large magnitudes of parameters. Degree 1 Degree 3 Degree 9 w 0 0.82 0.31 0.35 w 1-1.27 7.99 232.37 w 2-25.43-5321.83 w 3 17.37 48568.31 w 4-231639.30 w 5 640042.26 w 6-1061800.52 w 7 1042400.18 w 8-557682.99 w 9 125201.43 40

Regularization If we are confident that large magnitudes of polynomial parameters are due to overfitting, we can penalize them in the error function: (t n F x n ) 2 + λ w 2 n The blue part is the sum-of-squares error that we saw before. The red part is what is called a regularization term. w 2 is the sum of squares of the parameters w i. λ is a parameter that you have to specify. It controls how much you penalize large w 2 values. 41

Regularization λ = 0 λ = e 18 w 0 0.35 0.35 w 1 232.37 4.74 w 2-5321.83-0.77 w 3 48568.31-31.97 w 4-231639.30-3.89 w 5 640042.26 55.28 w 6-1061800.52 41.32 w 7 1042400.18-45.95 w 8-557682.99-91.53 w 9 125201.43 72.68 A small λ solves the overfitting problem in this case. 42

Using a Validation Set How can we choose a good value for λ? A standard approach is to use a validation set. Like the training set, the validation set is a set of example inputs and associated outputs. However, the objects in the validation set should not appear in the training set. We use the training set to fit polynomials using many different values for λ. We choose the λ that gives best results on the validation set. 43

Using a Validation Set Strictly speaking, choosing a good value for λ is part of the training task. Oftentimes, we have a general method for solving a problem, which requires that we choose some parameters. Typically, the training set is used to solve our problem multiple times, with different choices of those parameters. The validation set is used to decide which choice of parameters works best. 44

Using a Test Set If we want to evaluate one or more methods, to see how well they work, we use a test set. Test examples should not appear either in the training set or in the validation set. Error rates on the test set are a reliable estimate of how well a function generalizes to data outside training. Error rates on the training set are not reliable for that task. Error rates on the validation set are still not quite reliable, as the validation set was used to choose some parameters. 45

Recap: Training, Validation, Test Sets Training set: use to learn the function that we want to learn, that maps inputs to outputs. Validation set: use to evaluate different values of parameters (like λ for regularization) that need to be hardcoded during training. Train with different values, and then see how well each resulting function works on the validation set. Test set: use to evaluate the final product (after the choice of parameters has been finalized). 46