Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Basic Concepts of Machine Learning

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

CS Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Lecture 10: Reinforcement Learning

Major Milestones, Team Activities, and Individual Deliverables

Learning Methods for Fuzzy Systems

STA 225: Introductory Statistics (CT)

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Rule Learning With Negation: Issues Regarding Effectiveness

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Statewide Framework Document for:

Grade 6: Correlated to AGS Basic Math Skills

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Rule Learning with Negation: Issues Regarding Effectiveness

An empirical study of learning speed in backpropagation

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

INPE São José dos Campos

Axiom 2013 Team Description Paper

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Ohio s Learning Standards-Clear Learning Targets

Generative models and adversarial training

Test Effort Estimation Using Neural Network

Welcome to. ECML/PKDD 2004 Community meeting

CSL465/603 - Machine Learning

Probabilistic Latent Semantic Analysis

Discriminative Learning of Beam-Search Heuristics for Planning

Computerized Adaptive Psychological Testing A Personalisation Perspective

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Multi-label classification via multi-target regression on data streams

Analysis of Enzyme Kinetic Data

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

arxiv: v1 [cs.lg] 15 Jun 2015

Proof Theory for Syntacticians

Dublin City Schools Mathematics Graded Course of Study GRADE 4

A study of speaker adaptation for DNN-based speech synthesis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Case Study: News Classification Based on Term Frequency

Probability and Statistics Curriculum Pacing Guide

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

A survey of multi-view machine learning

Semi-Supervised Face Detection

CS 446: Machine Learning

Word Segmentation of Off-line Handwritten Documents

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

On-the-Fly Customization of Automated Essay Scoring

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An OO Framework for building Intelligence and Learning properties in Software Agents

Introduction to the Practice of Statistics

Reinforcement Learning by Comparing Immediate Reward

On the Combined Behavior of Autonomous Resource Management Agents

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Extending Place Value with Whole Numbers to 1,000,000

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

PM tutor. Estimate Activity Durations Part 2. Presented by Dipo Tepede, PMP, SSBB, MBA. Empowering Excellence. Powered by POeT Solvers Limited

Introduction to Simulation

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

A Neural Network GUI Tested on Text-To-Phoneme Mapping

12- A whirlwind tour of statistics

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Comparison of Two Text Representations for Sentiment Analysis

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Speech Recognition at ICSI: Broadcast News and beyond

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Human Emotion Recognition From Speech

Reducing Features to Improve Bug Prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Speech Emotion Recognition Using Support Vector Machine

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Softprop: Softmax Neural Network Backpropagation Learning

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Software Maintenance

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Universityy. The content of

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

WHEN THERE IS A mismatch between the acoustic

Issues in the Mining of Heart Failure Datasets

Language properties and Grammar of Parallel and Series Parallel Languages

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Transcription:

Parallel & Scalable Machine Learning Introduction to Machine Learning Algorithms Dr. Ing. Morris Riedel Adjunct Associated Professor School of Engineering and Natural Sciences, University of Iceland Research Group Leader, Juelich Supercomputing Centre, Germany LECTURE 5 Supervised Classification and Learning Theory Basics January 16 th, 2018 JSC, Germany

Review of Lecture 4 Unsupervised Clustering K-Means & K-Median DBSCAN very effective Applications in Context Parameter Changes minpoints & Epsilon Point Cloud Datasets 3D/4D laser scans Cities, Buildings, etc. Bremen small & big datasets Big Data: Whole Countries (e.g. Netherlands) # ε 2/ 41

Outline 3/ 41

Outline of the Course 1. Introduction to Machine Learning Fundamentals 2. PRACE and Parallel Computing Basics 3. Unsupervised Clustering and Applications 4. Unsupervised Clustering Challenges & Solutions 5. Supervised Classification and Learning Theory Basics 6. Classification Applications, Challenges, and Solutions 7. Support Vector Machines and Kernel Methods 8. Practicals with SVMs 9. Validation and Regularization Techniques 10. Practicals with Validation and Regularization 11. Parallelization Benefits 12. Cross-Validation Practicals Day One beginner Day Two moderate Day Three expert 4/ 41

Outline Supervised Classification Approach Formalization of Machine Learning Mathematical Building Blocks Feasibility of Learning Hypothesis Set & Final Hypothesis Learning Models & Linear Example Learning Theory Basics Union Bound & Problematic Factor M Theory of Generalization Linear Perceptron Example in Context Model Complexity & VC Dimension Problem of Overfitting 5/ 41

Supervised Classification Approach 6/ 41

Learning Approaches Supervised Learning Revisited petal width (in cm) 3 2.5 2 1.5? Example of a very simple linear supervised learning model: The Perceptron Iris-setosa Iris-virginica 1 (N = 100 samples) 0.5 0 (decision boundary) 0 1 2 3 4 5 6 7 8 petal length (in cm) 7/ 41

Learning Approaches Supervised Learning Formalization Each observation of the predictor measurement(s) has an associated response measurement: Input Output Data Training Examples (historical records, groundtruth data, examples) Goal: Fit a model that relates the response to the predictors Prediction: Aims of accurately predicting the response for future observations Inference: Aims to better understanding the relationship between the response and the predictors Supervised learning approaches fits a model that related the response to the predictors Supervised learning approaches are used in classification algorithms such as SVMs Supervised learning works with data = [input, correct output] [1] An Introduction to Statistical Learning 8/ 41

Feasibility of Learning Statistical Learning Theory deals with the problem of finding a predictive function based on data Theoretical framework underlying practical learning algorithms E.g. Support Vector Machines (SVMs) Best understood for Supervised Learning [2] Wikipedia on statistical learning theory Theoretical background used to solve A learning problem Inferring one target function that maps between input and output Learned function can be used to predict output from future input (fitting existing data is not enough) Unknown Target Function (ideal function) 9/ 41

Mathematical Building Blocks (1) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Elements that we derive from our skillset and that can be computationally intensive Elements that we derive from our skillset 10 / 41

Mathematical Building Blocks (1) Our Linear Example Unknown Target Function (ideal function) 1. Some pattern exists 2. No exact mathematical formula (i.e. target function) 3. Data exists Training Examples (historical records, groundtruth data, examples) (if we would know the exact target function we dont need machine learning, it would not make sense) (decision boundaries depending on f) Iris-virginica if Iris-setosa if (w i and threshold are still unknown to us) (we search a function similiar like a target function) 11 / 41

Feasibility of Learning Hypothesis Set & Final Hypothesis The ideal function will remain unknown in learning Unknown Target Function Impossible to know and learn from data If known a straightforward implementation would be better than learning E.g. hidden features/attributes of data not known or not part of data But (function) approximation of the target function is possible Use training examples to learn and approximate it Hypothesis set consists of m different hypothesis (candidate functions) select one function that best approximates Hypothesis Set Final Hypothesis 12 / 41

Mathematical Building Blocks (2) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Final Hypothesis Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 13 / 41

Mathematical Building Blocks (2) Our Linear Example (decision boundaries depending on f) (we search a function similiar like a target function) Final Hypothesis Hypothesis Set (Perceptron model linear model) (trained perceptron model and our selected final hypothesis) 14 / 41

The Learning Model: Hypothesis Set & Learning Algorithm The solution tools the learning model: 1. Hypothesis set - a set of candidate formulas /models 2. Learning Algorithm - train a system with known algorithms Training Examples Learning Algorithm ( train a system ) Final Hypothesis Hypothesis Set solution tools Our Linear Example 1. Perceptron Model 2. Perceptron Learning Algorithm (PLA) 15 / 41

Mathematical Building Blocks (3) Unknown Target Function (ideal function) Elements we not exactly (need to) know Training Examples (historical records, groundtruth data, examples) Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 16 / 41

Mathematical Building Blocks (3) Our Linear Example Unknown Target Function (ideal function) (training data) Training Examples (historical records, groundtruth data, examples) (algorithm uses training dataset) (training phase; Find w i and threshold that fit the data) Learning Algorithm ( train a system ) Final Hypothesis (Perceptron Learning Algorithm) Hypothesis Set (Perceptron model linear model) (trained perceptron model and our selected final hypothesis) 17 / 41

[Video] Towards Multi-Layer Perceptrons [3] YouTube Video, Neural Networks A Simple Explanation 18 / 41

Learning Theory Basics 19 / 41

Feasibility of Learning Probability Distribution Predict output from future input (fitting existing data is not enough) In-sample 1000 points fit well Possible: Out-of-sample >= 1001 point doesn t fit very well Learning any target function is not feasible (can be anything) Assumptions about future input Statement is possible to define about the data outside the in-sample data All samples (also future ones) are derived from same unknown probability distribution Unknown Target Function Training Examples Probability Distribution (which exact probability is not important, but should not be completely random) Statistical Learning Theory assumes an unknown probability distribution over the input space X 20 / 41

Feasibility of Learning In Sample vs. Out of Sample Given unknown probability Given large sample N for There is a probability of picking one point or another Error on in sample is known quantity (using labelled data): Error on out of sample is unknown quantity: In-sample frequency is likely close to out-of-sample frequency E in tracks E out depend on which hypothesis h out of M different ones in sample out of sample use for predict! use E in (h) as a proxy thus the other way around in learning Statistical Learning Theory part that enables that learning is feasible in a probabilistic sense (P on X) 21 / 41

Feasibility of Learning Union Bound & Factor M The union bound means that (for any countable set of m events ) the probability that at least one of the events happens is not greater that the sum of the probabilities of the m individual events Assuming no overlaps in hypothesis set Apply mathematical rule union bound (Note the usage of g instead of h, we need to visit all) Final Hypothesis Think if E in deviates from E out with more than tolerance Є it is a bad event in order to apply union bound or or... visiting M different hypothesis fixed quantity for each hypothesis obtained from Hoeffdings Inequality problematic: if M is too big we loose the link between the in-sample and out-of-sample 22 / 41

Feasibility of Learning Modified Hoeffding s Inequality Errors in-sample track errors out-of-sample Statement is made being Probably Approximately Correct (PAC) Given M as number of hypothesis of hypothesis set Tolerance parameter in learning [4] Valiant, A Theory of the Learnable, 1984 Mathematically established via modified Hoeffdings Inequality : (original Hoeffdings Inequality doesn t apply to multiple hypothesis) Approximately Probably Probability that E in deviates from E out by more than the tolerance Є is a small quantity depending on M and N Theoretical Big Data Impact more N better learning The more samples N the more reliable will track well (But: the quality of samples also matter, not only the number of samples) Statistical Learning Theory part describing the Probably Approximately Correct (PAC) learning 23 / 41

Mathematical Building Blocks (4) Unknown Target Function (ideal function) Training Examples (historical records, groundtruth data, examples) Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 24 / 41

Mathematical Building Blocks (4) Our Linear Example (infinite M decision boundaries depending on f) Probability Distribution P Is this point very likely from the same distribution or just noise? We assume future points are taken from the same probability distribution as those that we have in our training examples P Training Examples Is this point very likely from the same distribution or just noise? (we help here with the assumption for the samples) (we do not solve the M problem here) (counter example would be for instance a random number generator, impossible to learn this!) 25 / 41

Statistical Learning Theory Error Measure & Noisy Targets Question: How can we learn a function from (noisy) data? Error measures to quantify our progress, the goal is: Often user-defined, if not often squared error : Error Measure E.g. point-wise error measure (e.g. think movie rated now and in 10 years from now) (Noisy) Target function is not a (deterministic) function Getting with same x in the same y out is not always given in practice Problem: Noise in the data that hinders us from learning Idea: Use a target distribution instead of target function E.g. credit approval (yes/no) Statistical Learning Theory refines the learning problem of learning an unknown target distribution 26 / 41

Mathematical Building Blocks (5) Unknown Target Distribution Function target function plus noise (ideal function) Training Examples (historical records, groundtruth data, examples) Error Measure Probability Distribution Elements we not exactly (need to) know constants in learning Elements we must and/or should have and that might raise huge demands for storage Learning Algorithm ( train a system ) (set of known algorithms) Final Hypothesis (final formula) Elements that we derive from our skillset and that can be computationally intensive Hypothesis Set (set of candidate formulas) Elements that we derive from our skillset 27 / 41

Mathematical Building Blocks (5) Our Linear Example Iterative Method using (labelled) training data (one point at a time is picked) 1. Pick one misclassified training point where: y = +1 w + yx Error Measure (a) w x 2. Update the weight vector: (a) (b) adding a vector or subtracting a vector (y n is either +1 or -1) Terminates when there are no misclassified points (converges only with linearly seperable data) Error Measure (b) y = -1 w yx w x 28 / 41

Training and Testing Influence on Learning Mathematical notations Testing follows: (hypothesis clear) Training follows: (hypothesis search) Practice on training examples Create two disjoint datasets One used for training only (aka training set) Another used for testing only (aka test set) (e.g. student exam training on examples to get E in down, then test via exam) Training Examples (historical records, groundtruth data, examples) Training & Testing are different phases in the learning process Concrete number of samples in each set often influences learning 29 / 41

Exercises Check Indian Pines Dataset Training vs. Testing 30 / 41

Theory of Generalization Initial Generalization & Limits Learning is feasible in a probabilistic sense Reported final hypothesis using a generalization window on Expecting out of sample performance tracks in sample performance Approach: acts as a proxy for This is not full learning rather good generalization since the quantity E out (g) is an unknown quantity Reasoning Final Hypothesis Above condition is not the final hypothesis condition: More similiar like approximates 0 (out of sample error is close to 0 if approximating f) measures how far away the value is from the target function Problematic because is an unknown quantity (cannot be used ) The learning process thus requires two general core building blocks 31 / 41

Theory of Generalization Learning Process Reviewed Learning Well Two core building blocks that achieve approximates 0 First core building block Theoretical result using Hoeffdings Inequality Using directly is not possible it is an unknown quantity Second core building block (try to get the in-sample error lower) Practical result using tools & techniques to get e.g. linear models with the Perceptron Learning Algorithm (PLA) Using is possible it is a known quantity so lets get it small Lessons learned from practice: in many situations close to 0 impossible E.g. remote sensing images use case of land cover classification Full learning means that we can make sure that E out (g) is close enough to E in (g) [from theory] Full learning means that we can make sure that E in (g) is small enough [from practical techniques] 32 / 41

Complexity of the Hypothesis Set Infinite Spaces Problem Tradeoff & Review Tradeoff between Є, M, and the complexity of the hypothesis space H Contribution of detailed learning theory is to understand factor M M Elements of the hypothesis set theory helps to find a way to deal with infinite M hypothesis spaces M elements in H here Ok if N gets big, but problematic if M gets big bound gets meaningless E.g. classification models like perceptron, support vector machines, etc. Challenge: those classification models have continous parameters Consequence: those classification models have infinite hypothesis spaces Aproach: despite their size, the models still have limited expressive power Many elements of the hypothesis set H have continous parameter with infinite M hypothesis spaces 33 / 41

Factor M from the Union Bound & Hypothesis Overlaps or or... assumes no overlaps, all probabilities happen disjointly takes no overlaps of M hypothesis into account Union bound is a poor bound, ignores correlation between h Overlaps are common: the interest is shifted to data points changing label (at least very often, indicator to reduce M) h 1 h2 ΔE out unimportant ΔE in important change in areas ΔE out change in data label Statistical Learning Theory provides a quantity able to characterize the overlaps for a better bound 34 / 41

Replacing M & Large Overlaps (Hoeffding Inequality) (Union Bound) (towards Vapnik Chervonenkis Bound) (valid for 1 hypothesis) (valid for M hypothesis, worst case) (valid for m (N) as growth function) Characterizing the overlaps is the idea of a growth function Number of dichotomies: Number of hypothesis but on finite number N of points Much redundancy: Many hypothesis will reports the same dichotomies The mathematical proofs that m H (N) can replace M is a key part of the theory of generalization 35 / 41

Complexity of the Hypothesis Set VC Inequality Vapnik-Chervonenkis (VC) Inequality Result of mathematical proof when replacing M with growth function m 2N of growth function to have another sample ( 2 x, no ) (characterization of generalization) In Short finally : We are able to learn and can generalize ouf-of-sample The Vapnik-Chervonenkis Inequality is the most important result in machine learning theory The mathematial proof brings us that M can be replaced by growth function (no infinity anymore) 36 / 41

Complexity of the Hypothesis Set VC Dimension Vapnik-Chervonenkis (VC) Dimension over instance space X VC dimension gets a generalization bound on all possible target functions Issue: unknown to compute VC solved this using the growth function on different samples Error ( generalization error ) model complexity first sample ( training error ) out of sample d* VC VC dimension d VC second sample idea: first sample frequency close to second sample frequency Complexity of Hypothesis set H can be measured by the Vapnik-Chervonenkis (VC) Dimension d VC Ignoring the model complexity d VC leads to situations where E in (g) gets down and E out (g) gets up 37 / 41

Prevent Overfitting for better ouf-of-sample generalization [5] Stop Overfitting, YouTube 38 / 41

Lecture Bibliography 39 / 41

Lecture Bibliography [1] An Introduction to Statistical Learning with Applications in R, Online: http://www-bcf.usc.edu/~gareth/isl/index.html [2] Wikipedia on Statistical Learning Theory, Online: http://en.wikipedia.org/wiki/statistical_learning_theory [3] YouTube Video, Decision Trees, Online: http://www.youtube.com/watch?v=dctutpjn42s [4] Leslie G. Valiant, A Theory of the Learnable, Communications of the ACM 27(11):1134 1142, 1984, Online: https://people.mpi-inf.mpg.de/~mehlhorn/seminarevolvability/valiantlearnable.pdf [5] Udacity, Overfitting, Online: https://www.youtube.com/watch?v=cxaxrcv9woa Acknowledgements and more Information: Yaser Abu-Mostafa, Caltech Lecture series, YouTube 40 / 41

Slides Available at http://www.morrisriedel.de/talks 41 / 41