The Fundamentals of Machine Learning

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning From the Past with Experiment Databases

CSL465/603 - Machine Learning

Artificial Neural Networks written examination

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probabilistic Latent Semantic Analysis

Exploration. CS : Deep Reinforcement Learning Sergey Levine

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

A Case Study: News Classification Based on Term Frequency

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Rule Learning With Negation: Issues Regarding Effectiveness

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CS 446: Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Rule Learning with Negation: Issues Regarding Effectiveness

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Switchboard Language Model Improvement with Conversational Data from Gigaword

Speech Recognition at ICSI: Broadcast News and beyond

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Reducing Features to Improve Bug Prediction

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Lecture 1: Basic Concepts of Machine Learning

Australian Journal of Basic and Applied Sciences

Model Ensemble for Click Prediction in Bing Search Ads

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Truth Inference in Crowdsourcing: Is the Problem Solved?

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

arxiv: v1 [cs.lg] 15 Jun 2015

Lecture 10: Reinforcement Learning

Comparison of network inference packages and methods for multiple networks inference

Word Segmentation of Off-line Handwritten Documents

Semi-Supervised Face Detection

Softprop: Softmax Neural Network Backpropagation Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Applications of data mining algorithms to analysis of medical data

Axiom 2013 Team Description Paper

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Speech Emotion Recognition Using Support Vector Machine

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Knowledge Transfer in Deep Convolutional Neural Nets

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v2 [cs.cv] 30 Mar 2017

Probability and Statistics Curriculum Pacing Guide

Comment-based Multi-View Clustering of Web 2.0 Items

Statewide Framework Document for:

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Human Emotion Recognition From Speech

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Evolutive Neural Net Fuzzy Filtering: Basic Description

Linking Task: Identifying authors and book titles in verbose queries

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Detailed course syllabus

Indian Institute of Technology, Kanpur

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

A survey of multi-view machine learning

Software Maintenance

WHEN THERE IS A mismatch between the acoustic

Learning Methods for Fuzzy Systems

An investigation of imitation learning algorithms for structured prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Learning to Rank with Selection Bias in Personal Search

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Attributed Social Network Embedding

A study of speaker adaptation for DNN-based speech synthesis

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Learning to Schedule Straight-Line Code

NOT SO FAIR AND BALANCED:

Modeling user preferences and norms in context-aware systems

An Online Handwriting Recognition System For Turkish

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning Methods in Multilingual Speech Recognition

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Medical Complexity: A Pragmatic Theory

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

The Strong Minimalist Thesis and Bounded Optimality

Improvements to the Pruning Behavior of DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Laboratorio di Intelligenza Artificiale e Robotica

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Universidade do Minho Escola de Engenharia

Transcription:

The Fundamentals of Machine Learning Willie Brink 1, Nyalleng Moorosi 2 1 Stellenbosch University, South Africa 2 Council for Scientific and Industrial Research, South Africa Deep Learning Indaba 2017 1/31

About this tutorial We ll provide a very gentle and intuition-friendly introduction to the basics of Machine Learning. We might brush over (and most likely skip a few) important issues, but luckily a lot of these will be revisited over the next few days. Overview: the what, why and where of ML a few typical ML tasks optimization and probability (two often overlooked but key aspects of ML!) practical considerations 2/31

What is ML? The aim of Machine Learning is to give a computer the ability to find patterns and underlying structure in data, in order to understand what is happening and to predict what will happen. Many ML problems boil down to that of constructing a mathematical relationship between some input and output, given many examples, such that new examples will also be explained well. (The principle of generalization.) 3/31

Why is ML a thing? Hard problems in high dimensions, like many modern CV or NLP problems, require complex models which would be difficult (if not impossible) for a human to engineer and hard-code. Machines can discover hidden, non-obvious patterns. The proliferation of data as well as the rise of cloud-based storage and computing have opened up unprecedented possibilities for discovery and advancement in just about every scientific discipline. Quick experiment: how many disciplines are in this room? 4/31

ML as data-driven modelling Get data. Decide what the machine should learn from the data, and which aspects (features) of the data are useful. Make assumptions, pick a suitable model/algorithm. Train the model. This is where the machine learns. Test how well the model performs (generalizes to unseen data). Like any mathematical modelling process, this one can be iterative. 5/31

Common learning paradigms Supervised learning Learn a model from a given set of input-output pairs, in order to predict the output of new inputs. Unsupervised learning Discover patterns and learn the structure of unlabelled data. Reinforcement learning Learn what actions to take in a given situation, based on rewards and penalties. Semi-supervised learning Learn from a partially labelled dataset. Deep learning Any of these, but with very complex models and lots of data! 6/31

A typical ML task: regression Fit a continuous functional relationship between input-output pairs. y y x x Example applications: weather forecasting house pricing prediction epidemiology 7/31

A typical ML task: classification Learn decision boundaries between the different classes in a dataset. image-net.org Example applications: handwritten character recognition e-mail spam filter cancer diagnosis 8/31

A typical ML task: clustering Group similar datapoints into meaningful clusters. pixolution.org/lab/palm Example applications: identify clients with similar insurance risk profiles discover links between symptoms and diseases build visual dictionaries 9/31

A typical ML task: reinforcement learning Learn a winning strategy through constant feedback. Example applications: autonomous navigation evaluate trading strategies learning to fly a helicopter 10/31

A typical ML task: make a recommendation Generate personalized recommendation, through collaborative filtering of sparse user ratings. user Book1 Book2 Book3 Book4 Book5 Book6 A 3 2 5 B 4 3 0 5 C 1 4 3 2 D 3 2 4 Example applications: which movies or books might a particular user enjoy which news articles are relevant for a particular user which song should be played next 11/31

Characteristics of an ML model Consider the ML task of polynomial regression: given training data (x 1, y 1 ), (x 2, y 2 ),..., (x N, y N ), fit an nth degree polynomial of the form y = a n x n + a n 1 x n 1 +... + a 1 x + a 0. The degree of the polynomial is a hyperparameter of the model. Its value should be chosen prior to training (akin to model selection). The coefficients a n, a n 1,..., a 1, a 0 are parameters for which optimal values are learned during training (model fitting). 12/31

Learning as an optimization problem Consider a model with parameter set θ. Training involves finding values for θ that lets the model fit the data. The idea is to minimize an error, or difference between model predictions and true output in the tarining set. Define an error/loss/cost function E of the model parameters. A popular minimization technique is gradient descent: θ t+1 = θ t η E θ The learning rate η can be updated (by some policy) to aid convergence. Common issues to look out for! Convergence to a local, non-global minimum. Convergence to a non-optimium (if the learning rate becomes too small too quickly). 13/31

Probability Thinking about ML in terms of probability distributions can be extremely useful and powerful, esp. for reasoning under uncertainty. Your training data are samples from some underlying distribution, and a crucial underlying assumption is that that data in the test set are sampled from the same distribution. Discriminative modelling: Predict a most likely label y given a sample x, i.e. solve arg max p(y x). y Find the boundary between classes, and use it to classify new data. Generative modelling: Find y by modelling the joint p(x, y), i.e. solve arg max p(x y)p(y). y Model the distribution of each class, and match new data to these distributions. 14/31

Practical 1 Implement a simple linear classifier. Minimize cross-entropy through gradient descent. Run the linear classifier on nonlinear data in a swiss rol structure. Introduce a nonlinear activitation function in the scores, thereby implementing a nonlinear classifier. 15/31

Practical considerations Let s take a closer look at a few practical aspects of... data collection and features models and training improving your model 16/31

Practical considerations Let s take a closer look at a few practical aspects of... data collection and features models and training improving your model 17/31

Data exploration and cleanup Explore your data: print out a few frames, plot stuff, eyeball potential correlations between inputs and outputs or between input dimensions,... Deal with dirty data, missing data, outliers, etc. (domain knowledge crucial) Preprocess or filter the data (e.g. de-noise)? Maybe not! We don t want to throw out useful information! Most learning algorithms accept only numerical data, so some encoding may be necessary. E.g. one-hot encoding on categorical data: ID Gender 0 female 1 male 2 not specified 3 not specified 4 female ID male female not specified 0 0 1 0 1 1 0 0 2 0 0 1 3 0 0 1 4 0 1 0 Next, design the inputs: turn raw data into relevant data (features). 18/31

Feature engineering Until recently, this has been viewed as a very important step. Now, since the surge of DL, it s becoming almost taboo in certain circles! With enough data, the machine should learn which features are relevant. But in the absence of a lot of data, careful feature engineering can make all the difference (good features > good algorithms). Features should be informative and discriminative. Domain knowledge can be crucially useful here. Which features might cause the required output? A few feature selection strategies: remove features one-by-one and measure performance remove features with low variance (not discriminative) remove features that are highly correlated to others (not informative) 19/31

Feature transformations Feature normalization can help remove bias in situations where different features are scaled differently. House Size (m 2 ) Bedrooms 0 208 3 1 460 5 2 240 2 3 373 4 House Size Bedrooms 0 0.956 0.387 1 1.190 1.162 2 0.684 1.162 3 0.449 0.387 Dimensionality reduction (e.g. PCA). x 1 x 1 x 0 x 0 x 0 Potential downside: dimensions lose their intuitive meaning. 20/31

Practical considerations Let s take a closer look at a few practical aspects of... data collection and features models and training improving your model 21/31

Choosing a model The choice depends on the data and what we want to do with it... Is the training data labelled or unlabelled? Do we want categorical/ordinal/continuous output? How complex do we think is the relationship between inputs and outputs? Many models to choose from! Regression: linear, nonlinear, ridge, lasso,... Classification: naive Bayes, logistic regression, SVM, neural nets,... Clustering: k-means, mean-shift, EM,... No one model is inherently better than any other. It really depends on the situation. (the no free lunch theorem) Choosing the appropriate model and tuning its hyperparameters can be tricky, and may have to become part of the training phase. 22/31

Measuring performance The value of the loss function after training (or how well training labels are reproduced) may give insight into how well the model assumptions fit the data, but it is not an indication of the model s ability to generalize! Withhold part of our data from the training phase: the test set. Can we use our test set to select the best model, or to find good values for the hyperparameters? No! That d be like training on the test set! The performance measure on the test set should be unbiased. Instead, extract a subset of the training set: the validation set. Train on the training set, and use the validation set for an indication of generalizability and to optimize over model structure and hyperparameters. 23/31

K-fold cross validation The training/validation split reduces the data available for training. Idea: partition the training set into K folds, and give every fold a chance to act as validation set. Average the results, for a more unbiased indication of how our model might generalize. 1 st iteration training set } validation set = E 1 2 nd iteration = E 2 K 3 rd iteration... = E 3 E = 1 E i K i=1 K th iteration = E K 24/31

Performance metrics How exactly do we measure the performance of a trained model (when validating or testing)? Many options! Remember: it is very important to be clear about which metric you use. A number by itself doesn t mean much. Regression: RMSE, correlation coefficients,... Classification: confusion matrix, precision and recall,... Clustering: validity indices, inter-cluster density,... Recommender systems: rank accuracy, click-through rates,... Note: most of these compare model output to a ground truth. Based on the application, which kinds of errors are more tolerable? It might also be insightful to ponder the statistical significance of your performance measurements. 25/31

Practical considerations Let s take a closer look at a few practical aspects of... data collection and features models and training improving your model 26/31

Underfitting and overfitting We want models that generalize well to new (previously unseen) inputs. Reasons for under-performance include underfitting and overfitting. Example: polynomial regression y y underfitting overfitting a good x x Underfitting (high model bias): the model is too simple for the underlying structure in the data. Overfitting (high model variance): the model is too complex, and learns the noise in the data instead of ignoring it. 27/31

The bias-variance tradeoff How can we find the sweet spot between underfitting and overfitting? Use the validation set! Underfit Good fit Overfit performance on training set bad good very good performance on validation set bad good bad We want to keep training (keep decreasing the training error) only while the validation error is also decreasing. prediction error stop training validation training training epoch 28/31

A few more potential remedies To avoid overfitting, try... a simpler model structure a larger training set (makes the model less sensitive to noise) regularization (penalize model complexity during optimization) bagging (train different models on random subsets of the data, and aggregate) To avoid underfitting, try... a more complex model structure more features (more information about the underlying structure) boosting (sequentially train models on random subsets of the data, focussing attention on areas where the previous model struggled) Note: it seems as if, no matter what, more data is better! 29/31

Summary We touched on what ML is and why it exists, and listed a few examples of typical ML tasks. We viewed the learning process as an optimization problem, and mentioned the importance of considering ML in terms of probability distributions. We ran through a few practical considerations concerning data cleanup, feature selection, model selection and training, measuring performance, and improving performance. 30/31

Take-home messages It can be extremely useful to think in terms of probability distributions, and also that training is essentially an optimization process. Components of an ML solution, in decreasing order of importance: data features the model/algorithm Of course, deep learning blurs the boundary between data and features. 31/31