Lecture 10 Summary and reflections

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Generative models and adversarial training

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v1 [cs.lg] 15 Jun 2015

Model Ensemble for Click Prediction in Bing Search Ads

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

CS Machine Learning

Probabilistic Latent Semantic Analysis

Lecture 10: Reinforcement Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Softprop: Softmax Neural Network Backpropagation Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Assignment 1: Predicting Amazon Review Ratings

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 1: Basic Concepts of Machine Learning

Axiom 2013 Team Description Paper

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Learning From the Past with Experiment Databases

Knowledge Transfer in Deep Convolutional Neural Nets

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

arxiv: v1 [cs.cv] 10 May 2017

WHEN THERE IS A mismatch between the acoustic

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

A survey of multi-view machine learning

Introduction to Simulation

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Cultivating DNN Diversity for Large Scale Video Labelling

Semi-Supervised Face Detection

Human Emotion Recognition From Speech

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Deep Bag-of-Features Model for Music Auto-Tagging

Reinforcement Learning by Comparing Immediate Reward

A Review: Speech Recognition with Deep Learning Methods

Truth Inference in Crowdsourcing: Is the Problem Solved?

Attributed Social Network Embedding

arxiv: v2 [cs.cv] 30 Mar 2017

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v1 [cs.cv] 2 Jun 2017

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Evolutive Neural Net Fuzzy Filtering: Basic Description

FF+FPG: Guiding a Policy-Gradient Planner

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Deep Facial Action Unit Recognition from Partially Labeled Data

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

An empirical study of learning speed in backpropagation

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

arxiv: v1 [cs.cl] 27 Apr 2016

Speech Emotion Recognition Using Support Vector Machine

AMULTIAGENT system [1] can be defined as a group of

Calibration of Confidence Measures in Speech Recognition

Time series prediction

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

A Reinforcement Learning Variant for Control Scheduling

INPE São José dos Campos

Corrective Feedback and Persistent Learning for Information Extraction

Summarizing Answers in Non-Factoid Community Question-Answering

arxiv: v1 [cs.lg] 7 Apr 2015

Georgetown University at TREC 2017 Dynamic Domain Track

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

An investigation of imitation learning algorithms for structured prediction

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

arxiv: v2 [cs.ro] 3 Mar 2017

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Switchboard Language Model Improvement with Conversational Data from Gigaword

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

ECE-492 SENIOR ADVANCED DESIGN PROJECT

Applications of data mining algorithms to analysis of medical data

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Word Segmentation of Off-line Handwritten Documents

Continual Curiosity-Driven Skill Acquisition from High-Dimensional Video Inputs for Humanoid Robots

arxiv:submit/ [cs.cv] 2 Aug 2017

Transcription:

Lecture 10 Summary and reflections Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University. Email: niklas.wahlstrom@it.uu.se SML - Lecture 10

Contents Lecture 10 1. Summary of Lecture 9 2. Summary of the laboratory work 3. Summary of the whole course 4. Outlook: a few words about things that we have not covered 5. New course! 1 / 26 SML - Lecture 10

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. 2 / 26 SML - Lecture 10 Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 1 Hidden units σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. 2 / 26 SML - Lecture 10 Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 1 β (1) 1,3 β (1) 3,3 β (1) 0 Hidden units σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 1 β (1) 0 Hidden units σ σ σ σ σ σ x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 β (1) 1,3 σ σ σ σ σ σ σ σ σ σ σ σ x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 β (1) 3,3 σ σ σ σ σ σ σ σ σ σ σ σ x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 σ σ σ σ σ σ 2 / 26 SML - Lecture 10

Summary of Lecture 9 (II/IV) Convolutional neural network (CNN) A full CNN usually consist of multiple convolutional layers (here three),......and a few final dense layers (here two). Input variables Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 size 28 28 1 Hidden Hidden Hidden Hidden Output Type: Convolutional Type: Convolutional Type: Convolutional Type: Dense Type: Dense Size: 28 28 4 Size: 14 14 8 Size: 7 7 12 Size: 200 Size: 10 Kernel rows and columns: (5 5) Kernel rows and columns: (5 5) Kernel row and columns: (4 4) Stride: [1, 1] Stride: [2, 2] Stride: [2, 2] Predicted class probabilities Size: 10 p(y = 1 x; θ) p(y = 2 x; θ) p(y = 3 x; θ) p(y = 4 x; θ) p(y = 5 x; θ) p(y = 6 x; θ). p(y = 7 x; θ) p(y = 8 x; θ) p(y = 9 x; θ) p(y = 10 x; θ) 3 / 26 SML - Lecture 10

Summary of Lecture 9 (III/IV) Training a neural network We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n θ all parameters of the network θ the estimated parameters n L(x i, y i, θ) i=1 {x i, y i } n i=1 the training data L(x i, y i, θ) the loss function (for example cross-entropy) J(θ) the cost function 4 / 26 SML - Lecture 10

Summary of Lecture 9 (IV/IV) Stochastic gradient descent At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y n i, θ t ). i=1 Challenge - n is big - expensive to compute gradient. Solution: For each iteration, we only use a small random batch of the data set a mini-batch to compute the gradient g t. This procedure is called the stochastic gradient descent. { Training data (reshuffled) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 6 Epoch: 2 5 / 26 SML - Lecture 10

Summary of laboratory work One layer neural network (logistic regression) Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 6 / 26 SML - Lecture 10

Summary of laboratory work Two layer neural network with sigmoid activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 Significantly better performance. 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with sigmoid activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 Convergence slow, not yet converged. 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with ReLU activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 It trains much faster! 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with ReLU activation function. Trained for 10 000 iterations. Adam with learning rate: γ = 0.002 6 / 26 SML - Lecture 10 Not a bigg difference with Adam optimizer (but it is important in the CNN part!)

Summary of laboratory work CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with learning rate: γ = 0.002 7 / 26 SML - Lecture 10 CNN increases performance! Cost function oscillates decrease learning rate

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 And now we start to overfit... Regularize! 7 / 26 SML - Lecture 10

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 Dropout with p = 0.75 on units in last hidden layer. Better cross-entropy, and now also an improvement in accuracy!! 7 / 26 SML - Lecture 10

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 6-12-24-200-10, Kernels 6x6str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 Dropout with p = 0.75 on units in last hidden layer. This was the best I could get. Did you get any better? 7 / 26 SML - Lecture 10

This course Machine learning gives computers the ability to solve problems without being explicitly programmed for the task at hand. This is done by learning from examples, i.e. from training data. Data on its own is typically useless, it is only when we can extract knowledge from the data that it becomes useful. Specifically, we have studied supervised learning methods, in which we build a model of the relationship between an input variable x and an output variable y. 8 / 26 SML - Lecture 10

Supervised Machine Learning Learning a model from labeled data. Training data Labels e.g. mat, mirror, boat,... Learning algorithm Model 9 / 26 SML - Lecture 10

Supervised Machine Learning Using the learned model on new previously unseen data. Unseen data? Model prediction The model must generalize to new unseen data. example images from two disease classes. These test images highligh difficulty of malignant versus benign discernment for the three med 10 / 26 SML - Lecture 10

Inputs and outputs The input x is composed of all the available variables which are believed to be relevant for predicting the value of the output y. We have considered the case where we have p input variables, x = (x j ) p j=1, and one output variable y. Both the inputs x j and the output y can be either quantitative (can be ordered), or qualitative (takes values in an unordered set). 11 / 26 SML - Lecture 10

Regression and classification Regression Classification Output, y quantitative qualitative Inputs, x j Model ( conceptual ) quantitative or qualitative y = f(x) + ε quantitative or qualitative p(y = k x), k = 1,..., K 12 / 26 SML - Lecture 10

Bias-variance E new : How well a method will perform for unseen data. Bias: The inability of a model to describe the training data. Variance: How sensitive a model is to the training data. E new = bias 2 + variance + irreducible error 13 / 26 SML - Lecture 10

Bias-variance Underfit Overfit Ē new Irreducible error Error Variance Bias 2 14 / 26 SML - Lecture 10 Model complexity

Cross validation To estimate E new, we can use cross-validation. 1st iteration 2nd iteration Training data Validation data Validation data cth iteration Validation data. Training data When using cross validation to select, e.g., inputs and hyperparameters, there is a risk of overfitting! (But it can still be the best available option... ) 15 / 26 SML - Lecture 10

Regularization Regularization offers a way to decrease the model complexity (and hence risk of overfitting) Ridge Regression: add a penalty term λ β 2 2 LASSO: add a penalty term λ β 1 can result in sparse solutions Select λ, e.g. by cross validation! There are also other ways to change the model complexity: Increase k in k-nn Bagging 16 / 26 SML - Lecture 10

Parametric vs. nonparametric models Parametric models Parameterized by a finite-dimensional parameter θ Training/learning the model = estimating θ Once θ is estimated, the predictions depend only on θ (not the training data) ex) Linear regression, LDA, QDA, Neural Networks Nonparametric models The model flexibility is allowed to grow with the amount of available data Predictions depend directly on the training data Can be viewed as having an infinite number of parameters ex) k-nn, CART 17 / 26 SML - Lecture 10

Ensemble methods Ensemble methods are a type of meta algorithms : Construct one powerful model from multiple base models (=ensemble members), each of which may perform poorly on its own! We have encountered two such approaches: 1. Bagging: Reduce variance of low-bias/high-variance models by bootstrap aggregation 2. Boosting: Construct weak base models sequentially, so that each model tries to correct the mistakes of the previous one 18 / 26 SML - Lecture 10

A toolbox of methods Regression Classification Non-parametric Parametric Ensemble Linear regression Logistic regression LDA QDA k-nn CART Random Forests AdaBoost ( ) (Deep) Neural nets 19 / 26 SML - Lecture 10

Summary for the exam (in one slide) Classification and regression problem formulations Parametric and non-parametric models Inputs and outputs / quantitative and qualitative variables Decision boundaries / linear vs. nonlinear classifiers Cross-validation (the purpose!) and model testing Bias-variance trade-off / model flexibility / over-fitting Regularization / ridge regression and LASSO The different methods discussed throughout the course 20 / 26 SML - Lecture 10

Summary for life What should you remember from statistical machine learning? The problem formulations: regression and classification The existence of different types of methods The bias-variance trade-off and cross validation The possibilities: Machine learning can be used for an extremely wide range of applications and data types The TSTF principle: Try simple things first! 21 / 26 SML - Lecture 10

Outlook: Unsupervised learning Regression and classification are supervised learning problems The models are trained using both inputs x and outputs y. Unsupervised learning methods tries to find patterns in unlabeled data, i.e. we train the models from just the x. Dimensionality reduction / manifold learning Cluster analysis Generative model learning Blind source separation 22 / 26 SML - Lecture 10

Outlook: Reinforcement learning A reinforcement learning system is asked to take actions that influence its environment in order to maximize a reward. Contrary to supervised learning, the correct input/output pair is not revealed learning has to be carried out based on the reward feedback often a focus on online performance ( exploration-exploitation trade-off) 23 / 26 SML - Lecture 10

New course!! Advanced probabilistic machine learning Contents (very brief): Probabilistic/Bayesian modeling Bayesian linear regression Graphical models Gaussian processes Variational inference Monte Carlo methods Unsupervised learning Variational autoencoders Examination: Mini-project, lab, oral exam. When: Period 1, running every year starting this fall. Info: http://www.it.uu.se/edu/course/homepage/apml/ 24 / 26 SML - Lecture 10

overlap is needed to solve this problem. Our Machine Learning research ultrabrief Monte Carlo methods (especially sequential Monte Carlo) Deep learning Gaussian processes The use of probabilistic programming he object detection marking an obstacle that con- driving (with Autoliv), digital Applications: Autonomous rts. In this case a curb, a speed bump and a traffic pathology (withroad Sectra), input and to the right the estimated surface etc. ne (white) and the detected obstacle (purple). We take a particular interest in nonlinear dynamical systems. wo25objects receiving / 26 SML - Lecturethe 10 same label when coming

Thank you! Machine learning gives computers the ability to solve problems without being explicitly programmed for the task at hand. Thank you for your attention and good luck in the future!!! 26 / 26 SML - Lecture 10