Lecture 10 Summary and reflections

Lecture 10 Summary and reflections Niklas Wahlström Division of Systems and Control Department of Information Technology Uppsala University. Email: niklas.wahlstrom@it.uu.se SML - Lecture 10

Contents Lecture 10 1. Summary of Lecture 9 2. Summary of the laboratory work 3. Summary of the whole course 4. Outlook: a few words about things that we have not covered 5. New course! 1 / 26 SML - Lecture 10

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. 2 / 26 SML - Lecture 10 Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 1 Hidden units σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. 2 / 26 SML - Lecture 10 Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 1 β (1) 1,3 β (1) 3,3 β (1) 0 Hidden units σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ σ

Summary of Lecture 9 (I/IV) Convolutional layer Consider a hidden layer with 6 6 hidden units. Dense layer: Each hidden unit is connected with all pixels. Each pixel-hidden-unit-pair has its own unique parameter. Convolutional layer: Each hidden unit is connected with a region of pixels via a set of parameters, so-called kernel. Different hidden units have the same set of parameters. Input variables x 1,1 x 1,2 x 1,3 x 1,4 x 1,5 x 1,6 1 β (1) 0 Hidden units σ σ σ σ σ σ x 2,1 x 2,2 x 2,3 x 2,4 x 2,5 x 2,6 x 3,1 x 3,2 x 3,3 x 3,4 x 3,5 x 3,6 β (1) 1,3 σ σ σ σ σ σ σ σ σ σ σ σ x 4,1 x 4,2 x 4,3 x 4,4 x 4,5 x 4,6 x 5,1 x 5,2 x 5,3 x 5,4 x 5,5 x 5,6 β (1) 3,3 σ σ σ σ σ σ σ σ σ σ σ σ x 6,1 x 6,2 x 6,3 x 6,4 x 6,5 x 6,6 σ σ σ σ σ σ 2 / 26 SML - Lecture 10

Summary of Lecture 9 (II/IV) Convolutional neural network (CNN) A full CNN usually consist of multiple convolutional layers (here three),......and a few final dense layers (here two). Input variables Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 size 28 28 1 Hidden Hidden Hidden Hidden Output Type: Convolutional Type: Convolutional Type: Convolutional Type: Dense Type: Dense Size: 28 28 4 Size: 14 14 8 Size: 7 7 12 Size: 200 Size: 10 Kernel rows and columns: (5 5) Kernel rows and columns: (5 5) Kernel row and columns: (4 4) Stride: [1, 1] Stride: [2, 2] Stride: [2, 2] Predicted class probabilities Size: 10 p(y = 1 x; θ) p(y = 2 x; θ) p(y = 3 x; θ) p(y = 4 x; θ) p(y = 5 x; θ) p(y = 6 x; θ). p(y = 7 x; θ) p(y = 8 x; θ) p(y = 9 x; θ) p(y = 10 x; θ) 3 / 26 SML - Lecture 10

Summary of Lecture 9 (III/IV) Training a neural network We train a network by considering the optimization problem θ = arg min θ J(θ), J(θ) = 1 n θ all parameters of the network θ the estimated parameters n L(x i, y i, θ) i=1 {x i, y i } n i=1 the training data L(x i, y i, θ) the loss function (for example cross-entropy) J(θ) the cost function 4 / 26 SML - Lecture 10

Summary of Lecture 9 (IV/IV) Stochastic gradient descent At each optimization step we need to compute the gradient g t = θ J(θ t ) = 1 n θ L(x i, y n i, θ t ). i=1 Challenge - n is big - expensive to compute gradient. Solution: For each iteration, we only use a small random batch of the data set a mini-batch to compute the gradient g t. This procedure is called the stochastic gradient descent. { Training data (reshuffled) }} { x 19 x 16 x 18 x 6 x 9 x 13 x 1 x 14 x 20 x 11 x 3 x 8 x 7 x 12 x 4 x 17 x 5 x 10 x 2 x 15 y 19 y 16 y 18 y 6 y 9 y 13 y 1 y 14 y 20 y 11 y 3 y 8 y 7 y 12 y 4 y 17 y 5 y 10 y 2 y 15 Iteration: 6 Epoch: 2 5 / 26 SML - Lecture 10

Summary of laboratory work One layer neural network (logistic regression) Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 6 / 26 SML - Lecture 10

Summary of laboratory work Two layer neural network with sigmoid activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 Significantly better performance. 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with sigmoid activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 Convergence slow, not yet converged. 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with ReLU activation function. Trained for 10 000 iterations. SGD with learning rate: γ = 0.5 It trains much faster! 6 / 26 SML - Lecture 10

Summary of laboratory work Five layer neural network with ReLU activation function. Trained for 10 000 iterations. Adam with learning rate: γ = 0.002 6 / 26 SML - Lecture 10 Not a bigg difference with Adam optimizer (but it is important in the CNN part!)

Summary of laboratory work CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with learning rate: γ = 0.002 7 / 26 SML - Lecture 10 CNN increases performance! Cost function oscillates decrease learning rate

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 And now we start to overfit... Regularize! 7 / 26 SML - Lecture 10

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 4-8-12-200-10, Kernels 5x5str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 Dropout with p = 0.75 on units in last hidden layer. Better cross-entropy, and now also an improvement in accuracy!! 7 / 26 SML - Lecture 10

Summary of laboratory work Extras! CNN - three conv layers, two dense layers Channels/units: 6-12-24-200-10, Kernels 6x6str1-5x5str2-4x4str2 Adam with decaying learning rate from: γ = 0.003 to γ = 0.0001 Dropout with p = 0.75 on units in last hidden layer. This was the best I could get. Did you get any better? 7 / 26 SML - Lecture 10

This course Machine learning gives computers the ability to solve problems without being explicitly programmed for the task at hand. This is done by learning from examples, i.e. from training data. Data on its own is typically useless, it is only when we can extract knowledge from the data that it becomes useful. Specifically, we have studied supervised learning methods, in which we build a model of the relationship between an input variable x and an output variable y. 8 / 26 SML - Lecture 10

Supervised Machine Learning Learning a model from labeled data. Training data Labels e.g. mat, mirror, boat,... Learning algorithm Model 9 / 26 SML - Lecture 10

Supervised Machine Learning Using the learned model on new previously unseen data. Unseen data? Model prediction The model must generalize to new unseen data. example images from two disease classes. These test images highligh difficulty of malignant versus benign discernment for the three med 10 / 26 SML - Lecture 10

Inputs and outputs The input x is composed of all the available variables which are believed to be relevant for predicting the value of the output y. We have considered the case where we have p input variables, x = (x j ) p j=1, and one output variable y. Both the inputs x j and the output y can be either quantitative (can be ordered), or qualitative (takes values in an unordered set). 11 / 26 SML - Lecture 10

Regression and classification Regression Classification Output, y quantitative qualitative Inputs, x j Model ( conceptual ) quantitative or qualitative y = f(x) + ε quantitative or qualitative p(y = k x), k = 1,..., K 12 / 26 SML - Lecture 10

Bias-variance E new : How well a method will perform for unseen data. Bias: The inability of a model to describe the training data. Variance: How sensitive a model is to the training data. E new = bias 2 + variance + irreducible error 13 / 26 SML - Lecture 10

Bias-variance Underfit Overfit Ē new Irreducible error Error Variance Bias 2 14 / 26 SML - Lecture 10 Model complexity

Cross validation To estimate E new, we can use cross-validation. 1st iteration 2nd iteration Training data Validation data Validation data cth iteration Validation data. Training data When using cross validation to select, e.g., inputs and hyperparameters, there is a risk of overfitting! (But it can still be the best available option... ) 15 / 26 SML - Lecture 10

Regularization Regularization offers a way to decrease the model complexity (and hence risk of overfitting) Ridge Regression: add a penalty term λ β 2 2 LASSO: add a penalty term λ β 1 can result in sparse solutions Select λ, e.g. by cross validation! There are also other ways to change the model complexity: Increase k in k-nn Bagging 16 / 26 SML - Lecture 10

Parametric vs. nonparametric models Parametric models Parameterized by a finite-dimensional parameter θ Training/learning the model = estimating θ Once θ is estimated, the predictions depend only on θ (not the training data) ex) Linear regression, LDA, QDA, Neural Networks Nonparametric models The model flexibility is allowed to grow with the amount of available data Predictions depend directly on the training data Can be viewed as having an infinite number of parameters ex) k-nn, CART 17 / 26 SML - Lecture 10

Ensemble methods Ensemble methods are a type of meta algorithms : Construct one powerful model from multiple base models (=ensemble members), each of which may perform poorly on its own! We have encountered two such approaches: 1. Bagging: Reduce variance of low-bias/high-variance models by bootstrap aggregation 2. Boosting: Construct weak base models sequentially, so that each model tries to correct the mistakes of the previous one 18 / 26 SML - Lecture 10

A toolbox of methods Regression Classification Non-parametric Parametric Ensemble Linear regression Logistic regression LDA QDA k-nn CART Random Forests AdaBoost ( ) (Deep) Neural nets 19 / 26 SML - Lecture 10

Summary for the exam (in one slide) Classification and regression problem formulations Parametric and non-parametric models Inputs and outputs / quantitative and qualitative variables Decision boundaries / linear vs. nonlinear classifiers Cross-validation (the purpose!) and model testing Bias-variance trade-off / model flexibility / over-fitting Regularization / ridge regression and LASSO The different methods discussed throughout the course 20 / 26 SML - Lecture 10

Summary for life What should you remember from statistical machine learning? The problem formulations: regression and classification The existence of different types of methods The bias-variance trade-off and cross validation The possibilities: Machine learning can be used for an extremely wide range of applications and data types The TSTF principle: Try simple things first! 21 / 26 SML - Lecture 10

Outlook: Unsupervised learning Regression and classification are supervised learning problems The models are trained using both inputs x and outputs y. Unsupervised learning methods tries to find patterns in unlabeled data, i.e. we train the models from just the x. Dimensionality reduction / manifold learning Cluster analysis Generative model learning Blind source separation 22 / 26 SML - Lecture 10

Outlook: Reinforcement learning A reinforcement learning system is asked to take actions that influence its environment in order to maximize a reward. Contrary to supervised learning, the correct input/output pair is not revealed learning has to be carried out based on the reward feedback often a focus on online performance ( exploration-exploitation trade-off) 23 / 26 SML - Lecture 10

New course!! Advanced probabilistic machine learning Contents (very brief): Probabilistic/Bayesian modeling Bayesian linear regression Graphical models Gaussian processes Variational inference Monte Carlo methods Unsupervised learning Variational autoencoders Examination: Mini-project, lab, oral exam. When: Period 1, running every year starting this fall. Info: http://www.it.uu.se/edu/course/homepage/apml/ 24 / 26 SML - Lecture 10

overlap is needed to solve this problem. Our Machine Learning research ultrabrief Monte Carlo methods (especially sequential Monte Carlo) Deep learning Gaussian processes The use of probabilistic programming he object detection marking an obstacle that con- driving (with Autoliv), digital Applications: Autonomous rts. In this case a curb, a speed bump and a traffic pathology (withroad Sectra), input and to the right the estimated surface etc. ne (white) and the detected obstacle (purple). We take a particular interest in nonlinear dynamical systems. wo25objects receiving / 26 SML - Lecturethe 10 same label when coming

Thank you! Machine learning gives computers the ability to solve problems without being explicitly programmed for the task at hand. Thank you for your attention and good luck in the future!!! 26 / 26 SML - Lecture 10