DATA11001 INTRODUCTION TO DATA SCIENCE EPISODE 6: MACHINE LEARNING
TODAY S MENU 1. WHAT IS ML? 2. CLASSIFICATION AND REGRESSSION 3. EVALUATING PERFORMANCE & OVERFITTING
WHAT IS MACHINE LEARNING? Definition: machine = computer, computer program (in this course) learning = improving performance on a given task, based on experience / examples In other words, instead of the programmer writing explicit rules for how to solve a given problem, the programmer instructs the computer how to learn from examples In many cases the computer program can even become better at the task than the programmer is!
EXAMPLE: SPAM FILTER method #1: Programmer writes rules: If it contains viagra then it is spam. (difficult, not user-adaptive) method #2: The user marks which mails are spam, which are legit, and a ML algorithm is used to construct a classifier From: medshop@spam.com Subject: viagra cheap meds... From: my.professor@helsinki.fi Subject: important information here s how to ace the exam... From: mike@example.org Subject: you need to see this how to win $1,000,000... spam non-spam?
MACHINE LEARNING SETTING One definition of machine learning: A computer program improves its performance on a given task with experience (i.e. examples, data). So we need to separate Task: What is the problem that the program is solving? Performance measure: How is the performance of the program (when solving the given task) evaluated? Experience: What is the data (examples) that the program is using to improve its performance?
NEIGHBORING DISCIPLINES Artificial Intelligence (AI) : Machine learning can be seen as one approach towards implementing intelligent machines Neural networks, deep learning: Inspired by and trying to mimic the function of biological brains, in order to make computers that learn from experience. Modern machine learning really grew out of the neural networks boom in the 1980 s and early 1990 s. Pattern recognition: Recognizing objects and identifying people in controlled or uncontrolled settings, from images, audio, etc. Such tasks typically require machine learning techniques.
NEIGHBORING DISCIPLINES Statistics historically, introductory courses on statistics tend to focus on hypothesis testing and some other basic problems such as linear regression There s a lot more to statistics than hypothesis testing There is a lot of interaction between research in machine learning and statistics
NEIGHBORING DISCIPLINES image from: machinelearners.wordpress.com
KINDS OF MACHINE LEARNING Supervised machine learning: task is to predict the correct (or good) response y given an input x, e.g.: + classify samples to normal and abnormal + classify emails as spam or legit ("ham") + predict movie profits based on director, actors,... + generate text descriptions of images Unsupervised machine learning: task is to create models or summaries of the input x (no y): + clustering (users, products, text documents by topic,...) + building dependency graphs (Bayesian networks,...) + reducing dimensionality to the essentials + visualization (dimension reduction to 2D/3D)
KINDS OF MACHINE LEARNING Other kinds exist as well: semi-supervised learning: supervised learning task but only some training data is labeled reinforcement learning: supervised learning but no direct feedback about the goodness of individual choices; instead delayed reward/penalty (e.g., win/lose a game, reach destination successfull/not,...) We'll mostly focus on supervised and unsupervised learning Goal here is to learn to identify a machine learning problem, choose the right approach, instead of learning the details
KINDS OF MACHINE LEARNING Case: Bank loan application Training data: 10000 customer background questionnaires & info about paid-on-time/not Task: predict whether a new customer will pay back on time or not based on their background questionnaire ML approach: SUPERVISED LEARNING!
KINDS OF MACHINE LEARNING Case: Autonomous car Training data: Control data from Tesla drivers driving around & info about crash/no-crash Task: Self-driving car ML approach: SUPERVISED LEARNING (for learning how to mimic human drivers) + REINFORCEMENT LEARNING (for learning to drive even better!)
KINDS OF MACHINE LEARNING Case: Customer segmentation Training data: Shopping basket data from 1 000 000 purchases Task: Group customers into different groups to tailor product placement and marketing ML approach: UNSUPERVISED LEARNING (clustering)
KINDS OF MACHINE LEARNING Case: Product pricing Training data: Sales data (product descriptions, final price) from on-line marketplace (swap.com, huuto.net) Task: Choose appropriate price for new products based on description ML approach: SUPERVISED LEARNING (but remember "game-theoretic aspect")
LOSS FUNCTIONS The key problem in supervised learning (classification and regression) is to maximize the predictive performance (Of course computational complexity is crucial for big data scenarios.) Performance is measured using a loss function predictor: f: X Y (map an input x X to output ŷ Y) loss: L: Y 2 R (map the predicted output ŷ and the correct output y to a score measuring "cost" or "error") Training loss: average L(f(x), y) over (x,y) in training data set Test loss: average L(f(x), y) over (x,y) in test data set
LOSS FUNCTIONS Example loss functions: squared error (regression) L(ŷ,y) = (ŷ y) 2 zero-one-loss (classification) L(ŷ,y) = 1 if ŷ y, 0 if ŷ=y ^ ^ log-loss (probabilistic classification) L(p,y) = log p(y), ^ where p(y) is the predicted probability of y NB: In the last case, the predictor outputs a probability distribution over the outcomes It is important to understand what the real "cost" or utility in the practical application is: minimizing one thing can far from optimal in terms of another
OVERFITTING Training loss can be low because: problem is simple and good predictions easy to find we have tried a huge number of different predictors and some of them just happen to fit the training data! The second alternative is called overfitting In case of overfitting: training error is small but test error is big
OVERFITTING The overfitting problem is closely related to the complexity of the models being fitted There are fewer simple models than complex models Therefore, fitting a simple model leads to a lower risk of overfitting than fitting a complex model Classic example: polynomial fitting
OVERFITTING Mean Squared Error 0.0 0.5 1.0 1.5 2.0 2.5 Left: Data source (black line), data (circles), and three regression models of increasing complexity; Right: training (blue) and test error (red) of the three models
VALIDATION A separate validation data set can be used to reduce the risk of overfitting train validation available data Fit models with varying complexity on training data, e.g. regression with different covariate subsets (feature selection) decision trees with variable number of nodes support vector machines with different regularization parameters Choose the subset/number-of-nodes/regularization based on performance on the validation set
CROSS-VALIDATION To get more reliable statistics than a single split provides, use K- fold cross-validation (see Exercise 1.3.c): 1. Divide the data into K equal-sized subsets: 1 2 3 4 5 2. For j from 1 to K: available data 2.1 Train the model(s) using all data except that of subset j 2.2 Compute the resulting validation error on the subset j 3. Average the K results When K = N (i.e. each datapoint is a separate subset) this is known as leave-one-out cross-validation.