MACHINE LEARNING FOR DEVELOPERS A SHORT INTRODUCTION. Gregor Roth / 1&1 Mail & Media Development & Technology GmbH

Size: px

Start display at page:

Download "MACHINE LEARNING FOR DEVELOPERS A SHORT INTRODUCTION. Gregor Roth / 1&1 Mail & Media Development & Technology GmbH"

Malcolm Basil Butler
5 years ago
Views:

1 MACHINE LEARNING FOR DEVELOPERS A SHORT INTRODUCTION Gregor Roth / 1&1 Mail & Media Development & Technology GmbH

Engineer builds systems that consolidate, store, and retrieve data from

then implemented into the code base by software engineers and data

Kafka Scala Python Data Engineer Hadoop Business Intelligence Machine

MatLAB Hive R definitions taken from http://101.datascience.

2 Software Engineer vs. Data Engineer vs. Data Scientist Software Engineer "builds applications and systems Data Engineer builds systems that consolidate, store, and retrieve data from the various applications and systems [ ] PHP Data Scientist Swift builds analysis on top of data. This may come in the form of [ ] a machine learning algorithm that is then implemented into the code base by software engineers and data engineers Software Engineer AngularJs Spring WebServices C/C++ Java Kafka Scala Python Data Engineer Hadoop Business Intelligence Machine Learning Jupyter Spark ML Data Mining Data Scientist ETL Data Warehouse MatLAB Hive R definitions taken from &1 Mail & Media Development & Technology GmbH

3 AlphaGo The original AlphaGo first learned from studying 30 million moves of expert human play By contrast, AlphaGo Zero never saw humans play. Instead, it began by knowing only the rules of the game. source: &1 Mail & Media Development & Technology GmbH

4 Supervised machine learning expected price 411,000 Predict h ɵ x &1 Mail & Media Development & Technology GmbH

5 Supervised machine learning expected price 411,000 Predict h ɵ x 542, &1 Mail & Media Development & Technology GmbH

6 Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542, , &1 Mail & Media Development & Technology GmbH

7 Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542, ,000 Predict h ɵ x mail type Order Confirmation &1 Mail & Media Development & Technology GmbH

8 Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542, ,000 Predict h ɵ x mail type Order Confirmation Newsletter &1 Mail & Media Development & Technology GmbH

9 Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542, ,000 Classification: predict a discrete number of category values mail type Order Confirmation Predict h ɵ x Newsletter Billing &1 Mail & Media Development & Technology GmbH

Features the input data Input of a prediction is a feature vector A feature is an individual measurable property or

features. num size (m 2 ) rooms age 1 90 2 23 2 101 3 3 key features.

Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly

10 Features the input data Input of a prediction is a feature vector A feature is an individual measurable property or characteristic of a phenomenon being observe (taken from wikipedia) Challenge is to identify and extract the relevant features. num size (m 2 ) rooms age key features Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern key features num size (KiB) #attachm. dkim?..?text? ? ? &1 Mail & Media Development & Technology GmbH

11 Vectorizing text In most cases text will be preprocessed. E.g. tokenizing, stop-words, lower-casing, normalizing URLs/ addresses, stemming, Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern normalized Usually, a vocabulary list of the most important words is used to build the feature vector. The vocabulary list may be generated based on the training data. E.g. by using the TF-IDF approach two definition machine more modern feature vector able... about... above.. vocabulary list Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern key features num size (KiB) #attachm. dkim? able? about? &1 Mail & Media Development & Technology GmbH

Prediction 120 3... 411,000 155 6... Predict ɵ 542,000 90 2... 249,000 21 0 1.

12 Prediction , Predict ɵ 542, , Order Confirmation Predict ɵ Newsletter Billing &1 Mail & Media Development & Technology GmbH

Prediction function 120 3... 411,000 155 6.

13 Prediction function , , , Order Confirmation Newsletter Billing &1 Mail & Media Development & Technology GmbH

14 Prediction function Essentially, a prediction function is a function which takes the feature vector (x) and returns the prediction value (y). Also called target or hypothesis function. Usage example: &1 Mail & Media Development & Technology GmbH

15 Prediction function Essentially, a prediction function is a function which takes the feature vector (x) and returns the prediction value (y). Also called target or hypothesis function. Usage example: &1 Mail & Media Development & Technology GmbH

16 Which machine learning algorithm to use? Which algorithm? &1 Mail & Media Development & Technology GmbH

17 Which machine learning algorithm to use? Which algorithm? Some supervising algorithms Algorithm Problem Type Easy to explain? Average predictive accuracy Training speed Prediction speed parameter tuning needed? Works with small num. of observations Handles lots of irrelevant features well KNN Either Yes Lower Fast Depends on n Minimal No No Linear regression Regression Yes Lower Fast Fast None Yes No Logistic regression Classification Somewhat Lower Fast Fast None Yes No Naive Bayes Classification Somewhat Lower Fast Fast Some Yes Yes Decision trees Either Somewhat Lower Fast Fast Some No No AdaBoost Either No Higher Slow Fast Some No Yes Neural networks Either No Higher Slow Fast Lots No Yes taken from &1 Mail & Media Development & Technology GmbH

18 Which machine learning algorithm to use? Which algorithm? Some supervising algorithms Algorithm Problem Type Easy to explain? Average predictive accuracy Training speed Prediction speed parameter tuning needed? Works with small num. of observations Handles lots of irrelevant features well KNN Either Yes Lower Fast Depends on n Minimal No No Linear regression Regression Yes Lower Fast Fast None Yes No Logistic regression Classification Somewhat Lower Fast Fast None Yes No Naive Bayes Classification Somewhat Lower Fast Fast Some Yes Yes Decision trees Either Somewhat Lower Fast Fast Some No No AdaBoost Either No Higher Slow Fast Some No Yes Neural networks Either No Higher Slow Fast Lots No Yes taken from &1 Mail & Media Development & Technology GmbH

19 Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data &1 Mail & Media Development & Technology GmbH

20 Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data. Simple example &1 Mail & Media Development & Technology GmbH

21 Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data. Simple example &1 Mail & Media Development & Technology GmbH

22 Process the prediction function Creating a new instance of the regression function with the theta vector. The theta vector is result of a previous train process h θ x = x 1.. and predict the house price based on house size of 155 m 2. The first element of the feature vector ( ) has to be 1 for computational reasons h θ x = , &1 Mail & Media Development & Technology GmbH

Prediction graph incl. real price-size pairs θ 1 How do you know that the used theta values { 1.

23 Prediction graph incl. real price-size pairs θ 1 How do you know that the used theta values { , } are the best fit? &1 Mail & Media Development & Technology GmbH

24 Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = size h θ x = size Evaluate h θ x = size &1 Mail & Media Development & Technology GmbH

25 Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = size h θ x = size h θ x = size &1 Mail & Media Development & Technology GmbH

26 Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = size h θ x = size h θ x = size Requires test data including labels (which represents the right answer ) , , &1 Mail & Media Development & Technology GmbH

27 Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs &1 Mail & Media Development & Technology GmbH

28 Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs. Simple example: &1 Mail & Media Development & Technology GmbH

29 Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs. Simple example: for each test example predicted result real result &1 Mail & Media Development & Technology GmbH

Evaluate the prediction function - examples h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3.

30 Evaluate the prediction function - examples h θ x = size h θ x = size h θ x = size 1,551, ,769 69, &1 Mail & Media Development & Technology GmbH

31 Evaluate the prediction function - examples h θ x = size h θ x = size h θ x = size 1,551, ,769 69,829 How to get the best fitting Theta vector: &1 Mail & Media Development & Technology GmbH

32 How to get the best fitting prediction function (theta parameters)? linear regression algorithm h θ x = θ T x Learner prediction function h θ x =?? 0 +?? &1 Mail & Media Development & Technology GmbH

33 How to get the best fitting prediction function (theta parameters)? algorithm linear regression h θ x = θ T x Learner prediction function h θ x = labelled train data 411, , &1 Mail & Media Development & Technology GmbH

34 How to get the best fitting prediction function (theta parameters)? algorithm linear regression h θ x = θ T x Learner prediction function h θ x = labelled train data 411, , &1 Mail & Media Development & Technology GmbH

35 Minimizing the cost function Gradient descent Gradient descent minimizes the cost function, meaning that it's used to find the theta combinations that produces the lowest cost J(θ) based on the training data. Within each iteration a new value will be computed for each theta parameter: θ 0, θ 1, and θ n in parallel. Requires high calculating power, potentially &1 Mail & Media Development & Technology GmbH

36 Minimizing the cost function Gradient descent Gradient descent minimizes the cost function, meaning that it's used to find the theta combinations that produces the lowest cost J(θ) based on the training data. new n th element of theta vector n th element of theta vector learning rate predicted result real result n th element of feature vector (of a train data record) Within each iteration a new value will be computed for each theta parameter: θ 0, θ 1, and θ n in parallel. Requires high calculating power, potentially &1 Mail & Media Development & Technology GmbH

37 Gradient decent a simple Java-based implementation &1 Mail & Media Development & Technology GmbH

38 Train Train the regression function Graphs &1 Mail & Media Development & Technology GmbH

39 Underfitting Underfitting occurs when the machine learning algorithm can not capture the underlying trend of the data. Underfitting is often due to an excessively simple model such as A common way to correct underfitting is to add more features add polynomial features Adding more features often requires additional feature scaling which standardize the range of independent variables &1 Mail & Media Development & Technology GmbH

40 Playing with the number of parameters Example: h x = θ 1 + θ size h x = θ 1 + θ size + θ size h x = θ 1 + θ size + θ size +.. +θ size If you add too many features, you could end up with a prediction function that is overfitting. Overfitting occurs when the function fits the training data too well, by capturing noise or random fluctuations in the training data &1 Mail & Media Development & Technology GmbH

41 Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Labelled Data test/validation data train data &1 Mail & Media Development & Technology GmbH

42 Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Examples well-fitting cost with train examples cost with untouched examples Labelled Data test/validation data train data overfitting cost with train examples cost with untouched examples &1 Mail & Media Development & Technology GmbH

43 Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Examples well-fitting cost with train examples cost with untouched examples Labelled Data test/validation data train data overfitting cost with train examples cost with untouched examples Possible options to avoid overfitting Use a larger set of training data. Use an improved machine learning algorithm by considering regularization. Use fewer features &1 Mail & Media Development & Technology GmbH

44 Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = x x x 2 411, labelled train data &1 Mail & Media Development & Technology GmbH

45 Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = x x x 2 Evaluate 1 411, , labelled train data labelled test data &1 Mail & Media Development & Technology GmbH

46 Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = x x x 2 Evaluate 1 411, , labelled train data release labelled test data learning phase prediction phase Predict h θ x = x x x 2 249, &1 Mail & Media Development & Technology GmbH

Putting all together h θ x = θ T x algorithm prediction function

32209 x 2 1 411,000 120 542,000 1 4 155 6 labelled train data

phase 1 90 3 Predict 32209 x 2 249,000 47 13.12.

47 Putting all together h θ x = θ T x algorithm prediction function h θ x = x x x , , labelled train data release 542,000 labelled test data learning phase prediction phase Predict h θ x = x x x 2 249, &1 Mail & Media Development & Technology GmbH

48 Machine learing libraries and tools In practice, you will likely rely on machine learning frameworks, libraries, and tools. Some examples Software Creator Written in Interface Torch Ronan Collobert, Koray Kavukcuoglu, Clement Farabet C, Lua Lua, LuaJIT, C, utility library for C++/OpenCL Caffe2 Facebook C++, Python Python, MATLAB Scikit-learn David Cournapeau C++, Python Python Microsoft Cognitive Toolkit Microsoft Research C++ Python, C++, Command line, BrainScript TensorFlow Google Brain team C++, Python Python, Java, C/C++, Go, R Spark ML Apache Software Fundation Scala Python, Java, Scala Deeplearning4j Skymind engineering team; Deeplearning4j community; C++, Java Python, Java, Scala, Clojure Weka University of Waikato Java Java Parts taken from &1 Mail & Media Development & Technology GmbH

Machine Learning (~10 weeks, for free) 49 13.

49 Literature Andrew Ng's Machine Learning course (~11 weeks, for free) Udacity's Intro to Machine Learning (~10 weeks, for free) &1 Mail & Media Development & Technology GmbH

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3