MACHINE LEARNING FOR DEVELOPERS A SHORT INTRODUCTION. Gregor Roth / 1&1 Mail & Media Development & Technology GmbH

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

(Sub)Gradient Descent

CSL465/603 - Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Learning From the Past with Experiment Databases

A Case Study: News Classification Based on Term Frequency

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Axiom 2013 Team Description Paper

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Top US Tech Talent for the Top China Tech Company

Indian Institute of Technology, Kanpur

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Laboratorio di Intelligenza Artificiale e Robotica

Reducing Features to Improve Bug Prediction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Assignment 1: Predicting Amazon Review Ratings

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Lecture 1: Basic Concepts of Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Rule Learning With Negation: Issues Regarding Effectiveness

Switchboard Language Model Improvement with Conversational Data from Gigaword

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Probabilistic Latent Semantic Analysis

Using dialogue context to improve parsing performance in dialogue systems

Generative models and adversarial training

Calibration of Confidence Measures in Speech Recognition

Multivariate k-nearest Neighbor Regression for Time Series data -

Universidade do Minho Escola de Engenharia

Computerized Adaptive Psychological Testing A Personalisation Perspective

Linking Task: Identifying authors and book titles in verbose queries

Applications of data mining algorithms to analysis of medical data

COSI Meet the Majors Fall 17. Prof. Mitch Cherniack Undergraduate Advising Head (UAH), COSI Fall '17: Instructor COSI 29a

MYCIN. The MYCIN Task

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

CS 446: Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Seminar - Organic Computing

arxiv: v1 [cs.lg] 15 Jun 2015

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Bluetooth mlearning Applications for the Classroom of the Future

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Model Ensemble for Click Prediction in Bing Search Ads

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Laboratorio di Intelligenza Artificiale e Robotica

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Softprop: Softmax Neural Network Backpropagation Learning

Speech Emotion Recognition Using Support Vector Machine

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

November 17, 2017 ARIZONA STATE UNIVERSITY. ADDENDUM 3 RFP Digital Integrated Enrollment Support for Students

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Beyond the Pipeline: Discrete Optimization in NLP

Bayllocator: A proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning

Exposé for a Master s Thesis

Georgetown University at TREC 2017 Dynamic Domain Track

Attributed Social Network Embedding

Circuit Simulators: A Revolutionary E-Learning Platform

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Word Segmentation of Off-line Handwritten Documents

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Semi-Supervised Face Detection

School of Innovative Technologies and Engineering

Learning Methods for Fuzzy Systems

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Speech Recognition at ICSI: Broadcast News and beyond

A Bayesian Learning Approach to Concept-Based Document Classification

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Evidence for Reliability, Validity and Learning Effectiveness

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Knowledge Transfer in Deep Convolutional Neural Nets

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Australian Journal of Basic and Applied Sciences

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Software Maintenance

Introduction, Organization Overview of NLP, Main Issues

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

An Introduction to Simio for Beginners

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Transcription:

MACHINE LEARNING FOR DEVELOPERS A SHORT INTRODUCTION Gregor Roth / 1&1 Mail & Media Development & Technology GmbH

Software Engineer vs. Data Engineer vs. Data Scientist Software Engineer "builds applications and systems Data Engineer builds systems that consolidate, store, and retrieve data from the various applications and systems [ ] PHP Data Scientist Swift builds analysis on top of data. This may come in the form of [ ] a machine learning algorithm that is then implemented into the code base by software engineers and data engineers Software Engineer AngularJs Spring WebServices C/C++ Java Kafka Scala Python Data Engineer Hadoop Business Intelligence Machine Learning Jupyter Spark ML Data Mining Data Scientist ETL Data Warehouse MatLAB Hive R definitions taken from http://101.datascience.community/2016/11/28/data-scientists-data-engineers-software-engineers-the-difference-according-to-linkedin/ 2 13.12.2017 1&1 Mail & Media Development & Technology GmbH

AlphaGo The original AlphaGo first learned from studying 30 million moves of expert human play By contrast, AlphaGo Zero never saw humans play. Instead, it began by knowing only the rules of the game. source: https://theconversation.com/googles-new-go-playing-ai-learns-fast-and-even-thrashed-its-former-self-85979 3 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning expected price 411,000 Predict h ɵ x 4 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning expected price 411,000 Predict h ɵ x 542,000 5 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542,000 249,000 6 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542,000 249,000 Predict h ɵ x mail type Order Confirmation 7 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542,000 249,000 Predict h ɵ x mail type Order Confirmation Newsletter 8 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Supervised machine learning Regression: predict continues numeric valued output expected price 411,000 Predict h ɵ x 542,000 249,000 Classification: predict a discrete number of category values mail type Order Confirmation Predict h ɵ x Newsletter Billing 9 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Features the input data Input of a prediction is a feature vector A feature is an individual measurable property or characteristic of a phenomenon being observe (taken from wikipedia) Challenge is to identify and extract the relevant features. num size (m 2 ) rooms age 1 90 2 23 2 101 3 3 key features.. 19754 1330 11 12 Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern key features num size (KiB) #attachm. dkim?..?text?.. 1 21 0 1? 2 421 3 0?.. 10 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Vectorizing text In most cases text will be preprocessed. E.g. tokenizing, stop-words, lower-casing, normalizing URLs/ email addresses, stemming, Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern normalized Usually, a vocabulary list of the most important words is used to build the feature vector. The vocabulary list may be generated based on the training data. E.g. by using the TF-IDF approach two definition machine more modern feature vector 1 0 0 0....... able... about... above.. vocabulary list Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. Tom Mitchell provides a more modern key features num size (KiB) #attachm. dkim? able? about? 1 21 0 1 1 0 2 421 3 0 0 0.. 11 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Prediction 120 3... 411,000 155 6... Predict ɵ 542,000 90 2... 249,000 21 0 1... Order Confirmation 421 3 0... Predict ɵ Newsletter 34 1 1... Billing 12 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Prediction function 120 3... 411,000 155 6... 542,000 90 2... 249,000 21 0 1... 421 3 0... Order Confirmation Newsletter 34 1 1... Billing 13 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Prediction function Essentially, a prediction function is a function which takes the feature vector (x) and returns the prediction value (y). Also called target or hypothesis function. Usage example: 14 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Prediction function Essentially, a prediction function is a function which takes the feature vector (x) and returns the prediction value (y). Also called target or hypothesis function. Usage example: 15 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Which machine learning algorithm to use? Which algorithm? 16 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Which machine learning algorithm to use? Which algorithm? Some supervising algorithms Algorithm Problem Type Easy to explain? Average predictive accuracy Training speed Prediction speed parameter tuning needed? Works with small num. of observations Handles lots of irrelevant features well KNN Either Yes Lower Fast Depends on n Minimal No No Linear regression Regression Yes Lower Fast Fast None Yes No Logistic regression Classification Somewhat Lower Fast Fast None Yes No Naive Bayes Classification Somewhat Lower Fast Fast Some Yes Yes Decision trees Either Somewhat Lower Fast Fast Some No No AdaBoost Either No Higher Slow Fast Some No Yes Neural networks Either No Higher Slow Fast Lots No Yes taken from http://www.dataschool.io/comparing-supervised-learning-algorithms/ 17 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Which machine learning algorithm to use? Which algorithm? Some supervising algorithms Algorithm Problem Type Easy to explain? Average predictive accuracy Training speed Prediction speed parameter tuning needed? Works with small num. of observations Handles lots of irrelevant features well KNN Either Yes Lower Fast Depends on n Minimal No No Linear regression Regression Yes Lower Fast Fast None Yes No Logistic regression Classification Somewhat Lower Fast Fast None Yes No Naive Bayes Classification Somewhat Lower Fast Fast Some Yes Yes Decision trees Either Somewhat Lower Fast Fast Some No No AdaBoost Either No Higher Slow Fast Some No Yes Neural networks Either No Higher Slow Fast Lots No Yes taken from http://www.dataschool.io/comparing-supervised-learning-algorithms/ 18 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data. 19 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data. Simple example 20 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression Linear regression models the relationship between the input feature vector (x) and a the response label (y). Thetas are used within a learning process to adapt the regression function based on the training data. Simple example 21 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Process the prediction function Creating a new instance of the regression function with the theta vector. The theta vector is result of a previous train process h θ x = 1. 004579 1 + 5. 286822 x 1.. and predict the house price based on house size of 155 m 2. The first element of the feature vector ( ) has to be 1 for computational reasons 1 155 h θ x = 1. 004579 1 + 5. 286822 155. 0 542,000 22 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Prediction graph incl. real price-size pairs θ 1 How do you know that the used theta values { 1.004579, 5.286822 } are the best fit? 23 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3. 912451 size Evaluate h θ x = 1. 004579 1 + 5. 286822 size 24 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3. 912451 size h θ x = 1. 004579 1 + 5. 286822 size 25 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Evaluate the prediction function Evaluate the prediction functions to identify the theta vector which produces the best fitting prediction. E.g.: h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3. 912451 size h θ x = 1. 004579 1 + 5. 286822 size Requires test data including labels (which represents the right answer ) 120 3... 155 6... 411,000 542,000 26 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs. 27 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs. Simple example: 28 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Linear Regression - Cost function To identify the best-fitting theta parameter vector, you need a cost function, which will evaluate how well the prediction function performs. Simple example: for each test example predicted result real result 29 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Evaluate the prediction function - examples h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3. 912451 size h θ x = 1. 004579 1 + 5. 286822 size 1,551,418 341,769 69,829 30 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Evaluate the prediction function - examples h θ x = 1. 001391 1 + 2. 058826 size h θ x = 1. 003745 1 + 3. 912451 size h θ x = 1. 004579 1 + 5. 286822 size 1,551,418 341,769 69,829 How to get the best fitting Theta vector: 31 13.12.2017 1&1 Mail & Media Development & Technology GmbH

How to get the best fitting prediction function (theta parameters)? linear regression algorithm h θ x = θ T x Learner prediction function h θ x =?? 0 +?? 1 32 13.12.2017 1&1 Mail & Media Development & Technology GmbH

How to get the best fitting prediction function (theta parameters)? algorithm linear regression h θ x = θ T x Learner prediction function h θ x = 1. 004579 0 + 5. 286822 1 labelled train data 411,000 542,000 1 120 1 155 33 13.12.2017 1&1 Mail & Media Development & Technology GmbH

How to get the best fitting prediction function (theta parameters)? algorithm linear regression h θ x = θ T x Learner prediction function h θ x = 1. 004579 0 + 5. 286822 1 labelled train data 411,000 542,000 1 120 1 155 34 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Minimizing the cost function Gradient descent Gradient descent minimizes the cost function, meaning that it's used to find the theta combinations that produces the lowest cost J(θ) based on the training data. Within each iteration a new value will be computed for each theta parameter: θ 0, θ 1, and θ n in parallel. Requires high calculating power, potentially 35 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Minimizing the cost function Gradient descent Gradient descent minimizes the cost function, meaning that it's used to find the theta combinations that produces the lowest cost J(θ) based on the training data. new n th element of theta vector n th element of theta vector learning rate predicted result real result n th element of feature vector (of a train data record) Within each iteration a new value will be computed for each theta parameter: θ 0, θ 1, and θ n in parallel. Requires high calculating power, potentially 36 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Gradient decent a simple Java-based implementation 37 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Train Train the regression function Graphs 38 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Underfitting Underfitting occurs when the machine learning algorithm can not capture the underlying trend of the data. Underfitting is often due to an excessively simple model such as A common way to correct underfitting is to add more features add polynomial features Adding more features often requires additional feature scaling which standardize the range of independent variables 39 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Playing with the number of parameters Example: h x = θ 1 + θ size h x = θ 1 + θ size + θ size h x = θ 1 + θ size + θ size +.. +θ size If you add too many features, you could end up with a prediction function that is overfitting. Overfitting occurs when the function fits the training data too well, by capturing noise or random fluctuations in the training data. 40 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Labelled Data test/validation data train data 41 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Examples well-fitting cost with train examples cost with untouched examples Labelled Data test/validation data train data overfitting cost with train examples cost with untouched examples 42 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Detecting Overfitting Holdout method: Use e.g. 60% of the labelled data to train models. Use the remaining untouched labelled data for cross-validation and final tests Examples well-fitting cost with train examples cost with untouched examples Labelled Data test/validation data train data overfitting cost with train examples cost with untouched examples Possible options to avoid overfitting Use a larger set of training data. Use an improved machine learning algorithm by considering regularization. Use fewer features 43 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 411,000 1 120 4 labelled train data 44 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 Evaluate 1 411,000 120 542,000 1 4 155 6 labelled train data labelled test data 45 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Putting all together h θ x = θ T x algorithm Learner prediction function h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 Evaluate 1 411,000 120 542,000 1 4 155 6 labelled train data release labelled test data learning phase prediction phase 1 90 3 Predict h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 249,000 46 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Putting all together h θ x = θ T x algorithm prediction function h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 1 411,000 120 542,000 1 4 155 6 labelled train data release 542,000 labelled test data learning phase prediction phase 1 90 3 Predict h θ x = 1991. 61538 x 0 + 9817. 58845 x 1 + 2665. 32209 x 2 249,000 47 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Machine learing libraries and tools In practice, you will likely rely on machine learning frameworks, libraries, and tools. Some examples Software Creator Written in Interface Torch Ronan Collobert, Koray Kavukcuoglu, Clement Farabet C, Lua Lua, LuaJIT, C, utility library for C++/OpenCL Caffe2 Facebook C++, Python Python, MATLAB Scikit-learn David Cournapeau C++, Python Python Microsoft Cognitive Toolkit Microsoft Research C++ Python, C++, Command line, BrainScript TensorFlow Google Brain team C++, Python Python, Java, C/C++, Go, R Spark ML Apache Software Fundation Scala Python, Java, Scala Deeplearning4j Skymind engineering team; Deeplearning4j community; C++, Java Python, Java, Scala, Clojure Weka University of Waikato Java Java Parts taken from https://en.wikipedia.org/wiki/comparison_of_deep_learning_software 48 13.12.2017 1&1 Mail & Media Development & Technology GmbH

Literature Andrew Ng's Machine Learning course (~11 weeks, for free) Udacity's Intro to Machine Learning (~10 weeks, for free) 49 13.12.2017 1&1 Mail & Media Development & Technology GmbH