Machine Learning :: Introduction. Konstantin Tretyakov

Similar documents
Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Australian Journal of Basic and Applied Sciences

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Lecture 1: Basic Concepts of Machine Learning

(Sub)Gradient Descent

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Learning From the Past with Experiment Databases

Lecture 1: Machine Learning Basics

Welcome to. ECML/PKDD 2004 Community meeting

CSL465/603 - Machine Learning

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Reducing Features to Improve Bug Prediction

Assignment 1: Predicting Amazon Review Ratings

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Applications of data mining algorithms to analysis of medical data

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

A Case Study: News Classification Based on Term Frequency

Using dialogue context to improve parsing performance in dialogue systems

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Universidade do Minho Escola de Engenharia

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CS 446: Machine Learning

Mining Association Rules in Student s Assessment Data

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

For Jury Evaluation. The Road to Enlightenment: Generating Insight and Predicting Consumer Actions in Digital Markets

Truth Inference in Crowdsourcing: Is the Problem Solved?

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Detecting English-French Cognates Using Orthographic Edit Distance

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Cross-lingual Short-Text Document Classification for Facebook Comments

Laboratorio di Intelligenza Artificiale e Robotica

Detecting Wikipedia Vandalism using Machine Learning Notebook for PAN at CLEF 2011

Softprop: Softmax Neural Network Backpropagation Learning

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Mining Student Evolution Using Associative Classification and Clustering

Switchboard Language Model Improvement with Conversational Data from Gigaword

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Disambiguation of Thai Personal Name from Online News Articles

On-Line Data Analytics

An investigation of imitation learning algorithms for structured prediction

Activity Recognition from Accelerometer Data

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Lecture 10: Reinforcement Learning

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Chapter 2 Rule Learning in a Nutshell

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Comparison of network inference packages and methods for multiple networks inference

Office Hours: Mon & Fri 10:00-12:00. Course Description

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Generative models and adversarial training

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

arxiv: v1 [cs.lg] 15 Jun 2015

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

A Biological Signal-Based Stress Monitoring Framework for Children Using Wearable Devices

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Multi-Lingual Text Leveling

Corrective Feedback and Persistent Learning for Information Extraction

Axiom 2013 Team Description Paper

Issues in the Mining of Heart Failure Datasets

Learning Methods in Multilingual Speech Recognition

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

What Different Kinds of Stratification Can Reveal about the Generalizability of Data-Mined Skill Assessment Models

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Radius STEM Readiness TM

Speech Emotion Recognition Using Support Vector Machine

State University of New York at Buffalo INTRODUCTION TO STATISTICS PSC 408 Fall 2015 M,W,F 1-1:50 NSC 210

Laboratorio di Intelligenza Artificiale e Robotica

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

A Bayesian Learning Approach to Concept-Based Document Classification

Multi-label classification via multi-target regression on data streams

Courses in English. Application Development Technology. Artificial Intelligence. 2017/18 Spring Semester. Database access

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Semi-Supervised Face Detection

Data Stream Processing and Analytics

Calibration of Confidence Measures in Speech Recognition

Robot manipulations and development of spatial imagery

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Artificial Neural Networks written examination

Evolutive Neural Net Fuzzy Filtering: Basic Description

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Transcription:

Machine Learning :: Introduction Konstantin Tretyakov (kt@ut.ee) MTAT.03.183 Data Mining November 5, 2009

So far Data mining as knowledge discovery Frequent itemsets Descriptive analysis Clustering Seriation DWH/OLAP/BI 2

Coming up next Machine learning Terminology, foundations, general framework. Supervised machine learning Basic ideas, algorithms & toy examples. Statistical challenges P-values, significance, consistency, stability State of the art techniques SVM, kernel methods, graphical models, latent variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, 3

A Dear Child has Many Names 4 Data mining, Data analysis, Statistical analysis, Pattern discovery, Statistical learning, Machine learning, Predictive analytics, Business intelligence, Data-driven statistics Inductive reasoning, Pattern analysis, Knowledge discovery from databases, Analytical processing,

Machine Learning.. is mainly about methods for modeling data and mining patterns. To gain knowledge Bioinformatics, LHC physics, Web analytics, To infer intelligent behaviour from data Spam filtering, Automated recommendations, OCR, robotics, fraud detection, To automatically organize data Data summarization, compression, noise reduction, 5

Typical approaches 6

Typical approaches Clustering ( Unsupervised learning ) 7

Typical approaches Regression, classification ( Supervised learning ) 8

Typical approaches Outlier detection 9

Typical approaches Frequent pattern mining 10

Typical approaches Specific pattern mining 11

Machine learning: How? The approach depends strongly on application The general principle is the same, though: 1. Define a set of patterns of interest 2. Define a measure of goodness for the patterns 3. Find the best pattern in the data 12

Machine learning: How? The approach depends strongly on application The general principle is the same, though: 1. Define a set of patterns of interest 2. Define a measure of goodness for the patterns 3. Find the best pattern in the data Hence, heavy use of statistics and optimization. (In other words, heavy maths). 13

Supervised learning Observation Outcome Summer of 2003 was cold Winter of 2003 was warm Summer of 2004 was cold Winter of 2004 was cold Summer of 2005 was cold Winter of 2005 was cold Summer of 2006 was hot Winter of 2006 was warm Summer of 2007 was cold Winter of 2007 was cold Summer of 2008 was warm Winter of 2008 was warm Summer of 2009 was warm Winter of 2009 will be? 14

Supervised learning Observation Outcome Study=hard, Professor= I get a C Study=slack, Professor= I get an A Study=hard, Professor= I get an A Study=slack, Professor= I get a D Study=slack, Professor= I get an A Study=slack, Professor= I get an A Study=hard, Professor= I get an A Study=slack, Professor= I get a B? I get an A 15

Supervised learning Day Observation Outcome Mon I was not using magnetic bracelet TM In the evening I had a headache Tue I was using magnetic bracelet TM In the evening I had less headache Wed I was using magnetic bracelet TM In the evening no headache! Thu I was using magnetic bracelet TM The headache is gone!! Fri I was not using magnetic bracelet TM No headache!! 16

Supervised learning Day Observation Outcome Mon I was not using magnetic bracelet TM In the evening I had a headache Tue I was using magnetic bracelet TM In the evening I had less headache Wed I was using magnetic bracelet TM In the evening no headache! Thu I was using magnetic bracelet TM The headache is gone!! Fri I was not using magnetic bracelet TM No headache!! Magnetic bracelet TM cures headache 17

Supervised learning 18

Supervised learning 19

Supervised learning 20

Supervised learning 21

Supervised learning 22

Supervised learning Formally, 23

Regression 24

Classification 25

The Dumb User Perspective Weka, RapidMiner, MSSSAS, Clementine, SPSS, R, 26

The Dumb User Perspective Validation 27

Classification demo: Iris dataset 150 measurements, 4 attributes, 3 classes 28

Classification demo: Iris dataset 29

Validation a b c <-- classified as 50 0 0 a = Iris-setosa 0 49 1 b = Iris-versicolor 0 2 48 c = Iris-virginica Correctly Classified Instances 147 98% Incorrectly Classified Instances 3 2% Kappa statistic 0.97 Mean absolute error 0.0233 Root mean squared error 0.108 Relative absolute error 5.2482 % Root relative squared error 22.9089 % Total Number of Instances 150 30

Validation a b c <-- classified as 50 0 0 a = Iris-setosa 0 49 1 b = Iris-versicolor 0 2 48 c = Iris-virginica Class setosa versic. virg. Avg TP Rate 1 0.98 0.96 0.98 FP Rate 0 0.02 0.01 0.01 Precision 1 0.961 0.98 0.98 Recall 1 0.98 0.96 0.98 F-Measure 1 0.97 0.97 0.98 ROC Area 1 0.99 0.99 0.99 31

Validation a b c <-- classified as 50 0 0 a = Iris-setosa 0 49 1 b = Iris-versicolor 0 2 48 c = Iris-virginica setosa versic. virg. Avg TP Rate 1 0.98 0.96 0.98 FP Rate 0 0.02 0.01 0.01 Precision 1 0.961 0.98 0.98 Recall 1 0.98 0.96 0.98 F-Measure 1 0.97 0.97 0.98 ROC Area 1 0.99 0.99 0.99 32

Validation a b c <-- classified as 50 0 0 a = Iris-setosa 0 49 1 b = Iris-versicolor 0 2 48 c = Iris-virginica True positives False positives setosa versic. TP Rate 1 0.98 = TP/positive examples FP Rate 0 0.02 = FP/negative examples Precision 1 0.961 = TP/positives Recall 1 0.98 = TP/positive examples F-Measure 1 0.97 = 2*P*R/(P + R) ROC Area 1 0.99 ~ Pr(s(false)<s(true)) 33

Classification summary Actual = Yes Actual = No Positives Predicted = Yes True positives (TP) False positives (FP) (Type I, α-error) Negatives Predicted = No False negatives (FN) (Type II, β-error) True negatives (FN) 34

Classification summary Positives Negatives Predicted = Yes Predicted = No Actual = Yes True positives (TP) False negatives (FN) (Type II, β-error) Recall Actual = No False positives (FP) (Type I, α-error) True negatives (FN) Precision Accuracy F-measure = harmonic_mean(precision, Recall) 35

Training classifiers on data Thus, a good classifier is the one which has good Accuracy/Precision/Recall. Hence, machine learning boils down to finding a function that optimizes these parameters for given data. 36

Training classifiers on data Thus, a good classifier is the one which has good Accuracy/Precision/Recall. Hence, machine learning boils down to finding a function that optimizes these parameters for given data. Yet, there s a catch 37

Training classifiers on data We want our algorithm to perform well on unseen data! This makes algorithms and theory way more complicated. This makes validation somewhat more complicated. 38

Proper validation You may not test your algorithm on the same data that you used to train it! 39

Proper validation You may not test your algorithm on the same data that you used to train it! 40

Proper validation :: Holdout Training set Split Testing set Validation 41

Proper validation What are the sufficient sizes for the test/training sets and why? What if the data is scarce? Cross-validation K-fold cross-validation Leave-one-out cross-validation Bootstrap.632+ 42

Intermediate summary Supervised learning = predicting f(x) well. For classification, well = high accuracy/precision/recall on unseen data. To achieve that, most training algorithms will try to optimize their accuracy/precision/recall on training data. We can then validate how good they are on test data. 43

Next Three examples of approaches Ad-hoc Decision tree induction Probabilistic modeling Naïve Bayes classifier Objective function optimization Linear least squares regression 44

Decision Tree Induction :: ID3 Iterative Dichotomizer 3 Simple yet popular decision tree induction algorithm Builds a decision tree top-down, starting at the root. Ross Quinlan 45

ID3 46

ID3 :: First split Which split is the most informative? 47

Information gain of a split Before split: p no = 5/14, p yes = 9/14, H(p) = 0.94 After split on outlook: H=0.97 H=0 H=0.97 =0.69 Information gain = 0.94-0.69 = 0.25 49

ID3 1. Start with a single node 2. Find the attribute with the largest information gain 3. Split the node according to this attribute 4. Repeat recursively on subnodes 50

C4.5 C4.5 is an extension of ID3 Supports continuous attributes Supports missing values Supports pruning There is also a C5.0 A commercial version with additional bells & whistles 51

Decision trees The goods: Easy & efficient Interpretable and pretty The bads Rather ad-hoc Can overfit unless properly pruned Not the best model for all classification tasks 52

Next Three examples of approaches Ad-hoc Decision tree induction Probabilistic modeling Naïve Bayes classifier Objective function optimization Linear least squares regression 53

Next: Naïve Bayes Classifier To be continued 54

Questions? 55