Machine Learning :: Introduction. Konstantin Tretyakov

Size: px

Start display at page:

Download "Machine Learning :: Introduction. Konstantin Tretyakov"

Bethanie Gaines
6 years ago
Views:

1 Machine Learning :: Introduction Konstantin Tretyakov MTAT Data Mining November 5, 2009

2 So far Data mining as knowledge discovery Frequent itemsets Descriptive analysis Clustering Seriation DWH/OLAP/BI 2

3 Coming up next Machine learning Terminology, foundations, general framework. Supervised machine learning Basic ideas, algorithms & toy examples. Statistical challenges P-values, significance, consistency, stability State of the art techniques SVM, kernel methods, graphical models, latent variable models, boosting, bagging, LASSO, on-line learning, deep learning, reinforcement learning, 3

4 A Dear Child has Many Names 4 Data mining, Data analysis, Statistical analysis, Pattern discovery, Statistical learning, Machine learning, Predictive analytics, Business intelligence, Data-driven statistics Inductive reasoning, Pattern analysis, Knowledge discovery from databases, Analytical processing,

5 Machine Learning.. is mainly about methods for modeling data and mining patterns. To gain knowledge Bioinformatics, LHC physics, Web analytics, To infer intelligent behaviour from data Spam filtering, Automated recommendations, OCR, robotics, fraud detection, To automatically organize data Data summarization, compression, noise reduction, 5

6 Typical approaches 6

7 Typical approaches Clustering ( Unsupervised learning ) 7

8 Typical approaches Regression, classification ( Supervised learning ) 8

9 Typical approaches Outlier detection 9

10 Typical approaches Frequent pattern mining 10

11 Typical approaches Specific pattern mining 11

12 Machine learning: How? The approach depends strongly on application The general principle is the same, though: 1. Define a set of patterns of interest 2. Define a measure of goodness for the patterns 3. Find the best pattern in the data 12

13 Machine learning: How? The approach depends strongly on application The general principle is the same, though: 1. Define a set of patterns of interest 2. Define a measure of goodness for the patterns 3. Find the best pattern in the data Hence, heavy use of statistics and optimization. (In other words, heavy maths). 13

14 Supervised learning Observation Outcome Summer of 2003 was cold Winter of 2003 was warm Summer of 2004 was cold Winter of 2004 was cold Summer of 2005 was cold Winter of 2005 was cold Summer of 2006 was hot Winter of 2006 was warm Summer of 2007 was cold Winter of 2007 was cold Summer of 2008 was warm Winter of 2008 was warm Summer of 2009 was warm Winter of 2009 will be? 14

Supervised learning Observation Outcome Study=hard, Professor= I get a C Study=slack, Professor= I get an A Study=hard, Professor= I get an A Study=slack,

15 Supervised learning Observation Outcome Study=hard, Professor= I get a C Study=slack, Professor= I get an A Study=hard, Professor= I get an A Study=slack, Professor= I get a D Study=slack, Professor= I get an A Study=slack, Professor= I get an A Study=hard, Professor= I get an A Study=slack, Professor= I get a B? I get an A 15

Supervised learning Day Observation Outcome Mon I was not using magnetic bracelet TM In the evening I had a headache Tue I was using magnetic bracelet TM In the evening I had less

16 Supervised learning Day Observation Outcome Mon I was not using magnetic bracelet TM In the evening I had a headache Tue I was using magnetic bracelet TM In the evening I had less headache Wed I was using magnetic bracelet TM In the evening no headache! Thu I was using magnetic bracelet TM The headache is gone!! Fri I was not using magnetic bracelet TM No headache!! 16

17 Supervised learning Day Observation Outcome Mon I was not using magnetic bracelet TM In the evening I had a headache Tue I was using magnetic bracelet TM In the evening I had less headache Wed I was using magnetic bracelet TM In the evening no headache! Thu I was using magnetic bracelet TM The headache is gone!! Fri I was not using magnetic bracelet TM No headache!! Magnetic bracelet TM cures headache 17

18 Supervised learning 18

19 Supervised learning 19

20 Supervised learning 20

21 Supervised learning 21

22 Supervised learning 22

23 Supervised learning Formally, 23

24 Regression 24

25 Classification 25

26 The Dumb User Perspective Weka, RapidMiner, MSSSAS, Clementine, SPSS, R, 26

27 The Dumb User Perspective Validation 27

28 Classification demo: Iris dataset 150 measurements, 4 attributes, 3 classes 28

29 Classification demo: Iris dataset 29

30 Validation a b c <-- classified as a = Iris-setosa b = Iris-versicolor c = Iris-virginica Correctly Classified Instances % Incorrectly Classified Instances 3 2% Kappa statistic 0.97 Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances

31 Validation a b c <-- classified as a = Iris-setosa b = Iris-versicolor c = Iris-virginica Class setosa versic. virg. Avg TP Rate FP Rate Precision Recall F-Measure ROC Area

32 Validation a b c <-- classified as a = Iris-setosa b = Iris-versicolor c = Iris-virginica setosa versic. virg. Avg TP Rate FP Rate Precision Recall F-Measure ROC Area

Validation a b c <-- classified as 50 0 0 a = Iris-setosa 0 49 1 b = Iris-versicolor 0 2 48 c = Iris-virginica True positives False positives setosa versic. TP Rate 1 0.

33 Validation a b c <-- classified as a = Iris-setosa b = Iris-versicolor c = Iris-virginica True positives False positives setosa versic. TP Rate = TP/positive examples FP Rate = FP/negative examples Precision = TP/positives Recall = TP/positive examples F-Measure = 2*P*R/(P + R) ROC Area ~ Pr(s(false)<s(true)) 33

34 Classification summary Actual = Yes Actual = No Positives Predicted = Yes True positives (TP) False positives (FP) (Type I, α-error) Negatives Predicted = No False negatives (FN) (Type II, β-error) True negatives (FN) 34

35 Classification summary Positives Negatives Predicted = Yes Predicted = No Actual = Yes True positives (TP) False negatives (FN) (Type II, β-error) Recall Actual = No False positives (FP) (Type I, α-error) True negatives (FN) Precision Accuracy F-measure = harmonic_mean(precision, Recall) 35

36 Training classifiers on data Thus, a good classifier is the one which has good Accuracy/Precision/Recall. Hence, machine learning boils down to finding a function that optimizes these parameters for given data. 36

37 Training classifiers on data Thus, a good classifier is the one which has good Accuracy/Precision/Recall. Hence, machine learning boils down to finding a function that optimizes these parameters for given data. Yet, there s a catch 37

38 Training classifiers on data We want our algorithm to perform well on unseen data! This makes algorithms and theory way more complicated. This makes validation somewhat more complicated. 38

39 Proper validation You may not test your algorithm on the same data that you used to train it! 39

40 Proper validation You may not test your algorithm on the same data that you used to train it! 40

41 Proper validation :: Holdout Training set Split Testing set Validation 41

42 Proper validation What are the sufficient sizes for the test/training sets and why? What if the data is scarce? Cross-validation K-fold cross-validation Leave-one-out cross-validation Bootstrap

43 Intermediate summary Supervised learning = predicting f(x) well. For classification, well = high accuracy/precision/recall on unseen data. To achieve that, most training algorithms will try to optimize their accuracy/precision/recall on training data. We can then validate how good they are on test data. 43

44 Next Three examples of approaches Ad-hoc Decision tree induction Probabilistic modeling Naïve Bayes classifier Objective function optimization Linear least squares regression 44

45 Decision Tree Induction :: ID3 Iterative Dichotomizer 3 Simple yet popular decision tree induction algorithm Builds a decision tree top-down, starting at the root. Ross Quinlan 45

46 ID3 46

47 ID3 :: First split Which split is the most informative? 47

49 Information gain of a split Before split: p no = 5/14, p yes = 9/14, H(p) = 0.94 After split on outlook: H=0.97 H=0 H=0.97 =0.69 Information gain = =

50 ID3 1. Start with a single node 2. Find the attribute with the largest information gain 3. Split the node according to this attribute 4. Repeat recursively on subnodes 50

51 C4.5 C4.5 is an extension of ID3 Supports continuous attributes Supports missing values Supports pruning There is also a C5.0 A commercial version with additional bells & whistles 51

52 Decision trees The goods: Easy & efficient Interpretable and pretty The bads Rather ad-hoc Can overfit unless properly pruned Not the best model for all classification tasks 52

53 Next Three examples of approaches Ad-hoc Decision tree induction Probabilistic modeling Naïve Bayes classifier Objective function optimization Linear least squares regression 53

54 Next: Naïve Bayes Classifier To be continued 54

55 Questions? 55

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled