Introduction to Machine Learning

Similar documents
Lecture 1: Machine Learning Basics

CS Machine Learning

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Methods in Multilingual Speech Recognition

Lecture 1: Basic Concepts of Machine Learning

Reducing Features to Improve Bug Prediction

A Case Study: News Classification Based on Term Frequency

Rule Learning With Negation: Issues Regarding Effectiveness

Python Machine Learning

Mining Student Evolution Using Associative Classification and Clustering

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

CSL465/603 - Machine Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

Applications of data mining algorithms to analysis of medical data

Universidade do Minho Escola de Engenharia

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

(Sub)Gradient Descent

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule Learning with Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Data Stream Processing and Analytics

A Version Space Approach to Learning Context-free Grammars

Australian Journal of Basic and Applied Sciences

CS 446: Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning From the Past with Experiment Databases

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Probabilistic Latent Semantic Analysis

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Word Segmentation of Off-line Handwritten Documents

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Semi-Supervised Face Detection

Preference Learning in Recommender Systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Chapter 2 Rule Learning in a Nutshell

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Issues in the Mining of Heart Failure Datasets

Softprop: Softmax Neural Network Backpropagation Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Visual CP Representation of Knowledge

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Welcome to. ECML/PKDD 2004 Community meeting

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Grade 6: Correlated to AGS Basic Math Skills

A Comparison of Two Text Representations for Sentiment Analysis

Laboratorio di Intelligenza Artificiale e Robotica

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Emotion Recognition Using Support Vector Machine

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

A Bayesian Learning Approach to Concept-Based Document Classification

GACE Computer Science Assessment Test at a Glance

Content-based Image Retrieval Using Image Regions as Query Examples

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Learning to Schedule Straight-Line Code

Multivariate k-nearest Neighbor Regression for Time Series data -

Word learning as Bayesian inference

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Radius STEM Readiness TM

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Data Fusion Through Statistical Matching

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Axiom 2013 Team Description Paper

Accuracy (%) # features

Mining Association Rules in Student s Assessment Data

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Learning Methods for Fuzzy Systems

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Detecting English-French Cognates Using Orthographic Edit Distance

Using dialogue context to improve parsing performance in dialogue systems

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Laboratorio di Intelligenza Artificiale e Robotica

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Prediction of Maximal Projection for Semantic Role Labeling

The stages of event extraction

arxiv: v1 [cs.lg] 3 May 2013

Indian Institute of Technology, Kanpur

Ensemble Technique Utilization for Indonesian Dependency Parser

Fuzzy rule-based system applied to risk estimation of cardiovascular patients

Second Exam: Natural Language Parsing with Neural Networks

Cross-lingual Short-Text Document Classification for Facebook Comments

A Vector Space Approach for Aspect-Based Sentiment Analysis

Calibration of Confidence Measures in Speech Recognition

Multi-Lingual Text Leveling

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Transcription:

Introduction to Machine Learning D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 6, 2009

Outline Outline Introduction to Machine Learning

Outline Outline Introduction to Machine Learning Decision Tree

Outline Outline Introduction to Machine Learning Decision Tree Naive Bayes

Outline Outline Introduction to Machine Learning Decision Tree Naive Bayes K-nearest neighbor

Introduction to Machine Learning Like human learning from past experiences.

Introduction to Machine Learning Like human learning from past experiences. A computer system learns from data, which represent some past experiences of an application domain.

Introduction to Machine Learning Like human learning from past experiences. A computer system learns from data, which represent some past experiences of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute.

Introduction to Machine Learning Like human learning from past experiences. A computer system learns from data, which represent some past experiences of an application domain. Our focus: learn a target function that can be used to predict the values of a discrete class attribute. The task is commonly called: Supervised learning, classification.

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company given an employe described trough some attributes (the number of attributes can be very high)

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company given an employe described trough some attributes (the number of attributes can be very high) assign to the employe the correct level into the hierarchy.

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company given an employe described trough some attributes (the number of attributes can be very high) assign to the employe the correct level into the hierarchy. How many if are necessary to select the correct level?

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company given an employe described trough some attributes (the number of attributes can be very high) assign to the employe the correct level into the hierarchy. How many if are necessary to select the correct level? How many time is necessary to study the relations between the hierarchy and attributes?

Introduction to Machine Learning Example You need to write a program that: given a Level Hierarchy of a company given an employe described trough some attributes (the number of attributes can be very high) assign to the employe the correct level into the hierarchy. How many if are necessary to select the correct level? How many time is necessary to study the relations between the hierarchy and attributes? Solution Learn the function to link each employe to the correct level.

Supervised Learning: Data and Goal Data: a set of data records (also called examples, instances or cases) described by: k attributes: A 1,A 2,...,A k.

Supervised Learning: Data and Goal Data: a set of data records (also called examples, instances or cases) described by: k attributes: A 1,A 2,...,A k. a class: Each example is labelled with a pre-defined class.

Supervised Learning: Data and Goal Data: a set of data records (also called examples, instances or cases) described by: k attributes: A 1,A 2,...,A k. a class: Each example is labelled with a pre-defined class. In previous example data can be obtained from existing DataBase.

Supervised Learning: Data and Goal Data: a set of data records (also called examples, instances or cases) described by: k attributes: A 1,A 2,...,A k. a class: Each example is labelled with a pre-defined class. In previous example data can be obtained from existing DataBase. Goal: to learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.

Supervised vs. Unsupervised Learning Supervised Learning Needs supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a?teacher? gives the classes.

Supervised vs. Unsupervised Learning Supervised Learning Needs supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a?teacher? gives the classes. New data (Test) are classified into these classes too.

Supervised vs. Unsupervised Learning Supervised Learning Needs supervision: The data (observations, measurements, etc.) are labeled with pre-defined classes. It is like that a?teacher? gives the classes. New data (Test) are classified into these classes too. Unsupervised Learning Class labels of the data are unknown Given a set of data, the task is to establish the existence of classes or clusters in the data.

Supervised Learning process: two steps Learning (Training) Learn a model using the training data

Supervised Learning process: two steps Learning (Training) Learn a model using the training data Testing Test the model using unseen test data to assess the model accuracy

Supervised Learning process: two steps Learning (Training) Learn a model using the training data Testing Test the model using unseen test data to assess the model accuracy

Learning Algorithms Boolean Functions (Decision Trees)

Learning Algorithms Boolean Functions (Decision Trees) Probabilistic Functions (Bayesian Classifier)

Learning Algorithms Boolean Functions (Decision Trees) Probabilistic Functions (Bayesian Classifier) Functions to partitioning Vector Space

Learning Algorithms Boolean Functions (Decision Trees) Probabilistic Functions (Bayesian Classifier) Functions to partitioning Vector Space Non-Linear: KNN, Neural Networks,...

Learning Algorithms Boolean Functions (Decision Trees) Probabilistic Functions (Bayesian Classifier) Functions to partitioning Vector Space Non-Linear: KNN, Neural Networks,... Linear: Support Vector Machines, Perceptron,...

Decision Tree: Domain Example The class to learn is: approve a loan

Decision Tree Decision Tree example for the loan problem

Is the decision tree unique? No. Here is a simpler tree.

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better.

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better.

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better. Finding the best tree is NP-hard.

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better. Finding the best tree is NP-hard. All current tree building algorithms are heuristic algorithms

Is the decision tree unique? No. Here is a simpler tree. We want smaller tree and accurate tree. Easy to understand and perform better. Finding the best tree is NP-hard. All current tree building algorithms are heuristic algorithms A decision tree can be converted to a set of rules.

From a decision tree to a set of rules

From a decision tree to a set of rules Each path from the root to a leaf is a rule

From a decision tree to a set of rules Each path from the root to a leaf is a rule Rules Own_house = true Class = yes Own_house = false, Has_job = true Class = yes Own_house = false, Has_job = false Class = no

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too)

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain)

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain) Conditions for stopping partitioning All examples for a given node belong to the same class

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain) Conditions for stopping partitioning All examples for a given node belong to the same class There are no remaining attributes for further partitioning? majority class is the leaf

Algorithm for decision tree learning Basic algorithm (a greedy divide-and-conquer algorithm) Assume attributes are categorical now (continuous attributes can be handled too) Tree is constructed in a top-down recursive manner At start, all the training examples are at the root Examples are partitioned recursively based on selected attributes Attributes are selected on the basis of an impurity function (e.g., information gain) Conditions for stopping partitioning All examples for a given node belong to the same class There are no remaining attributes for further partitioning? majority class is the leaf There are no examples left

Choose an attribute to partition data How chose the best attribute set?

Choose an attribute to partition data How chose the best attribute set? The objective is to reduce the impurity or uncertainty in data as much as possible

Choose an attribute to partition data How chose the best attribute set? The objective is to reduce the impurity or uncertainty in data as much as possible A subset of data is pure if all instances belong to the same class.

Choose an attribute to partition data How chose the best attribute set? The objective is to reduce the impurity or uncertainty in data as much as possible A subset of data is pure if all instances belong to the same class. The heuristic is to choose the attribute with the maximum Information Gain or Gain Ratio based on information theory.

Choose an attribute to partition data How chose the best attribute set? The objective is to reduce the impurity or uncertainty in data as much as possible A subset of data is pure if all instances belong to the same class. The heuristic is to choose the attribute with the maximum Information Gain or Gain Ratio based on information theory.

Information Gain Entropy of D Given a set of examples D is possible to compute the original entropy of the dataset such as: C H[D] = P(c j )log 2 P(c j ) where C is the set of desired class. j=1

Information Gain Entropy of D Given a set of examples D is possible to compute the original entropy of the dataset such as: C H[D] = P(c j )log 2 P(c j ) where C is the set of desired class. j=1 Entropy of an attribute A i If we make attribute A i, with v values, the root of the current tree, this will partition D into v subsets D 1,D 2,...,D v. The expected entropy if A i is used as the current root: H Ai [D] = v j=1 D j D H[D j]

Information Gain Information Gain Information gained by selecting attribute A i to branch or to partition the data is given by the difference of prior entropy and the entropy of selected branch gain(d,a i ) = H[D] H Ai [D]

Information Gain Information Gain Information gained by selecting attribute A i to branch or to partition the data is given by the difference of prior entropy and the entropy of selected branch gain(d,a i ) = H[D] H Ai [D] We choose the attribute with the highest gain to branch/split the current tree.

Example

Example H[D] = 6 15 log 6 2 15 9 15 log 9 2 15 = 0.971 H OH [D] = 6 15 H[D 1] 9 15 H[D 2] = 6 15 0 + 9 0.918 = 0.551 15

Example gain(d,age) = 0.971 0.888 = 0.083 gain(d,own_house) = 0.971 0.551 = 0.420 gain(d,has_job) = 0.971 0.647 = 0.324 gain(d,credit) = 0.971 0.608 = 0.363 H[D] = 6 15 log 6 2 15 9 15 log 9 2 15 = 0.971 H OH [D] = 6 15 H[D 1] 9 15 H[D 2] = 6 15 0 + 9 0.918 = 0.551 15

Example gain(d,age) = 0.971 0.888 = 0.083 gain(d,own_house) = 0.971 0.551 = 0.420 gain(d,has_job) = 0.971 0.647 = 0.324 gain(d,credit) = 0.971 0.608 = 0.363 H[D] = 6 15 log 6 2 15 9 15 log 9 2 15 = 0.971 H OH [D] = 6 15 H[D 1] 9 15 H[D 2] = 6 15 0 + 9 0.918 = 0.551 15

Algorithm for decision tree learning