Machine Learning. November 19, 2015

Similar documents
Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Rule Learning With Negation: Issues Regarding Effectiveness

Rule Learning with Negation: Issues Regarding Effectiveness

Chapter 2 Rule Learning in a Nutshell

Learning From the Past with Experiment Databases

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Python Machine Learning

Lecture 1: Basic Concepts of Machine Learning

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Word Segmentation of Off-line Handwritten Documents

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Reducing Features to Improve Bug Prediction

Speech Recognition at ICSI: Broadcast News and beyond

A Version Space Approach to Learning Context-free Grammars

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Switchboard Language Model Improvement with Conversational Data from Gigaword

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Lecture 10: Reinforcement Learning

CSL465/603 - Machine Learning

Linking Task: Identifying authors and book titles in verbose queries

A Case Study: News Classification Based on Term Frequency

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Probability and Statistics Curriculum Pacing Guide

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Truth Inference in Crowdsourcing: Is the Problem Solved?

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Exploration. CS : Deep Reinforcement Learning Sergey Levine

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Proof Theory for Syntacticians

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

MYCIN. The MYCIN Task

Radius STEM Readiness TM

Software Maintenance

On-Line Data Analytics

Discriminative Learning of Beam-Search Heuristics for Planning

Universidade do Minho Escola de Engenharia

Physics 270: Experimental Physics

Word learning as Bayesian inference

Learning Methods in Multilingual Speech Recognition

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

STAT 220 Midterm Exam, Friday, Feb. 24

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

CS 446: Machine Learning

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

The Strong Minimalist Thesis and Bounded Optimality

Artificial Neural Networks written examination

Intelligent Agents. Chapter 2. Chapter 2 1

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Reinforcement Learning by Comparing Immediate Reward

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Generative models and adversarial training

Learning goal-oriented strategies in problem solving

Active Learning. Yingyu Liang Computer Sciences 760 Fall

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Assignment 1: Predicting Amazon Review Ratings

Innovative Methods for Teaching Engineering Courses

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Introduction to Questionnaire Design

Pre-AP Geometry Course Syllabus Page 1

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Model Ensemble for Click Prediction in Bing Search Ads

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

(Sub)Gradient Descent

A Neural Network GUI Tested on Text-To-Phoneme Mapping

An OO Framework for building Intelligence and Learning properties in Software Agents

Prediction of Maximal Projection for Semantic Role Labeling

Corrective Feedback and Persistent Learning for Information Extraction

Learning and Transferring Relational Instance-Based Policies

Math 96: Intermediate Algebra in Context

Applications of data mining algorithms to analysis of medical data

Learning Methods for Fuzzy Systems

PHY2048 Syllabus - Physics with Calculus 1 Fall 2014

Calibration of Confidence Measures in Speech Recognition

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

The Good Judgment Project: A large scale test of different methods of combining expert predictions

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Multi-Lingual Text Leveling

Human Emotion Recognition From Speech

Introduction to Simulation

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Data Stream Processing and Analytics

Using dialogue context to improve parsing performance in dialogue systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

AQUA: An Ontology-Driven Question Answering System

On the Combined Behavior of Autonomous Resource Management Agents

Transcription:

Machine Learning November 19, 2015

Componentes de um Agente Performance standard Critic Sensors feedback learning goals Learning element changes knowledge Performance element Environment Problem generator Agent Effectors

Learning from observations ˆ Performance element design is affected by 4 factors: which components must be improved. which representation is used for the components. what kind of feedback is available. what background information is known.

Learning from observations ˆ Components of a performance element: direct mapping from conditions of the current state to actions. ways to infer relevant properties of the environment. info about modifications in the environment. info about results of possible actions. utility info. info about priority actions values that indicate preference for a given action for a given state.. objectives that describe sets of states that maximize the utility.

Learning from observations ˆ Representation of components: can be done using any kind of knowledge or data representation (tables, rules, sets, data structures, database tables etc.) ˆ Feedback: supervised learning: inputs and outputs are known. Agents give predictions about the outputs given the inputs (not always perfect predictions). Output is know as class or target variable or ground-truth or golden standard. reinforcement learning: agent receives some evaluation (positive or negative) of each action, but it does not know the correct one. non-supervised learning: learning patterns without knowing information about the outputs (classes are not known a priori). ˆ Background knowledge: necessary to improve learning.

Inductive Learning ˆ The learning element knows the correct or approximate value of the class variable. In other words, in y = f(x), it knows about the feature vector x and knows its class y. f is not known. The objective is to learn f. ˆ Induction: given a set of observations (examples) of f, returns a function h (hypothesis) that approximates f. ˆ Bias: preference for one or other hypothesis. ˆ f can be a regression, a Support Vector Machine (SVM), a neural network, a Bayesian network, a Decision Tree, a Random Forest, a Markov Logic Network, Propositional rules, First-Order rules, etc.

Inductive Learning Different hypotheses can be learned to the same set of observations (for example, a and b are distinct hypotheses to the same set of data. Idem for c and d) f(x) f(x) f(x) f(x) x x x x (a) (b) (c) (d)

Inductive Learning global examples fg function REFLEX-PERFORMANCE-ELEMENT( percept) returns an action if ( percept, a) in examples then return a else h INDUCE(examples) return h( percept) procedure REFLEX-LEARNING-ELEMENT(percept, action) inputs: percept, feedback percept action, feedback action examples examples [ f( percept, action)g

Inductive Learning ˆ Algorithm updates a global variable examples, list of pairs perception, action. ˆ Perception can be a situation in a chess match. ˆ Action: can be the best play according to a chess master. ˆ If the agent sees a situation that has seen before, executes corresponding action. ˆ Otherwise uses machine learning algorithm INDUCE over examples that have seen before to find a new hypothesis. ˆ INDUCE returns a hypothesis h, which is uses to choose the best action.

Inductive Learning ˆ Incremental learning. Agent tries to update prior hypotheses whenever a new example appears, without the need to induce over all examples again. ˆ Agent can receive feedback about the quality of the chosen actions. ˆ Hypothesis representation: free. ˆ Examples of machine learning representations: propositional, first order logic, graphical, equations etc. ˆ Problem: how do we know if a learning algorithm is producing a good hypothesis?

Decision Trees ˆ Simple and easy to implement. ˆ If we have a set of observations including a class variable, the learned classifier executes: if?? then class=y, where?? is a set of test conditions. ˆ In its simplest form it represents boolean functions. ˆ Example: wait or not for a table in a restaurant. ˆ Objective: to learn the predicate WillWait with the definition represented as a decision tree.

Decision Trees ˆ Observed variables: Alternative (Alt): any alternative restaurant nearby? Bar: does the restaurant have a waiting area? Fri/Sat: True if it is Friday or Saturday. Hungry: is the customer hungry? Patrons: number of people in the restaurant (None, Some, Full). Price: $, $$, $$$. Rain: True if it is raining. Reservation: True if we have a reservation. Typeo: French, Italian etc. WaitingTime: 0 10min, 10 30, 30 60, > 60.

Decision Trees Example Attributes Goal Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X 1 Yes No No Yes Some $$$ No Yes French 0 10 Yes X 2 Yes No No Yes Full $ No No Thai 30 60 No X 3 No Yes No No Some $ No No Burger 0 10 Yes X 4 Yes No Yes Yes Full $ No No Thai 10 30 Yes X 5 Yes No Yes No Full $$$ No Yes French >60 No X 6 No Yes No Yes Some $$ Yes Yes Italian 0 10 Yes X 7 No Yes No No None $ Yes No Burger 0 10 No X 8 No No No Yes Some $$ Yes Yes Thai 0 10 Yes X 9 No Yes Yes No Full $ Yes No Burger >60 No X 10 Yes Yes Yes Yes Full $$$ No Yes Italian 10 30 No X 11 No No No No None $ No No Thai 0 10 No X 12 Yes Yes Yes Yes Full $ No No Burger 30 60 Yes

Decision Tree for the restaurant example Patrons? None Some Full No Yes WaitEstimate? >60 30 60 10 30 0 10 No Alternate? No Yes Hungry? No Yes Yes Reservation? Fri/Sat? Yes Alternate? No Yes No Yes No Yes Bar? Yes No Yes Yes Raining? No Yes No Yes No Yes No Yes

Decision Trees ˆ In logic: r P at(r, F ull) W aitingt ime(r, 10 30) Hungry(r, N) W illw ait(r) ˆ In its simplest form, decision trees can not represent tests over two or more different objects (every object needs to be ground ) ˆ Limitations in representation ˆ Any boolean function can be represented by a decision tree ˆ Representation of a decision tree must be compact, because truth-tables have exponential growth.

Decision Trees ˆ Examples: attribute values plus class value (feature vector). ˆ Classification of an example: predicted value of the class value for a given example. ˆ when value is true, example is positive, otherwise example is negative. ˆ full set of examples: training set.

Decision Trees ˆ How to induce a decision tree from examples? ˆ Each example can be a different path in the tree... ˆ...but the classifier can not extract any pattern different from the ones used in the tree. ˆ To extract a pattern is to describe a large number of cases ina concise way. ˆ General principle of inductive learning: Ockham s razor. The most probable hypothesis is the simplest consistent with all (or most) observations. ˆ To find a minimal decision tree is an intractable problem. ˆ Heuristics can help.

Decision Trees ˆ Basic idea of the algorithm: test most important attributes first. ˆ What is a most important attribute? ˆ Example: 12 observations, separated in positive and negative sets. ˆ Patrons is an important attribute: if its value is None or Some, the predicate has always a definite value: No or Yes. ˆ Type: poor attribute. ˆ Algorithm chooses the strongest attribute and places it as the root of the subtree.

Decision Trees Choice between two attributes: Type and Patrons. Patrons is chosen because it distinguishes better positive (willwait=yes) and negative (willwait=no) examples. 1 3 4 6 8 12 2 5 7 9 10 11 Type? 1 3 4 6 8 12 2 5 7 9 10 11 Patrons? French Italian Thai Burger 1 5 6 10 4 8 2 11 3 12 7 9 7 11 None Some Full 1 3 6 8 4 12 2 5 9 10 No Yes Hungry? No Yes 4 12 (a) (b) 5 9 2 10

Decision Trees ˆ There are still subsets of examples not yet classified. The algorithm is recursively applied. There are 4 possible cases: If there are still positive and negative examples to be classified, select the best attribute to split them. If all remaining examples are positive (or negative), create a leaf to answer Yes (or No). Return. If there no more examples left, it means there is no observation in that path. Return Yes or No value depending on the majority class of the parent node. If there are no more attributes left, but there are remaining examples, this means that those examples have exactly the same description, but different classifications. Simple solution: return majority class of these examples.

Decision Trees Choice of attribute Patrons and continuation of the algorithm with the choice of the next best attribute: Hungry (c) (a) +: X1,X3,X4,X6,X8,X12 : X2,X5,X7,X9,X10,X11 Patrons? None Some Full +: : X7,X11 +: X1,X3,X6,X8 : +: X4,X12 : X2,X5,X9,X10 (b) +: X1,X3,X4,X6,X8,X12 : X2,X5,X7,X9,X10,X11 Type? French Italian Thai Burger +: X1 : X5 +: X6 : X10 +: X4,X8 : X2,X11 +: X3,X12 : X7,X9 (c) +: X1,X3,X4,X6,X8,X12 : X2,X5,X7,X9,X10,X11 Patrons? None Some Full +: : X7,X11 +: X1,X3,X6,X8 : +: X4,X12 : X2,X5,X9,X10 Yes No Hungry? Y +: X4,X12 : X2,X10 N +: : X5,X9

Decision Trees Possible tree generated by an inductive decision tree learning algorithm. Patrons? None Some Full No Yes Hungry? No Yes Type? No French Italian Thai Burger Yes No Fri/Sat? Yes No Yes No Yes

Decision Trees ˆ Notes: algorithm may conclude facts that are not evident from the examples. For example, always wait for a Thai restaurant if it is a weekend. Because of this lots of time can be wasted looking for bugs that do not exist. The more examples, the most detailed will be the decision tree. In this example, the tree can answer with an error, because it never saw a case where the waiting time is 0-10 minutes, but the restaurant is full ˆ Question: if the algorithm induces a consistent tree, but makes mistakes when classifying some examples, how incorrect is the tree?

Decision Trees Pruning consists in removing redundant nodes. The most common approach is to perform post-pruning. One of the simplest forms of post-pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

Decision Trees Example of pruning. (from Eibe Frank s PhD thesis Pruning Decision Trees and Lists)

Performance of a Machine Learning Algorithm ˆ A learning algorithm is good if it produces hypotheses that correctly classifies examples not yet seen. ˆ Simple method to evaluate performance (not always the best): check predictions over a test set (data unseen during the training phase). 1. Choose a set of examples. 2. Divide this set in two: training and test 3. Use the training set to produce the hypothesis H. 4. Calculate the percentage of correctly classified examples in the teste set according to H (evaluation metric can vary depending on what is more important). 5. Repeat steps 1 to 4 to different sizes of training and test sets randomly selected. ˆ Result: data that can be used to produce a learning curve.

Performance of a Machine Learning Algorithm Learning Curve 1 0.9 % correct on test set 0.8 0.7 0.6 0.5 0.4 0 20 40 60 Training set size

Information Theory ˆ Used to find formal metrics to categorize attributes as good ou reasonable or poor etc. ˆ Information represented in number of bits. If I(p) = 1, we need 1 bit of information. If I(p) = 0, we do not need additional information. ˆ Let an attribute have v i possible values with probability P (v i ). Total information: I(P (v 1 ),..., P (v n )) = n i=1 P (v i)log 2 P (v i ) ˆ Coding of the info with optimal size will have log 2 p bits for an attribute with probability p.

Information Theory ˆ Considering positive and negative examples: I( p p+n, n p+n ) = p p+n log 2 p p+n n p+n log 2 n p+n, estimator of the info contained in a correct answer. ˆ Information Gain: difference between the original information and the information after adding a new attribute: Gain(A) = I( p p+n, n p+n ) Rmaining(A) ˆ Heuristic used by CHOOSE-ATTRIBUTE chooses attribute with larger gain (less entropy). ˆ Ex: Gain(Clientes) = 1 [ 2 4 6 12I(0, 1) + 12I(1, 0) + 12 I( 2 6, 4 6 )] 0.541 bits. ˆ The 1 in the formula comes from the initial information: we have 6 positive examples (willwait=yes) and 6 negative examples (willwait=no). Initial info: 6 12 log 2 6 12 6 12 log 2 6 12 = 1

Algorithm ID3 for Decision Tree Induction ID3(Examples, Target_Attribute, Attributes) Create a root node for the tree If all examples are positive, Return the single-node tree Root, with label = +. If all examples are negative, Return the single-node tree Root, with label = -. If number of predicting attributes is empty, Return the single node tree Root, with label = most common value of the target attribute in the examples. Else A = Attribute that best classifies examples Decision Tree attribute for Root = A For each possible value, vi, of A, Add a new tree branch below Root, corresponding to the test A = vi. Let Examples(vi) be the subset of examples that have the value vi for A If Examples(vi) is empty below this new branch add a leaf node with label = most common target value in the examples Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes - {A}) EndIf EndFor EndIf Return Root

ID3 algorithm ˆ Limitations: information gain is useful only for problems with two classes ID3 algorithm does not deal with numerical values ˆ Alternatives for attribute utility: jini index, gain ratio etc ˆ Alternative algorithms that handle numerical values: C4.5, C5.0, J48 (implementation of C4.5 in WEKA) ˆ When handling numerical values, discretization is needed. ˆ Methods: non-supervised (already studied: fixed width, fixed frequency or clustering) or supervised. ˆ Simple supervised method: 1Rule. ˆ 1Rule: works with the attribute and with the class variable. Sorts the attribute values and splits at each change of class. It is common to determine a minimum number of elements to place in an interval before splitting.