Learning. Learning Definitions. More Learning Definitions

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

CS Machine Learning

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Lecture 1: Machine Learning Basics

Chapter 2 Rule Learning in a Nutshell

Lecture 1: Basic Concepts of Machine Learning

Python Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Probability and Statistics Curriculum Pacing Guide

Proof Theory for Syntacticians

Rule Learning With Negation: Issues Regarding Effectiveness

A Version Space Approach to Learning Context-free Grammars

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

STA 225: Introductory Statistics (CT)

Artificial Neural Networks written examination

Rule Learning with Negation: Issues Regarding Effectiveness

Learning Methods in Multilingual Speech Recognition

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Probabilistic Latent Semantic Analysis

Data Stream Processing and Analytics

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

AP Statistics Summer Assignment 17-18

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Learning goal-oriented strategies in problem solving

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

12- A whirlwind tour of statistics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Model Ensemble for Click Prediction in Bing Search Ads

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

An overview of risk-adjusted charts

A Case Study: News Classification Based on Term Frequency

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Lecture 10: Reinforcement Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Assignment 1: Predicting Amazon Review Ratings

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning From the Past with Experiment Databases

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Multi-label classification via multi-target regression on data streams

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CSL465/603 - Machine Learning

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Calibration of Confidence Measures in Speech Recognition

Radius STEM Readiness TM

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Neural Network GUI Tested on Text-To-Phoneme Mapping

On the Combined Behavior of Autonomous Resource Management Agents

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Automatic Discretization of Actions and States in Monte-Carlo Tree Search

How do adults reason about their opponent? Typologies of players in a turn-taking game

Research Design & Analysis Made Easy! Brainstorming Worksheet

Switchboard Language Model Improvement with Conversational Data from Gigaword

Universidade do Minho Escola de Engenharia

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Corrective Feedback and Persistent Learning for Information Extraction

CS177 Python Programming

A survey of multi-view machine learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

A Comparison of Standard and Interval Association Rules

Applications of data mining algorithms to analysis of medical data

Mining Student Evolution Using Associative Classification and Clustering

Using focal point learning to improve human machine tacit coordination

TU-E2090 Research Assignment in Operations Management and Services

Discriminative Learning of Beam-Search Heuristics for Planning

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Word learning as Bayesian inference

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Sample Problems for MATH 5001, University of Georgia

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Data Structures and Algorithms

CSC200: Lecture 4. Allan Borodin

Evolution of Symbolisation in Chimpanzees and Neural Nets

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Multi-Lingual Text Leveling

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Extending Place Value with Whole Numbers to 1,000,000

MYCIN. The MYCIN Task

Dublin City Schools Mathematics Graded Course of Study GRADE 4

APPENDIX A: Process Sigma Table (I)

Grade 6: Correlated to AGS Basic Math Skills

Mathematics Success Level E

Axiom 2013 Team Description Paper

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Detailed course syllabus

Transcription:

Learning 2 Learning Learning 2 Learning Definitions....................................... 2 More Learning Definitions................................... 3 Example of Examples...................................... 4 More about Inductive Learning................................ 5 Error in Learning......................................... 6 Decision Trees 7 Definition.............................................. 7 Example of a Decision Tree.................................. 8 Algorithm for Growing Decision Trees........................... 9 Comparing Attibutes: Information Gain......................... 10 Plot of Information Function................................ 11 Plot of Information Gain................................... 12 Example of Attribute Selection............................... 13 Attribute Selection, Continued............................... 14 Alternative Attributes Measures.............................. 15 Special Cases in Decision Trees.............................. 16 Pruning Decision Trees.................................... 17 Estimating Error......................................... 18 Algorithm for Pruning Decision Trees.......................... 19 Ensemble Learning 20 Definition............................................. 20 Boosting.............................................. 21 Example Boosting Algorithm................................ 22 Example Run of AdaBoost.................................. 23 Example Run of AdaBoost, Continued.......................... 24 Learning Definitions Learning is improvement of performance (time, accuracy). Inductive inference is improving accuracy by generalizing from experience. An example is a single, specific experience. In supervised learning, each example is an input/output pair. Regression is when the output is continuous. Classification is when the output is discrete. Concept learning has two possible outputs (positive or negative). CS 5233 Artificial Intelligence Learning 2 More Learning Definitions In unsupervised learning, examples do not always have outputs. In reinforcement learning, an agent performs a series of actions, receiving intermittent feedback. In batch learning, the learner receives all the examples at the same time. In online learning, the learner receives the examples one at a time. CS 5233 Artificial Intelligence Learning 3 Example of Examples No. Attributes Class Outlook Temp Humidity Windy 1 sunny hot high false neg 2 sunny hot high true neg 3 overcast hot high false pos 4 rain mild high false pos 5 rain cool normal false pos 6 rain cool normal true neg 7 overcast cool normal true pos 8 sunny mild high false neg 9 sunny cool normal false pos 10 rain mild normal false pos 11 sunny mild normal true pos 12 overcast mild high true pos 13 overcast hot normal false pos 14 rain mild high true neg CS 5233 Artificial Intelligence Learning 4 1 2

More about Inductive Learning The learner learns a hypothesis h from a set of training examples. h can be evaluated empirically on a set of test examples or theoretically on the probability distribution of the examples. Inductive bias refers to the hypotheses that the learner prefers. One kind of inductive bias is to restrict the hypothesis space, the set of hypotheses to be considered. CS 5233 Artificial Intelligence Learning 5 Error in Learning Perfect learning cannot be guaranteed by any learning algorithm from a finite set of training examples. The training examples might not cover all the possibilities, or might not be representative. No learning algorithm is best. All learning algorithms are forced to make assumptions which might not be true. The goal of PAC learning (PAC = probably approximately correct ) is to find a hypothesis that is unlikely (δ or less) to have high error (ǫ or more). CS 5233 Artificial Intelligence Learning 6 Decision Trees 7 Definition Decision trees are a representation for classification. The root is labeled by an attribute. Edges are labeled by attribute values. Edges go to decision trees or leaves. Each leaf is labeled by a class. Growth Phase: Construct the tree top-down. Find the best attribute. Split examples based on attribute s values. Pruning Phase: Prune the tree bottom-up. For each node, keep subtree or change to leaf. Example of a Decision Tree humidity high neg sunny normal pos outlook overcast pos neg true rain windy outlook sunny overcast hot rain false temp mild outlook s o r neg pos??? neg??? pos cool CS 5233 Artificial Intelligence Learning 8 Algorithm for Growing Decision Trees Grow DT(examples) 1. N a new node 2. N.class most common class in examples 3. if examples have identical class or values 4. then return N 5. N.test best attribute (or test) 6. for each value v j of N.test 7. examples j examples with N.test = v j 8. if examples j is empty 9. then N.branch j N.class 10. else N.branch j Grow DT(examples j ) 11. return N CS 5233 Artificial Intelligence Learning 9 pos CS 5233 Artificial Intelligence Learning 7 3 4

Comparing Attibutes: Information Gain p positive examples and n negative examples The information contained is: I(p, n) = p p + n log p 2 p + n n p + n log n 2 p + n Attribute A has v values, p j positive examples and n j negative examples when A = v j The Remainder of A is: Remainder(A) = v Σ j=1 p j + n j p + n I(p j, n j ) The information gain of A is: Gain(A) = I(p, n) Remainder(A) CS 5233 Artificial Intelligence Learning 10 Plot of Information Function p positive examples and n negative examples 1 0.8 0.6 0.4 0.2 I(p, n=100-p) 0 0 20 40 p 60 80 100 CS 5233 Artificial Intelligence Learning 11 Plot of Information Gain p 1 positive and n 1 negative exs. when attr.=v 1 p 2 positive and n 2 negative exs. when attr.=v 2 1 0.5 0 50 40 gain(p1, n1=50-p1, p2, n2=50-p2) 30 20 p2 10 0 0 10 20 30 p1 40 50 CS 5233 Artificial Intelligence Learning 12 Example of Attribute Selection Refer to Example of Examples earlier. Outlook Sunny Rain Overcast 2 pos 3 neg 4 pos 0 neg Gain(Outlook) 0.246 3 pos 2 neg Temp Cool Mild Hot 3 pos 4 pos 2 pos 1 neg 2 neg 2 neg Gain(Temp) 0.029 CS 5233 Artificial Intelligence Learning 13 5 6

Attribute Selection, Continued Humidity Normal High 6 pos 3 pos 1 neg 4 neg Wind True 3 pos 3 neg False 6 pos 2 neg Gain(Humidity) 0.152 Gain(Wind) 0.048 Outlook has the highest gain. Overcast branch is pure. Need to construct DTs for two branches. CS 5233 Artificial Intelligence Learning 14 Alternative Attributes Measures Maximize Information Gain Ratio GainRatio(A) = Gain(A)/I(p 1 + n 1,..., p v + n v ) Minimize Gini Index ( ) 2 ( ) 2 p n Gini(p, n) = 1 p + n p + n GiniIndex(A) = Σ v p j + n j j=1 p + n Gini(p j, n j ) Maximize Chi-Squared Statistic χ 2 = Σ v (p j p s j ) 2 + (n j n s j ) 2 j=1 p s j n s j where s j = (p j + n j )/(p + n) Special Cases in Decision Trees Attribute A is numeric. Find best A v test. Requires sorting. Or: Discretization. Partition A into ranges. Attribute A has missing values. Pretend missing is just another value. Or: Ignore missing values. Split examples with missing values across branches. Attribute A has many discrete values. Find best A = v test. Forms binary tree. Or: Partition values into subsets. CS 5233 Artificial Intelligence Learning 16 Pruning Decision Trees Why are there errors? Statistical fluctuations. Examples might have noise and/or outliers. DT approximates decision boundary. Results in overfitting at lower levels of DT Pruning Prepruning: Avoid creation of subtrees based on number of examples or attribute relevance. Postpruning: Create overfitting DT and substitute subtrees with leaves if estimated error is reduced. CS 5233 Artificial Intelligence Learning 17 CS 5233 Artificial Intelligence Learning 15 7 8

Estimating Error Use a validation set of examples. (training set, validation set, test set should be disjoint) Minimum Description Length principle (minimize size of tree and minimize size of errors) Add some error to each leaf (C4.5). Suppose a leaf has e errors on n examples. Find 75% confidence interval using binomial dist. Estimate true error as upper limit of interval. CS 5233 Artificial Intelligence Learning 18 Algorithm for Pruning Decision Trees Prune DT(N: node, examples) 1. leaferr number of examples N.class 2. increase leaferr if examples were training set 3. if N is a leaf then return leaferr 4. treeerr 0 5. for each value v j of N.test 6. examples j examples with N.test = v j 7. suberr Prune DT(N.branch j, examples j ) 8. treeerr treeerr + suberr 9. if leaferr < treeerr 10. then make N a leaf; return leaferr 11. else return treeerr CS 5233 Artificial Intelligence Learning 19 Ensemble Learning 20 Definition There are many algorithms for learning a single hypothesis. Ensemble learning will learn and combine a collection of hypotheses by running the algorithm on different training sets. Bagging (briefly mentioned in the book) runs a learning algorithm on repeated subsamples of the training set. If there are n examples, then a subsample of n examples is generated by sampling with replacement. On a test example, each hypothesis casts 1 vote for the class it predicts. CS 5233 Artificial Intelligence Learning 20 Boosting In boosting, the hypotheses are learned in sequence. Both hypotheses and examples have weights with different purposes. After each hypothesis is learned, its weight is based on its error rate, and the weights of the training examples (initially all equal) are also modified. On a test example, when each hypothesis predicts a class, its weight is the size of its vote. The ensemble predicts the class with the highest vote. CS 5233 Artificial Intelligence Learning 21 Example Boosting Algorithm AdaBoost(examples, algorithm, iterations) 1. n number of examples 2. initialize weights w[1... n] to 1/n 3. for i from 1 to iterations 4. h[i] algorithm(examples) 5. error (sum of exs. misclassfied by h[i]) / n 6. for j from 1 to n 7. if h[i] is correct on example j 8. then w[j] w[j] error/(1 error) 9. normalize w[1...n] so it sums to 1 10. weight of h[i] log((1 error)/error) 11. return h[1... iterations] and their weights CS 5233 Artificial Intelligence Learning 22 9 10

Example Run of AdaBoost Using the 14 examples as a training set: The hypothesis windy = false class = pos is wrong on 5 of the 14 examples. The weights of the correctly classified examples are multiplied by 5/9, then all examples are multiplied by 14/10 so they sum up to 1 again. This hypothesis has a weight of log(9/5). Note that after weight updating, the sum of the correctly classified examples equals the sum of the incorrectly classified examples. CS 5233 Artificial Intelligence Learning 23 Example Run of AdaBoost, Continued The next hypothesis must be different from the previous one to have error less than 1/2. Now the hypothesis outlook = overcast class = pos has an error rate of 29/90 0.322 The weights of the correctly classified examples are multiplied times 29/61 0.475, then all examples are multiplied by 90/58 1.55 so they sum up to 1 again. This hypothesis has a weight of log(61/29). CS 5233 Artificial Intelligence Learning 24 11