CSC 4510/9010: Applied Machine Learning. Rule Inference. Dr. Paula Matuszek

Similar documents
MYCIN. The embodiment of all the clichés of what expert systems are. (Newell)

MYCIN. The MYCIN Task

CS Machine Learning

Learning From the Past with Experiment Databases

Lecture 1: Basic Concepts of Machine Learning

Rule-based Expert Systems

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

(Sub)Gradient Descent

Study and Analysis of MYCIN expert system

Probability estimates in a scenario tree

Chapter 2 Rule Learning in a Nutshell

Grade 6: Correlated to AGS Basic Math Skills

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The Good Judgment Project: A large scale test of different methods of combining expert predictions

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Rule Learning With Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Lecture 1: Machine Learning Basics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

On-Line Data Analytics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Pre-vocational training. Unit 2. Being a fitness instructor

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Grammars & Parsing, Part 1:

Rule Learning with Negation: Issues Regarding Effectiveness

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Discriminative Learning of Beam-Search Heuristics for Planning

Evolutive Neural Net Fuzzy Filtering: Basic Description

AQUA: An Ontology-Driven Question Answering System

Proof Theory for Syntacticians

Major Lessons from This Work

The Singapore Copyright Act applies to the use of this document.

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Assignment 1: Predicting Amazon Review Ratings

Learning Methods for Fuzzy Systems

IMGD Technical Game Development I: Iterative Development Techniques. by Robert W. Lindeman

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using dialogue context to improve parsing performance in dialogue systems

Python Machine Learning

A Version Space Approach to Learning Context-free Grammars

A Genetic Irrational Belief System

Librarians of Highlights of a survey of RUL faculty. June 7, Librarians of 2023 June 7, / 11

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Introduction to Simulation

preassessment was administered)

Active Learning. Yingyu Liang Computer Sciences 760 Fall

The Foundations of Interpersonal Communication

Mathematics process categories

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Neuro-Symbolic Approaches for Knowledge Representation in Expert Systems

BSP !!! Trainer s Manual. Sheldon Loman, Ph.D. Portland State University. M. Kathleen Strickland-Cohen, Ph.D. University of Oregon

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Causal Link Semantics for Narrative Planning Using Numeric Fluents

Computerized Adaptive Psychological Testing A Personalisation Perspective

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Team Formation for Generalized Tasks in Expertise Social Networks

Some Principles of Automated Natural Language Information Extraction

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Sight Word Assessment

Machine Learning and Development Policy

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

A student diagnosing and evaluation system for laboratory-based academic exercises

A Reinforcement Learning Variant for Control Scheduling

CS 446: Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Emporia State University Degree Works Training User Guide Advisor

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Data Stream Processing and Analytics

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Word learning as Bayesian inference

Visual CP Representation of Knowledge

AP Statistics Summer Assignment 17-18

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Probabilistic Latent Semantic Analysis

Kindergarten - Unit One - Connecting Themes

arxiv: v1 [math.at] 10 Jan 2016

Learning to Schedule Straight-Line Code

Artificial Neural Networks written examination

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

CS 100: Principles of Computing

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

An investigation of imitation learning algorithms for structured prediction

SARDNET: A Self-Organizing Feature Map for Sequences

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

INSTRUCTIONAL FOCUS DOCUMENT Grade 5/Science

Houghton Mifflin Online Assessment System Walkthrough Guide

Transcription:

CSC 4510/9010: Applied Machine Learning 1 Rule Inference Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

Classification rules Popular alternative to decision trees Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree) Tests are usually logically ANDed together (but may also be general logical expressions) Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule Individual rules are often logically ORed together Conflicts arise if different conclusions apply Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 3) 2

Rules in AI 3 Mycin: Shortliffe and Buchanan, mid-1980s Hand-crafted rules for diagnosing blood infections Triggered the idea of If..Then rules which modeled human expertise Emycin: empty Mycin : The inference component of Mycin, without the actual rules. Expert Systems (ES) Knowledge base (KB) Inference engine Effective, still in extensive use http://people.dbmi.columbia.edu/~ehs7001/buchanan-shortliffe-1984/mycin%20book.htm

Example Rules in an ES 4 Rule Class1: If: It is snowing, and SEPTA is not running Then: It is possible that 4510 will not meet (0.7) Rule Class2: If There is an exam scheduled Then It is possible that 4510 will meet (0.8) Rule Class3: IF Villanova cancels class THEN It is possible that 4510 will not meet (1.0)

Some Problems Here 5 Time consuming to elicit, typically requires both domain expert and knowledge engineer Not trivial to handle conflict resolution Looks like the inference is straight logic and probabilities, but in fact people are terrible at both All of these mean that it would be nice to find some way to create the rules automatically: rule induction

Inductive Logic 6 Deductive logic: If it s Tuesday, then Paula is teaching It is Tuesday. Therefore Paula is teaching Sound if premises are true, conclusion is always true. Inductive logic: Paula taught 8/30, 9/6, 9/13, 9/20, 9/27. That is every Tuesday this semester. Therefore Paula is teaching on Tuesdays. If it s Tuesday then Paula is teaching Paula is teaching. Therefore it is Tuesday. Not sound. But often useful.

Simple Rule Induction 7 Given Features Training examples Output for training examples Generate automatically a set of rules which will allow you to judge new objects Basic approach is Combinations of features become antecedents or links Examples become consequents or nodes

Simple Rule Induction Example 8 Starting with 100 cases, 10 outcomes, 15 variables Form 100 rules, each with 15 antecedents and one consequent. Collapse rules. Cancellations: If we have C, A => B and C, A => B, collapse to A => B Drop Terms: D, E => F and D, G => F, collapse to D => F Test rules and undo collapse if performance gets worse Additional heuristics for combining rules.

Rose Diagnosis 9 Yellow Leaves Wilted Leaves Brown Spots Fungus N Y Y Bugs N Y Y Nutrition Y N N Fungus N N Y Fungus Y N Y Bugs Y Y N R1: If not yellow leaves and wilted leaves and brown spots then fungus. R6: If wilted leaves and yellow leaves and not brown spots then bugs

Rose Diagnosis 10 Cases 1 and 4 have opposite values for wilted leaves, so create new rule: R7: If not yellow leaves and brown spots then fungus. KB is rules. Learner is system collapsing and test rules. Critic is the test cases. Performer is rule-based inference. Problems: Over-generalization Irrelevance Need data on all features for all training cases Computationally painful. Useful if you have enough good training cases. Output can be understood and modified by humans

Alternate Approach: Covers 11 Rather than start with one rule/example, we can start with a rule that includes, or covers, all of one class and excludes the others. Pick ad attribute and a value that comes closest; expand rule or add rules to capture additional cases. Repeat for additional classes PRISM, RIPPER, JRip in Weka; pp 108-116. in text. Similar to a decision tree algorithm, but bottom-up instead of top-down.

Induced Rule Sets 12 Positives: A covering set of rules can match a consistent dataset perfectly. Human-Readable, and even modifiable. White-box Negatives Tend to overfit Computationally difficult For a large dataset or many attributes, can end up with a complex set of rules. And conflict resolution is still an issue Can we make it simpler?

Simplicity first Simple algorithms often work very well! There are many kinds of simple structure, eg: One attribute does all the work All attributes contribute equally & independently A weighted linear combination might do Instance-based: use a few prototypes Use simple logical rules Success of method depends on the domain Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 13

Inferring rudimentary rules 1R: learns a 1-level decision I.e., rules that all test one particular attribute Basic version One branch for each value Each branch assigns most frequent class Error rate: proportion of instances that don t belong to the majority class of their corresponding branch Choose attribute with lowest error rate (assumes nominal attributes) Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 14

Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Note: missing is treated as a separate attribute value Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 15

Evaluating the weather attributes Outlook Temp Humidity Windy Play Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot High High High High Normal Normal Normal High Normal Normal Normal High Normal True True True True True No No No No Attribute Outlook Temp Humidity Windy Rules Sunny No Overcast Rainy Hot No* Mild Cool High No Normal True No* Errors Rainy Mild High True No * indicates a tie Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 16 2/5 0/4 2/5 2/4 2/6 1/4 3/7 1/7 2/8 3/6 Total errors 4/14 5/14 4/14 5/14

Dealing with numeric attributes Discretize numeric attributes Divide each attribute s range into intervals Sort instances according to attribute s values Place breakpoints where class changes (majority class) This minimizes the total error Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No Outlook Sunny Sunny Overcast Rainy Temperature 85 80 83 75 Humidity 85 90 86 80 Windy Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 17 True Play No No

The problem of overfitting Discretization is very sensitive to noise One instance with an incorrect class label will probably produce a separate interval Also: time stamp attribute will have zero errors Simple solution: enforce minimum number of instances in majority class per interval Example (with min = 3): 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 18

Discussion of 1R 1R was described in a paper by Holte (1993) Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) Minimum number of instances was set to 6 after some experimentation 1R s simple rules performed not much worse than much more complex decision trees Simplicity first pays off! Very Simple Classification Rules Perform Well on Most Commonly Used Datasets Robert C. Holte, Computer Science Department, University of Ottawa Slides from the Weka text can be found at http://www.cs.waikato.ac.nz/ml/weka/book.html Data Mining: Practical Machine Learning Tools and Techniques (Chapter 4) 19

Rules and Trees 20 A decision tree can be turned into rules: walk the tree and AND the tests. One rule for each leaf. Unambiguous rules, not order dependent However, unnecessarily complex Rules to decision trees is harder. Trees do not capture OR well; If x = 3 then class = A If X = 4 then class = A If rules have conflicts, how does it get encoded? Top-Down (DT) and Bottom-Up (Rules) can lead to different models

Rules Summary 21 Multiple approaches, but the basic idea is the same: infer simple rules that make the decision based on logical combinations of attributes ZeroR: predict the most common class OneR is a good first test. For simple domains the rules are easy to understand by humans Sensitive to noise, overfitting Not a good fit for complex domains, large number of attributes