Deriving Decision Trees from Case Data

Similar documents
CS Machine Learning

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Getting Started with Deliberate Practice

Lecture 1: Basic Concepts of Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Software Maintenance

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Rule Learning with Negation: Issues Regarding Effectiveness

Artificial Neural Networks written examination

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Machine Learning Basics

Grade 6: Correlated to AGS Basic Math Skills

Chapter 2 Rule Learning in a Nutshell

Modeling user preferences and norms in context-aware systems

Executive Guide to Simulation for Health

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Assessment System for M.S. in Health Professions Education (rev. 4/2011)

The One Minute Preceptor: 5 Microskills for One-On-One Teaching

Learning Methods in Multilingual Speech Recognition

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Learning goal-oriented strategies in problem solving

GACE Computer Science Assessment Test at a Glance

Lecture 2: Quantifiers and Approximation

Data Stream Processing and Analytics

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Chapter 4 - Fractions

Linking Task: Identifying authors and book titles in verbose queries

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Radius STEM Readiness TM

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Transfer Learning Action Models by Measuring the Similarity of Different Domains

Critical Thinking in Everyday Life: 9 Strategies

STUDENTS' RATINGS ON TEACHER

Some Principles of Automated Natural Language Information Extraction

Applications of data mining algorithms to analysis of medical data

Rule-based Expert Systems

Emporia State University Degree Works Training User Guide Advisor

Proof Theory for Syntacticians

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Medical Complexity: A Pragmatic Theory

Cognitive Modeling. Tower of Hanoi: Description. Tower of Hanoi: The Task. Lecture 5: Models of Problem Solving. Frank Keller.

Computerized Adaptive Psychological Testing A Personalisation Perspective

CS 446: Machine Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Introduction to Causal Inference. Problem Set 1. Required Problems

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Using dialogue context to improve parsing performance in dialogue systems

Reducing Features to Improve Bug Prediction

How do adults reason about their opponent? Typologies of players in a turn-taking game

Disambiguation of Thai Personal Name from Online News Articles

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

AQUA: An Ontology-Driven Question Answering System

DegreeWorks Advisor Reference Guide

re An Interactive web based tool for sorting textbook images prior to adaptation to accessible format: Year 1 Final Report

(Sub)Gradient Descent

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Introduction to Simulation

10.2. Behavior models

Australian Journal of Basic and Applied Sciences

Lecture 10: Reinforcement Learning

A Version Space Approach to Learning Context-free Grammars

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Discriminative Learning of Beam-Search Heuristics for Planning

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Ielts listening test practise online. We test you exactly what to practise when you decide to work with a particular listening provider..

Cognitive Thinking Style Sample Report

Math 96: Intermediate Algebra in Context

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Activity 2 Multiplying Fractions Math 33. Is it important to have common denominators when we multiply fraction? Why or why not?

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Maths Games Resource Kit - Sample Teaching Problem Solving

Measurement. When Smaller Is Better. Activity:

INTERNAL MEDICINE IN-TRAINING EXAMINATION (IM-ITE SM )

How to make an A in Physics 101/102. Submitted by students who earned an A in PHYS 101 and PHYS 102.

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Individual Differences & Item Effects: How to test them, & how to test them well

BENCHMARK MA.8.A.6.1. Reporting Category

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

GCE. Mathematics (MEI) Mark Scheme for June Advanced Subsidiary GCE Unit 4766: Statistics 1. Oxford Cambridge and RSA Examinations

Dentist Under 40 Quality Assurance Program Webinar

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Millersville University Degree Works Training User Guide

Calibration of Confidence Measures in Speech Recognition

Assessment. the international training and education center on hiv. Continued on page 4

Parsing of part-of-speech tagged Assamese Texts

3.7 General Education Homebound (GEH) Program

Transcription:

Topic 4 Automatic Kwledge Acquisition PART II Contents 5.1 The Bottleneck of Kwledge Aquisition 5.2 Inductive Learning: Decision Trees 5.3 Converting Decision Trees into Rules 5.4 Generating Decision Trees: Information gain 1 Deriving Decision Trees from Case Data 2 1

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree There are various ways to derive decision trees from case data. A simple method is called Random-Tree, and is described on the enxt few slides. We assume that all given attributes are discrete (t continuous). We assume that the expert classification is binary ( or, true or false, treat or dont-treat, etc.) 3 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree Lets take a case study: will someone play tennis given the weather, Case Outlook Temp. Humidity Wind Play? 1 2 3 overcast 4 5 6 overcast 7 8 9 4 2

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree To develop a decision tree with RandomTree, we select an attribute at random, e.g., humidity We make the root de of the tree with this attribute: Humidity? 5 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree We then list for each branch the cases which fit that branch: Humidity? 5,6,8,9 1,2,3,4,7 Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 6 3

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree If all of the cases of a branch share the same conclusion, we make this branch a leaf, and just show the decision. This is t the case here, since the decisions are mixed on both branches. Humidity? Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast 5,6,8,9 1,2,3,4,7 Temp. Humidity Wind Play? 7 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree For n-terminal des, we then select a second attribute at random, and create the branch: Wind? Humidity? Wind? 5,6 8,9 1,3,4,7 2 Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 8 4

5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree And then repeat the previous steps until finished: overcast Humidity? Wind? Wind? Outlook? Case 1 2 3 4 5 6 7 8 9 Outlook overcast overcast Temp. Humidity Wind Play? 9 5. Automatic Kwledge Acquisition Deriving Decision Trees: random-tree Trees will differ depending on the order in which attributes are used Trees may be smaller or larger (number of des, depth) than others. Later, we will look at a means to produce compact trees as (ID3-tree). 10 5

5. Automatic Kwledge Acquisition Other Models: The Restaurant Problem Will people wait for a table to be available? Assume that an expert provided the following decision tree (But... is Sociology an exact science? Or is Medicine?...) 11 5. Automatic Kwledge Acquisition Data from experts, their observations/diagstics, guide us Perfection is unreachable (even for experts see X 4 and previous tree) Goal: equal (or better) sucessful prediction rate on unseen instances, compared with the human expert 12 6

5. Automatic Kwledge Acquisition The best tree for this data might be : Smaller than the Expert s tree (this is an advantage Occam s Razor) Both trees agree on the root and two branches The best (the only?) measure of quality is prediction rate on unseen instances Later, we will use Information Theory to obtain trees as good as this one 13 5. Automatic Kwledge Acquisition For ather problem, the following tree was produced from a set of ting data: What s wrong with this tree? Experts tice it. Non-experts do t, r would a program. 14 7

5. Automatic Kwledge Acquisition What s wrong with this tree? Experts tice it. Non-experts do t, r would a program. The problem is, the ting data had cases of diabetic women on their first pregnancy who were renally insufficient. The tree DID cover all observed cases, but t all possible cases! The wrong recommendation would be given in these cases. 15 Deriving Rules from Decision Trees 16 8

5. Automatic Kwledge Acquisition RULE EXTRACTION Traversing the tree from root to leaves produces: We focus on rules for the class NO because there are fewer of them The other class is defined using negation by default (as in Prolog) 17 Simplification of a Rule 18 9

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION Sometimes, we can delete conditions in our rules without affecting the results produced by the rules: Original rules: Simplified rules: pregnant is implied by this being the patient s fiest pregnancy, so it can be dropped. Dropping renal insufficiency actually improves the working of the rules, because those with renal insufficiency and diabetic first pregnancy should also t be treated. 19 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION A rule can be simplified by dropping some of the conditions, where dropping the conditions will t affect the decision the rule makes. There are two main ways to drop conditions: 1. The logical approach: Where one condition is logically implied by ather, the implied condition can be dropped: pregnant & first-pregnancy: BUT first-pregnancy implies pregnant! Age > 23 and Age > 42: BUT Age > 42 implies Age > 23! 2. The statistical approach: Where one condition can be dropped without changing the decision made by the rule over a set of data, then drop it. OR BETTER: when dropping the condition leaves unchanged OR IMPROVES the decisions made, drop it. 20 10

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the logical approach Algorithm: For each rule: For each condition: If ather condition for this rule logically implies this one, then delete this one. Logical implication can be derived from the ting set: A condition X is implied by a condition Y if X is true whenever Y is true. E.g., Age>23 is true whenever Age>52 is true. E.g., first pregnancy is true whenever pregnant is true. 21 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach We need a test data set, which is a set of data with the expert s classification which was NOT used to derive the set of rules. Thus, two sets of data: One to derive the rules Ather to simplify them To test the precision of a rule set: 1. Set SCORE to 0 2. For each case in the test set, Apply the rules to the case data to produce a conclusion If the estimated conclusion is the same as the experts, increm SCORE 3. PRECISION = SCORE / No. of cases 22 11

5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach To simplify rules: 1. Test the precision of the rules 2. For each rule: For each condition of the rule: Make a copy of the rule set with this condition deleted Test the precision of the new rule set on the test data If the precision is equal or better than the original precision:» Replace the original rule set with the copy» Replace original precision with this one» Restart Step 2 3. We get here when more conditions can be deleted. The rules are maximally simple. 23 5. Automatic Kwledge Acquisition RULE SIMPLIFICATION: the statistical approach The decision tree from above made a mistake in that it does t deal with cases with renal insufficiency who are diabetic and first pregnancy. Assuming there are such cases in our test database, using the statistical approach would lead to the renal insufficiency condition being dropped from our first rules, as it would improve precision. 24 12

Deleting Subsumed Rules 25 5. Automatic Kwledge Acquisition RULE Deletion More complete ting data would produce a better tree: Is this tree better or worse? It is more complex (larger) and has redundancy But for a doctor, it has better semantics. And for a machine, it has better predictive accuracy on the test set 26 13

5. Automatic Kwledge Acquisition RULE Deletion Lets look at the rules from this case: ( renal insuff. & pregnant & diabetes & first-preg) -> (renal insuff& press& pregn&diabetes&first-preg) -> (renal insuff& press) -> 27 5. Automatic Kwledge Acquisition Simplifying these rules using logic: ( renal insuff. & pregnant & diabetes & first-preg) -> (renal insuff& press& pregn&diabetes&first-preg) -> (renal insuff& press) -> Gives: ( renal insuff.&diabetes & first-preg) -> (renal insuff& press&diabetes&first-preg) -> (renal insuff& press) -> Looking at predicting accuracy, we see that deleting renal insuff. from the first rule does t change the predictions. (diabetes & first-preg) -> (renal insuff& press&diabetes&first-preg) -> (renal insuff& press) -> Now, the second rule cant fire unless the first does. So, we can delete the second rule (it is a subset of the cases of the first). 28 14

5. Automatic Kwledge Acquisition RULE Deletion As with deleting conditions from a rule, we can apply the same methods to deleting rules: 1. The logical approach: Where one rule is logically implied by ather, the implied rule can be dropped: 2. The statistical approach: Where one rule can be dropped without worsening the predictive accuracy of the rule-set as a whole, then delete the rule. 29 Producing Optimal Decision Trees: ID3-Tree 30 15

5. Automatic Kwledge Acquisition Psuedo-code to generate a decision tree 31 5. Automatic Kwledge Acquisition Selecting the best attribute for the root Random-Tree selects an attribute at random for the root of the tree. This approach tries to select the best attribute for the root. We seek an attribute which most determines the expert s decision. ID-3 assesses each attribute in terms of how much it helps to make a decision. Using the attribute splits the cases into smaller subsets. The closer these subsets are to being purely one of the decision classes, the better. The formula used is called Information Gain. 32 16

5. Automatic Kwledge Acquisition Information Suppose we have a set of cases, and the expert judges whether to treat the patient or t. In 50% of the cases, the expert proposes treatment, and in the other 50% proposes treatment. For a given new case, without looking at attributes, the probability of treatment is 50% (we have information to favor treatment or t). Now, assume we use an attribute to split our cases into two sets: Set 1: treatment recommended in 75% of cases Set 2: treatment recommended in 30% of cases Now, in each subset, we have more information as to what decision to make. -> Information Gain. 33 5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Firstly, we caluclate the information contained before the split The formula we use is: H (for entropy H(p,q) = -p.log 2 (p) q.log 2 (q)...where p is the probability of decision 1 and q is the probability of the reverse decision. In our previous case: Initially: p = 50%, q=50% H(p,q) = -0.5 * log 2 (0.5) 0.5 * log 2 (0.5) = 1 ( information) Information = 1- Entropy = 1-H(p,q) 34 17

5. Automatic Kwledge Acquisition Alternative Formulas: Both give equal values Values always between 0 and 1 35 5. Automatic Kwledge Acquisition Special Cases: H(1/3, 2/3) = H(2/3, 1/3) = 0.92 bits of information H(1/2, 2/2) 1 bit of information ( information) H(1,0)=0 bits (maximum information) 36 18

5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Initially: p = 50%, q=50% H(p,q) = -0.5 * log 2 (0.5) 0.5 * log 2 (0.5) = 1 ( information) Splitting the data, we get: H1(p,q) = -0.75 * log 2 (0.75) - 0.25 * log 2 (0.25) = 0.81 H2(p,q) = -0.25 * log 2 (0.25) - 0.75 * log 2 (0.75) = 0.81 We derive the total information of the two subsets by multiplying each information measure by the probability of the set. Lets assume the first set is 2/3 of the cases: H new (p,q) = 0.66*H1(p,q) + 0.34 * H2(p,q) = 0.81 37 5. Automatic Kwledge Acquisition How to Calculate Information Gain of an Atribute Given the original information of the case data was 1.0, and the information of the cases divided by the attribute is 0.811, we have an information gain of 0.189 The idea is, we look at each of the attributes in turn, and choose the attribute which gives us the est gain in information. 38 19

5. Automatic Kwledge Acquisition The Restaurant case revisited 39 5. Automatic Kwledge Acquisition The Restaurant case revisited 40 20

5. Automatic Kwledge Acquisition The Restaurant case revisited 41 5. Automatic Kwledge Acquisition The Restaurant case revisited 42 21

5. Automatic Kwledge Acquisition 43 22