Where are we? Knowledge Engineering Semester 2, Knowledge Acquisition. Inductive Learning

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Chapter 2 Rule Learning in a Nutshell

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Lecture 1: Basic Concepts of Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

CS Machine Learning

MYCIN. The MYCIN Task

Rule Learning with Negation: Issues Regarding Effectiveness

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

A Version Space Approach to Learning Context-free Grammars

Proof Theory for Syntacticians

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Rule-based Expert Systems

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Lecture 1: Machine Learning Basics

Transfer Learning Action Models by Measuring the Similarity of Different Domains

(Sub)Gradient Descent

Compositional Semantics

Using dialogue context to improve parsing performance in dialogue systems

AQUA: An Ontology-Driven Question Answering System

Introduction to Simulation

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Section 3.4. Logframe Module. This module will help you understand and use the logical framework in project design and proposal writing.

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Abstractions and the Brain

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

LEGO MINDSTORMS Education EV3 Coding Activities

Major Milestones, Team Activities, and Individual Deliverables

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

On-Line Data Analytics

Learning Methods in Multilingual Speech Recognition

Learning goal-oriented strategies in problem solving

Prediction of Maximal Projection for Semantic Role Labeling

Two-Valued Logic is Not Sufficient to Model Human Reasoning, but Three-Valued Logic is: A Formal Analysis

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Word learning as Bayesian inference

Radius STEM Readiness TM

Data Stream Processing and Analytics

Software Maintenance

Truth Inference in Crowdsourcing: Is the Problem Solved?

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

A. What is research? B. Types of research

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

IBM Software Group. Mastering Requirements Management with Use Cases Module 6: Define the System

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

Knowledge-Based - Systems

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

GACE Computer Science Assessment Test at a Glance

Field Experience Management 2011 Training Guides

Word Segmentation of Off-line Handwritten Documents

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

12- A whirlwind tour of statistics

Generating Test Cases From Use Cases

Section 7, Unit 4: Sample Student Book Activities for Teaching Listening

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Learning From the Past with Experiment Databases

TU-E2090 Research Assignment in Operations Management and Services

Knowledge Elicitation Tool Classification. Janet E. Burge. Artificial Intelligence Research Group. Worcester Polytechnic Institute

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Your School and You. Guide for Administrators

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Some Principles of Automated Natural Language Information Extraction

EDIT 576 (2 credits) Mobile Learning and Applications Fall Semester 2015 August 31 October 18, 2015 Fully Online Course

Infrared Paper Dryer Control Scheme

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Parsing of part-of-speech tagged Assamese Texts

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

CS 446: Machine Learning

Calibration of Confidence Measures in Speech Recognition

A Genetic Irrational Belief System

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

EDIT 576 DL1 (2 credits) Mobile Learning and Applications Fall Semester 2014 August 25 October 12, 2014 Fully Online Course

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Outreach Connect User Manual

Data Structures and Algorithms

ACTION LEARNING: AN INTRODUCTION AND SOME METHODS INTRODUCTION TO ACTION LEARNING

SOFTWARE EVALUATION TOOL

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Clouds = Heavy Sidewalk = Wet. davinci V2.1 alpha3

Setting Up Tuition Controls, Criteria, Equations, and Waivers

Visual CP Representation of Knowledge

Scientific Method Investigation of Plant Seed Germination

Lecture 2: Quantifiers and Approximation

Transcription:

H O E E U D N I I N V E B R U S R I H G Knowledge Engineering Semester 2, 2004-05 Michael Rovatsos mrovatso@inf.ed.ac.uk Lecture 2 : Decision rees 14th January 2005 Y Where are we? Last time... we defined knowledge, KBS and KE looked at KE process identified important building blocks of KE process. oday... marks the beginning of the Knowledge Acquisition (KA) part of the module we will discuss methods for automating KA in particular: Informatics UoE Knowledge Engineering 1 Informatics UoE Knowledge Engineering 17 Knowledge Acquisition Knowledge Acquisition generally considered bottleneck in KE process Informal methods: Expert interviews (today developers experts) Analysis of organisational databases and documents Independent analysis of domain knowledge (textbooks, online documents, etc.) (Although inevitable) these methods are complex, costly, and inflexible automation desirable Discussion of machine learning methods, in particular: inductive (symbolic) learning Idea: we are provided with examples (x, f (x)) where f (x) is the correct value of the target function f for input x and we want to learn f ask of inductive inference: Given a collection of examples of f, return a function h that approximates f h is a hypothesis taken from a hypothesis space H (Pure) inductive inference assumes no prior knowledge Validation: construct/adjust h using a training set, evaluate generalisation capabilities on test set Informatics UoE Knowledge Engineering 18 Informatics UoE Knowledge Engineering 19

Inductive learning (IL) is a form of supervised learning: information about the output value f (x) of x is explicit Art of inductive learning: given a set of training examples, choose the best hypothesis h consistent: agrees with all example data seen so far (not all learning algorithms return consistent hypotheses) H defines the range of functions we can use and determines expressiveness of hypothesis Learning problem realisable if f (x) H (often this is not known in advance) Informatics UoE Knowledge Engineering 20 Choosing Hypotheses Ockham s razor: prefer the simplest hypothesis consistent with the data Why is this a reasonable policy? Intuitively, why choose complex hypothesis if simple one does the job? here exist more long (i.e. more complex) hypotheses than short ones accidental choice of bad hypothesis that is consistent with data is more unlikely if the hypothesis is simple Problem: identifying what simple hypotheses are rade-off: the more expressive the hypothesis space, the more examples are needed (and the more the complex learning algorithm) Informatics UoE Knowledge Engineering 21

Describing IL Methods What kind of information do the examples offer? How much training data is available? All at once? What are their attributes and those attributes domains (boolean, discrete, continuous)? What is the range of possible classifications? Do we have to consider noise in the data? he hypothesis space: Choice of right representation Questions of expressiveness vs. complexity How can the learning result be used after learning? Choosing hypotheses: Incremental vs. batch processing of examples Refining an initial hypothesis vs. starting with none What kind of inductive bias is applied? Informatics UoE Knowledge Engineering 23

Decision rees Attribute-based classification learning: input x: situation/object described in terms of attribute values output f (x): a discrete-valued classification decision Here: Boolean classification, each example is classified as positive (true) or negative (false) Alternatively: f describes an unknown concept, and all values of x for which f (x) = true describe the instances of this concept Hypothesis = a decision tree (D) whose nodes correspond to tests on attribute values to decide whether f (x) is true or false Informatics UoE Knowledge Engineering 24 Assume we are given a set of situations in which a customer will or will not wait in a restaurant (examples), i.e. the goal predicate is WillWait(x). Attributes arget Alt Bar ri Hun Pat Price Rain Res ype Est WillWait X1 Some $$$ rench 0 10 X2 ull $ hai 30 60 X3 Some $ Burger 0 10 X4 ull $ hai 10 30 X5 ull $$$ rench >60 X6 Some $$ Italian 0 10 X7 None $ Burger 0 10 X8 Some $$ hai 0 10 X9 ull $ Burger >60 X10 ull $$$ Italian 10 30 X11 None $ hai 0 10 X12 ull $ Burger 30 60 Informatics UoE Knowledge Engineering 25 Attributes: Alternate: Is there an alternative restaurant nearby? Bar: Is there a bar that makes waiting comfortable? ri/sat: rue if current day is riday or Saturday Patrons: None or some people in the restaurant, or is it full? Raining: Is it raining outside? Reservation: Was a reservation made? Estimate: How long is the estimated waiting time?... and some other (self-explanatory) Assume this is the actual decision tree used by the person in question: None Some ull Patrons? >60 30 60 10 30 0 10 Bar? Reservation? WaitEstimate? Alternate? ri/sat? Hungry? Alternate? Raining? Informatics UoE Knowledge Engineering 26 Informatics UoE Knowledge Engineering 27

Expressiveness What kind of logical constraints can Ds express? Consider conjunction Pi of attribute values on each path leading to Yes and disjunction G = P1... Pn over these conjunctions Ds can represent any formula of propositional logic : Each truth table row corresponds to one path A B A xor B B A B Easy to build a tree that is consistent with all examples, but will it be able to generalise? Informatics UoE Knowledge Engineering 28 Algorithm Iteratively build a tree by selecting the best attribute and adding descendant nodes for all its values If all examples on some branch have the same classification, then no more decision steps are necessary (add leaf node with this classification) If some examples are positive and some negative, choose a new attribute to discriminate between them If we run out of attributes, examples have same description but different classification (noise) use majority vote as a workaround If we run out of examples then no data is available for current attribute value; use majority value of parent node Informatics UoE Knowledge Engineering 29 he Algorithm Decision-ree-Learning(examples, attribs, default) 1 inputs : examples, a set of examples,attribs, a set of attributes 2 default, default value for the goal predicate 3 if examples is empty then return default 4 else if all examples have same classification 5 then return this classification 6 else if attribs is empty then return Majority-Value(examples) 7 else 8 best Choose-Attribute(attribs, examples) 9 tree a new decision tree with root test best 10 m Majority-Value(examples) 11 for each value vi of best do 12 examplesi { elements of examples with best = vi} 13 subtree Decision-ree-Learning(examplesi, attribs best, m) 14 add a branch to tree with label vi and subtree subtree 15 return tree Heuristics Best way to obtain compact decision tree: find attributes that split example set into positive/negative examples : Patrons? None Some ull ype? rench Italian hai Burger Informatics UoE Knowledge Engineering 30 Informatics UoE Knowledge Engineering 31

Entropy-Based Measures Information-theoretic entropy can be used as a measure for amount of information If v1,... vn attribute values with probabilities P(vi), information content n I (P(v1),... P(vn)) = P(vi) log 2 P(vi) or example: I(0.5,0.5)=1 (bit), I(0.01,0.99)=0.08 (bits) Assume we have p positive and n negative examples classifying a given example correctly requires I ( p, n ) bits of information p+n p+n i=1 Information Gain Attribute A splits example set into n subsets Ei containing pi positive and ni negative examples How much information do we still need after this test? Assumption: an example has value vi for the attribute in pi +ni question with probability p+n measure for remaining information-to-go : n pi + ni Remainder(A) = p + n I ( pi ni, ) pi + ni pi + ni i=1 Gain(A) = I ( p, n ) Remainder(A) provides a p+n p+n measure for the information gain provided by A Heuristics: choose A that maximises Gain(A) Informatics UoE Knowledge Engineering 32 Informatics UoE Knowledge Engineering 33 Overfitting Problem: If hypothesis space is large enough, there is a probability of finding meaningless regularities : Date of birth data as a predictor for getting an MSc in Informatics If the hypothesis overfits the learning data, it may be consistent with examples but useless for generalisation purposes A general problem of all learning algorithms One way of dealing with overfitting: decision tree pruning (e.g. use significance tests to determine irrelevance of attributes) Validation ypical validation for inductive learning methods: Split example data into training set and test set rain system with example data Evaluate prediction accuracy on test set Optionally: use cross-validation to prevent overfitting Set a portion (e.g. 1/k of the data) aside Conduct k experiments using the left out examples as test set (and remaining data as training set) Average performance over k runs Informatics UoE Knowledge Engineering 34 Informatics UoE Knowledge Engineering 35

Critique Many functions not easy to represent with Ds (e.g. majority function or mathematical functions) Best for problems with limited number of attributes and attribute values Assumes examples are unambiguously and completely (no missing data) described/classified (deterministic and fully observable environment) No use of prior knowledge learning can be very slow Is DL an (1) an incremental and/or (2) an anytime algorithm? Is this an adequate model of real learning? Summary : Inference of knowledge from examples Decision rees: A simple yet effective method for attribute-based inductive inference Expressiveness vs. complexity, Ockham s Razor Entropy-based heuristics for attribute selection Problems of noise and overfitting Next lecture: Version space learning Informatics UoE Knowledge Engineering 36 Informatics UoE Knowledge Engineering 37