Knowledge Representation. Model Selection and Assessment. (c) Marcin Sydow. Knowledge. Complexity. Summary

Similar documents
CS Machine Learning

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Python Machine Learning

Rule Learning With Negation: Issues Regarding Effectiveness

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Model Ensemble for Click Prediction in Bing Search Ads

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Rule Learning with Negation: Issues Regarding Effectiveness

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Axiom 2013 Team Description Paper

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Assignment 1: Predicting Amazon Review Ratings

(Sub)Gradient Descent

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Learning Methods for Fuzzy Systems

SARDNET: A Self-Organizing Feature Map for Sequences

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Using focal point learning to improve human machine tacit coordination

Calibration of Confidence Measures in Speech Recognition

Learning From the Past with Experiment Databases

Introduction to the Practice of Statistics

Using dialogue context to improve parsing performance in dialogue systems

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

CSL465/603 - Machine Learning

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Knowledge Transfer in Deep Convolutional Neural Nets

Applications of data mining algorithms to analysis of medical data

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Probability and Statistics Curriculum Pacing Guide

Human Emotion Recognition From Speech

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Test Effort Estimation Using Neural Network

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

arxiv: v1 [cs.cv] 10 May 2017

Learning Methods in Multilingual Speech Recognition

Why Did My Detector Do That?!

arxiv: v1 [cs.cl] 2 Apr 2017

A study of speaker adaptation for DNN-based speech synthesis

Speech Emotion Recognition Using Support Vector Machine

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

An OO Framework for building Intelligence and Learning properties in Software Agents

Speaker Identification by Comparison of Smart Methods. Abstract

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Grade 6: Correlated to AGS Basic Math Skills

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Word learning as Bayesian inference

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Chapter 2 Rule Learning in a Nutshell

Reducing Features to Improve Bug Prediction

Linking Task: Identifying authors and book titles in verbose queries

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Accuracy (%) # features

Multivariate k-nearest Neighbor Regression for Time Series data -

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Evolutive Neural Net Fuzzy Filtering: Basic Description

content First Introductory book to cover CAPM First to differentiate expected and required returns First to discuss the intrinsic value of stocks

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The stages of event extraction

CS 446: Machine Learning

Issues in the Mining of Heart Failure Datasets

Lecture 2: Quantifiers and Approximation

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Word Segmentation of Off-line Handwritten Documents

Learning goal-oriented strategies in problem solving

Learning to Schedule Straight-Line Code

A Comparison of Annealing Techniques for Academic Course Scheduling

Proceedings of the 19th COLING, , 2002.

The Strong Minimalist Thesis and Bounded Optimality

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

The Effects of Ability Tracking of Future Primary School Teachers on Student Performance

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Data Stream Processing and Analytics

Chapter 9 Banked gap-filling

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Truth Inference in Crowdsourcing: Is the Problem Solved?

SOFTWARE EVALUATION TOOL

arxiv: v1 [cs.lg] 15 Jun 2015

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

How People Learn Physics

Name Class Date. Graphing Proportional Relationships

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Softprop: Softmax Neural Network Backpropagation Learning

TD(λ) and Q-Learning Based Ludo Players

Visual CP Representation of Knowledge

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Transcription:

Topics covered by this lecture: knowledge representation decision rules decision trees ID3 algorithm model complexity model selection assessment overtting methods of overcoming it cross-validation

Variety of ML models There are many models available in machine learning: neural networks decision trees decision rules support vector machines many others...

Neural Networks as a black box Multi-layer non-linear neural network is a powerful tool used in machine learning AI. However, in NN, the learnt knowledge is encoded with the numerical values of weights thresholds. Such encoding is uncomprehensible for humans for analysis. Due to this, NN are considered as an example of a so-called black box model. Providing input, it produces useful output but the internal structure is inpenetrable.

There are models in machine learning, other than NN, that represent the learnt knowledge in much more interpretable way. For example: Decision rules Decision trees

Example - medicine in the raw form of decision table: age prescription astigmatism tear prod. DECISION young myope no reduced NONE young myope no normal SOFT young myope yes reduced NONE young myope yes normal HARD young hypermetrope no reduced NONE young hypermetrope no normal SOFT young hypermetrope yes reduced NONE young hypermetrope yes normal HARD pre-presbyopic myope no reduced NONE pre-presbyopic myope no normal SOFT pre-presbyopic myope yes reduced NONE pre-presbyopic myope yes normal HARD pre-presbyopic hypermetrope no reduced NONE pre-presbyopic hypermetrope no normal SOFT pre-presbyopic hypermetrope yes reduced NONE pre-presbyopic hypermetrope yes normal NONE presbyopic myope no reduced NONE presbyopic myope no normal NONE presbyopic myope yes reduced NONE presbyopic myope yes normal HARD presbyopic hypermetrope no reduced NONE presbyopic hypermetrope no normal SOFT presbyopic hypermetrope yes reduced NONE presbyopic hypermetrope yes normal NONE (the least compact form - each row is a separate case)

in the form of decision rules Example of the rst few decision rules automatically generated by so-called covering algorithm for the mentioned problem) IF tear production rate = reduced THEN recommendation = NONE IF age = young AND astigmatic = no AND tear production rate = normal THEN recommendation = SOFT IF age = presbyopic AND astigmatic = no AND tear production rate = normal THEN recommendation = SOFT IF age = presbyopic AND spectacle prescription = myope AND astigmatic = no THEN recommendation = NONE Decision rules are convenient for analysis are much more compact than decision table. The covering algorithm in iterations greedily covers maximum possible number of uncovered cases until some stop condition.

in the form of decision tree It is much more compact than decision table. (notice: it represents the whole decision table except 2 cases)

ID3 algorithm for decision tree generation In short 1 an attribute is selected according to some criteria 2 branches are created for dierent values of the attribute 3 1 i 2 are repeated until the leaves are almost pure (only 1 category) Note: the more iterations the higher danger of overtting Criteria of selecting the attribute for branching with regard to the following: high classication accuracy simplicity of the tree (notice the conict of interests above)

Decision tree generation - example Outdoor game: outlook temperature humidity windy PLAY? sunny hot high false no sunny hot high true no overcast hot high false yes rainy mild high false yes rainy cool normal false yes rainy cool normal true no overcast cool normal true yes sunny mild high false no sunny cool normal false yes rainy mild normal false yes sunny mild normal true yes overcast mild high true yes overcast hot normal false yes rainy mild high true no

Decision tree generation - example, cont. We have 4 attributes: outlook, temperature, humidity wind. Which is the best?

How to chose the attribute to split Algorithm ID3 Intuitively: the attribute is better if it better distinguishes the categories (ideally: each leaf contains cases from one category) More precisely, we can introduce some measure of quality of split with each possible attribute chose that for which this measure is best. There are many possible ideas: fractions of categories in leaves entropy information gain Which attributes are good in our example?

Usually, we can control the degree of so called complexity of the model. For example: in neural networks, the complexity of the model increases with the number of neurons/layers (i.e. the more hidden neurons the more complex model). In decision tree, the complexity increases with the number of leaves. In decision rules, with the number of rules, etc. Almost all of the models have some parameters that control the complexity.

Is complexity good? In general, the more complex model the more complicated concepts it can learn. However, there is some price for complexity: overtting. Thus, unnecessary complexity should be avoided. Simplicity should be preferred to complexity. For example, the neural network with less neurons should be preferred if it solves the given task suciently well.

Examples: too complex models For example: using a 100-leaf decision tree for the iris problem (described before) is unnecessary using a multi-layer neural network with 100 neurons for modeling a XoR problem is not a good idea.

Why complexity of the model should be controlled? Obviously, too simple model cannot learn the concept. E.g. a single neuron by no means can learn the XoR problem. However, too complex models are also problematic: they are more dicult to train they can t too perfectly to the training data (overtting). Overtting means that the model ts exactly to the training set without ability to generalise. I.e. it achieves perfect performance on the data on which it was trained, however does poorly on new, unseen examples. This is similar to learning data by heart without observing any general rules.

Dependence between model complexity training/test error: overtting Overtting is visible on the rightmost part of the graph (too complex model). As can be seen, the best balance (minimum test error) can be found for middle complexity. (statisticians call it bias vs variance balance) (Hastie, Tibshirani Elements of Statistical Learning, p. 194)

Actually, the selection of the appropriate model complexity is not the only task to be solved. There are two important tasks: model selection (choosing appropriate model its complexity level) model assessment (predicting: how well will it perform on new, unseen examples?) If we measure the performance only on the training data it is overestimated (another view on the overtting problem)

How to avoid overestimation of the model performance? (equivalently: avoid overtting) If there is enough labelled data (training data): The best is to divide it into three dierent subsets: 1 train (for teaching the model on data) 2 validation (model selection complexity control) 3 test (kept only for nal assessment of future generalisation ability of the model) No single rule for proportions, but can be 50%, 25%, 25%, respectively

if not enough data: cross-validation leave-one-out bootstrap cross-validation is most popular

Cross-Validation Makes it possible to achieve 2 seemingly conicting goals: use the whole data for training (in some way) avoid assessment of the errors on the training examples Romly split the data into N non-overlapping parts. Repeat N times (once for each part): take i-th part as the testing set (to compute the error) the remaining N-1 parts as training parts. Average the error over N iterations. Quite often N=10 (10-fold cross-validation).

Questions/Problems: black box model knowledge representation decision rules & covering algorithm (idea) decision trees model complexity model selection assessment overtting overcoming it training/testing/validation sets cross validation

Thank you for attention