Combining Multiple Models

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning From the Past with Experiment Databases

CS Machine Learning

Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Softprop: Softmax Neural Network Backpropagation Learning

Semi-Supervised Face Detection

Discriminative Learning of Beam-Search Heuristics for Planning

Probabilistic Latent Semantic Analysis

Lecture 1: Basic Concepts of Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Rule Learning With Negation: Issues Regarding Effectiveness

Assignment 1: Predicting Amazon Review Ratings

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Artificial Neural Networks written examination

Rule Learning with Negation: Issues Regarding Effectiveness

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Switchboard Language Model Improvement with Conversational Data from Gigaword

CSL465/603 - Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Cooperative evolutive concept learning: an empirical study

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

A Case Study: News Classification Based on Term Frequency

CS 446: Machine Learning

Universidade do Minho Escola de Engenharia

Reinforcement Learning by Comparing Immediate Reward

Activity Recognition from Accelerometer Data

Reducing Features to Improve Bug Prediction

Chapter 2 Rule Learning in a Nutshell

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Model Ensemble for Click Prediction in Bing Search Ads

A survey of multi-view machine learning

School Size and the Quality of Teaching and Learning

Algebra 2- Semester 2 Review

Generative models and adversarial training

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Version Space Approach to Learning Context-free Grammars

Online Updating of Word Representations for Part-of-Speech Tagging

arxiv: v1 [cs.lg] 15 Jun 2015

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Probability and Statistics Curriculum Pacing Guide

Learning Methods in Multilingual Speech Recognition

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Ensemble Technique Utilization for Indonesian Dependency Parser

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Applications of data mining algorithms to analysis of medical data

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Lecture 10: Reinforcement Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Calibration of Confidence Measures in Speech Recognition

Multi-label classification via multi-target regression on data streams

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Content-based Image Retrieval Using Image Regions as Query Examples

Introduction to Simulation

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

STAT 220 Midterm Exam, Friday, Feb. 24

arxiv: v2 [cs.cv] 30 Mar 2017

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probability estimates in a scenario tree

Laboratorio di Intelligenza Artificiale e Robotica

Linking Task: Identifying authors and book titles in verbose queries

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Universityy. The content of

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Team Formation for Generalized Tasks in Expertise Social Networks

Analysis of Enzyme Kinetic Data

Using dialogue context to improve parsing performance in dialogue systems

Indian Institute of Technology, Kanpur

Using focal point learning to improve human machine tacit coordination

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Beyond the Pipeline: Discrete Optimization in NLP

Australian Journal of Basic and Applied Sciences

Introduction to Causal Inference. Problem Set 1. Required Problems

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Postprint.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

NCEO Technical Report 27

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

AUTHORITATIVE SOURCES ADULT AND COMMUNITY LEARNING LEARNING PROGRAMMES

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Gender and socioeconomic differences in science achievement in Australia: From SISS to TIMSS

Evolution of Symbolisation in Chimpanzees and Neural Nets

Transcription:

Combining Multiple Models Lecture Outline: Combining Multiple Models Bagging Boosting Stacking Using Unlabeled Data Reading: Chapters 7.5 Witten and Frank, 2nd ed. Nigam, McCallum, Thrun & Mitchell. Text Classification from Labeled and Unlabeled Data using EM. Machine Learning, 39, pp. 103-134, 2000. COM3250 / 6170 1 2010-2011

Combining Multiple Models When making critical decisions people usually consult several experts, rather than just one A model generated by an ML technique over some training data can be viewed as an expert Natural to ask: Can we combine judgements of multiple models to get a decision that is more reliable than that of any single one on its own? Answer is: yes! (though not always) Disadvantage is that resulting combined models may be hard to understand/analyse COM3250 / 6170 2 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote COM3250 / 6170 3 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data COM3250 / 6170 3-a 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets COM3250 / 6170 3-b 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population COM3250 / 6170 3-c 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem COM3250 / 6170 3-d 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem Total error is sum of bias and variance (bias-variance decomposition) COM3250 / 6170 3-e 2010-2011

Why Combining Models Works Suppose (ideally) we have an infinite number of independent training sets of the same size from which we train an infinite number of classifiers (using one learning scheme) which are used to classify a test instance via majority vote Such a combined classifier will still make errors, depending on how well the ML method fits the problem and on noise in the data If we were to average the error rate of the combined classifier across an infinite number of independently chosen test examples, we arrive at the bias of the learning algorithm for the learning problem the residual error that cannot be eliminated regardless of the number of training sets A second source of error arises from the use, in practice, of finite data sets, which inevitably are not fully representative of entire instance population The average of this error over all training sets of a given size and all test sets is the variance of the learning method for the problem Total error is sum of bias and variance (bias-variance decomposition) Combining classifiers reduces the variance component of the error COM3250 / 6170 3-f 2010-2011

Bagging Stands for bootstrap aggregating COM3250 / 6170 4 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers COM3250 / 6170 4-a 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset COM3250 / 6170 4-b 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size COM3250 / 6170 4-c 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees COM3250 / 6170 4-d 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees The combined classifier works by applying each of the learned classifiers (e.g. the k decision trees) to novel instances and deciding their classification by majority vote COM3250 / 6170 4-e 2010-2011

Bagging Stands for bootstrap aggregating A process whereby a single classifier is constructed from a number of classifiers Each classifier is learned by applying a single learning scheme to multiple artificial training datasets that are derived from a single, original training dataset The artificial datasets are obtained by randomly sampling with replacement from the original dataset, creating new datasets of the same size The sampling procedure deletes some instances and replicates others E.g. a decison tree learner could be applied to k artificial datasets derived by this random sampling procedure, resulting in k decision trees The combined classifier works by applying each of the learned classifiers (e.g. the k decision trees) to novel instances and deciding their classification by majority vote For numeric prediction final values are determined by averaging classifier outputs COM3250 / 6170 4-f 2010-2011

A Bagging Algorithm Model Generation Let n be the number of instances in the training data For each of t iterations: Sample n instances with replacement from training data Apply the learning algorithm to the sample Store the resulting model Classification For each of the t models: Predict class of instance using model Return class that has been predicted most often Bagging produces a combined model that often performs significantly better than a single model built from the original data set and never performs substantially worse COM3250 / 6170 5 2010-2011

Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input COM3250 / 6170 6 2010-2011

Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging COM3250 / 6170 6-a 2010-2011

Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging A random element can be added into most learning algorithms E.g. for decision trees instead of picking best attribute to split on at each node, randomly pick one of the best n attributes COM3250 / 6170 6-b 2010-2011

Randomisation Bagging generates an ensemble of classifiers by introducing randomness into the learner s input Some learning algorithms have randomness built-in For example, perceptrons start out with randomly assigned connections weights which are then adjusted during training One way to make such algorithms more stable is to run them several times with different random number seeds and combine classifier predictions by voting/averaging A random element can be added into most learning algorithms E.g. for decision trees instead of picking best attribute to split on at each node, randomly pick one of the best n attributes Randomisation requires more work than bagging, because the learning algorithm has to be modified; however it can be applied to a wider range of learners. For example: Bagging fails with stable learners those whose output is insensitive to small changes in input, such as knn However, randomisation can be applied by, e.g., selecting different randomly chosen subsets of attributes on which to base the classifiers COM3250 / 6170 6-c 2010-2011

Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme COM3250 / 6170 7 2010-2011

Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last COM3250 / 6170 7-a 2010-2011

Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last Boosting encourages subsequent models to emphasize examples badly handled by earlier ones builds classifiers whose strengths complement each other COM3250 / 6170 7-b 2010-2011

Boosting Like bagging works by combining, via voting or averaging, multiple models produced by a single learning scheme Unlike bagging does not derive models from artificially produced datasets generated by random sampling Instead builds models iteratively each model takes into account performance of the last Boosting encourages subsequent models to emphasize examples badly handled by earlier ones builds classifiers whose strengths complement each other In AdaBoost.M1 this is achieved by using the notion of weighted instance: Error is computed by taking into account the weights of misclassified instances rather than just the proportion of misclassified instances By increasing the weight of misclassified instances following the training of one model, the next model can be made to attend to these instances Final classification is determined by weighted voting across all the classifiers, where weighting is based on classifier performance in the AdaBoost.M1 case by error of the individual classifiers COM3250 / 6170 7-c 2010-2011

The ADABOOST.M1 algorithm Model Generation Assign equal weight to each training instance For each of t iterations: Apply learning algorithm to weighted dataset and store resulting model Compute error e of model on weighted dataset and store error If e = 0 or e 0.5 Then terminate model generation For each instance i in dataset If i classified correctly by model Then weight i weight i e/(1 e) Normalise weight of all instances so that their summed weight remains constant Classification Assign weight of zero to all classes For each of the t (or less) models: Add log(e/(1 e)) to weight of each class predicted by model Return class with highest weight COM3250 / 6170 8 2010-2011

Boosting: Observations Boosting often performs substantially better than bagging COM3250 / 6170 9 2010-2011

Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) COM3250 / 6170 9-a 2010-2011

Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) COM3250 / 6170 9-b 2010-2011

Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) Boosting allows powerful combined classifiers to be built from simple ones (provided they achieve < 50% error on reweighted data) Such simple learners are called weak learners Examples are learners such as decision stumps (one level decision trees) or OneR (single conjunctive rule) COM3250 / 6170 9-c 2010-2011

Boosting: Observations Boosting often performs substantially better than bagging Unlike bagging which never produces a combined classifier which is substantially worse than a single classifier built from the same data, boosting can sometimes do so (overfitting) Interestingly performing more boosting iterations after error on training data has dropped to zero, can further improve performance on new test data Seems to contradict Occam s razor (prefer simpler hypothesis), since more iterations lead to more complex hypothesis which does not explain training data any better However, more iterations improves classifier s confidence in its predictions difference between estimated probability of true class and that of most likely predicted class other than true class (called the margin) Boosting allows powerful combined classifiers to be built from simple ones (provided they achieve < 50% error on reweighted data) Such simple learners are called weak learners Examples are learners such as decision stumps (one level decision trees) or OneR (single conjunctive rule) Good example: Weka decision stump on mushroom data try without, then with, boosting COM3250 / 6170 9-d 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme COM3250 / 6170 10 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms COM3250 / 6170 10-a 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner COM3250 / 6170 10-b 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models COM3250 / 6170 10-c 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 COM3250 / 6170 10-d 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 From these instances the level 1 model makes the final prediction COM3250 / 6170 10-e 2010-2011

Stacking (1) Bagging and boosting combine multiple models produced by one learning scheme Stacking is normally used to combine models built by different learning algorithms Rather than simply voting, stacking attempts to learn which are the reliable classifiers using a metalearner Inputs to the metalearner are instances built of the outputs of level 0, or base level models These level 1 instances consist of one attribute for each level 0 learner the class the level 0 learner predicts for the level instance 1 From these instances the level 1 model makes the final prediction During training the level 1 model is given instances which are the level 0 predictions for level 0 instances plus the actual class of the instance However, if the predictions of the level 0 learners over the data they were trained on are used the result will be a metalearner trained to prefer classifiers that overfit the training data COM3250 / 6170 10-f 2010-2011

Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 COM3250 / 6170 11 2010-2011

Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 Stacking can be extended to deal with level 0 classifiers that produce probability distributions over output class labels numeric prediction rather than classification COM3250 / 6170 11-a 2010-2011

Stacking (2) In order to avoid overfitting the level 1 instances must either be formed from level 0 predictions over instances that were held out from level 0 training; or from predictions on the instances in the test folds, if cross-validation was used for training at level 0 Stacking can be extended to deal with level 0 classifiers that produce probability distributions over output class labels numeric prediction rather than classification While any ML algorithms could be used at level 1, simple level 1 algorithms such as linear regression have proved best COM3250 / 6170 11-b 2010-2011

Using Unlabeled Data Labeled training data i.e. data with associated target class is always limited frequently requires extensive/expensive manual annnotation or cleaning However, large amounts of unlabeled data may be readily available pre-classified text hard to get (e.g. catalogued news articles) unclassified text very easy to get Is there any way we can utilise unlabeled training data to improve a classifier? COM3250 / 6170 12 2010-2011

Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering COM3250 / 6170 13 2010-2011

Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering Suppose we have labelled training data L + unlabelled training data U. Proceed as follows: train Naïve Bayes classifier on L repeat until convergence (E-step) Use current classsifer to estimate component mixture for each instance in U (i.e. probability that each mixture component generated each instance) (M-step) re-estimate the classifier using the estimated component mixture for each instance in L + U output a classifier that predicts labels for unlabelled instances (after Nigam et al. 2000) COM3250 / 6170 13-a 2010-2011

Using Unlabeled Data: Clustering for Classification One possibility is to couple a probabilistic classifier, such as Naïve Bayes classification, with Expectation-Maximisation (EM) iterative probabilistic clustering Suppose we have labelled training data L + unlabelled training data U. Proceed as follows: train Naïve Bayes classifier on L repeat until convergence (E-step) Use current classsifer to estimate component mixture for each instance in U (i.e. probability that each mixture component generated each instance) (M-step) re-estimate the classifier using the estimated component mixture for each instance in L + U output a classifier that predicts labels for unlabelled instances (after Nigam et al. 2000) Experiments show such a learner can attain equivalent performance to a traditional learner using < 1/3 the labeled training examples together with 5 times as many unlabeled examples COM3250 / 6170 13-b 2010-2011

Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages COM3250 / 6170 14 2010-2011

Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages Co-training exploits these two perspectives: Train model A using perspective 1 on labelled data Train model B using perspective 2 on labelled data Label the unlabeled data using model A and model B separately For each model select the example it most confidently labels positively and the one it most confidently labels negatively and add these to pool of labeled examples Repeat the whole process training both models on the augmented pool of labeled examples until there are no more unlabeled examples COM3250 / 6170 14-b 2010-2011

Using Unlabeled Data: Co-training Suppose there are two independent perspectives or views (feature sets) on a classification task. E.g. for web page classification: the web page s content links to the web page from other pages Co-training exploits these two perspectives: Train model A using perspective 1 on labelled data Train model B using perspective 2 on labelled data Label the unlabeled data using model A and model B separately For each model select the example it most confidently labels positively and the one it most confidently labels negatively and add these to pool of labeled examples Repeat the whole process training both models on the augmented pool of labeled examples until there are no more unlabeled examples There is some experimental evidence to indicate co-training using Naïve Bayes as learner outperforms an approach which learns a single model using all features from both perspectives COM3250 / 6170 14-c 2010-2011

Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge COM3250 / 6170 15 2010-2011

Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge Co-EM appears to perform consistently better than co-training (because it does not commit to class labels, but re-estimates their probabilities at each iteration) COM3250 / 6170 15-a 2010-2011

Using Unlabeled Data: Co-EM Co-EM trains model A using perspective 1 on the labeled data uses model A to probabilistically label all the unlabeled data trains model B using perspective 2 on the original labeled data + the unlabeled data tenatively labeled using model A uses model B to probabilistically relabel all the data for use in retraining model A the process iterates until the classifiers converge Co-EM appears to perform consistently better than co-training (because it does not commit to class labels, but re-estimates their probabilities at each iteration) Co-training/co-EM limited to applications where multiple perspectives on the data are available some recent evidence that this split perspective can be artificially manufactured (e.g. random selection of features, though feature independence preferred) some recent arguments/evidence that co-training using models derived by different classifiers (instead of from different feature sets) also works COM3250 / 6170 15-b 2010-2011

Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: COM3250 / 6170 16 2010-2011

Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting COM3250 / 6170 16-a 2010-2011

Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance COM3250 / 6170 16-b 2010-2011

Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance Stacking combines the models built by different learning schemes by training a metalearner to decide amongst the predictions of the base level learners COM3250 / 6170 16-c 2010-2011

Summary Multiple learned models may be combined in various ways to produce classifiers whose performance is superior to a single model on its own: Bagging trains multiple models using a single learning scheme trained on multiple training sets artificially derived from a single data set through random deletion and repitition of instances; final classification is arrived at by simple majority voting Boosting builds multiple models using a single learning scheme iteratively over a single data set where the instances are re-weighted between iterations so that subsequent models pay more attention to instances misclassified by earlier models; final classification is arrived at by weighted voting of all classifiers, each vote weighted by classifier performance Stacking combines the models built by different learning schemes by training a metalearner to decide amongst the predictions of the base level learners Unlabeled data can be utilised to improve the performance of classifiers, or to allow them to attain equivalent performance using less labeled (expensive) training data. Approaches include: Learning over probabilistically clustered unlabeled data (Naïve Bayes + EM) co-learning and co-em which assume different perspectives (feature views) over the same data with models/estimates iteratively improved over the unlabeled data COM3250 / 6170 16-d 2010-2011