COMP9444: Neural Networks Committee Machines

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Lecture 1: Machine Learning Basics

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Softprop: Softmax Neural Network Backpropagation Learning

Calibration of Confidence Measures in Speech Recognition

(Sub)Gradient Descent

Artificial Neural Networks written examination

CS Machine Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Probabilistic Latent Semantic Analysis

Learning From the Past with Experiment Databases

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Generative models and adversarial training

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speech Emotion Recognition Using Support Vector Machine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Universidade do Minho Escola de Engenharia

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Case Study: News Classification Based on Term Frequency

The Evolution of Random Phenomena

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.lg] 15 Jun 2015

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Chapter 2 Rule Learning in a Nutshell

Human Emotion Recognition From Speech

Axiom 2013 Team Description Paper

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

CSL465/603 - Machine Learning

Lecture 1: Basic Concepts of Machine Learning

Issues in the Mining of Heart Failure Datasets

Word Segmentation of Off-line Handwritten Documents

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Corrective Feedback and Persistent Learning for Information Extraction

Learning Methods in Multilingual Speech Recognition

Cooperative evolutive concept learning: an empirical study

Knowledge Transfer in Deep Convolutional Neural Nets

Conference Presentation

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Truth Inference in Crowdsourcing: Is the Problem Solved?

Evolutive Neural Net Fuzzy Filtering: Basic Description

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Discriminative Learning of Beam-Search Heuristics for Planning

INPE São José dos Campos

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Modeling function word errors in DNN-HMM based LVCSR systems

SARDNET: A Self-Organizing Feature Map for Sequences

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Speech Recognition at ICSI: Broadcast News and beyond

Australian Journal of Basic and Applied Sciences

Modeling function word errors in DNN-HMM based LVCSR systems

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

CS 446: Machine Learning

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Model Ensemble for Click Prediction in Bing Search Ads

Rule Learning With Negation: Issues Regarding Effectiveness

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Switchboard Language Model Improvement with Conversational Data from Gigaword

On the Combined Behavior of Autonomous Resource Management Agents

WHEN THERE IS A mismatch between the acoustic

A survey of multi-view machine learning

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Go fishing! Responsibility judgments when cooperation breaks down

School Size and the Quality of Teaching and Learning

Why Did My Detector Do That?!

Probability and Statistics Curriculum Pacing Guide

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Using dialogue context to improve parsing performance in dialogue systems

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Lecture 10: Reinforcement Learning

CS 598 Natural Language Processing

Semi-Supervised Face Detection

Dublin City Schools Mathematics Graded Course of Study GRADE 4

The Boosting Approach to Machine Learning An Overview

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Time series prediction

Reducing Features to Improve Bug Prediction

Learning Methods for Fuzzy Systems

Learning goal-oriented strategies in problem solving

Probability estimates in a scenario tree

Speaker recognition using universal background model on YOHO database

Grade Dropping, Strategic Behavior, and Student Satisficing

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Rule Learning with Negation: Issues Regarding Effectiveness

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Transcription:

COMP9444: Neural Networks Committee Machines

OMP9444 09s2 Committee Machines 1 Committee Machines

OMP9444 09s2 Committee Machines 2 Motivation If several classifiers are trained on (subsets of) the same training items, can their outputs be combined to produce a composite machine with better accuracy than the individual classifiers?

OMP9444 09s2 Committee Machines 3 Outline Static structures (Combiner does not make direct use of the Input) Ensemble Averaging Bagging Boosting Dynamic structures (Combiner does make direct use of the Input) Mixture of Experts Hierarchical Mixture of Experts

OMP9444 09s2 Committee Machines 4 Ensemble Experiment Distinguish between two classes, each generated according to a Gaussian distribution: Class 1: µ 1 = 0 0 σ 2 1 = 1 Class 2: µ 2 = 2 0 σ 2 2 = 4

OMP9444 09s2 Committee Machines 5 Ensemble Experiment Ten neural networks MLPs with 2 hidden nodes trained on same 500 patterns each with different initial weights same learning rate and momentum tested on the same 500 (new) patterns individual networks deliberately overtrained classifier % correct Net 1 80.65 Net 2 76.91 Net 3 80.06 Net 4 80.47 Net 5 80.44 Net 6 76.89 Net 7 80.55 Net 8 80.47 Net 9 76.91 Net 10 80.38

OMP9444 09s2 Committee Machines 6 Ensemble Experiment The average probability of correct classification for the individual networks is 79.37%. If we instead base our classification on the sum of the outputs of the individual networks, the probability of correct classification rises, but only marginally, to 80.27% Question: Answer: Can we do better? Yes, by feeding a different distribution of inputs to each classifier.

OMP9444 09s2 Committee Machines 7 Weak and Strong Learners a weak learner is one that is only guaranteed to achieve an error rate slightly less than what would be achieved by random guessing a strong learner is one which can achieve an error rate arbitrarily close to zero, in the PAC learning sense. Question: Answer: Can a weak learner be boosted into a strong learner, by applying it repeatedly to different subsets of the training data? Yes!

OMP9444 09s2 Committee Machines 8 Boosting by Filtering Assume you have access to an unlimited stream of training examples: The first classifier C 1 is generated by applying the weak learner to n training examples. C 1 is used as a filter to collect n new training examples: A fair coin is flipped: If head turns up, the next example from the stream is collected that is incorrectly classified by C 1. If tail turns up, the next example is collected that is correctly classified by C 1. Generate a new classifier C 2 using the weak learner and the collected training examples. Generate a third classifier by using the weak learner and a training sample of n examples created by just retaining those examples which are differently classified by C 1 and C 2.

OMP9444 09s2 Committee Machines 9 Boosting by Filtering of the total number of items seen, only a subset are used for the actual training of the classifiers; the procedure filters out items that are easy to learn and focuses on those that are hard to learn. in the original work (Schapire, 1990) a voting mechanism was used to combine the classifiers, but it has later been shown that summing the outputs of the individual classifiers gives better performance. it can be proved that if the error rate for the individual classifiers is ε < 1/2, then the error rate for the committee machine is less than g(ε) = 3ε 2 2ε 3 therefore, by applying the boosting algorithm recursively, the error rate can be made arbitrarily close to zero.

OMP9444 09s2 Committee Machines 10 Discussion Boosting by Filtering has the drawback that it requires a huge number of training items there are alternative algorithms which use fewer items, by judiciously re-using data: Bagging AdaBoost

OMP9444 09s2 Committee Machines 11 Bagging start with a training set of N items for each classifier, choose a set of N items from the original set with replacement; this means that some items can be chosen more than once, while others are left out train each classifier on the chosen items once all classifiers have been trained, new (test set) items are classified by majority vote, or by averaging the outputs of the individual classifiers for numerical outputs

OMP9444 09s2 Committee Machines 12 AdaBoost given: N training items ( x 1,d 1 )... ( x N,d N ) train a series of learners C 1... C T producing hypotheses f 1... f T training items for C n chosen using distribution D n initialize D 1 (i) = 1 N for i = 1... N set update β n = ε n 1 ε n, where ε n = training error of f n D n+1 (i) = D n(i) Z n { βn, if f n ( x i ) = d i 1, otherwise where Z n is a normalizing constant.

OMP9444 09s2 Committee Machines 13 AdaBoost output the final hypothesis: f ( x) = sign ( N n=1 f n ( x)log 1 β n ).

OMP9444 09s2 Committee Machines 14 AdaBoost Generalization the base learner for AdaBoost could be any kind of learner (neural networks, decision trees, stumps... ) with AdaBoost, as with SVM s, the test error often continues to decrease even after the training error has already reached zero this goes against the traditional conception of bias-variance trade-off, Ockham s Razor and overfitting although the number of free parameters is enormous, each additional degree of freedom is highly costrained

OMP9444 09s2 Committee Machines 15 Sensitivity to Errors AdaBoost, like SVM, is very sensitive to mislabled data AdaBoost will assign enormous weight to incorrectly labeled items, and put huge effort into learning them

OMP9444 09s2 Committee Machines 16 Mixture of Experts

OMP9444 09s2 Committee Machines 17 Mixture of Experts Each individual expert tries to approximate the target function on some subset of the input space the gating network tries to learn which expert(s) are best suited to the current input for each expert k, the gating network produces a linear function u k of the inputs. the outputs g 1... g K of the gating network are computed using the softmax principle: g k = exp(u k) j exp(u j ) in stochastic training, g k is treated as the probability of selecting expert k; for soft training, it is treated as a mixing parameter for expert k.

OMP9444 09s2 Committee Machines 18

OMP9444 09s2 Committee Machines 19 Hierachical Mixture of Experts

OMP9444 09s2 Committee Machines 20 Hierachical Mixture of Experts HME can be trained either by maximum likelihood estimation, or by the expectation maximization (EM) algorithm HME model is often seen as a soft version of decision trees