Machine Learning for Language Technology

Similar documents
Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Machine Learning Basics

Python Machine Learning

Universidade do Minho Escola de Engenharia

Probability and Statistics Curriculum Pacing Guide

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CS Machine Learning

Ensemble Technique Utilization for Indonesian Dependency Parser

Softprop: Softmax Neural Network Backpropagation Learning

(Sub)Gradient Descent

Discriminative Learning of Beam-Search Heuristics for Planning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

Reducing Features to Improve Bug Prediction

The Evolution of Random Phenomena

Probabilistic Latent Semantic Analysis

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Activity Recognition from Accelerometer Data

Lecture 1: Basic Concepts of Machine Learning

Learning From the Past with Experiment Databases

STAT 220 Midterm Exam, Friday, Feb. 24

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

STA 225: Introductory Statistics (CT)

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

School Size and the Quality of Teaching and Learning

Switchboard Language Model Improvement with Conversational Data from Gigaword

The Boosting Approach to Machine Learning An Overview

A Case Study: News Classification Based on Term Frequency

The Good Judgment Project: A large scale test of different methods of combining expert predictions

CSC200: Lecture 4. Allan Borodin

Seminar - Organic Computing

Cooperative evolutive concept learning: an empirical study

Model Ensemble for Click Prediction in Bing Search Ads

Word Segmentation of Off-line Handwritten Documents

Multivariate k-nearest Neighbor Regression for Time Series data -

South Carolina English Language Arts

Algebra 2- Semester 2 Review

arxiv: v2 [cs.cv] 30 Mar 2017

Probability estimates in a scenario tree

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Improving Conceptual Understanding of Physics with Technology

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Statistical Studies: Analyzing Data III.B Student Activity Sheet 7: Using Technology

Artificial Neural Networks written examination

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

An Empirical Comparison of Supervised Ensemble Learning Approaches

UCLA UCLA Electronic Theses and Dissertations

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Calibration of Confidence Measures in Speech Recognition

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

A Bootstrapping Model of Frequency and Context Effects in Word Learning

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Learning Distributed Linguistic Classes

Shockwheat. Statistics 1, Activity 1

On the Combined Behavior of Autonomous Resource Management Agents

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Radius STEM Readiness TM

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Linking Task: Identifying authors and book titles in verbose queries

Introduction to Causal Inference. Problem Set 1. Required Problems

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Writing Research Articles

12- A whirlwind tour of statistics

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

AQUA: An Ontology-Driven Question Answering System

Learning By Asking: How Children Ask Questions To Achieve Efficient Search

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Optimizing to Arbitrary NLP Metrics using Ensemble Selection

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Generative models and adversarial training

Innovative Methods for Teaching Engineering Courses

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Diagnostic Test. Middle School Mathematics

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

A Reinforcement Learning Variant for Control Scheduling

Short vs. Extended Answer Questions in Computer Science Exams

Online Updating of Word Representations for Part-of-Speech Tagging

Title:A Flexible Simulation Platform to Quantify and Manage Emergency Department Crowding

A survey of multi-view machine learning

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Human Emotion Recognition From Speech

Chapter 2 Rule Learning in a Nutshell

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Mathematics process categories

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

arxiv: v1 [cs.lg] 15 Jun 2015

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Knowledge Transfer in Deep Convolutional Neural Nets

An investigation of imitation learning algorithms for structured prediction

Transcription:

October 2013 Machine Learning for Language Technology Lecture 6: Ensemble Methods Marina Santini, Uppsala University Department of Linguistics and Philology

Where we are Previous lectures, various different learning methods: Decision trees Nearest neighbors Linear classifiers Structured Prediction This lecture: How to combine classifiers 2

Thanks to E. Alpaydin and Oscar Täckström Combining Multiple Learners 3

Wisdom of the Crowd Guess the weight of an ox Average of people's votes close to true weight Better than most individual members' votes and cattle experts' votes Intuitively, the law of large numbers 4

Definition An ensemble of classifiers is a set of classifiers whose individual decisions are combined in some way to classify new examples (Dietterich, 2000) 5

Diversity vs accuracy An ensemble of classifiers must be more accurate than any of its individual members. The indivudual classifiers composing an ensemble must be accurate and diverse: An accurate classifier is one that has an error rate better than random when guessing new examples Two classifiers are diverse if they make different errors on new data points. 6

Why it can be a good idea to build an ensemble It is possible to build good ensembles for three fundamental reasons. (Dietterich, 2000): 1. Statistical: if little data 2. Computational: enough data, but local optima produced by local search 3. Representational: when the true function f cannot be represeted by any of the hypothesis in H (weighted sums of hypotheses drawn from H might expand the space 7

Distinctions Base learner Arbitrary learning algorithm which could be used on its own Ensemble A learning algorithm composed of a set of base learners. The base learners may be organized in some structure However, not completely clear cut E.g. a linear classifier is a combination of multiple simple learners, in the sense that each dimension is in a simple predictor 8

The main purpose of an ensemble: maximising individual accuracy and diversity Different learners use different Algorithms Hyperparameters Representations /Modalities/Views Training sets Subproblems 9

Practical Example 10

Rationale No Free Lunch Theorem: There is no algorithm that is always the most accurate in all situations. Generate a group of base-learners which when combined has higher accuracy. 11

Methods for Constructing Ensembles 12

Approaches How do we generate base-learners that complement each other? How do we combine the outputs of base learner for maximum accuracy? Examples: Voting Boostrap Resampling Bagging Boosting AdaBoost Stacking Cascading 13

Voting Linear combination 14

Fixed Combination Rules 15

Boostrap Resampling Daume (2012): 150 16

Bagging (bootstrap+aggregating) Use bootstrapping to generate L training sets Train L base learners using an unstable learning procedure During test, take the avarage In bagging, generating complementary base-learners is left to chance and to the instability of the learning method. **Unstable algorithm: when small change in the training set causes a large differnce in the base learners. 17

Boosting: Weak learner vs Strong learner In boosting, we actively try to generate complementary base-learners by training the next learner on the mistakes of the previous learners. The original boosting algorithm (Schapire 1990) combines three weak learners to generate a strong learner. A weak learner has error probability less than 1/2, which makes it better than random guessing on a two-class problem A strong learner has arbitrarily small error probability. 18

Boosting (ii) [Alpaydin, 2010: 431] Given a large training set, we randomly divide it into three. We use X1 and train d1. We then take X2 and feed it to d1. We take all instances misclassified by d1 and also as many instances on which d1 is correct from X2, and these together form the training set of d2. We then take X3 and feed it to d1 and d2. The instances on which d1 and d2 disagree form the training set of d3. During testing, given an instance, we give it to d1 and d2; if they agree, that is the response, otherwise the response of d3 is taken as the output. 19

Boosting: drawback Though it is quite successful, the disadvantage of the original boosting method is that it requires a very large training sample. 20

Adaboost (adaptive boosting) Use the same training set over and over and thus need not to be large. Classifiers must be simple so they do not overfit. Can combine an arbitrary number of base learners, not only three. 21

AdaBoost Generates a sequence of base-learners each focusing on previous one s errors. The porbability of a correctly classified instance is decreased, and the probability of a missclassified instance increases. This has the effect that the next classifier focuses more on instances missclassified by the previous classifier. [Alpaydin, 2010: 432-433] 22

Adaboost: Testing Given an instance, all the classifiers decide and a weighted vote is taken. The weights are proportional to the base learners accuracies on the training set. improved accuracy The success of Adaboost is due to its property of increasing the margin. If the margin increases, the training istances are better separated and errors are less likely. (This aim is similar to SVMs) 23

Stacking (i) In stacked generalization, the combiner f( ) is another learner and is not restricted to being a linear combination as in voting. 24

Stacking (ii) The combiner system should learn how the base learners make errors. Stacking is a means of estimating and correcting for the biases of the base-learners. Therefore, the combiner should be trained on data unused in training the base-learners 25

Cascading Use d j only if preceding ones are not confident Cascade learners in order of complexity 26

Cascading Cascading is a multistage method, and we use dj only if all preceding learners are not confident. Associated with each learner is a confidence wj such that we say dj is confident of its output and can be used if wj > θj (the threshold). Confident: misclassifications as well as the instances for which the posterior is not high enough. Important: The idea is that an early simple classifier handles the majority of instances, and a more complex classifier is used only for a small percentage, so does not significantly increase the overall complexity. 27

Summary It is often a good idea to combine several learning methods We want diverse classifiers, so their errors cancel out However, remember, ensemble methods do not get free lunch 28

Example in the case of arc-factored graph-based parsing, we relied on spanning tree over a dense graph over the input. a dense graph is a graph that contains all possible arcs (wordforms) a spanning tree is a tree that has an incoming arc for each word. 29

Example: Ensemble MST Dependency Parsing 30

Conclusions Combining multiple learners has been a popular topic in machine learning since the early 1990s, and research has been going on ever since. Recently, it has been noticed that ensembles do not always improve accuracy and research has started to focus on the criteria that a good ensemble should satisfy or how to form a good one. 31

Reading Dietterich (2000) Alpaydin (2010): Ch. 17 Daumé (2012): Ch. 11 32

Thanx for your attention! 33