Active Learning. Yingyu Liang Computer Sciences 760 Fall

Similar documents
Lecture 1: Machine Learning Basics

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning

Python Machine Learning

CSL465/603 - Machine Learning

Probability and Statistics Curriculum Pacing Guide

Lecture 1: Basic Concepts of Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

The Good Judgment Project: A large scale test of different methods of combining expert predictions

arxiv: v2 [cs.cv] 30 Mar 2017

A survey of multi-view machine learning

Chapter 2 Rule Learning in a Nutshell

Rule Learning With Negation: Issues Regarding Effectiveness

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Exposé for a Master s Thesis

arxiv: v1 [cs.lg] 15 Jun 2015

An investigation of imitation learning algorithms for structured prediction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

School Size and the Quality of Teaching and Learning

Artificial Neural Networks written examination

CS 446: Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Physics 270: Experimental Physics

Discriminative Learning of Beam-Search Heuristics for Planning

Rule Learning with Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Generative models and adversarial training

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

Space Travel: Lesson 2: Researching your Destination

arxiv: v1 [cs.cv] 10 May 2017

Linking Task: Identifying authors and book titles in verbose queries

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

A Version Space Approach to Learning Context-free Grammars

Semi-Supervised Face Detection

Online Updating of Word Representations for Part-of-Speech Tagging

Ensemble Technique Utilization for Indonesian Dependency Parser

Speech Emotion Recognition Using Support Vector Machine

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Honors Mathematics. Introduction and Definition of Honors Mathematics

Georgetown University at TREC 2017 Dynamic Domain Track

16.1 Lesson: Putting it into practice - isikhnas

Team Formation for Generalized Tasks in Expertise Social Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Calibration of Confidence Measures in Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

Scientific Method Investigation of Plant Seed Germination

Probabilistic Latent Semantic Analysis

Regret-based Reward Elicitation for Markov Decision Processes

Multi-label classification via multi-target regression on data streams

arxiv: v1 [cs.cl] 2 Apr 2017

Learning Lesson Study Course

Laboratorio di Intelligenza Artificiale e Robotica

arxiv: v1 [cs.lg] 3 May 2013

Transductive Inference for Text Classication using Support Vector. Machines. Thorsten Joachims. Universitat Dortmund, LS VIII

Probability estimates in a scenario tree

Mexico (CONAFE) Dialogue and Discover Model, from the Community Courses Program

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Australian Journal of Basic and Applied Sciences

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Laboratorio di Intelligenza Artificiale e Robotica

Coimisiún na Scrúduithe Stáit State Examinations Commission LEAVING CERTIFICATE 2008 MARKING SCHEME GEOGRAPHY HIGHER LEVEL

THE UNIVERSITY OF SYDNEY Semester 2, Information Sheet for MATH2068/2988 Number Theory and Cryptography

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Issues in the Mining of Heart Failure Datasets

Postprint.

Myers-Briggs Type Indicator Team Report

Cal s Dinner Card Deals

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Corrective Feedback and Persistent Learning for Information Extraction

Comparison of network inference packages and methods for multiple networks inference

Softprop: Softmax Neural Network Backpropagation Learning

Inside the mind of a learner

BMBF Project ROBUKOM: Robust Communication Networks

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Functional Skills Mathematics Level 2 assessment

Go fishing! Responsibility judgments when cooperation breaks down

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Informal Comparative Inference: What is it? Hand Dominance and Throwing Accuracy

Bootstrapping Personal Gesture Shortcuts with the Wisdom of the Crowd and Handwriting Recognition

Transfer Learning Action Models by Measuring the Similarity of Different Domains

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Multi-label Classification via Multi-target Regression on Data Streams

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

AQUA: An Ontology-Driven Question Answering System

Detecting English-French Cognates Using Orthographic Edit Distance

The Netherlands. Jeroen Huisman. Introduction

Decision Making. Unsure about how to decide which sorority to join? Review this presentation to learn more about the mutual selection process!

A Case Study: News Classification Based on Term Frequency

Travis Park, Assoc Prof, Cornell University Donna Pearson, Assoc Prof, University of Louisville. NACTEI National Conference Portland, OR May 16, 2012

B. How to write a research paper

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Transcription:

Active Learning Yingyu Liang Computer Sciences 760 Fall 2017 http://pages.cs.wisc.edu/~yliang/cs760/ Some of the slides in these lectures have been adapted/borrowed from materials developed by Mark Craven, David Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Matt Gormley, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Goals for the lecture you should understand the following concepts active learning active SVM and uncertainty sampling disagreement based active learning other active learning techniques 2

Classic Fully Supervised Learning Paradigm Insufficient Nowadays Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts. Protein sequences Billions of webpages Images

Modern ML: New Learning Approaches Modern applications: massive amounts of raw data. Active learning: techniques that best utilize data, minimizing need for expert/human intervention. Expert

Batch Active Learning Learning Algorithm Unlabeled examples Data Source Underlying data distr. D. Expert Request for the Label of an Example A Label for that Example Request for the Label of an Example A Label for that Example... Algorithm outputs a classifier w.r.t D Learner can choose specific examples to be labeled. Goal: use fewer labeled examples [pick informative examples to be labeled].

Selective Sampling Active Learning Learning Algorithm Unlabeled example x 13 2 Data Source Underlying data distr. D. Expert A label y 13 for example x 13 Request for label or let it go? Request label Let it go Algorithm outputs a classifier w.r.t D Selective sampling AL (Online AL): stream of unlabeled examples, when each arrives make a decision to ask for label or not. Goal: use fewer labeled examples [pick informative examples to be labeled].

What Makes a Good Active Learning Algorithm? Guaranteed to output a relatively good classifier for most learning problems. Doesn t make too many label requests. Hopefully a lot less than passive learning and SSL. Need to choose the label requests carefully, to get informative labels.

Can adaptive querying really do better than passive/random sampling? YES! (sometimes) We often need far fewer labels for active learning than for passive. This is predicted by theory and has been observed in practice.

Can adaptive querying help? [CAL92, Dasgupta04] Threshold fns on the real line: - h w (x) = 1(x w), C = {h w : w 2 R} + Active Algorithm Get N unlabeled examples How can we recover the correct labels with N queries? Do binary search! - - w Just need O(log N) labels! + Output a classifier consistent with the N inferred labels. N = O(1/ϵ) we are guaranteed to get a classifier of error ϵ. Passive supervised: Ω(1/ϵ) labels to find an -accurate threshold. Active: only O(log 1/ϵ) labels. Exponential improvement.

Common Technique in Practice Uncertainty sampling in SVMs common and quite useful in practice. E.g., [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010; Schohon Cohn, ICML 2000] Active SVM Algorithm At any time during the alg., we have a current guess w t of the separator: the max-margin separator of all labeled points so far. Request the label of the example closest to the current separator.

Common Technique in Practice Active SVM seems to be quite useful in practice. [Tong & Koller, ICML 2000; Jain, Vijayanarasimhan & Grauman, NIPS 2010] Algorithm (batch version) Input S u ={x 1,,x mu } drawn i.i.d from the underlying source D Start: query for the labels of a few random x i s. For t = 1,., Find w t the max-margin separator of all labeled points so far. Request the label of the example closest to the current separator: minimizing x i w t. (highest uncertainty)

Common Technique in Practice Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010 Newsgroups dataset (20.000 documents from 20 categories)

Common Technique in Practice Active SVM seems to be quite useful in practice. E.g., Jain, Vijayanarasimhan & Grauman, NIPS 2010 CIFAR-10 image dataset (60.000 images from 10 categories)

Active SVM/Uncertainty Sampling Works sometimes. However, we need to be very very very careful!!! Myopic, greedy technique can suffer from sampling bias. A bias created because of the querying strategy; as time goes on the sample is less and less representative of the true data source. [Dasgupta10]

Active SVM/Uncertainty Sampling Works sometimes. However, we need to be very very careful!!!

Active SVM/Uncertainty Sampling Works sometimes. However, we need to be very very careful!!! Myopic, greedy technique can suffer from sampling bias. Bias created because of the querying strategy; as time goes on the sample is less and less representative of the true source. Observed in practice too!!!! Main tension: want to choose informative points, but also want to guarantee that the classifier we output does well on true random examples from the underlying distribution.

Safe Active Learning Schemes Disagreement Based Active Learning Hypothesis Space Search [CAL92] [BBL06] [Hanneke 07, DHM 07, Wang 09, Fridman 09, Kolt10, BHW 08, BHLZ 10, H 10, Ailon 12, ]

Version Spaces X feature/instance space; distr. D over X; c target fnc Fix hypothesis space H. Definition (Mitchell 82) Assume realizable case: c H. Given a set of labeled examples (x 1, y 1 ),,(x ml, y ml ), y i = c (x i ) Version space of H: part of H consistent with labels so far. I.e., h VS(H) iff h x i = c x i i {1,, m l }.

Version Spaces X feature/instance space; distr. D over X; c target fnc Fix hypothesis space H. Definition (Mitchell 82) Assume realizable case: c H. Given a set of labeled examples (x 1, y 1 ),,(x ml, y ml ), y i = c (x i ) Version space of H: part of H consistent with labels so far. current version space E.g.,: data lies on circle in R 2, H = homogeneous linear seps. + + region of disagreement in data space

Version Spaces. Region of Disagreement Definition (CAL 92) Version space: part of H consistent with labels so far. Region of disagreement = part of data space about which there is still some uncertainty (i.e. disagreement within version space) x X, x DIS(VS H ) iff h 1, h 2 VS(H), h 1 x h 2 (x) E.g.,: data lies on circle in R 2, H = homogeneous linear seps. + current version space + region of disagreement in data space

Disagreement Based Active Learning [CAL92] current version space Algorithm: region of uncertainy Pick a few points at random from the current region of uncertainty and query their labels. Stop when region of uncertainty is small. Note: it is active since we do not waste labels by querying in regions of space we are certain about the labels.

Disagreement Based Active Learning [CAL92] current version space Algorithm: region of uncertainy Query for the labels of a few random x i s. Let H 1 be the current version space. For t = 1,., Pick a few points at random from the current region of disagreement DIS(H t ) and query their labels. Let H t+1 be the new version space.

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. Region of uncertainty = part of data space about which there is still some uncertainty (i.e. disagreement within version space) current version space + + region of uncertainty in data space

Region of uncertainty [CAL92] Current version space: part of C consistent with labels so far. Region of uncertainty = part of data space about which there is still some uncertainty (i.e. disagreement within version space) new version space + + New region of disagreement in data space

Other Interesting ALTechniques used in Practice Interesting open question to analyze under what conditions they are successful.

Density-Based Sampling Centroid of largest unsampled cluster [Jaime G. Carbonell]

Uncertainty Sampling Closest to decision boundary (Active SVM) [Jaime G. Carbonell]

Maximal Diversity Sampling Maximally distant from labeled x s [Jaime G. Carbonell]

Ensemble-Based Possibilities Uncertainty + Diversity criteria Density + uncertainty criteria [Jaime G. Carbonell]

What You Should Know Active learning could be really helpful, could provide exponential improvements in label complexity (both theoretically and practically)! Common heuristics (e.g., those based on uncertainty sampling). Need to be very careful due to sampling bias. Safe Disagreement Based Active Learning Schemes. Understand how they operate precisely in the realizable case (noise free scenarios).