Theoretical Foundations of Active Learning

Theoretical Foundations of Active Learning Steve Hanneke Machine Learning Department Carnegie Mellon University shanneke@cs.cmu.edu

Passive Learning Learning Algorithm Data Source Expert / Oracle Labeled data points Algorithm outputs a classifier Steve Hanneke 2

Active Learning Learning Algorithm Data Source Request for the label of a data point The label of that point Request for the label of another data point The label of that point Expert / Oracle... Algorithm outputs a classifier Steve Hanneke 3

Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle How many label requests Request for the label of a data point The label of that point are required to learn? Request for the label of another data point The label of that point Label Complexity... Algorithm outputs a classifier e.g., Das04, Das05, DKM05, BBL06, Kaa06, Han07a&b, BBZ07, DHM07, BHW08 Steve Hanneke 4

Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. - + Steve Hanneke 5

Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. Take m unlabeled examples Repeatedly request the label of the median point between -/+ boundaries. Take any threshold consistent with the observed labels. - - - - - - - - - - - + + + + + + + Used only log(m) label requests, but get a classifier consistent with all m examples! Exponential improvement over passive! Steve Hanneke 6

Outline Formal Model Analysis of Uncertainty-based Active Learning Strict Improvements Over Passive Learning Open Problems Steve Hanneke 7

Formal Model Steve Hanneke 8

Formal Model Steve Hanneke 9

CAL A simple idea from Cohn, Atlas & Ladner (1994). Assuming ν=0, produces a perfectly labeled data set, which we can feed into any passive algorithm! So we get a natural fallback guarantee. Can we characterize the label complexity achieved by CAL? Can we generalize it to handle label noise or non-separable data? Steve Hanneke 10

Disagreement Coefficient [Hanneke,07] (for our purposes, take r 0 = ε) DIS(B(f,r)) f Concepts in B(f,r) look like this Steve Hanneke 11

Disagreement Coefficient [Hanneke,07] (for our purposes, take r 0 = ε) Steve Hanneke 12

θ Characterizes CAL s Performance Steve Hanneke 13

What about Noise? Steve Hanneke 14

What about Noise? Steve Hanneke 15

Activized Learning Activizer Meta-algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point...... Algorithm outputs a classifier Passive Learning Algorithm (Supervised / Semi-Supervised) Steve Hanneke 16

Formal Model Steve Hanneke 18

Uncertainty-based Sampling Doesn t Activize Intervals 0 - + - 1 Steve Hanneke 19

Uncertainty-based Sampling Doesn t Activize Intervals 0 - - - - - - - - Suppose the target labels everything -1 1 Uncertainty-based sampling requests every label. No improvements over passive. Steve Hanneke 20

What s Wrong? (formally) Steve Hanneke 21

How Can We Fix It? Steve Hanneke 22

A Simple Activizer So, which ever of the 2 k classifications can t be realized by V, look at the label of x and take the opposite. Steve Hanneke 23

This Works for Any C! [HLW94] passive algorithm has O(1/ε) sample complexity. Steve Hanneke 24

Dealing with Noise and Misspecification Recall passive gets O(1/ε 2 ) (minimax) Steve Hanneke 25

Open Questions Question: What can we activize with noise? Question: Can we give more detailed bounds on Λ a when θ>>1? Question: Is there a labeled/unlabeled trade-off under arbitrary D XY? Steve Hanneke 26

Thank You Steve Hanneke 27

A Simple Activizer Intervals revisited 0 - - - - - - - - - - - - - - -- - -- - - -- - - - Again, suppose the target labels everything -1 Passive algorithm trained on Ω(n 2 ) samples. Improved label complexity. x 1 1 Steve Hanneke 28

Efficiency? m = # unlabeled examples used by the algorithm. Suppose can test separability of O(n) points in poly(n) time Then SimpleActivizer runs in poly(n)m time (plus the time of the passive algorithm). For most learning problems, can set a poly(n) limit on m in the algorithm without losing our guarantees. Steve Hanneke 29