Theoretical Foundations of Active Learning Steve Hanneke Machine Learning Department Carnegie Mellon University shanneke@cs.cmu.edu
Passive Learning Learning Algorithm Data Source Expert / Oracle Labeled data points Algorithm outputs a classifier Steve Hanneke 2
Active Learning Learning Algorithm Data Source Request for the label of a data point The label of that point Request for the label of another data point The label of that point Expert / Oracle... Algorithm outputs a classifier Steve Hanneke 3
Active Learning (Sequential Design) Learning Algorithm Data Source Expert / Oracle How many label requests Request for the label of a data point The label of that point are required to learn? Request for the label of another data point The label of that point Label Complexity... Algorithm outputs a classifier e.g., Das04, Das05, DKM05, BBL06, Kaa06, Han07a&b, BBZ07, DHM07, BHW08 Steve Hanneke 4
Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. - + Steve Hanneke 5
Active Learning Sometimes Helps An Example: 1-dimensional threshold functions. Take m unlabeled examples Repeatedly request the label of the median point between -/+ boundaries. Take any threshold consistent with the observed labels. - - - - - - - - - - - + + + + + + + Used only log(m) label requests, but get a classifier consistent with all m examples! Exponential improvement over passive! Steve Hanneke 6
Outline Formal Model Analysis of Uncertainty-based Active Learning Strict Improvements Over Passive Learning Open Problems Steve Hanneke 7
Formal Model Steve Hanneke 8
Formal Model Steve Hanneke 9
CAL A simple idea from Cohn, Atlas & Ladner (1994). Assuming ν=0, produces a perfectly labeled data set, which we can feed into any passive algorithm! So we get a natural fallback guarantee. Can we characterize the label complexity achieved by CAL? Can we generalize it to handle label noise or non-separable data? Steve Hanneke 10
Disagreement Coefficient [Hanneke,07] (for our purposes, take r 0 = ε) DIS(B(f,r)) f Concepts in B(f,r) look like this Steve Hanneke 11
Disagreement Coefficient [Hanneke,07] (for our purposes, take r 0 = ε) Steve Hanneke 12
θ Characterizes CAL s Performance Steve Hanneke 13
What about Noise? Steve Hanneke 14
What about Noise? Steve Hanneke 15
Activized Learning Activizer Meta-algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point...... Algorithm outputs a classifier Passive Learning Algorithm (Supervised / Semi-Supervised) Steve Hanneke 16
Activized Learning Activizer Meta-algorithm Data Source Expert / Oracle Request for the label of a data point The label of that point Request for the label of another data point The label of that point...... Algorithm outputs a classifier Passive Learning Algorithm (Supervised / Semi-Supervised) Are there general-purpose activizers that strictly improve the label complexity of any passive algorithm? Steve Hanneke 17
Formal Model Steve Hanneke 18
Uncertainty-based Sampling Doesn t Activize Intervals 0 - + - 1 Steve Hanneke 19
Uncertainty-based Sampling Doesn t Activize Intervals 0 - - - - - - - - Suppose the target labels everything -1 1 Uncertainty-based sampling requests every label. No improvements over passive. Steve Hanneke 20
What s Wrong? (formally) Steve Hanneke 21
How Can We Fix It? Steve Hanneke 22
A Simple Activizer So, which ever of the 2 k classifications can t be realized by V, look at the label of x and take the opposite. Steve Hanneke 23
This Works for Any C! [HLW94] passive algorithm has O(1/ε) sample complexity. Steve Hanneke 24
Dealing with Noise and Misspecification Recall passive gets O(1/ε 2 ) (minimax) Steve Hanneke 25
Open Questions Question: What can we activize with noise? Question: Can we give more detailed bounds on Λ a when θ>>1? Question: Is there a labeled/unlabeled trade-off under arbitrary D XY? Steve Hanneke 26
Thank You Steve Hanneke 27
A Simple Activizer Intervals revisited 0 - - - - - - - - - - - - - - -- - -- - - -- - - - Again, suppose the target labels everything -1 Passive algorithm trained on Ω(n 2 ) samples. Improved label complexity. x 1 1 Steve Hanneke 28
Efficiency? m = # unlabeled examples used by the algorithm. Suppose can test separability of O(n) points in poly(n) time Then SimpleActivizer runs in poly(n)m time (plus the time of the passive algorithm). For most learning problems, can set a poly(n) limit on m in the algorithm without losing our guarantees. Steve Hanneke 29