Capacity, Learning, Teaching Xiaojin Zhu Department of Computer Sciences University of Wisconsin-Madison jerryzhu@cs.wisc.edu 23
Machine learning human learning Learning capacity and generalization bounds Beyond supervised learning: semi-supervised, active Beyond learning: teaching
Capacity VC-dimension F : a family of binary classifiers VC-dimension V C(F ): size of the largest set that F can shatter With probability at least δ, sup R(f) R n (f) 2 f F R(f): error of f in the future R n (f): error of f on a training set of size n 2 V C(F ) log n + V C(F ) log 2e V C(F ) + log 2 δ. n
Capacity Rademacher complexity σ,..., σ n : P (σ i = ) = P (σ i = ) = 2 Rademacher complexity Rad n (F ) = E σ,x (sup f F With probability at least δ, n ) n σ i f(x i ). i= sup R n (f) R(f) 2Rad n (F ) + f F log(2/δ). 2n
Machine learning human learning f: you categorize x by f(x) F : all the classifiers in your mind R n (f): how did you do in class R(f): how well can you do outside class Capacity: can we measure it in humans? V C(F ): too brittle (find one dataset of size n) and combinatorial (verify shattering) Others may behave better, e.g., Radn (F )
Measuring human Rademacher complexity learning random labels (x, σ )... (x n, σ n ), e.g., (grenade, B), (skull, A), (conflict, A), (meadow, B), (queen, B) Rad n (F ) m m j= n n i= σ(j) i ˆf (j) (x (j) i ) ˆf mnemonics: a queen was sitting in a meadow and then a grenade was thrown (B = before), then this started a conflict ending in bodies & skulls (A = after). ˆf wrong rules: (daylight, A), (hospital, B), (termite, B), (envy, B), (scream, B), anything related to omitting[sic] light rape killer funeral fun laughter joy Rademacher complexity 2.5 2 3 4 n Rademacher complexity 2.5 2 3 4 n
Overfitting indicator.5 bound observed e e^ Shape,4 Word,4 Shape,5 Word,5.5 2 Rademacher complexity e test set error, ê training set error generalization error bound holds actual overfitting tracks bound (nice but not predicted by theory) The study of capacity may constrain cognitive models understand groups differ in age, health, education, etc.
Human semi-supervised learning.9.8.7.6.4.3.2. Humans learn supervised first, then... decision boundary shifts to distribution trough in test data Can be explained by a variety of semi-supervised machine learning models test examples range examples left shifted Gaussian mixture 3 2 2 3 x percent class 2 responses.9.8.7.6.4.3.2. test, all test 2, L subjects test 2, R subjects x
Human semi-supervised learning, the other way around Human unsupervised learning first p(x) p(x) p(x) p(x) 8 4.8.6 8 8.6 4.8 8 8.6 8 8.6 8 time time time time 8 4.8.6 8 x 8.6 4.8 8 x 8.6 8 x 8.6 8 x trough peak uniform converge... influences subsequent (identical) supervised learning task mean accuracy ± std err.95.9.85.8.75.7 trough converge uniform peak.65.6
Active learning Passive learning (slow) inf ˆθ n sup θ [,] E[ ˆθ n θ ] 4 ( ) + 2ɛ 2ɛ 2ɛ n + Active learning (fast) ( sup E[ ˆθ n θ ] 2 θ [,] 2 + ) n ɛ( ɛ)
Active learning humans noise ɛ = ɛ =.5 ɛ =. ɛ =.2 ɛ =.4 Human Passive 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 Human Active 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4
Machine teaching Example: a threshold classifier in D passive learning (x i, y i ) iid p, risk O( n ) active learning risk 2 n taught: n = 2. Teaching dimension curriculum learning
Human teacher behaviors strategy graspability (n = 3) lines (n = 32) boundary % 56% curriculum 48% 9% linear 42% 25% positive % %
A framework for teaching a Bayesian learner. World: p(x, y θ ), loss function l(f(x), y) 2. Learner: Bayesian. prior over Θ (θ Θ), likelihood p(x, y θ) maintains posterior p(θ data) by Bayesian update makes prediction f(x data) using the posterior 3. Teacher: clairvoyant, knows everything above can only teach by examples (x, y) goal: choose the least-effort teaching set D = (x, y):n to minimize the learner s future loss (risk): E θ [l(f(x D), y)] + effort(d) if the future loss approaches Bayes risk, D is a teaching set and n is the (generalized) teaching dimension
References R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, and X. Zhu. Human active learning. In Advances in Neural Information Processing Systems (NIPS) 22. 28. B. R. Gibson, T. T. Rogers, and X. Zhu. Human semi-supervised learning. Topics in Cognitive Science, 5():32 72, 23. F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems (NIPS) 25. 2. X. Zhu, T. Rogers, R. Qian, and C. Kalish. Humans perform semi-supervised classification too. In Twenty-Second AAAI Conference on Artificial Intelligence (AAAI-7), 27. X. Zhu, T. T. Rogers, and B. Gibson. Human Rademacher complexity. In Advances in Neural Information Processing Systems (NIPS) 23. 29.