Do we need more training data or better models for object detection?

Do we need more training data or better models for object detection? Xiangxin Zhu Carl Vondrick Deva Ramanan Charless Fowlkes University of California, Irvine Appeared in BMVC 2012 Slides adapted from Charless Fowlkes 1

MoBvaBons 2 2

Current state of object recognibon Avg. AP 0.4 0.3 0.2 2006 2007 2008 2009 2010 2011 Year Avg. AP 0.4 0.3 0.2 400 600 800 1000 1200 1400 Avg. num. of training samples per class PASCAL VOC detec=on challenge provides realisbc benchmark of object detecbon performance. Performance has steadily increased! 3

Bayes Risk P(X face) P(X background) Feature space may limit our ulbmate classificabon performance 5

Performance saturabon Ideal Performance Data 6

Model Bias Class of models may not be flexible enough 7

Model Bias 8

Ideal Performance Model Complexity 9

Experiments 10 10

Experiment #1 Single Face Template HOG Feature vector Train a linear classifier using SVM PosiBve examples + hard negabve mining 11

Performance vs #training examples 12

Performance vs #training examples Worse performance with more training data?!?!? 13

Performance vs #training examples Average precision 0.6 0.5 0.4 0.3 Single template face model Fixed C=0.002 Crossval on C 0 500 1000 Num. of training samples 14

Average precision 0.6 0.5 0.4 0.3 0.2 Need to make cross validabon easy for everyday users! 0.1 10 7 10 5 10 3 10 1 10 1 C N=10 N=50 N=100 N=500 N=900 15

16 16

Experiment #2 We want to detect faces at many different viewpoints what posibve training data should we use? (a) include all viewpoints in training (b) only train on a subset of views (e.g. frontal faces) 17

Single template face model Average precision 0.6 0.5 0.4 All Frontal 0 200 400 600 800 Number of training samples Worse performance with more training data?!?!? 18

Single template face model Average precision 0.6 0.5 0.4 All Frontal 0 200 400 600 800 Number of training samples Single template trained with 200 clean frontal faces outperforms template trained with 800 images that include all views of faces. This holds true for both training and test performance 19

Learned templates All views Frontal views only 20

SVM is sensibve to outliers Single template face model Average precision 0.6 0.5 0.4 All Frontal 0 200 400 600 800 Number of training samples ALL has lower training objec3ve, but higher 0-1 loss! 21

Experiment #3 Increase model complexity by using mixture components to model different viewpoints. 22

Model wider range of variability by using a mixture of rigid templates 23

DiscriminaBve clustering uses mixture components to take care of outliers? AP Dataset Size 24

Human supervised clustering 1 3 5 13 25

Human- in- the- loop clustering can boost mixture model performance 1 3 5 13 Average precision 0.74 0.72 0.7 0.68 0.66 Face Human cluster, K=5 Kmeans cluster, K=4 0 500 1000 Num. of training samples 26

Human- in- the- loop clustering can boost mixture model performance Average precision 0.6 0.55 0.5 Bus 0.45 Human cluster, K=5 0.4 Kmeans cluster, K=4 0.35 0 1000 2000 3000 4000 5000 Num. of training samples 27

0.8 0.7 0.8 0.7 AP 0.6 K=1 K=3 K=5 0.5 K=13 K=26 0.4 0 500 1000 Number of training data AP 0.6 N=50 0.5 N=100 N=500 N=900 0.4 0 10 20 30 Number of mixtures Ideal Ideal Performance Data Model Complexity 28

Bus Category 0.6 0.55 0.5 0.45 0.4 0.35 K=1 K=3 K=5 K=11 K=21 AP 0.6 0.55 0.5 0.45 0.4 0.35 N=50 N=100 N=500 N=1000 N=1898 0 500 1000 1500 2000 Number of training data 0 5 10 15 20 25 Number of mixtures 29

PASCAL 10x Dataset Collected 10 Bmes as much posibve training data as original PASCAL dataset Collect images from Flickr, MTurk users label images 30

10x training dataset with DPM Average precision Horse Bicycle 0.4 Bus Cat 0.3 Cow Diningtable 0.2 Motorbike Sheep Sofa 0.1 Train Tvmonitor 0 0 2000 4000 6000 8000 100001200014000 Num. of training samples Cross validabon to choose opbmal regularizabon and # of mixture components for each category Performance saturates with 10 templates per category and 100 posibve training examples per template 31

Experiment #4: Have we reached Bayes Risk for linear classifiers with HOG features? Ideal Performance Data 32

Deformable part models parts deformabon model detector output Represent local part appearance with templates, connected by springs that encode relabve locabons. Trained using SVM. [Felzenszwalb, McAllester, Ramaman. 2008] 33

Alternate view of DPM Every placement of parts synthesizes a rigid template Dynamic programming used in DPM is a fast way to index a very large collecbon of rigid templates 34

Why does DPM do bejer than rigid mixtures? Part appearances are shared during training Can extrapolate to new unseen configurabons 35

Rigid Part Model (RPM): part appearance is learned from training only score spabal configurabons of parts seen during training very fast to test 36

Faces 0.9 Average precision 0.8 0.7 Sup.DPM RMP 0.6 Latent DPM Mix. HoG templ. 0.5 K=1, Frontal 0.4 K=1, All 0 200 400 600 800 1000 Num. of training samples 37

State of the art face detecbon with only 100 training examples DPM with shared parameters precision [Zhu & Ramanan, 2012] 38

Do we need more data? 40

Do we need more data? More training data helps, but only if you are careful Clean training data can help SVM which is sensibve to outliers Having the proper correspondence / alignment / clustering can greatly improve model performance Bejer models might provide more bang for the buck 41

Dataset Bias: Distribu3ons Match 42 42