CSE 291: Advances in Computer Vision. Manmohan Chandraker. Lecture 2: Background

Size: px

Start display at page:

Download "CSE 291: Advances in Computer Vision. Manmohan Chandraker. Lecture 2: Background"

Owen Stephens
5 years ago
Views:

1 CSE 291: Advances in Computer Vision Manmohan Chandraker Lecture 2: Background

2 Recap

3 Features have been key SIFT [Lowe IJCV 04] HOG [Dalal and Triggs CVPR 05] SPM [Lazebnik et al. CVPR 06] Textons and many others: SURF, MSER, LBP, GLOH,..

4 Learning a Hierarchy of Feature Extractors Hierarchical and expressive feature representations Trained end-to-end, rather than hand-crafted for each task Remarkable in transferring knowledge across tasks

5 Significant recent impact on the field Big labeled datasets Deep learning GPU technology

6 Neuron Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output +1 Negative, output -1 Slide credit: Pieter Abeel and Dan Klein

7 Two-layer neural network Slide credit: Pieter Abeel and Dan Klein

8 Activation functions

9 From fully connected to convolutional networks image Fully connected layer Slide: Lazebnik

10 From fully connected to convolutional networks feature map learned weights image Convolutional layer Slide: Lazebnik

11 From fully connected to convolutional networks feature map learned weights image Convolutional layer Slide: Lazebnik

12 From fully connected to convolutional networks image Convolutional layer next layer Slide: Lazebnik

13 Learnable filters

14 Number of weights

15 Filters over the whole image

16 Weight sharing Insight: Images have similar features at various spatial locations!

Pooling operations Aggregate multiple values into a single value Invariance to small transformations Keep only most important information for next layer

17 Pooling operations Aggregate multiple values into a single value Invariance to small transformations Keep only most important information for next layer Reduces the size of the next layer Fewer parameters, faster computations Observe larger receptive field in next layer Hierarchically extract more abstract features

18 Convolutional Neural Networks

19 Architectural details of AlexNet Similar framework to LeCun 1998 but: Bigger model (7 hidden layers, 650k units, 60M parameters) More data (10 6 images instead of 10 3 images) GPU implementation (50 times speedup over CPU)

20 VGGNet architecture Much more accurate AlexNet : 18.2% top-5 error VGGNet: 6.8% top-5 error More than twice as many layers Filters are much smaller Harder and slower to train

21 Deep residual learning Plain Net Simple design Use only 3x3 conv (like VGG) No hidden FC

22 Key ideas for CNN architectures Convolutional layers Same local functions evaluated everywhere Much fewer parameters Pooling Larger receptive field ReLU Maintain a gradient signal over large portion of domain Limit parameters Sequence of 3x3 filters instead of large filters 1x1 convolutions to reduce feature dimensions Skip network Easier optimization with greater depth

23 Optimization in CNNs

24 A 3-layer network for digit recognition MNIST dataset

25 Cost function The network tries to approximate the function y(x) and its output is a We use a quadratic cost function, or MSE, or L2-loss.

26 Gradient descent

27 Stochastic gradient descent Update rules for each parameter: Cost function is a sum over all the training samples: Gradient from entire training set: Usually, n is very large.

28 Stochastic gradient descent Gradient from entire training set: For large training data, gradient computation takes a long time Leads to slow learning Instead, consider a mini-batch with m samples If sample size is large enough, properties approximate the dataset

29 Stochastic gradient descent

30 Stochastic gradient descent

31 Stochastic gradient descent

32 Stochastic gradient descent Build up velocity as a running mean of gradients.

33 Backpropagation This is all you need to know to get the gradients in a neural network! Backpropagation: application of chain rule in certain order, taking advantage of forward propagation to efficiently compute gradients.

34 Backpropagation example [Slides credit: Fei-Fei Li]

35 Backpropagation example

36 Backpropagation example

37 Backpropagation example

38 Backpropagation example

39 Backpropagation example

40 Backpropagation example

41 Backpropagation example

42 Backpropagation example

43 Backpropagation example

44 Backpropagation example

45 Backpropagation example

46 Backpropagation example Add gate: gradient distributor Mul gate: gradient switcher

47 Convolutional layer is differentiable

48 Max Pooling

49 Loss Functions and Regularizations

50 Slow learning with sigmoid neurons

51 Slow learning with sigmoid neurons When the neuron s output is close to 1, learning becomes slow.

52 Cross-Entropy Loss

53 Cross-Entropy Loss Rate of learning depends on error in prediction! Prevents the learning slowdown from derivative of sigmoid.

54 Better activation functions Computes f(x) = max (0, x) Does not saturate (in positive region) Computationally efficient Converges faster than sigmoid Same advantages as ReLU Stays alive when x < 0

55 Over-fitting

56 More data prevents over-fitting But not always feasible to have more data that is relevant.

57 Regularization reduces over-fitting

58 L2 regularization L2 regularization: Partial derivatives: Update rule:

59 L1 regularization L1 regularization: Partial derivatives: C 0 is the cross-entropy term. Update rule:

60 L2 or L1 regularization

61 Regularization reduces over-fitting

62 Dropout as a regularization Modify the network itself Randomly delete half the hidden neurons in the network Repeat several times to learn weights and biases At runtime, twice as many neurons, so halve the weights outgoing from a neuron

63 Dropout as a regularization Modify the network itself Randomly delete half the hidden neurons in the network Repeat several times to learn weights and biases At runtime, twice as many neurons, so halve the weights outgoing from a neuron Averaging or voting scheme to decide output Same training data, but random initializations Each network over-fits in a different way Average output not sensitive to particular mode

Dropout as a regularization A useful insight from AlexNet paper Reduces complex co-adaptations of neurons, since a neuron cannot rely on presence of

64 Dropout as a regularization A useful insight from AlexNet paper Reduces complex co-adaptations of neurons, since a neuron cannot rely on presence of others Each neuron forced to learn independent features in conjunction with random other neurons Dropout ensures the model can make robust predictions.

65 Data augmentation as regularization

66 Data augmentation as regularization Horizontal flips

67 Data augmentation as regularization Random crops and scales

68 Data augmentation as regularization Color jitter

69 Data augmentation as regularization Color jitter Can do a lot more: rotation, shear, non-rigid, motion blur, lens distortions,.

70 Transfer Learning in CNNs

71 Transfer Learning Improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. Weight initialization for CNN Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks [Oquab et al. CVPR 2014] Slide: Jiabin Huang

72 Transfer Learning

73 CNNs are good at transfer learning

74 Fine-tune h T using h S as initialization

75 Initializng ht with hs

76 Initializng ht with hs

77 Initializng ht with hs

78 Initializng ht with hs

79 Strategy for fine-tuning Amount of data needed

80 Use hs as a feature extractor for ht

81 Transfer learning is a common choice

82 Training a Good CNN

83 Verifying that CNN is Trained Well [M. Ranzato]

84 Verifying that CNN is Trained Well [M. Ranzato]

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION Atul Laxman Katole 1, Krishna Prasad Yellapragada 1, Amish Kumar Bedi 1, Sehaj Singh Kalra 1 and Mynepalli Siva Chaitanya 1 1 Samsung