Learning with Hidden Information using A Max-Margin Latent Variable Model

2014 22nd International Conference on Pattern Recognition Learning with Hidden Information using A Max-Margin Latent Variable Model Ziheng Wang Department of Electrical, Computer & Systems Engineering Rensselaer Polytechnic Institute Troy, NY, USA, 12180 Email: wangz10@rpi.edu Tian Gao Department of Electrical, Computer & Systems Engineering Rensselaer Polytechnic Institute Troy, NY, USA, 12180 Email: gaot@rpi.edu Qiang Ji Department of Electrical, Computer & Systems Engineering Rensselaer Polytechnic Institute Troy, NY, USA, 12180 Email: jiq@rpi.edu Abstract Classifier learning is challenging when the training data is inadequate in either quantity or quality. Prior knowledge hence is important in such cases to improve the performance of classification. In this paper we study a specific type of prior knowledge called hidden information, which is only available during training but not available during testing. Hidden information has abundant applications in many areas but has not been thoroughly studied. In this paper, we propose to exploit the hidden information during training to help design an improved classifier. Towards this goal, we introduce a novel approach which automatically learns and transfers the useful hidden information through a latent variable model. Experiments on both digit recognition and gesture recognition tasks demonstrate the effectiveness of the proposed method in capturing hidden information for improved classification. I. INTRODUCTION Classification is a fundamental problem in pattern recognition. Classifier learning has been predominantly data-driven and can be formulated as follows: given a set of n i.i.d training pairs (x 1,y 1 ),, (x n,y n ) sampled from an unknown distribution P (x, y), where x X is the feature vector and y Y is the class label, learn a classifier f : X Ythat can classify unseen samples as accurately as possible. However, data-based learning could be challenging when the training data is limited in either quantity or quality. To address this issue a growing body of work has been done to incorporate extra sources of information in addition to the training data to improve the classification performance. For instance, contextual information (e.g. related objects besides the target object) [1] [], and object attributes [4], [5] have been exploited to improve image-based object recognition. Depth videos have been combined with RGB videos to enhance the performance of gesture recognition [6]. Despite their success, the extra information needs to be explicitly or implicitly obtained during both training and testing, which could be impractical in many applications due to two reasons. First, acquiring useful information during testing could be expensive. For example acquiring the depth videos are expensive especially for large-scale surveillance applications. Second, information such as object attributes needs to be indirectly derived from the measurements in the testing phase, which may not be accurate. Errors could propagate and even hurt classification. Fig. 1: Examples of different hidden information in different applications: facial action units for expression recognition, joint positions for gesture recognition, attributes for object recognition, and bounding boxes for action recognition. These issues motivate us to ask whether it is possible to incorporate the information to which we only have access during training, and if possible, whether it can still improve the learning performance. In this paper we give positive answers by proposing a novel approach that integrates all the related information through a latent variable model. We call the information which is only available during training but not available during testing as hidden information. Hidden information exists in different formats, and here we focus on the one which can be represented as additional features h i H for each training sample (x i,y i ). In the following of this paper we also call x as primary features and h as hidden features. In fact, hidden information can be utilized in many important areas. Besides the primary training data, hidden information supplies additional sources of information to learn the classifier. Figure 1 shows some examples of hidden information in different applications. For example, human labeled action units can be obtained along with the training images for facial expression recognition. Human joint positions can be collected for each training instance to help recognize gestures with RGB measurements. Attributes can also be used as hidden information for image-based object classification. Besides, for action recognition we can use bounding boxes as hidden information. 1051-4651/14 $1.00 2014 IEEE DOI 10.1109/ICPR.2014.248 189

Learning with hidden information also has a close analogy to human learning. Human usually turns to teachers or books to help them learn more efficiently and effectively. Likewise, classifier learning can also benefit from the hidden information. Learning with hidden information is similar to constructing a building with a scaffold, which is used to facilitate the learning of the model during training and then is disassembled during testing. Nevertheless, incorporating hidden information is challenging. It cannot be simply used as additional features and combined with the primary feature since it is absent during testing. Hence we need a learning algorithm that can successfully transfer the useful hidden information to learn the target classifier y = f(x). In this paper we propose to incorporate hidden information through a latent variable, which serves as a pivot to relate the primary feature x, the target label y and the hidden information h simultaneously. By connecting x and y through a latent variable z, the model can better interpret the intermediate complex structure of the feature space and thereby better capture the relationship between the input feature and output class label. Such a latent variable model has been demonstrated to yield superior classification performance, especially when huge intra-class variations exist [7]. Moreover, by connecting h to z, the learnt latent space will be changed accordingly to the hidden information. The useful prior knowledge within the hidden information is implicitly captured and transferred to learn the target classifier through the latent variable and the model is learned in a max-margin approach for optimal discrimination performance. Our basic assumption is that learning jointly with the hidden information will positively influence our estimation of the relationship between x and y, and therefore result in a better classifier than the one learned purely from the training data. The remainder of this paper is organized as follows. A brief literature review of the related works is provided in Section II. The proposed algorithm will be introduced in detailed in Section III. Experimental results are illustrated in Section IV. Finally, we conclude the paper in Section V. II. RELATED WORK Learning with hidden information was originally proposed by Vapnik et al. [8], [9] where hidden information is also called privileged information. Since then various approaches have been developed to capture hidden information for different applications. Vapnik et al. [8] proposed an SVM+ algorithm where the slack variable ξ i for each training instance is modeled as a function of the privileged information h i. The basic idea is that privileged information indicates which sample is easy to classify and which is hard to classify. Corresponding theories have also proved that SVM+ improves the learning rate of SVM from O(1/ n) to O(1/n), when privileged information is the ground truth value of the slack variables [8], [10]. Efficient implementations of SVM+ have also been proposed in [11], [12]. Niu and Wu [1] further studied using L1 regularization in SVM+ to capture hidden information. Instead of relating hidden information to the slack variable, Sharmanska et al. [14] proposed to transfer the score rank generated according to the hidden information to the primary data modality. Empirical evaluations on multiple datasets demonstrate that rank transfer achieves comparable results with SVM+ but is easier to implement. Besides, Laptin el al. [15] proved that SVM+ can be reformulated as a special case of instance weighted SVM, and Liang et al. [16] studied the connection between SVM+ and multi-task learning. Hidden information has also been used to improve the learning of classifiers other than SVM. Chen et al. [17] propose to incorporate hidden information to enhance the AdaBoost classifier where hidden information is used to as additional targets to construct weak classifiers. Yang and Patras [18] use privileged information to help selecting split functions to construct the conditional random regression forest. Hidden information in these two approaches serve as auxiliary targets to provide richer information about the class label. However, all of these approaches relate the hidden information with the primary data or parameters of the classifier through direct regressions, which are too strong assumptions for many applications. Instead, our proposed method assumes a latent relationship between the primary data and hidden information. It is automatically learned during training through a max-margin approach. In addition, the existing approaches are limited to binary classification problems while the proposed method can be used for multi-class classification. Besides classifier learning, hidden information has also been used in other learning problems. For instance, [19] used hidden information as auxiliary targets to help feature selection based on the assumption that features commonly effective for the target and hidden information will be better for classification. Feyereisl and Aickelin [20] used hidden information for clustering. Fouad et al. [21] incorporated hidden information for metric learning. III. PROPOSED ALGORITHM Instead of associating hidden information and the primary data through direct regression, the proposed model relates them with a latent variable and implicitly captures and transfers the information from h to learn the target classifier f : X Y. In this section, we first present a mathematical definition of learning with hidden information, and then introduce the proposed approach in detail. A. Problem Definition Learning with hidden information represents a paradigm shift from the traditional classification learning problem. Mathematically it is stated as follows: given a set of training data: (x 1,y 1 ),, (x n,y n ) x X, y Y where x is the input feature and y is the output class label, as well as the hidden information which can be properly represented as additional features for each training instance: h 1,,h n h H the goal of learning with hidden information is to learn a classifier f : X Ywhich can have better performance than a classifier learned only from the training samples to classify unseen samples. We would like to emphasize that the learned classifier f : X Y only uses the primary feature x as the input since hidden information h will not be present during testing. 190

Therefore hidden information h cannot be simply treated as additional features that can be combined with the primary feature x. However, it will indirectly or implicitly influence the choice of the classifier (or its parameters) during the learning in the training phase. B. Capturing Hidden Information with A Latent Variable The key to learning with hidden information is how to properly extract and transfer useful piece of hidden information to the original data modality. In this paper we achieve this goal with a latent variable model. Figure 2 shows the graphical depiction of the proposed model, where x X stands for the primary feature, h Hrepresents hidden information, y Y is the class label, and z is a latent node connected to all the other three variables x, h, and y. In this paper we assume the situation of a discrete latent node z {1,,K}. However it can be extended to a discrete vector or a continuous node. Besides, we study the case where the class label y can take C values {1,,C}. The total potential of the model Ψ(x, h, z, y) is defined in Equation 1, where {b k z} and {d c y} are the biases for each state of the latent variable z and class label y. The first set of parameters {wxz} k measure the compatibility between the primary feature x with each latent state. Similarly the second set of parameters {whz k } measure the compatibility between the hidden information h with each state of the latent variable. The third set of parameters {wyz k,c } model the compatibility between each state of the latent node and each class. 1( ) is the indication function. Ψ(x, h, z, y) = wxz k x 1(z = k) + + k=1 whz k h 1(z = k) k=1 C k=1 c=1 k=1 w k,c yz 1(z = k, y = c) C + b k z 1(z = k)+ d c y 1(y = c) (1) c=1 Concatenating the parameters together, the total potential can be written as a linear model Ψ(x, h, z, y) =w Φ(x, h, z, y) (2) where [ w is a vector consisting of all the model parameters {w k xz }, {whz k }, {wk,c yz }, {b k z}, {d c y} ] and Φ(x, h, z, y) is the joint feature vector constructed by arranging all the features in order with the corresponding parameters in w. We can see that the latent variable z acts as a pivot to connect all the related components (x, h, y) during training. Below we analyze the proposed model in detail. Part 1 Connecting X and Y through Z (X-Z-Y): In the first place, we can see that the input feature x and the output class variable y are not directed related as in SVM or other popular classifiers. They are connected through a latent variable z. From a bottom-up point of view, the raw input Fig. 2: The proposed model to incorporate hidden information, where y is the class label, x is the primary feature, h is the hidden information, and z is a latent node connecting all the other three nodes. features x is decomposed into a set of latent states before used for discriminating the class label y. This allows us to capture a more complex intermediate structure of the feature space to better characterize the relationship between the input and output. By transforming the feature space into a discrete latent space, the model can also more effectively deal with large intra-class variations which is prevalent in many classification applications. Part 2 Connecting X and H through Z (X-Z-H): Second, the latent variable also serves as a bridge to relate the hidden information h with the primary features x, and in this way the latent space also captures relationships among the primary data modality and hidden information. Compared to existing works which assume direct regression between hidden information and primary data, in our model we do not make explicit assumptions. The relationship between x and h is totally latent and is automatically discovered through learning. Moreover, By attaching the hidden information h to the latent variable, the estimation of the relation between x and y (i.e. parameters {wyz} kc and {wxz}) k will also be changed accordingly during training and hence the hidden information is implicitly transferred for classifying y from x. We assume that hidden information will bring positive influences by providing auxiliary and useful information to the learning problem. Part Connecting H and Y through Z (H-Z-Y): Finally, the latent variable also relates the hidden information h with the target class label y. This ensures that the information extracted or transferred from the hidden information must favor the discrimination of class y. C. Max-Margin Model Learning and Inference The model is learned in a max-margin Latent Structural SVM framework [22], which has demonstrated superior classification performance in many applications. Compared to other probabilistic learning approaches for undirected graphical models, the max-margin approach is also more efficient to implement since it does not need to deal with the partition function. To simultaneously learn the relationship between x, h and y (i.e. parameters {wxz}, k {whz k }, {wk,c yz }, {b k z}, {d c y}), during training we maximize the margin for classifying y based on x and h through the latent variable z, using the training h 191

data {(x i,y i } n and hidden information {h i} n. This is equivalent to minimizing the objective function shown in Equation. min w 1 n 2 w + C C n ( ) max[w Φ(x i,h i, ŷ i, ẑ i )+Δ(y i, ŷ i )] ŷ i,ẑ i ( ) max w Φ(x i,h i,y i,z i ) z i The concave-convex procedure (CCCP) [22] is employed to solve this optimization problem by iteratively finding the maximum a posteriori (MAP) estimate of the latent variable and solving a standard structural SVM optimizing problem with the latent variable completely observed. In addition, subgradient descent algorithm is adopted to solve the structural SVM optimization subroutine. However, such a model cannot be directly used for testing since the model requires the input of hidden information h. The hidden information node h has to be properly detached from the latent variable. To address this issue we propose the following method. First, we infer the latent state for each training instance with Equation 4 () z i =argmax z i w Φ(x i,h i,y i,z i ) (4) Then we use the memorized latent states {zi } as well as the training data {(x i,y i } n to relearn a baseline latent variable model without node h (see Figure ), and use that model for testing. Now the latent variable is still unknown during testing, but is observed during training. Therefore learning can be formulated as a standard structural SVM and the objective function is shown in Equation 5. 1 min w 2 w + C 2 n Δ(y i, ŷ i, ẑ i,z i )] ( max[w Φ(x i, ŷ i, ẑ i )+ ŷ i,ẑ i ) n C 2 w Φ(x i,y i,zi ) (5) In this way, the hidden information is indirectly transferred for learning the baseline model through the memorized latent states for each training instance. Compared to a baseline latent variable model purely learned from the training data, the parameter values obtained through the proposed algorithm are different since the latent space is learned with the guidance of the hidden information. In this sense the latent variable can be seen as partially latent since it incorporates the information from external expertise. During testing, the label of each sample is predicted with Equation 6. Fig. : A baseline latent variable model without the hidden information node h. It is used during testing to classify unseen samples based only on the primary feature x. IV. EXPERIMENT In this section we demonstrate the effectiveness of the proposed algorithm to incorporate hidden information on two applications, namely handwritten digit recognition and human gesture recognition. The holistic descriptions of the digit images and the human joint positions extracted from the depth video are used as the hidden information respectively. To be consistent with the related works, we name the baseline latent variable model (see Figure learned purely from the training data as LSSVM and the proposed algorithm as LSSVM+, and compare both of them with support vector machine (SVM) and the SVM+ model. While all the models can be extended to use more sophisticated kernels, here we use linear kernels to compare their performances. For the two latent variable models, K-means algorithm is used to cluster the data to initialize the hidden states. The corresponding experiment is performed 20 times for these two models and we report their average results. A. Digit Recognition In the first experiment we test the performance of the proposed algorithm for digit recognition, following the same evaluation protocol used in [8]. The goal of this experiment is to classify digit 5 and digit 8 images. The training data contains 50 images of digit 5 and 50 images of digit 8, and the testing data consists of 1866 images selected from the MNIST hand written digit dataset [2]. Note that the original experiment in [8] was based on an idealistic setting where a huge validation dataset (about 4000 digit samples) is available so that the learned model is almost optimal through model selection. However in practice one may never be able to obtain a validation dataset larger than the testing data. Hence in our experiments we tune the parameters for all models based on five fold cross-validation within the available training data. Images are resized to 10 10 pixels and the pixel values are used as the primary feature x. Figure 4 illustrates some digit samples from the training data. Fig. 4: Example of 10 10 digit images in the training data. Top row corresponds to digit 5 and bottom row corresponds to digit 8. f y (x) =maxw Φ(x, z, y), z y =argmaxf y (x) (6) y The hidden information we use in this case, is the holistic descriptions from the domain expert made for each digit picture in the training set. The descriptions are about a total 192

of 21 properties of each digit such as the thickness of the stroke and the degree of tilt. Each property is measured with an integer from 0 to m. For example a subset of these properties are: tilting to the right (0 - ); thickness of the line (0-4); stability (0 - ); uniformity (0 - ). Figure 5 shows two digits and their corresponding quantized holistic description of four properties. The description for each digit is translated into a 21 dimensional feature vector and used as hidden information. Tilt Thickness Stability Uniformity 2 2 Tilt Thickness Stability Uniformity Fig. 5: Two digit images and the corresponding values of four properties. Left - digit 5. Right - digit 8. Figure 6 shows the performances of all the models when we gradually increase the number of training data. For every training data size that is less than 100, 12 different random samples are selected and the average results are reported, as in [8]. Both the latent variable models LSSVM and LSSVM+ use 5 hidden states. Fig. 6: Average classification accuracy of each model with respect to the number of training data. From the results we can see that by incorporating the expert descriptions of the digit images, both SVM+ and LSSVM+ model outperform their counterparts, which demonstrate the usefulness of hidden information and the effectiveness of the proposed methods. Moreover, the improvement achieved LSSVM+ is significantly larger than the improvement achieved by SVM+. Regardless of the number of training samples, the proposed method LSSVM+ always achieves the best performance, compared to other models with or without using hidden information. Another observation is that when we increase the number of training samples, the improvement by incorporating hidden information is gradually decreased. This makes sense because more data increases classification performance and therefore the space of improvement would decrease. This also suggests that hidden information is more important when the number of training data is smaller. B. Gesture Recognition The second experiment is to classify 10 unique gestures in the devel01 gesture dataset [24]. In the dataset, each 0 0 gesture has an RGB and a corresponding depth video for both training and testing. For our purpose, RGB videos are used as the primary measurement, and the depth videos are excluded during testing and used as the source of hidden information. Figure 7 shows one sample RGB image frame and its corresponding depth image. The dataset is designed for one-shot learning. In other words, each gesture has only one RGB video measurement during training. Therefore training a video-based SVM or LSSVM classifier is impossible. In this experiment we chose to do the frame-based classification instead. The classifier is learned to predict the gesture label for each frame. A video sequence will be assigned to the gesture that has the most votes from its frames. The HOG and HOF features were extracted from each RGB frame as the primary input feature x. To obtain these features, the gradient and optical flow were firstly computed for each pixel and quantized into one of 8 directions. The image was then sequentially decomposed into 1, 4 and 9 blocks, and the histogram of gradient (HOG) and optical flows (HOF) were calculated for each block. Concatenating all the histograms finally resulted in a 112 dimensional feature vector x for each frame. (a) [144,0,114] [112,79,156] [94,146,145] [65,194,109] [181,58,155] [271,47,78] [228,96,125] Fig. 7: (a) an RGB image frame from a gesture video; (b) the corresponding depth image, D joint positions and the hidden information h. The depth video can provide a lot of useful information and in this experiment we chose to use the manually labeled human joint positions as hidden information. As shown in Figure 7b, the center of the head and two hands as well as the two shoulder joints and two elbow joints are manually labelled in the provided depth videos for all the training data. Their D positions are concatenated into a 21 dimensional feature vector as the hidden information h. A total of 80 RGB videos are used during testing. The latent variable models are evaluated with respect to the number of latent states, as illustrated in Figure 8 where the x axis represents the number of latent states and the y axis represents the average accuracy rate. When the number of latent states increases, the accuracy rates of both LSSVM and LSSVM+ increase and reach the peak when approximately 40 latent states are chosen. By incorporating joint positions as hidden information, LSSVM+ always outperforms LSSVM, regardless of the number of latent states. In particular, LSSVM+ improves LSSVM by about 5% on average. The two horizontal lines in Figure 8 show the accuracies of SVM and SVM+, which are much lower than the best performance of the proposed method LSSVM+. Moreover, the improvement of (b) h 144 0 114 112 79 156 94 146 145 65 194 109 181 58 155 228 96 125 271 47 78 19

TABLE I: Classification Accuracy for Each Gesture Algorithm G1 G2 G G4 G5 G6 G7 G8 G9 G10 Average SVM 100 100 85.71 42.86 100 100 75.00 12.50 100 7.50 77.52 SVM+ 100 100 71.4 42.86 100 100 66.67 50 100 7.50 78.65 LSSVM 100 95.8 2.14 7.21 95.8 98.75 94.79 5.1 100 65.6 8.57 LSSVM+ 100 97.22 55.6 75.00 100 100 95.8 56.25 100 85.94 88.48 LSSVM+ is much greater than SVM+ over their perspective counterparts. Fig. 8: Average classification accuracy for each algorithm with respect to the number of hidden states Detailed classification accuracy rates for each gesture are provided in Table I to compare the performance of individual gestures, where both the latent models used 40 hidden states. The recognition accuracies for 9 out of 10 gestures are increased by the proposed algorithm in varying degrees. In particular significant improvements are observed for gestures G4 and G10. V. CONCLUSION In this paper, we studied a novel problem of learning with hidden information, where additional information about training samples are available during training but absent during testing. We further proposed a novel approach to incorporate hidden information through a max-margin latent variable model. Experiments on both digit and gesture recognition tasks demonstrated the feasibility and effectiveness of our approach to capture hidden information. The proposed method can be readily extended to more sophisticated models that involve a richer set of latent components. ACKNOWLEDGEMENT The work described in this paper is supported in part by the grant IIS 1145152 from the National Science Foundation. REFERENCES [1] T. Antonio, Contextual priming for object detection, Int. J. Comput. Vision, vol. 5, pp. 169 191, 200. [2] M. Marszalek, I. Laptev, and C. Schmid, Actions in context, in Computer Vision and Pattern Recognition, IEEE Conference on, 2009, pp. 2929 296. [] X. Wang and Q. Ji, Incorporating contextual knowledge to dynamic bayesian networks for event recognition, in International Conference on Pattern Recognition, 2012. [4] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, Describing objects by their attributes, in Computer Vision and Pattern Recognition, IEEE Conference on, June 2009, pp. 1778 1785. [5] Y. Wang and G. Mori, A discriminative latent model of object classes and attributes, in ECCV, 2010, pp. 155 168. [6] J. Wang, Z. Liu, Y. Wu, and J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in Computer Vision and Pattern Recognition, IEEE Conference on, 2012. [7] A. Quattoni, S. Wang, L. P. Morency, M. Collins, and T. Darrell, Hidden-state conditional random fields, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007. [8] V. Vapnik and A. Vashist, A new learning paradigm: Learning using privileged information, Neural Networks, vol. 22, pp. 544 557, July 2009. [9] V. Vapnik, A. Vashist, and N. Pavlovitch, Learning using hidden information (learning with teacher), in Neural Networks, 2009. IJCNN 2009. International Joint Conference on. IEEE, 2009, pp. 188 195. [10] D. Pechyony and V. Vapnik, On the Theory of Learning with Privileged Information, in Advances in Neural Information Processing Systems 2, 2010. [11] D. Pechyony, R. Izmailov, A. Vashist, and V. Vapnik, Smo-style algorithms for learning using privileged information, in DMIN 10, 2010, pp. 25 241. [12] D. Pechyony and V. Vapnik, Fast optimization algorithms for solving svm, Statistical Learning and Data Science. Chapman & Hall, 2011. [1] L. Niu and J. Wu, Nonlinear l-1 support vector machines for learning using privileged information, in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on. IEEE, 2012, pp. 495 499. [14] V. Sharmanska, I. Austria, N. Quadrianto, and C. H. Lampert, Learning to rank using privileged information, in ICCV, 201. [15] M. Lapin, M. Hein, and B. Schiele, Learning using privileged information: Svm+ and weighted svm, arxiv preprint arxiv:106.161, 201. [16] L. Liang and V. Cherkassky, Connection between svm+ and multitask learning, in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 2048 2054. [17] J. Chen, X. Liu, and S. Lyu, Boosting with side information, in Proceeding of 11th Asian Conference on Computer Vision, Nov 2012. [18] H. Yang and I. Patras, Privileged information-based conditional regression forest for facial feature detection, in Automatic Face and Gesture Recognition, 10th IEEE International Conference and Workshops on, 201. [19] H. Wang, F. Nie, H. Huang, S. Risacher, A. J. Saykin, and L. Shen, Identifying ad-sensitive and cognition-relevant imaging biomarkers via joint classification and regression, in MICCAI 11, 2011, pp. 115 12. [20] J. Feyereisl and U. Aickelin, Privileged information for data clustering, Information Sciences, vol. 194, pp. 4 2, 2012. [21] S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider, Learning using privileged information in prototype based models, in Artificial Neural Networks and Machine Learning ICANN 2012. Springer, 2012, pp. 22 29. [22] C.-N. J. Yu and T. Joachims, Learning structural svms with latent variables, in Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 1169 1176. [2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, 1998. [24] ChaLearn, ChaLearn gesture dataset (CGD2011), California, 2011. 194