Relation Classification with Gated Recursive Convolutional Networks Karl-Heinz Krachenfels CIS, LMU-Munich, Germany February 21, 2017 Abstract In this work we investigate variants of recursive Convolutional Networks (rcnns) for a relation classification task. We give a short intuition why recursive Convolutional Networks (rcnns) are well suited for tasks dependent on structural patterns. We then show that we can improve the performance substantially by adding a gating logic to our convolutional architecture. The task is taken from exercise 7 from the Deep Learning master course at CIS-LMU Munich, WS 2016/17. 1 Task Description and Encoding The task is to classify a given sentence that contains a relation. The sentence is already preprocessed and consists of a left context (5 words), a middle context (10 words) and a right context (5 words) separated by a query tag and an arg tag (see figure 1). The task is to classify the input into one of 5 given relation classes. Because the convolutional network does not get the input sequentially and the weight matrix is shared throughout the convolutional network (in contrast to the convolutional architecture that we used in the lecture where we had 3 segments with different weight matrices) we add 12 padding symbols to our encoding. In this way the convolutional operators can determine in which context a word occurs as soon as possible in the information flow through the network layers. Figure 1: symbols Encoding with 12 different padding 2 Motivation and Intuition for Recursive Approach 2.1 Intuition for Recursive Network Figure 2 shows our intuition why a recursive architecture is a good representation for structural dependencies. On the left side we see a production of a grammar that is mapped to a convolutional unit (in the mid of the figure) and the combination of multiple such units to a recursive network (on the right side). The intuition is that 1
the production C AB can be reduced multiple times if the representation occurs on multiple levels. The intuition is based on the assumption that the representation of the symbols stays constant in the different layers. We believe it is one of the main challenges to build neural architectures that enforce the stability of feature representations in deep networks as precondition for this type of recursive networks to work. Figure 3: Gating Unit, from Cho et al. (2014) Figure 2: Production, Convolutional Unit, multiple Layers of Recursive Convolution 2.2 Motivation for Gating Cho et al. (2014) investigate gated recursive CNNs (grcnns) as an alternative for the encoder in encoder-decoder based neural translation systems. The idea of the gating is to either pass through the left or right input of the convolutional cell or to apply the convolution followed by a sigmoid function denoted by h in the drawing (figure 3, left side). One of the intuitions is that the convolutional network with gating logic automatically learns the structure (figure 4). Figure 4: Learning structure, from Cho et al. (2014) 2
3 Experiments 3.1 Experiments from Lecture We mapped all words that occurred less frequent than the 1000 most frequent words to an <UNK> symbol. The reason to do so was to avoid overfitting and to reduce training times. The focus of this work was the comparison of architectures and our hardware was quite limited to learn deep networks (27 layers for our deepest topology, see below) we did not investigate excessively in hyperparameter optimization and trained only for one epoch. We trained on the 366565 training samples from the Deep Learning course, exercise 7 and tested with 746 samples. We achieved 65.2 % accuracy for the convolutional architecture with 3 convolutional segments followed by a maxpooling layer and a softmax layer (figure 5) and 73.5 % accuracy for the variant with LSTMs consisting of a LSTM layer followed by a full connected layer and a softmax layer (figure 6). Figure 6: Topology with LSTM (unfolded) 3.2 Pyramidal Recursive CNN Topology The pyramidal recursive CNN is a multi layer CNN where all convolutional units on all layers share the weights (figure 7). For simplicity reasons we use the same input encoding for the pyramidal topology and the binary topology in the next section. Due to this reason the encoding is not optimal because we filled up with additional padding symbols until we reached the input length of 32. The topology has 24 recursive convolutional layers and the top level convolutional layer has 8 convolutional units. The input and output dimension of the convolutional cell and the embedding dimension in all our experiments with recursive CNNs is 50. On top we added a full connected layer with 20 hidden units followed by a softmax layer with 5 output units for the predicted classes. 3.3 Experiments with Pyramidal Recursive CNNs Figure 5: Topology with 3 Segment CNN We implemented the logic of the recursive CNN in theano. The implementation contains a software switch to toggle between the gated and non gated variant. We measured accuracy of 54.3% for the variant without gating and 73.9% accuracy for the variant with gating. 3
Figure 8: Binary Recursive CNN Figure 7: Pyramidal Recursive CNN 3.4 4 4.1 Binary Recursive CNN Topology Discussion Conclusion This work shows that recursive convolutional networks without maxpooling can solve classification tasks that depend on structure with a state of the art performance when the convolutional operators are combined with a gating logic. In our experiments we outperformed the LSTM variant by around 3% (76.3% vs. 73.5% accuracy). Even the deep network with 24 convolutional layers worked on a state of the art level but the intuition that this network represents the structure well and is thus superior did not hold. One of the reasons might be that it suffers from the depth of the network which might be caused by the vanishing/exploding gradient problem - although the gating helps a bit. A research direction might be to combine grcnns with techniques to build very deep networks. Specifically Highway Networks, Srivastava et al. (2015) and Residual Networks, He et al. (2016) are candidates for such architectures. As a further variant we evaluated a binary recursive CNN topology. This architecture is presumably not able to represent an arbitrary structure but is well capable of representing order dependent features. The topology is shown in figure 8. We performed the experiments again with a gated and a non gated variant of our recursive convolution architecture. We achieved accuracy of 56.4% without gates and an accuracy of 76.3% with the gated variant. 4
4.2 Critics and Future Work A critics might be that the feature size does not grow and the network is not capable to model richer and higher level features. The remedy to this could be a recursive convolution operation that reuses the weight matrix from the last layer but adds new entries per layer and so increases the dimensionality of the features steadily which could be seen as a compromise between feature stability on one side and inventing new features on the other side. Another research aspect is that the gating could be applied element wise for each feature as in the work of Dauphin et al. (2016). In our experiments all convolutions were binary convolutions - it would also be an option to investigate broader convolutions or combine K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio. On the properties of neural machine translation: Encoder-decoder approaches. 2014. arxiv preprint arxiv:1409.1259, Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. arxiv preprint arxiv:1612.08083, 2016. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016. R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arxiv preprint arxiv:1505.00387, 2015. 5 Appendix 5.1 Parameters The following parameters were used for all experiments with recursive CNNs: Figure 9: Improved gating variant the convolution with broader inputs for the gating logic as shown in figure 9. learning rate=0.1 l1 reg=0.00001 emb size = 50 hidden layer size=20 conv fan in=conv fan out=50 Learning Mode: SGD References 5