CS224D Final Report: Deep Recurrent Attention Networks for L A TEX to Source

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Generative models and adversarial training

Word Segmentation of Off-line Handwritten Documents

Lecture 1: Machine Learning Basics

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Knowledge Transfer in Deep Convolutional Neural Nets

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

INPE São José dos Campos

arxiv: v1 [cs.lg] 15 Jun 2015

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Georgetown University at TREC 2017 Dynamic Domain Track

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Offline Writer Identification Using Convolutional Neural Network Activation Features

THE enormous growth of unstructured data, including

arxiv: v1 [cs.cv] 10 May 2017

Second Exam: Natural Language Parsing with Neural Networks

Lecture 10: Reinforcement Learning

Dialog-based Language Learning

arxiv:submit/ [cs.cv] 2 Aug 2017

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cl] 27 Apr 2016

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Reinforcement Learning Variant for Control Scheduling

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Lip Reading in Profile

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v4 [cs.cl] 28 Mar 2016

Artificial Neural Networks written examination

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Residual Stacking of RNNs for Neural Machine Translation

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Lecture 1: Basic Concepts of Machine Learning

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.lg] 7 Apr 2015

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Calibration of Confidence Measures in Speech Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

SORT: Second-Order Response Transform for Visual Recognition

arxiv: v2 [cs.ro] 3 Mar 2017

AI Agent for Ice Hockey Atari 2600

A Case Study: News Classification Based on Term Frequency

An empirical study of learning speed in backpropagation

Test Effort Estimation Using Neural Network

Learning Methods for Fuzzy Systems

Model Ensemble for Click Prediction in Bing Search Ads

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An Online Handwriting Recognition System For Turkish

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Reinforcement Learning by Comparing Immediate Reward

TD(λ) and Q-Learning Based Ludo Players

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

(Sub)Gradient Descent

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.cl] 20 Jul 2015

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Cultivating DNN Diversity for Large Scale Video Labelling

Deep Neural Network Language Models

CS Machine Learning

arxiv: v5 [cs.ai] 18 Aug 2015

Rule Learning With Negation: Issues Regarding Effectiveness

arxiv: v1 [cs.cv] 2 Jun 2017

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Learning From the Past with Experiment Databases

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Introduction to Simulation

CSL465/603 - Machine Learning

Assignment 1: Predicting Amazon Review Ratings

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Soft Computing based Learning for Cognitive Radio

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Deep Facial Action Unit Recognition from Partially Labeled Data

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v2 [cs.cl] 26 Mar 2015

Reducing Features to Improve Bug Prediction

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

arxiv: v2 [cs.cv] 4 Mar 2016

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Transcription:

CS224D Final Report: Deep Recurrent Attention Networks for L A TEX to Source Keegan Go Department of Computer Science Stanford University Stanford, CA 94305 keegango@stanford.edu Kenji Hata Department of Computer Science Stanford University Stanford, CA 94305 khata@stanford.edu Abstract For our project, we wanted to explore the problem of recognizing L A TEX expressions and translating them to source using a deep neural network. Previous work involving attention models have improved sequence to sequence mappings and greatly helped in digit recognition. Inspired by these previous work, we implemented an attention model to recognize simple L A TEX expressions and also tested it on a small subset of CROHME, a dataset of handwritten mathematical expressions. We thoroughly explain our network, training procedure, hyperparameter tuning, and results. We achieve a classification accuracy of 63.4% on an automatically generated dataset and an accuracy of XXXXXX on CROHME. Introduction L A TEX documents pervade the computer science community (and many others). In this project, we want to tackle the ability to transcribe L A TEX expressions to source code. The applications of a highly robust transcription images of L A TEXto source code are numerous: for example, taking a picture of any paper and having its source to compile and edit would be extremely useful for the research community. The expressions found in L A TEX equations naturally have a lot of structure compared to regular text. For instance, the expression + + 4 + + + 4 + + + 4 has a recursive structure, which a deep convolutional recurrent network could learn and generalize it to other expressions. Convolutional neural networks have recently seen great success and subsequently gained immense popularity in numerous classification and recognition tasks [0]. Additionally, recurrent networks have proven to learn sequence to sequence mappings. A combination of these two type of networks has become extremely effective recently, especially with the addition of attention mechanisms to further boost the efficacy of how deep networks see images [, 4, 5]. In this project, we implement a deep recurrent, convolutional network with an attention model in order to identify L A TEXtokens from simple expressions in images. We discuss our training process, hyperparameter tuning, and the results our network produces. Overall, we find that this network achieves an accuracy of 63.4% on 4 tokens.

Figure : An example automatically generated L A TEX image. 2 Related Work 2. Digit, Handwriting, and Mathematical Expression Recognition Digit recognition is a simpler classification problem of mathematical expression recognition, but still proves to be useful. For example, the MNIST dataset [] of handwritten digits has been used as a benchmark for a wide variety of machine learning algorithms. A more unstructured application of digit recognition is Google s translation of streetview house numbers [4]. Two predominant forms of handwriting recognition: online, where the writing is recorded in realtime [6], and offline, where the data is stored statically, such as on paper [7]. The Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) dataset [3] provides a mapping from handwriting strokes to mathematical expressions. Work submitted to the CROHME competition often include online recognition using stroke to shape context features with AdaBoost [9] and a variety of foundational machine learning techniques (support vector machines, random forests, etc.) for offline recognition [2]. Previous approaches for L A TEXexpression recognition in apps like Photomath usually involve segmenting each individual token and then running each segmented token through a classifier. As far as we know, there has been no sufficient work on recognizing L A TEXexpressions using a deep neural network. Thus, we believe this project is a nice extension and application of previous deep learning techniques. 2.2 Attention Models Many deep network models take inspiration from the way humans perform visual recognition tasks, specifically focusing on relevant areas as they progress through sequences [3]. Attention models for deep neural networks focus on different parts of images or sentences to help handle the recurrent or sequential nature inherent to many tasks [5, 8]. These attention models have proven to work on a wide variety of real-world problems, such as image captioning [5], multiple digit recognition [], and image generation [8] 3 Data and Model 3. Data We will initially simplify our problem as much as possible by generating L A TEXdocuments and taking images of their outputted PDFs. By doing so, we can artificially generate as much data as we want to better train our neural networks. This idea can be further trained to produce results on images of long expressions from books or papers that one would generally not want to re-typeset. For our dataset, we generated ten-thousand images of three to ten character expressions from a set of 4 tokens (a to z, 0 to 9, and a few operators). To generate each expression, first choose the length uniformly at random from the above range, and then choose characters uniformly at random from the character set. We did not enforce any semantic constraints at this point, since we wanted our model to learn how to map arbitrary sequences back into source. Figure shows an example image in our automatically generated dataset. While the compiled output generally varied in height and width, we standardized our data by first scaling all images to a height of 32 pixels and then padding the width to that of the widest image. While we initially hoped to avoid the final padding, we found it difficult to process the data in batches without matching the image sizes. Additionally, there exists the CROHME dataset [3], 2

Figure 2: An example CROHME image. Figure 3: The RAM network we used to predict L A TEXtokens. Multiple RAM networks can be stacked together to predict multiple tokens. which allows us to extend to the handwritten domain as well. Figure 2 shows an example image in the CROHME dataset. 3.2 Model Noticing that mathematical expression recognition, in a very simplified sense, can be represented as a sequence of digit or character recognition, we drew inspiration from Recurrent Attention Model (RAM) networks [2][]. RAM networks identify salient parts of the image to look at and focus on these parts to improve their discrimination ability. At a high level, our model does the following: Glimpse Network: At time step t, given a location l t, we generate a glimpse x t of an image. This glimpse mirrors the foveal vision in humans, in that the area that the glimpse focuses on is at the highest resolution. As pixels become more distant from the glimpse, they progressively have less resolution. We send each glimpse x t through three convolutional layers and a fully connected layer to generate g (x) t g (l) t.. Furthermore, we send each location l t through a fully connected layer to generate Recurrent Network: We have two recurrent networks interwoven: r () t = LSTM(g (x) t g (l) t, r () t ) r (2) t = LSTM(r () t, r (2) t ) Long-Short-Term-Memory is used for the non-linearity because of their ability to learn stable, longrange dependencies. 3

Classification Network: By sending the final r n () through a fully connected layer and into a softmax, we compute the probabilities of the next L A TEX token. This token will be our predicted value. Emission Network: Using a fully connected layer with r (2) t as input, we can compute the next location the network can focus its attention l t+. To be explicit, we give the sizes and compositions of each component in the network. The Glimpse Network is composed of two sub-networks. It first maps the location (a 2- vector) and the associated glimpse (a 8 8 image) into two separate 28-vector hidden states using ReLUs. It then combines these two state vectors together into a 256-vector hidden state using ReLU again. We used LSTMs with hidden layers of size 256 for our Recurrent Networks. Our Classification Network is a simple fully connected layer fed into a softmax loss that takes the 256 size hidden state and maps it to predictions over the 4 character classes. The Emission Network consists of an affine layer fed into a tanh non-linearity to bound the locations to between the square given by ± along the axes. We then scale these coordinates by the image height and width in order to allow the algorithm to gaze at any part of the image. Our model was implemented in Torch (and partially in Tensorflow) on a NVIDIA GTX 970. 3.3 Training While the model described above is powerful and has led to high performance results in a number of areas, it introduces additional complexities in training the model. In particular, the step of indexing the image to create the next glimpse is a non-differentiable function. This means that standard backpropagation techniques are not sufficient to train the model. The solution is to use reinforcement learning to train the model to select glimpse locations that lead to good classification results. As the other attention models did, we used the REINFORCE algorithm [4]. At a high level, this technique works by sampling around the predicted location in the forward pass, and then treats the image is a fixed input in the backwards pass. To propagate error though the network, the Emission Network generates it s own gradient by using a reward function based on the success of the classification. 4 Experiment and Results 4. Hyperparameter Tuning After implementing the RAM network, we wanted to experiment on what parameters were the most influential on the network s ability to learn. Upon a few quick tests, we found that learning rate, the reward scale for REINFORCE, and the standard deviation for the Monte Carlo sampling were among the most influential factors. Therefore, we did a standard grid search on these hyperparameters to find a combination of them that worked well. Figure 4 shows the results of our tuning. We found that a learning rate of 5, reward scale of 0, and standard deviation of worked well. However, upon completing this grid search, we wanted to further test the effect of the standard deviation of the Monte Carlo sampling, as we saw that it was somewhat sensitive. We believe this to be the case because, in order for the network to read each token at the correct position, the glimpse windows need to be able to quickly adjust towards the correct locations during training. Therefore, we completed additional tuning on the standard deviation, keeping the previously stated learning rate and reward scale. As seen in Figure 5, we found the standard deviation for Monte Carlo sampling did not play a major role. However, we noticed if the values were too low or too high, it would sometimes not learn at all. We believe this effect stems from the reinforcement learning ending up not improving the emission network. 4

Training Accuracy Loss 0 5 0 5 0 Epoch Validation Accuracy 0.8 0.7 lr=05 reward= std= lr=0 reward=.0 std= lr= reward=.0 std= lr=5 reward= std= lr=05 reward=.0 std= lr= reward= std= lr=05 reward=.0 std= lr= reward=.0 std= lr=5 reward= std= lr=0 reward= std= lr=0 reward=.0 std= lr= reward= std= lr=5 reward=.0 std= lr=5 reward=.0 std= lr=0 reward= std= lr=05 reward= std= Epoch Epoch Figure 4: Hyperparameter tuning on learning rate, reward for REINFORCE, and standard deviation for the Monte Carlo sampling using a grid search. The best tuned network (red line) seems to follow a quick dropoff in loss and a corresponding sharp increase in accuracy. 4.2 Results Our best model achieves a 58.6% accuracy on the training set, 66.5% accuracy on the validation set, and 63.4% accuracy on the test set. Although these numbers seem low compared to the 90% accuracy achieved by similar networks on the MNIST dataset, we believe that our data is more confusing for the network to learn because of the additional tokens within the image. These additional tokens make it confusing for the emission network to predict a new location. 4.3 Further Experiments on CROHME As mentioned previously, we wanted to test our network on the CROHME dataset. Although the majority of the CROHME dataset uses very complex expressions, we chose a subset of around 500 images of easier expressions and tried to classify only tokens (a few common characters and numbers). Ultimately, using our best model, we were able to achieve a training set accuracy of 45.%, a validation set accuracy of 32.3%, and a test set accuracy of 28.7%. This is still much better than random, but there exists a lot of room for improvement. 5

0 8 6 Loss 4 2 0 8 lr=5 reward=.0 std=0.7 lr=5 reward=.0 std= lr=5 reward=.0 std= lr=5 reward=.0 std= lr=5 reward=.0 std= 6 4 Training Accuracy 2 Epochs Validation Accuracy 0.8 0.7 Epochs Epochs Figure 5: Hyperparameter tuning specifically on the standard deviation on the Monte Carlo sampling. We found that when the values were low or high, the network more frequently does not learn quickly. 5 Conclusion We have implemented an attention based model for classifying L A TEX expressions by character. We trained the network while varying hyperparameters and found a configuration that trains quickly and achieves significant results on our test set. However, as mentioned, training this type of model comes with the complication of introducing a number of new parameters associated with the reinforcement learning that need to be tuned in conjunction with the usual parameters. In addition to the larger parameter space, the parallel updates for the model from reward and error terms leads to more complicated behavior and so it is difficult to tell if the best results are being achieved. For future work, we d like to extend our model to making accurate predictions for arbitrary numbers of digits and for structured expressions. The second of these two is especially interesting, but will likely require a more recursive structure, perhaps outputting multiple subsequent glimpse locations and then tracing each of these separately. References [] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. arxiv preprint arxiv:42.7755, 204. 6

[2] K. Davila, S. Ludi, and R. Zanibbi. Using off-line features and synthetic data for on-line handwritten math symbol recognition. In Frontiers in Handwriting Recognition (ICFHR), 204 4th International Conference on, pages 323 328. IEEE, 204. [3] R. Desimone and J. Duncan. Neural mechanisms of selective visual attention. Annual review of neuroscience, 8():93 222, 995. [4] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arxiv preprint arxiv:32.6082, 203. [5] A. Graves. Generating sequences with recurrent neural networks. arxiv preprint arxiv:308.0850, 203. [6] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 3(5):855 868, 2009. [7] A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional recurrent neural networks. In Advances in neural information processing systems, pages 545 552, 2009. [8] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image generation. arxiv preprint arxiv:502.04623, 205. [9] L. Hu and R. Zanibbi. Segmenting handwritten math symbols using adaboost and multi-scale shape context features. In Document Analysis and Recognition (ICDAR), 203 2th International Conference on, pages 80 84. IEEE, 203. [0] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 097 05, 202. [] Y. LeCun, C. Cortes, and C. J. Burges. The mnist database of handwritten digits, 998. [2] V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pages 2204 222, 204. [3] H. Mouchere, C. Viard-Gaudin, R. Zanibbi, and U. Garain. Icfhr 204 competition on recognition of on-line handwritten mathematical expressions (crohme 204). In Frontiers in Handwriting Recognition (ICFHR), 204 4th International Conference on, pages 79 796. IEEE, 204. [4] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229 256, 992. [5] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. arxiv preprint arxiv:502.03044, 205. 7