Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski Problem Statement and Background Given a collection of 8th grade science questions, possible answer choices, and a set of educational resources, can a neural network correctly identify the correct answer choice? For example, given the question Which of the following does not allow sound to travel through? with possible answer choices of solid, liquid, gas, and vacuum, a model that successfully completes this task would output vacuum as the correct answer choice. This task was posed as a previous Kaggle challenge, and has a couple of large scale impacts in natural language processing (NLP) research. Question answering (QA) is a large area of NLP research and is a crucial part of shaping how humans can interact with computer systems. Additionally, the ability to weigh between different answer choices poses an interesting question when it comes to a network successfully synthesizing information and making decisions. This also has a larger application in the education space--if we can isolate pieces of information that helped our model answer a question, we could potentially help a student who is struggling in an area find a resource which could help him/her answer the question and better learn the material. The team which won the initial Kaggle challenge achieved an accuracy of 59.3% on the test question set. Second place achieved a 58.3% accuracy rate, and third place achieved a 48.1% rate. This accuracy rate was calculated on a slightly different test set than we use in our project, but these results should be comparable. State of the art approaches for this task included both neural and non-neural approaches. One team used textbooks and outside sources to hand-craft features that are scored by a scoring function [1]. The approach starts by examining each question and coming up with a hypothesis for each answer choice which combines each choice with the base of the question. From there, hand-crafted features (including Tf-idf, BM-25, and entailment) which compared each hypothesis with textbooks and other knowledge sources were used in an SVM with a scoring function to generate the most probable hypothesis. This approach was able to obtain 47.8% accuracy. Another state of the art approach involves using a sequence to sequence model to rank each answer choice on relevancy [2]. The model begins by creating a full sentence hypothesis with each answer choice. It then creates sentence embeddings from both open source textbook sources and each hypothesis and compares them. The network used to do this is a recurrent neural network with attention. A scoring mechanism is used to score each hypothesis and select a final answer. This approach was able to obtain 44.1% accuracy. The winner of the Kaggle challenge used a combination of 28 hand-crafted features and a regression model to output probabilities for each answer choice [3]. These features were created by searching through a variety of external documents to determine relevancy of answer choices. A SVM is then trained to predict the correct answer choice. Their final accuracy was 59.3%. Approach We gathered data from two different sources--a collection of 8th grade science questions from AI2 and open source textbooks from c12k.org. We also augmented science question data with a textual entailment dataset: the Stanford Natural Language Inference (SNLI) Corpus [4]. 1

The AI2 8th Grade Science Dataset includes 3710 8th grade science questions, already divided into training, development, and test datasets. In order to preprocess the data, we remove all columns of the AI2 question CSV except for the question, four answer choices, and the correct answer. The evidence that we pass to our model to answer the questions comes from an open source collection of science textbooks from c12k.org. Textbooks provide a hierarchical representation of material that we may be advantageous to our model, as information is structured and topics are clearly defined by chapter and section titles. We preprocessed these resources by having hyperlinks and author information removed, as this appears on every page of the textbook and skews our retrieval method. Tools Tensorflow was used to implement all of the models reported upon. Compared with other machine learning libraries such as Caffe or Theano, Tensorflow provides APIs on various level, allowing us to improve our baseline models beyond what is possible using off-the-shelf solutions. Additionally, Tensorflow provides easy GPU support and easy-to-use visualization tools. For determining relevant evidence from the textbook corpus, PyLucene, an indexing and querying search engine service based on Tf-idf, was used. Additionally, pretrained GloVe model was used as the word2vec model [5]. Models We have two main approaches to the question answering problem--entailment models and general QA models. We implemented a total of four entailment models (including our baseline) and two general question-answering models. Entailment Models The structure of our entailment model pipeline is shown in Figure 1, which is introduced by Baudis et al. [2]. Figure 1. Entailment Model structure. 2

The model is composed of four modules: 1. Form Hypothesis (FH) : Given a question and an answer choice, the FH module will generate a hypothesis. For example, given the question Which of the following does not allow sound to travel through? and an answer choice vacuum, FH will output the hypothesis, vacuum does not allow sound to travel through. The FH module is implemented mainly with regular expression substitution. 2. Find Evidence (FE) : Given a hypothesis, the FE module will search textbooks for supporting evidence, i.e. a sentence that supports or refutes the hypothesis sentence. For example, given the hypothesis vacuum does not allow sound to travel through., the FE module will return several sentences from textbooks such as sound cannot travel in vacuum, sound must travel through air, and vacuum means the absence of air in an environment. The FE model will also return a confidence score for each piece of evidence. The FE module is implemented with PyLucene. 3. Entailment (E) : The Entailment model is a model that computes the entailment relation between two input vectorized strings. If sentence A entails sentence B, it means that given the information in Sentence A, the information presented in Sentence B logically follows. On the other hand, if Sentence A contradicts Sentence B, it means that given Sentence A, Sentence B cannot logically follow. For each sentence pair, the output is a probability distribution over three categories: contradiction, entailment, and neutral. This distribution is further converted into a single score in which -1 means contradiction, 0 means neutral, and 1 means entailment. Each one of the four models were tested as this part of the pipeline. 4. Evidence Weighing (EW) : Given the confidence c and the entailment score r for all evidence, EM will compute a single score for the answer: y = c i r i i The final answer is chosen as the output with maximum y value. Baseline Model The Baseline model, shown in Figure 2, is a bidirectional RNN which takes as input the two sentences concatenated, has dropout on the word level, and passes the sentence embedding to a fully connected layer with dropout and regularization to determine entailment. Figure 2: Baseline model architecture 3

Siamese Model The Siamese model, shown in Figure 3, is a similar approach to the Baseline Model. However, sentence embeddings are generated separately, and the difference vector is fed in input to a fully-connected layer to determine entailment [6]. We hypothesized that by generating sentence embeddings separately and having the fully connected layer compare the two embeddings, the entailment model would be more robust and outperform the baseline model. Figure 3: Siamese model architecture. MFF Model We wanted to compare our sequence-based model to a non-sequence model, based on Parikh et al [7]. We implemented a multi feed-forward model, shown in Figure 4. This model creates an attention map with the two sentences and uses several feed-forward neural networks to synthesize and determine entailment. As is shown in Figure 4, each of three modules contains a nonlinear function, which is implemented as a feed-forward network with ReLU layer. Hidden layers of these networks are of the same size. We have tested two parameters: MFF-32, which has a hidden layer of size 32, and MFF-64, which has a hidden layer of size 64. Figure 4: MFF model architecture Convolutional Neural Network (CNN) Model As a variant of MFF model, instead of feed-forward network, we propose to use a convolutional neural network to synthesize attention information, shown in Figure 5. We use multiple attention maps, so that we can capture different types of word correlation. Each attention map is treated as an input channel for CNN. CNN will capture the local information and global information of attention maps. A FC layer with softmax is used at 4

the end the the network to output a probability distribution over three categories. Limited to computing resources, we only use one CNN layer with one 1 1 kernel, which combines channels into a single matrix. Figure 5: CNN Model architecture General Question Answering Models In addition to comparing multiple entailment models, we wanted to compare results to entirely different approach to question answering, which searches through a paragraph of information for the answer to a question instead of comparing two sentences to determine entailment. Dynamic Memory Network The Dynamic Memory Network (DMN) model is introduced by Kumar et al in 2016 [8]. It is composed of five parts: semantic memory module, input module, question module, episodic memory module, and answer module, as is shown in Figure 6. Episodic Memory Module is the core module of DMN. This module works as a soft attention on facts (evidence from PyLucene in our case), and is designed to emulate the change of human attentions over time when answering questions. Attentions are controlled by a recurrent neural network, which is initialized by the question embedding and takes as input the synthesized information on facts with current attention. As is shown in the paper, this module can also provide a very beautiful visualization of question answering process. The attention changes over time, following exactly the same way a human would answer the question, which can be interpreted as the model is doing reasoning on facts. Figure 6: Architecture of DMN Model [8]. 5

End to End Memory Network In order to compare the DMN to another type of QA system, we also tested with an end-to-end (E2E) memory network, as introduced by Bordes et al in 2015 [9]. The E2E memory system relies on a memory structure and a number of hops around the input passage to reason about the question and produce an answer. In the original model implementation, the data was labeled with which passage lines were necessary to answer the question. However, we did not have the time or bandwidth to hand-label all of the test and training passages with relevant line numbers. Therefore, we ran this model with two different data settings -- one which listed all lines in the passage as being necessary to attend to and one which listed none of them. We wanted to compare the two settings and see if insights could be gained by running the model regardless. Figure 7: (a) A single layer of the E2E network (b) The combination of three layers that comprise the final model [9] Results As a preliminary step, we evaluated our entailment models with the SNLI dataset, which is a collection of sentences catered specifically for entailment [4]. Each pair of sentences is accompanied by a label (either entailment, contradiction, or neutral) as well as a confidence score. This was done to see if our models could successfully complete an entailment task independent of the science question pipeline. Results are summarized in Figure 8. Figure 8: Results of each model on SNLI Entailment Data 6

After confirming that our models could complete entailment tasks, we turned to testing our models on the AI2 science question data. Results for these experiments can be seen in Figure 9. Figure 9: Results of each model on AI2 Science Question Data. (Note: E2E w/ lines has all lines included in attention while E2E no lines has none.) The non-neural Tf-Idf baseline was computed by choosing the answer choice whose hypothesis had the highest Tf-Idf similarity to a sentence from the textbook. Lessons Learned Through running experiments and analyzing performance of the six models we ultimately fully implemented yielded sub-par results. In fact, most of our neural models were not able to surpass the non-neural Tf-idf baseline of 33%. Thus, we explored why our models were not performing and hypothesized several reasons why. Question Difficulties One hypothesis about why our models were not performing well on the AI2 data was that the types of questions included in the training and test sets were such that there was not a one optimal network that would work for each question in the training or test set. We examined the questions by hand and completed two different ways to evaluate questions--the question length and the type of question. Each of these were compared using one representative sequence-based model (baseline) and one non-sequence model (MFF). Question length refers to whether or not a question includes one or more informative sentences that are necessary to answer the question. For example, the question The metal lid on a glass jar is hard to open so it is held under warm running water. What causes the jar to open easily after it was held under the water? includes information from the first sentence in the question to choose the correct answer choice. Results for this can be seen in Figure 10. Our hypothesis was that both networks would perform better on shorter questions, as these are typically more conceptual and require less memory ability. This held true for the MFF model, but an interesting result was the Baseline model s comparative advantage with long questions. Even though MFF outperforms the long question performance, the Baseline model does better on 7

longer questions than shorter questions. Thus, it may be beneficial to split questions and train networks specifically to answer longer or shorter questions. Question type refers to the category of question, determined by the wh word and other key words included in the questions, and results can be seen in Figure 11. It is evident that each model has certain strengths and other shortcomings. Because of this, it may be worth exploring a multi-network approach which trains multiple networks each on a single type of question. Figure 10: The accuracy of the two representative models on questions of different lengths. Figure 11: The accuracy of the two representative models on questions of different types. Model Shortcomings Entailment Models The significant deterioration of performance when switching from entailment data to science question data leads us to believe that entailment might not have been the right approach to this problem. One obvious shortcoming of an entailment model is that it is designed to compare two sentences to each other directly. This means that any information we want the network to use to choose a given answer choice has to be perfectly captured in one sentence. This, however, is not the case with many questions in the dataset. The questions either require information from more than one sentence or require a level of complex reasoning that an entailment model does not capture. General QA Models The memory-based QA models were implemented as a first step to test the hypothesis that entailment models were not the optimal way to answer 8th grade science questions. Memory-based methods work best when the specific lines in the passage that the network should 8

use to answer a question are explicitly stated. However, the DMN implementation assumes that all lines in the evidence are equally relevant (which is not necessarily the case), while the E2E approach tried both the approach that all are equally relevant and the approach which does not give any line numbers to the network. Giving the model no information about line numbers slightly out-performed assuming all lines were relevant, which speaks to the importance of backpropagation through the location in the passage that the network should attend to. Neural Models It has been mentioned in several papers, blogs, and talks that AI2 dataset is difficult for neural models [10] [11] [12]. The fact that a simple non-neural baseline model outperforms almost all of our neural models seems to confirm this argument. As we have discussed above, textual entailment based models have the intrinsic shortcoming of not being able to capture correlations among evidence. QA models were studied to address this issue. However, it is still questionable whether the model has the ability to perform human-like reasoning. According to our observations, to answer certain questions in AI2 dataset, a model would need not only word based as reasoning, but also pattern based reasoning. Word based reasoning only requires the model to capture information based on word matching, such as tasks in the BaBi dataset that DMN and E2E Memory Networks perform incredibly well on [9]. Pattern based reasoning, however, requires the model to capture information beyond characters. Syntax, semantics, definitions, backgrounds, correlations between agents are all important for answering the question. Understanding one of them can be a difficult task and has a rich collection of literatures, not to mention combining all pieces together and reasoning upon them. However, it is still worth studying this problem with neural models. The argument above assumes neural models answer questions the same way as humans, which might not be the case. It is possible that a neural model, with enough training data and external resources, can answer these questions with high accuracy, in a purely look-up way. If such a model exists, it can still have a huge impact on areas such as searching and education. Next Steps We are interested in further exploring the findings of this report in many different ways. In order to get a fully-comparable QA model, we wish to hand-label attention of training and test data and create a more robust memory QA model. Additionally, we aim to study the difference between Entailment model and QA model in terms of performance difference on different types questions as well as attention patterns, such as if attentions are similar to each other for the same question. We also aim to compare non-neural model with neural model with the aim of understanding which types of questions that non-neural models perform better than neural models as well as patterns captured by non-neural models that are missed by neural models. Finally, we would like to further explore more ways of improving our neural-models as well as the non-neural parts of the Entailment Pipeline that we have not considered. All in all, there are a multitude of ways to further explore this topic area, and we wish to continue analyzing a neural approach to science question answering. 9

Team Contributions and Workload Percentage An Ju (⅓): Wrote basic training structure in TF. Integrated and tested the baseline model. Wrote scripts to speedup training. Wrote scripts to test model modules. Wrote Siamese, MFF, DNN models in TF. Helped write DMN model in TF. Hyperparameter tuning. Steven Hewitt (⅓): Created and improved GloVe-based word vectorization, improved upon hypothesis gathering, and wrote data gathering scripts. Helped write DMN model. Hyperparameter tuning. Katherine Stasaski (⅓): Created initial evidence retrieval method from textbooks, improved evidence gathering by switching to PyLucene, created first version of entailment model in TF (later improved by An), pre-processed data, created naive hypothesis generator (later improved by Steven), found MFF paper, created end to end question answering model in TF. Hyperparameter tuning. References [1] Sachan, M., Dubey, A., & Xing, E. P. (n.d.). Science Question Answering using Instructional Materials, 467 473. [2] Baudis, Petr, Silvestr Stanko, and Jan Sedivy. 2016. Joint Learning of Sentence Embeddings for Relevance and Entailment, 8 17. http://arxiv.org/abs/1605.04655. [3] https://github.com/cardal/kaggle_allenaiscience [4] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). [5] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation [6] J. Mueller and A. Thyagarajan, Siamese Recurrent Architecture for Learning Sentence Similarity, AAAI, 2016. [7] Parikh, Ankur P, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. arxiv. http://arxiv.org/abs/1606.01933. [8] Kumar, Ankit, Ozan Irsoy, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2015. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing. Nips. http://arxiv.org/abs/1506.07285. [9] Bordes, Antoine, Usunier, Nicolas, Chopra, Sumit, and Weston, Jason. Large-scale simple question answering with memory networks. arxiv preprint arxiv:1506.02075, 2015. [10] May, Rob. "How We Approached The Allen A.I. Challenge on Kaggle." How We Approached The Allen A.I. Challenge on Kaggle. N.p., 11 Jan. 2016. Web. 14 Dec. 2016. [11] Vorontsov, Konstantin. DeepHack.Q&A Konstantin Vorontsov Regularization of Topic Models for Question Answering. YouTube. 01 Feb. 2016. Web. 14 Dec. 2016. [12] "Implementing Dynamic Memory Networks." Implementing Dynamic Memory Networks YerevaNN. YerevaNN, 05 Feb. 2016. Web. 14 Dec. 2016. 10