Investigating how well language models capture meaning in children s books

Investigating how well language models capture meaning in children s books Caitlin Hult Department of Mathematics UNC Chapel Hill Deep Learning Journal Club 11/30/16

Paper to discuss The Goldilocks Principle: Reading Children s Books with Explicit Memory Representations Authors: Felix Hill*, Antoine Bordes, Sumit Chopra, Jason Weston Facebook AI Research; University of Cambridge https://arxiv.org/pdf/1511.02301.pdf

Outline of Talk Motivation The Children s Book Test (CBT) Memory representation with Memory Networks Comparison with other models Conclusions

Motivation Humans do not interpret language in insolation Context is important! Guiding questions: How well can statistical models exploit wider contexts to make predictions about natural language? How well can they capture meaning? What is the role of local and wider contextual information in making predictions about different types of words in children s stories? What requirements are necessary for a language model to accurately predict syntatic function words (verbs, prepositions) vs. semantic words (named entities, nouns)? Useful for applications requiring semantic coherence (e.g. language generation, machine translation, dialogue and question-answering systems)

Model Overview The problem: Accurately predict missing words in children s stories Types of performance: Humans: predict all word types with similar accuracy, use wider context RNNs with LSTMs: very good at predicting verbs and prepositions, primarily use local contexts Memory Networks: good at predicting nouns and named entities, use both local info and wider context Importance of distinguishing the task of predicting syntactic function words from that of predicting lower-frequency semantic-content words

The Goldilocks Principle A sweet spot between word and sentence Optimal performance level is dependent on window choice and self-supervised training: We find the way in which wider context is represented in memory to be critical. If memories are encoded from a small window around important words in the context, there is an optimal size for memory representations between single words and entire sentences, that depends on the class of word to be predicted.

The Children s Book Test The basis of experiments in this paper Children s books have clear narrative structure (thus role of context more discernible) Designed to test the role of memory and context in language processing and understanding; directly measures how well language models can exploit wider linguistic context Books allocated to training, validation, or test sets Create example questions by numbering 21 consecutive sentences Definitions: context = first 20 sentences (denoted S) query = 21st sentence (denoted q) answer word = removed word from sentence (denoted a) selection of 10 candidate answers (denoted C) appearing in context sentences and query For a question answer pair (x, a): x = (q, S, C) S = ordered list of sentences q = a sentence (an ordered list q = q₁,, qk) containing a missing word symbol C = bag of unique words such that a C, cardinality C =10, every candidate word ω C satisfies ω q S

The Children s Book Test: Example

The Children s Book Test, cont. Evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs, and Prepositions Classical language modeling evaluations: assume average perplexity across all words in a text > thus proportionally more emphasis on accurate prediction of frequent words CBT: allows focused analyses on semantic content-bearing words > better proxy

Comparison to similar models Microsoft Research Sentence Completion Challenge (MSRCC) limited to single sentence, whereas CBT has wider context CBT has nearly 10 times the number of test questions, double the number of candidates on each question, contains larger training and validation sets CNN QA requires models to identify missing entities from bullet-point summaries of online news articles > higher focus on paraphrasing, whereas CBT focuses on making inferences and predicting from contexts it is anonymised (can t apply knowledge that s not apparent from article), whereas CBT is not anonymised MCTest of machine comprehension training set consists of only 300 examples

Memory Networks One of a class of contextual models that can interpret language at a given point in text conditioned directly on both local information and explicit representation of the wider context Applying them on the CBT enables us to examine impact of various ways of encoding context on their semantic processing ability over naturally occurring language Good resources include: Weston et al., 2015b; Weston et al., 2015a; Sukhbaatar et al., 2015

Encoding memories and queries Options for storing phrases s: Lexical memory: one word per memory slot, incorporates time features to record word order Window memory: each phrase s corresponds to a window of text from context S centered on individual mention of candidate c in S, memory slots are windows of words Sentential memory: phrases of s correspond to complete sentences of S, for the CBT each question will have 20 memories

End-to-end memory networks MemN2N architecture allows for direct training of Memory Networks through backpropagation Two main steps: supporting memories are retrieved an answer distribution is returned using a softmax Memory Networks can perform several hops in memory before returning answer

Self-supervision for window memories Memory supervision not provided at training time, inferred automatically Train by making gradient steps using SGD. The model selects its top relevant memory using: At test time, candidate score is the sum of the alpha_i of the windows it appears in. Do not exploit new label information beyond the training data (hard attention over memories)

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines: Implemented two simple baselines based on word frequencies. 1) Select most frequent candidate in entire training corpus, 2) For a given question, select most frequent candidate in its context. N-Gram Language Models Supervised Embedding Models Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models: Trained an n-gram language model using the KenLM toolkit (Heatfield et al., 2013). Compared with a variant with cache, in which they linearly interpolated n-gram model probabilities with unigram probabilities computed on the context. Supervised Embedding Models Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models: Goal is to directly test how much of CBT can be resolved by word embeddings. Learn both input and output embedding matrices for each word in vocabulary For a given input passage q and possible answer word w, the score is computed as Lobotomised Memory Networks with zero hops (no attention over the memory component) Investigate different input passages: context + query, query, window (sub-sentence of query centered around missing word), window + position (use different embedding matrix for encoding each position of window). Tune window-size d=5 on validation set Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models Recurrent Language Models: Trained RNN language models with LSTM activation units on training stories (5.5M words), used minibatch SGD to maximize negative log-likelihood of next word, hyper-parameters tuned on validation set best model had dimension 512 Investigated 2 variants: context + query (read entire context, then query), query (read only query). All models have access to query words after the missing word. Contextual LSTM learns a convolutional attention over windows of the context (given objective of predicting all words in the query) Tuned window size on validation set Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models Recurrent Language Models Human Performance: 15 native English speakers attempt a randomlyselected 10% from each question type of CBT, either with query only or with query + context. Obtained 2000 answers total.

Results: Modeling syntactic flow Model performance strongly depends on type of word to be predicted Conventional language models good at verb/preposition prediction, less good at predicting nouns/named entities Interesting results: RNNs with LSTMs slightly better than n-gram models (except for name entities, when cache is used) LSTM models better than humans at preposition prediction (suggests that sometimes several candidate prepositions are correct ) When only query available, LSTMs and n-gram models better than humans at verb prediction (possibly because models better attuned to verb distribution in children s books)

Results: Capturing semantic coherence Best performing Memory Networks predict nouns, named entities more accurately than conventional language models Rely on access to wider context LSTMs do not effectively exploit context when it is provided Performance level on nouns, named entities doesn t change Embedding Model (query), which is equivalent to memory network but without contextual memory, performs poorly

Results: Getting memory representations just right Trained memory networks exploited context to different degrees when predicting nouns and named entities. For example: Each sentence in C stored as ordered sequence of word embeddings (sentential mem + PE), performance is poor Lexical memory effective for verbs and prepositions, less for nouns and entities Window memories centered around candidate words more useful for nouns and named entities prediction

Results: Self-supervised memory retrieval Window-based Memory Network with self-supervision (hard attention selection made among window memories during training) outperforms all others at predicting nouns and named entities Simple window-based strategy for question representation Two examples:

Testing on the CNN QA task How well do conclusions generalize? Test best-performing Memory Network on CNN QA task (Hermann et al., 2015) CNN QA dataset: 93000 news articles, each coupled with a question derived from bullet point summary accompanying the article, and a single-word answer which is always a named entity Memory Network, with self-supervision added and removal of named entities appearing in bullet point summary from candidate list, results in best performance

Conclusions CBT is a new semantic language modeling benchmark Memories that encode sub-sentential chunks (windows) of informative text seem to be most useful to neural nets when interpreting and modeling language Most useful text chunk size depends on modeling task (semantic content vs. syntactic function words) Memory Networks that adhere to this principle can be efficiently trained using simple self-supervision to surpass all other methods for predicting named entities on both CBT and CNN QA benchmark

Works Cited Hill, F., Bordes, A., Chopra, S., & Weston, J. (2016). The Goldilocks Principle: Reading Children s Books with Explicit Memory Representations. Facebook AI Research. URL https://arxiv.org/pdf/1511.02301.pdf. Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason. A neural attention model for abstractive sentence summarization. Proceedings of NIPS, 2015. Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas. Towards ai-complete question answering: a set of prerequisite toy tasks. arxiv preprint arxiv: 1502.05698, 2015a. Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory networks. Proceedings of ICLR, 2015b.