Investigating how well language models capture meaning in children s books

Similar documents
Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Dialog-based Language Learning

arxiv: v4 [cs.cl] 28 Mar 2016

Probabilistic Latent Semantic Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lecture 1: Machine Learning Basics

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Georgetown University at TREC 2017 Dynamic Domain Track

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Deep Neural Network Language Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Second Exam: Natural Language Parsing with Neural Networks

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v3 [cs.cl] 7 Feb 2017

Python Machine Learning

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Attributed Social Network Embedding

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

arxiv: v1 [cs.cv] 10 May 2017

A study of speaker adaptation for DNN-based speech synthesis

arxiv: v1 [cs.cl] 27 Apr 2016

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

arxiv: v1 [cs.lg] 7 Apr 2015

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Calibration of Confidence Measures in Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Assignment 1: Predicting Amazon Review Ratings

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Model Ensemble for Click Prediction in Bing Search Ads

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

AQUA: An Ontology-Driven Question Answering System

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Indian Institute of Technology, Kanpur

arxiv: v1 [cs.lg] 15 Jun 2015

What the National Curriculum requires in reading at Y5 and Y6

Modeling function word errors in DNN-HMM based LVCSR systems

Formulaic Language and Fluency: ESL Teaching Applications

arxiv: v1 [cs.cl] 2 Apr 2017

Loughton School s curriculum evening. 28 th February 2017

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Writing a composition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Language Acquisition Chart

Statewide Framework Document for:

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Residual Stacking of RNNs for Neural Machine Translation

Knowledge Transfer in Deep Convolutional Neural Nets

Proof Theory for Syntacticians

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

THE world surrounding us involves multiple modalities

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Natural Language Processing. George Konidaris

On document relevance and lexical cohesion between query terms

Evolution of Symbolisation in Chimpanzees and Neural Nets

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Lip Reading in Profile

Lecture 10: Reinforcement Learning

arxiv: v2 [cs.ir] 22 Aug 2016

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

arxiv: v2 [cs.cl] 26 Mar 2015

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The College Board Redesigned SAT Grade 12

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Summarizing Answers in Non-Factoid Community Question-Answering

Compositional Semantics

CEFR Overall Illustrative English Proficiency Scales

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Applications of memory-based natural language processing

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Advanced Grammar in Use

EQuIP Review Feedback

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Word Embedding Based Correlation Model for Question/Answer Matching

Cross Language Information Retrieval

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Rule-based Expert Systems

1. Introduction. 2. The OMBI database editor

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Sample Goals and Benchmarks

Learning Methods for Fuzzy Systems

Transcription:

Investigating how well language models capture meaning in children s books Caitlin Hult Department of Mathematics UNC Chapel Hill Deep Learning Journal Club 11/30/16

Paper to discuss The Goldilocks Principle: Reading Children s Books with Explicit Memory Representations Authors: Felix Hill*, Antoine Bordes, Sumit Chopra, Jason Weston Facebook AI Research; University of Cambridge https://arxiv.org/pdf/1511.02301.pdf

Outline of Talk Motivation The Children s Book Test (CBT) Memory representation with Memory Networks Comparison with other models Conclusions

Motivation Humans do not interpret language in insolation Context is important! Guiding questions: How well can statistical models exploit wider contexts to make predictions about natural language? How well can they capture meaning? What is the role of local and wider contextual information in making predictions about different types of words in children s stories? What requirements are necessary for a language model to accurately predict syntatic function words (verbs, prepositions) vs. semantic words (named entities, nouns)? Useful for applications requiring semantic coherence (e.g. language generation, machine translation, dialogue and question-answering systems)

Model Overview The problem: Accurately predict missing words in children s stories Types of performance: Humans: predict all word types with similar accuracy, use wider context RNNs with LSTMs: very good at predicting verbs and prepositions, primarily use local contexts Memory Networks: good at predicting nouns and named entities, use both local info and wider context Importance of distinguishing the task of predicting syntactic function words from that of predicting lower-frequency semantic-content words

The Goldilocks Principle A sweet spot between word and sentence Optimal performance level is dependent on window choice and self-supervised training: We find the way in which wider context is represented in memory to be critical. If memories are encoded from a small window around important words in the context, there is an optimal size for memory representations between single words and entire sentences, that depends on the class of word to be predicted.

The Children s Book Test The basis of experiments in this paper Children s books have clear narrative structure (thus role of context more discernible) Designed to test the role of memory and context in language processing and understanding; directly measures how well language models can exploit wider linguistic context Books allocated to training, validation, or test sets Create example questions by numbering 21 consecutive sentences Definitions: context = first 20 sentences (denoted S) query = 21st sentence (denoted q) answer word = removed word from sentence (denoted a) selection of 10 candidate answers (denoted C) appearing in context sentences and query For a question answer pair (x, a): x = (q, S, C) S = ordered list of sentences q = a sentence (an ordered list q = q₁,, qk) containing a missing word symbol C = bag of unique words such that a C, cardinality C =10, every candidate word ω C satisfies ω q S

The Children s Book Test: Example

The Children s Book Test, cont. Evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs, and Prepositions Classical language modeling evaluations: assume average perplexity across all words in a text > thus proportionally more emphasis on accurate prediction of frequent words CBT: allows focused analyses on semantic content-bearing words > better proxy

Comparison to similar models Microsoft Research Sentence Completion Challenge (MSRCC) limited to single sentence, whereas CBT has wider context CBT has nearly 10 times the number of test questions, double the number of candidates on each question, contains larger training and validation sets CNN QA requires models to identify missing entities from bullet-point summaries of online news articles > higher focus on paraphrasing, whereas CBT focuses on making inferences and predicting from contexts it is anonymised (can t apply knowledge that s not apparent from article), whereas CBT is not anonymised MCTest of machine comprehension training set consists of only 300 examples

Memory Networks One of a class of contextual models that can interpret language at a given point in text conditioned directly on both local information and explicit representation of the wider context Applying them on the CBT enables us to examine impact of various ways of encoding context on their semantic processing ability over naturally occurring language Good resources include: Weston et al., 2015b; Weston et al., 2015a; Sukhbaatar et al., 2015

Encoding memories and queries Options for storing phrases s: Lexical memory: one word per memory slot, incorporates time features to record word order Window memory: each phrase s corresponds to a window of text from context S centered on individual mention of candidate c in S, memory slots are windows of words Sentential memory: phrases of s correspond to complete sentences of S, for the CBT each question will have 20 memories

End-to-end memory networks MemN2N architecture allows for direct training of Memory Networks through backpropagation Two main steps: supporting memories are retrieved an answer distribution is returned using a softmax Memory Networks can perform several hops in memory before returning answer

Self-supervision for window memories Memory supervision not provided at training time, inferred automatically Train by making gradient steps using SGD. The model selects its top relevant memory using: At test time, candidate score is the sum of the alpha_i of the windows it appears in. Do not exploit new label information beyond the training data (hard attention over memories)

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines: Implemented two simple baselines based on word frequencies. 1) Select most frequent candidate in entire training corpus, 2) For a given question, select most frequent candidate in its context. N-Gram Language Models Supervised Embedding Models Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models: Trained an n-gram language model using the KenLM toolkit (Heatfield et al., 2013). Compared with a variant with cache, in which they linearly interpolated n-gram model probabilities with unigram probabilities computed on the context. Supervised Embedding Models Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models: Goal is to directly test how much of CBT can be resolved by word embeddings. Learn both input and output embedding matrices for each word in vocabulary For a given input passage q and possible answer word w, the score is computed as Lobotomised Memory Networks with zero hops (no attention over the memory component) Investigate different input passages: context + query, query, window (sub-sentence of query centered around missing word), window + position (use different embedding matrix for encoding each position of window). Tune window-size d=5 on validation set Recurrent Language Models Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models Recurrent Language Models: Trained RNN language models with LSTM activation units on training stories (5.5M words), used minibatch SGD to maximize negative log-likelihood of next word, hyper-parameters tuned on validation set best model had dimension 512 Investigated 2 variants: context + query (read entire context, then query), query (read only query). All models have access to query words after the missing word. Contextual LSTM learns a convolutional attention over windows of the context (given objective of predicting all words in the query) Tuned window size on validation set Human Performance

Applying different types of language modeling and machine reading architectures to the CBT Non-Learning Baselines N-Gram Language Models Supervised Embedding Models Recurrent Language Models Human Performance: 15 native English speakers attempt a randomlyselected 10% from each question type of CBT, either with query only or with query + context. Obtained 2000 answers total.

Results: Modeling syntactic flow Model performance strongly depends on type of word to be predicted Conventional language models good at verb/preposition prediction, less good at predicting nouns/named entities Interesting results: RNNs with LSTMs slightly better than n-gram models (except for name entities, when cache is used) LSTM models better than humans at preposition prediction (suggests that sometimes several candidate prepositions are correct ) When only query available, LSTMs and n-gram models better than humans at verb prediction (possibly because models better attuned to verb distribution in children s books)

Results: Capturing semantic coherence Best performing Memory Networks predict nouns, named entities more accurately than conventional language models Rely on access to wider context LSTMs do not effectively exploit context when it is provided Performance level on nouns, named entities doesn t change Embedding Model (query), which is equivalent to memory network but without contextual memory, performs poorly

Results: Getting memory representations just right Trained memory networks exploited context to different degrees when predicting nouns and named entities. For example: Each sentence in C stored as ordered sequence of word embeddings (sentential mem + PE), performance is poor Lexical memory effective for verbs and prepositions, less for nouns and entities Window memories centered around candidate words more useful for nouns and named entities prediction

Results: Self-supervised memory retrieval Window-based Memory Network with self-supervision (hard attention selection made among window memories during training) outperforms all others at predicting nouns and named entities Simple window-based strategy for question representation Two examples:

Testing on the CNN QA task How well do conclusions generalize? Test best-performing Memory Network on CNN QA task (Hermann et al., 2015) CNN QA dataset: 93000 news articles, each coupled with a question derived from bullet point summary accompanying the article, and a single-word answer which is always a named entity Memory Network, with self-supervision added and removal of named entities appearing in bullet point summary from candidate list, results in best performance

Conclusions CBT is a new semantic language modeling benchmark Memories that encode sub-sentential chunks (windows) of informative text seem to be most useful to neural nets when interpreting and modeling language Most useful text chunk size depends on modeling task (semantic content vs. syntactic function words) Memory Networks that adhere to this principle can be efficiently trained using simple self-supervision to surpass all other methods for predicting named entities on both CBT and CNN QA benchmark

Works Cited Hill, F., Bordes, A., Chopra, S., & Weston, J. (2016). The Goldilocks Principle: Reading Children s Books with Explicit Memory Representations. Facebook AI Research. URL https://arxiv.org/pdf/1511.02301.pdf. Sukhbaatar, Sainbayar, Szlam, Arthur, Weston, Jason. A neural attention model for abstractive sentence summarization. Proceedings of NIPS, 2015. Weston, Jason, Bordes, Antoine, Chopra, Sumit, and Mikolov, Tomas. Towards ai-complete question answering: a set of prerequisite toy tasks. arxiv preprint arxiv: 1502.05698, 2015a. Weston, Jason, Chopra, Sumit, and Bordes, Antoine. Memory networks. Proceedings of ICLR, 2015b.