Predictive power of word surprisal for reading times is a linear function of language model quality

Similar documents
Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Syntactic surprisal affects spoken word duration in conversational contexts

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Switchboard Language Model Improvement with Conversational Data from Gigaword

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Deep Neural Network Language Models

Probability and Statistics Curriculum Pacing Guide

A Model to Predict 24-Hour Urinary Creatinine Level Using Repeated Measurements

Multi-Lingual Text Leveling

Lecture 1: Machine Learning Basics

Python Machine Learning

On-the-Fly Customization of Automated Essay Scoring

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Investigation on Mandarin Broadcast News Speech Recognition

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

WHEN THERE IS A mismatch between the acoustic

GDP Falls as MBA Rises?

CS Machine Learning

PROFESSIONAL TREATMENT OF TEACHERS AND STUDENT ACADEMIC ACHIEVEMENT. James B. Chapman. Dissertation submitted to the Faculty of the Virginia

Toward a Unified Approach to Statistical Language Modeling for Chinese

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

Speech Emotion Recognition Using Support Vector Machine

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Universityy. The content of

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Human Emotion Recognition From Speech

Human-like Natural Language Generation Using Monte Carlo Tree Search

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Mandarin Lexical Tone Recognition: The Gating Paradigm

A Comparison of Charter Schools and Traditional Public Schools in Idaho

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Search right and thou shalt find... Using Web Queries for Learner Error Detection

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Dyslexia/dyslexic, 3, 9, 24, 97, 187, 189, 206, 217, , , 367, , , 397,

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

School Competition and Efficiency with Publicly Funded Catholic Schools David Card, Martin D. Dooley, and A. Abigail Payne

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Psychometric Research Brief Office of Shared Accountability

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

ABILITY SORTING AND THE IMPORTANCE OF COLLEGE QUALITY TO STUDENT ACHIEVEMENT: EVIDENCE FROM COMMUNITY COLLEGES

Speech Recognition at ICSI: Broadcast News and beyond

Individual Differences & Item Effects: How to test them, & how to test them well

Benjamin Pohl, Yves Richard, Manon Kohler, Justin Emery, Thierry Castel, Benjamin De Lapparent, Denis Thévenin, Thomas Thévenin, Julien Pergaud

Assignment 1: Predicting Amazon Review Ratings

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Re-evaluating the Role of Bleu in Machine Translation Research

STA 225: Introductory Statistics (CT)

Estimating the Cost of Meeting Student Performance Standards in the St. Louis Public Schools

Linking Task: Identifying authors and book titles in verbose queries

Math Placement at Paci c Lutheran University

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Stages of Literacy Ros Lugg

Cross-Lingual Text Categorization

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Large vocabulary off-line handwriting recognition: A survey

Domain Adaptation in Statistical Machine Translation of User-Forum Data using Component-Level Mixture Modelling

Rachel E. Baker, Ann R. Bradlow. Northwestern University, Evanston, IL, USA

arxiv: v1 [cs.cl] 27 Apr 2016

Comparing Teachers Adaptations of an Inquiry-Oriented Curriculum Unit with Student Learning. Jay Fogleman and Katherine L. McNeill

Task Types. Duration, Work and Units Prepared by

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Lecture 9: Speech Recognition

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Reducing Features to Improve Bug Prediction

Abbreviated text input. The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters.

Gender and socioeconomic differences in science achievement in Australia: From SISS to TIMSS

Cal s Dinner Card Deals

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Measuring Web-Corpus Randomness: A Progress Report

Calibration of Confidence Measures in Speech Recognition

Analyzing the Usage of IT in SMEs

Hierarchical Linear Models I: Introduction ICPSR 2015

Detecting English-French Cognates Using Orthographic Edit Distance

Theory of Probability

Training and evaluation of POS taggers on the French MULTITAG corpus

A Pilot Study on Pearson s Interactive Science 2011 Program

COPING WITH LANGUAGE DATA SPARSITY: SEMANTIC HEAD MAPPING OF COMPOUND WORDS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

A Case Study: News Classification Based on Term Frequency

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Visit us at:

A High-Quality Web Corpus of Czech

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Transcription:

Predictive power of word surprisal for reading times is a linear function of language model quality Adam Goodkind & Klinton Bicknell Northwestern University Cognitive Modeling & Computational Linguistics Workshop

PROBABILITY IN CONTEXT 2 Don t touch the wet paint cement bed (Wlotko & Federmeier, 2012)

MOTIVATION HOW WE USE PROBABILITY IN CONTEXT 3 Studies of human sentence processing have shown that a word s probability in context is strongly related to processing difficulty Do better estimates of word probability improve processing predictions? ERP Response Reading Times (Wlotko & Federmeier, 2012) (Hale, 2001)

SURPRISAL AND SURPRISAL THEORY 4 From information theory (Shannon, 1948) A theory of communication The information content in a word = -log(p) More information is more difficult to process Difficulty (cognitive cost of processing a word) how predictable the word is in a given context (Hale, 2001; Levy, 2008) Prior studies (e.g. Demberg & Keller, 2008) found that surprisal can predict reading times

LANGUAGE MODELS CALCULATING WORD PROBABILITIES 5 Cloze task (Taylor, 1953) Count people's responses to filling in a missing word Inaccurate and labor intensive à need for computational models Language models A probability distribution over sequences of words Good language models assign a higher probability to word strings that occur more often Quality (accuracy) of a language model is quantified as perplexity Lower == Better

MANY TYPES OF LANGUAGE MODELS DIFFERENT BUILDING BLOCKS 6 n-grams (fixed sequence length) Bigrams, trigrams, 4-grams, etc. p(w n w n-1 ) Fixed dependency length Neural network Word probabilities use dependencies spanning arbitrary distances (number of words) Usually use Long Short-Term Memory (LSTM) networks Variable dependency length Interpolated Combine multiple models Recent neural network-based language models have significantly improved linguistic accuracy n-grams Prior Work NN interpolated

DEFINING ACCURACY 7 Linguistic accuracy How well language models predict unseen language Measured by perplexity Psychological accuracy How well language models predict psychological phenomena E.g. eye gaze duration, ERP response amplitude

OUR STUDY 8 Build a range of different types of language models Different language models produce different estimates of surprisal Construct a regression model predicting gaze duration in an eye-tracking corpus from the surprisal of each language model Compare the regression models quality of predictions for the gaze durations Understand the relationship between language model quality and predictions of processing difficulty

METHODS CREATING A LANGUAGE MODEL 9 Language models used Google One Billion Word Benchmark ( 1b ) Corpus Collected from international English news services ~900 million words, 800,000 word vocabulary size n-grams models created with kenlm Kneser-Ney smoothing Neural network model created from Google s pre-trained models Long Short-Term Memory (LSTM) units in a Recurrent Neural Network (RNN) Interpolated models created by mixing LSTM and 5-gram estimates

OUR LANGUAGE MODELS 10 n-grams NN interpolated

METHODS EYE-TRACKING DATA 11 Dundee Corpus 61,000 tokens from a British newspaper, read by 10 participants ~300,000 total tokens, 37,000 word vocabulary size Extracted gaze durations: how long a word was fixated during first pass reading Exclusions Words not fixated Words at beginning/end of line and others

METHODS PREDICTIVE REGRESSION MODELS 12 Generalized Additive Models (GAMs) Type of regression model Allows for non-linear effects Predictors of interest Surprisal of current and previous words

METHODS PREDICTIVE REGRESSION MODELS 13 We used Generalized Additive Mixed Models (GAMMs) Predict eye gaze duration given: Surprisal of current and previous word Non-linear effects of control covariates The interaction of word frequency and length Sequential word number Whether the prior word was fixated Random intercepts for each subject

METHODS PREDICTIVE REGRESSION MODELS 14 Linear versus non-linear GAMMs First set of experiments forced surprisal to be a linear predictor Second set of experiments allowed surprisal to make non-linear predictions Other predictors remained non-linear

METHODS PSYCHOLOGICAL ACCURACY 15 Measured improvements in predictions from each language model ΔLogLik(model m ) = LogLik(model m ) LogLik(baseline_model) LogLik (Log Likelihood) A measure of accuracy model m Includes language model m s surprisal as a predictor baseline_model Missing predictor of interest (surprisal) Includes only control covariates

RESULTS RELATIONSHIP BETWEEN LINGUISTIC AND PSYCHOLOGICAL ACCURACY 16 Using a linear regression model, we investigate the relationship between language models and their psychological predictions What is the relationship between linguistic accuracy (perplexity) and psychological prediction quality (ΔLogLik)?

RESULTS RELATIONSHIP BETWEEN LINGUISTIC AND PSYCHOLOGICAL ACCURACY 17 As the perplexity of a language model improves, the model makes more accurate predictions for reading times Linear GAMMs This relationship holds across model types

RESULTS MAGNITUDE OF EFFECT 18 As language models continue to improve and make better predictions, does the magnitude (size of effect) of surprisal change? Do better language models put more weight on the surprisal of current and previous words? We can compare coefficients of surprisal from each model to understand the magnitude of the effect

RESULTS MAGNITUDE OF EFFECT 19 The magnitude of the effect does not correlate with linguistic accuracy Effect size of surprisal does not seem to be biased for worse language models Current word Previous word

RESULTS SHAPE OF EFFECT 20 Smith & Levy (2013) looked at the shape of the effect of surprisal Found a linear relationship Supports various derivations of surprisal theory (e.g., Hale, 2001; Levy, 2008; Bicknell & Levy, 2009; Smith & Levy, 2013) Contra alternative probabilistic processing theories (e.g., Narayanan & Jurafsky, 2004; theories predicting UID optimality) Does this linear relationship hold for more sophisticated models, if we allow surprisal to be non-linear?

RESULTS SHAPE OF EFFECT 21 For both the current and previous word probability, gaze time changes at a linear rate, for all models Possibly even more linear as language model accuracy improves Current word Previous word

RESULTS RELATIONSHIP BETWEEN LINGUISTIC AND PSYCHOLOGICAL ACCURACY (PART II) 22 If we allow for non-linear effects, not only does the relationship between models improve, but the relationship becomes more linear Non-linear GAMMs

TAKEAWAYS 23 Strong relationship between linguistic model quality and its psychological predictive power No privileged language model class: better perplexity improves psychological predictions The size of the surprisal effect was consistent across models Estimates of the effect size of surprisal from worse language models appear to be relatively unbiased The effect of surprisal is linear across all models and distributions of word probabilities Supports surprisal theory processing models even with state-of-the-art language models

THANK YOU! Funding sources: Adam Goodkind a.goodkind@u.northwestern.eduz