Computational Cognitive Science

Similar documents
11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Natural Language Processing. George Konidaris

Syntactic surprisal affects spoken word duration in conversational contexts

Context Free Grammars. Many slides from Michael Collins

Accurate Unlexicalized Parsing for Modern Hebrew

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Grammars & Parsing, Part 1:

Argument structure and theta roles

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Parsing of part-of-speech tagged Assamese Texts

LTAG-spinal and the Treebank

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Chapter 4: Valence & Agreement CSLI Publications

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Prediction of Maximal Projection for Semantic Role Labeling

Ambiguity in the Brain: What Brain Imaging Reveals About the Processing of Syntactically Ambiguous Sentences

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Analysis of Probabilistic Parsing in NLP

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Syntactic Ambiguity Resolution in Sentence Processing: New Evidence from a Morphologically Rich Language

Copyright and moral rights for this thesis are retained by the author

CS 598 Natural Language Processing

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Theoretical Syntax Winter Answers to practice problems

Using dialogue context to improve parsing performance in dialogue systems

Loughton School s curriculum evening. 28 th February 2017

arxiv:cmp-lg/ v1 22 Aug 1994

Adapting Stochastic Output for Rule-Based Semantics

Some Principles of Automated Natural Language Information Extraction

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Developing Grammar in Context

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Three New Probabilistic Models. Jason M. Eisner. CIS Department, University of Pennsylvania. 200 S. 33rd St., Philadelphia, PA , USA

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Constraining X-Bar: Theta Theory

The Discourse Anaphoric Properties of Connectives

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Domain Adaptation for Parsing

Words come in categories

Update on Soar-based language processing

Good-Enough Representations in Language Comprehension

EAGLE: an Error-Annotated Corpus of Beginning Learner German

In search of ambiguity

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Good Enough Language Processing: A Satisficing Approach

LING 329 : MORPHOLOGY

Ensemble Technique Utilization for Indonesian Dependency Parser

Control and Boundedness

Specifying a shallow grammatical for parsing purposes

An Interactive Intelligent Language Tutor Over The Internet

The Smart/Empire TIPSTER IR System

A Comparison of Two Text Representations for Sentiment Analysis

CAS LX 522 Syntax I. Long-distance wh-movement. Long distance wh-movement. Islands. Islands. Locality. NP Sea. NP Sea

Pseudo-Passives as Adjectival Passives

The Interface between Phrasal and Functional Constraints

A Re-examination of Lexical Association Measures

Formulaic Language and Fluency: ESL Teaching Applications

The taming of the data:

A Graph Based Authorship Identification Approach

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Probabilistic Latent Semantic Analysis

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Construction Grammar. University of Jena.

Natural Language Processing: Interpretation, Reasoning and Machine Learning

BULATS A2 WORDLIST 2

Compositional Semantics

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Multi-genre Writing Assignment

Training and evaluation of POS taggers on the French MULTITAG corpus

The Pennsylvania State University. The Graduate School. College of the Liberal Arts THE TEACHABILITY HYPOTHESIS AND CONCEPT-BASED INSTRUCTION

MYCIN. The MYCIN Task

arxiv:cmp-lg/ v1 7 Jun 1997 Abstract

Using Semantic Relations to Refine Coreference Decisions

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Chapter 7. Working with probability

An Efficient Implementation of a New POP Model

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

The College Board Redesigned SAT Grade 12

What the National Curriculum requires in reading at Y5 and Y6

THE VERB ARGUMENT BROWSER

Building an HPSG-based Indonesian Resource Grammar (INDRA)

Transcription:

Computational Cognitive Science Lecture 14: Syntactic Surprisal Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 9 November 2017 1 / 26

1 Background Expectations in Sentence Processing Beyond Ambiguity 2 The Surprisal Model Computing Surprisal Results Reading: Hale (2001). 2 / 26

Garden Paths and Odds Last time, we saw the model of human parsing proposed by Jurafsky (1996): all syntactic trees for a sentence are computed in parallel and assigned probabilities; the probabilities are used to rank the set of trees; low probability trees are pruned (no longer considered); the set of trees (and their probabilities) is updated incrementally (word by word) as the input comes in; if the tree that turns out to be ultimately correct has been pruned, then a garden path occurs. 3 / 26

Garden Paths and Odds Example garden path sentence: (1) The horse raced past the barn fell. First parse tree: P(race, agent ) = 0.92 S NP the horse VP raced P(t 1 ) = 0.92 (preferred) 4 / 26

Garden Paths and Odds Second parse tree: P(race, agent, theme ) = 0.08 NP NP XP 0.14 S NP... NP the horse VP raced P(t 2 ) = 0.0112 (grossly dispreferred) 5 / 26

Garden Paths and Expectation In the Jurafsky model, processing difficulty is caused by the ratio of the correct parse to the best parse. For example in (1): P(t 1 ) P(t 2 ) = 0.92 0.0112 = 82 : 1 Intuitively, a high ratio means that the parser has a strong expectation about the correct structure. Maybe this expectation is based not only on two trees (most probable one and correct one) but on all trees of the sentence. Syntactic surprisal: processing difficulty (including garden paths) occurs when the probability distribution over parse trees changes. 6 / 26

Expectations in Sentence Processing There is evidence that expectation plays a role in sentence processing. For instance, when the parser sees an either, it expects an or (Staub & Clifton, 2006): (2) Peter read either a book or an essay in the school magazine. (3) Peter read a book or an essay in the school magazine. The region an essay is read faster in (2) than in (3). The parser is surprised to see an or if it doesn t expect it, i.e., if there is no either. Surprisal leads to processing difficulty. 7 / 26

Expectations in Sentence Processing Intuitively, this is also what is going on in garden paths: (4) a. The horse raced past the barn fell. b. The bird found in the room died. In (4-a), the parser is surprised when it gets fell, as it expected the sentence to end at barn. In (4-b), the surprisal at died is lower. (5) a. The complex houses married and single students. b. The warehouse fires a dozen employees each year. In (5-a), the parser is surprised when it gets married, as it expected a verb. In (5-b) it assumes fires is a verb, and is not surprised. 8 / 26

Beyond Ambiguity Ambiguity resolution (and garden paths) is not the only thing we want to model in sentence processing. Some sentences cause difficulty even though not ambiguous: (6) a. The reporter that attacked the senator admitted the error. b. The reporter that the senator attacked admitted the error. The object relative clause (ORC) in (6-b) is more difficult to process than the subject relative clause (SRC) in (6-a). To be modeled: reading time differences on the relative clause verb and noun phrase (Staub, 2010). 9 / 26

Beyond Ambiguity Empirical Data (Staub 2010, Expt 1) Reading Time in ms 0 100 200 300 400 500 600 SRC ORC *** *** *** rel_pron src_vb det noun orc_vb main_vb Word 10 / 26

The Surprisal Model The surprisal model (Hale, 2001) assumes an incremental, parallel, probabilistic parser: all syntactic trees for a sentence prefix w 1 w k are computed at the same time, and assigned probabilities; the set of trees is updated as a new word w k+1 comes in; trees no longer compatible with the input are removed; surprisal measures the change in the probability distribution as trees are removed (disconfirmed) when w k+1 is processed; if w k+1 disconfirms trees with a large probability mass (high surprisal), then processing difficulty occurs. 11 / 26

The Surprisal Model Surprisal is defined in terms of P(T w 1 w k ), the probability distribution over trees T given a sentence prefix w 1 w k, but comes down to: log(p(w k+1 w k )) We ve already seen surprisal in the Frank et al. model (object individuation). 12 / 26

The Surprisal Model Levy argues that a good measure of belief change is the Kullback-Leibler divergence (relative entropy) between the syntactic expectations before and after seeing the new word. The KL divergence between two distributions P and Q is: D(P Q) = i P(i) log P(i) Q(i) (1) For a mathematical argument that Surprisal is equivalent to KL divergence, see Levy (2008). 13 / 26

The Surprisal Model The KL divergence at word w k+1 is: T P(T w 1 w k+1 ) log P(T w 1 w k+1 ) P(T w 1 w k ) (2) This captures the difference in beliefs before and after seeing w k+1. 14 / 26

Computing Surprisal One advantage of using surprisal rather than computing KL divergence directly is that Surprisal doesn t depend on which representation we use; we just need to compute P(w k+1 w 1 w k ). We could use: an incremental parser which computes probabilities over trees; an n-gram model, which computes probabilities over sequences of words; intermediate cases, such as a model which computes probabilities over part of speech sequences (a tagger); a recurrent neural network. However, when modeling human language processing, we are interested in the cognitive process that leads to surprisal (e.g., an n-gram model doesn t tell us much about that). 15 / 26

Computing Surprisal The prefix probability P(w 1 w k ) can be obtained from a parser by summing over all trees compatible with the prefix: P(w 1 w k ) = T P(T, w 1 w k ) (3) We can now formulate surprisal in terms of prefix probabilities: S k+1 = log P(w 1 w k+1 ) = log T P(T, w 1 w k+1 ) P(w 1 w k ) T P(T, w 1 w k ) This is how surprisal is computed in practice. (4) 16 / 26

Surprisal: Example Assume we want to compute the prefix probability for: (7) The reporter who... The prefix probability by definition is: P(the, reporter, who) = T P(T, the, reporter, who) Assume that there is only one tree. We compute its probability using a PCFG, i.e., by multiplying the probs of the rules in T : P(T, the, reporter, who) = i P(rule i ) 17 / 26

Surprisal: Example Assume the following syntactic tree: S NP VP NP SBAR DT NN WHNP S The reporter WP who 18 / 26

Surprisal: Example An example for a PCFG that generates this tree: Example Rule Rule probability The reporter who... S VP NP p = 0.6 The reporter who... NP NP SBAR p = 0.004 The reporter NP DT NN p = 0.5 The DT the p = 0.7 reporter NN reporter p = 0.0002 who... SBAR WHNP S p = 0.12 who WHNP WP p = 0.2 who WP who p = 0.8 19 / 26

Results To evaluate surprisal, we need an probabilistic parser. The incremental top-down parser of Roark (2001) is often used. Evaluation procedure: train the parser on a training corpus (e.g., Penn Treebank); take experimental materials from psycholinguistic experiments; parse them using the parser, compute the surprisal values for each sentence; compare these to the reading time results for the sentence (typically by-condition averages). 20 / 26

Results: either... or Compare reading times for either... or sentences against surprisal: Surprisal successfully models the data. 21 / 26

Results: Relative Clauses Compare reading times for relative clauses against surprisal values: Empirical Data (Staub 2010, Expt 1) Surprisal Predictions Reading Time in ms 0 100 200 300 400 500 600 SRC ORC *** *** *** Surprisal 0 2 4 6 8 10 12 14 SRC ORC *** rel_pron src_vb det noun orc_vb main_vb Word rel_pron src_vb det noun orc_vb main_vb Word Surprisal successfully models only the difference at the NP. To model the difference at the verb, we need to add a distance-based memory cost component (Demberg, Keller, & Koller, 2013). 22 / 26

Results: Garden Paths Garden paths still work (Hale, 2001): Log[ previousprefix currentprefix ] garden-pathing 14 12 10 8 6 5.90627 4 2 1. 1. 0.1906840.0641303 0 0 the horse raced past the barn fell 23 / 26

Results: Garden Paths Compare reduced relative clause (garden path) against unreduced relative clause (not a garden path): Log[ previousprefix currentprefix ] 6 SubjectRelativeClause 5.87759 Log[ previousprefix currentprefix ] ReducedRelativeClause 6.67629 5 6 4 3.59991 3.45367 5 4 3 3 2 1.59946 1 0.798547 1.3212 0.498082 1.59946 2 1 1.59946 1.3212 0.798547 0.622262 1.59946 0. the banker who was told about the buy-back resigned 0. the banker told about the buy-back resigned 24 / 26

Summary The human sentence processor builds up expectations about the input; if these expectations are not met, surprisal ensues, which manifests itself as processing difficulty; mathematically, surprisal is the change in the probability distribution over possible trees from one word to the next; it can be computed based on the prefix probabilities returned by a probabilistic parser; the surprisal model accounts for garden path sentences, but also for processing difficulty not related to ambiguity. 25 / 26

References Demberg, V., Keller, F., & Koller, A. (2013). Incremental, predictive parsing with psycholinguistically motivated tree-adjoining grammar. Computational Linguistics, 39(4), 1025 1066. Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model. (Vol. 2, pp. 159 166). Jurafsky, D. (1996). A probabilistic model of lexical and syntactic access and disambiguation. Cognitive Science, 20(2), 137 194. Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3), 1126 1177. Roark, B. (2001). Probabilistic top-down parsing and language modeling. Computational Linguistics, 27(2), 249 276. Staub, A. (2010). Eye movements and processing difficulty in object relative clauses. Cognition, 116, 71 86. Staub, A. & Clifton, C. (2006). Syntactic prediction in language comprehension: Evidence from either... or. Journal of Experimental Psychology: Learning, Memory, and Cognition, 32, 425 436. 26 / 26