Crosslinguistic Quantitative Syntax: Dependency Length and Beyond. Richard Futrell work with Kyle Mahowald and Ted Gibson 22 September 2016

Similar documents
ROSETTA STONE PRODUCT OVERVIEW

Approved Foreign Language Courses

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

LNGT0101 Introduction to Linguistics

DETECTING RANDOM STRINGS; A LANGUAGE BASED APPROACH

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

arxiv: v1 [cs.cl] 2 Apr 2017

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Using dialogue context to improve parsing performance in dialogue systems

Ensemble Technique Utilization for Indonesian Dependency Parser

Prediction of Maximal Projection for Semantic Role Labeling

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Chapter 5: Language. Over 6,900 different languages worldwide

CS 598 Natural Language Processing

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Section V Reclassification of English Learners to Fluent English Proficient

Cross Language Information Retrieval

Lecture 1: Machine Learning Basics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Probability and Statistics Curriculum Pacing Guide

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Speech Recognition at ICSI: Broadcast News and beyond

STA 225: Introductory Statistics (CT)

Accurate Unlexicalized Parsing for Modern Hebrew

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

Linking Task: Identifying authors and book titles in verbose queries

Assignment 1: Predicting Amazon Review Ratings

AQUA: An Ontology-Driven Question Answering System

Developing a TT-MCTAG for German with an RCG-based Parser

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

CS Machine Learning

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Bachelor of Arts in Gender, Sexuality, and Women's Studies

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Revisiting the role of prosody in early language acquisition. Megha Sundara UCLA Phonetics Lab

Words come in categories

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Grammars & Parsing, Part 1:

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Mandarin Lexical Tone Recognition: The Gating Paradigm

Proof Theory for Syntacticians

Corpus Linguistics (L615)

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Syntactic surprisal affects spoken word duration in conversational contexts

Derivational and Inflectional Morphemes in Pak-Pak Language

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

Turkish Vocabulary Developer I / Vokabeltrainer I (Turkish Edition) By Katja Zehrfeld;Ali Akpinar

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

From Empire to Twenty-First Century Britain: Economic and Political Development of Great Britain in the 19th and 20th Centuries 5HD391

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Parsing of part-of-speech tagged Assamese Texts

The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy

Context Free Grammars. Many slides from Michael Collins

The Ups and Downs of Preposition Error Detection in ESL Writing

Experiments with a Higher-Order Projective Dependency Parser

Berlitz Swedish-English Dictionary (Berlitz Bilingual Dictionaries) By Berlitz Guides

Language Independent Passage Retrieval for Question Answering

Memory-based grammatical error correction

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Author: Justyna Kowalczys Stowarzyszenie Angielski w Medycynie (PL) Feb 2015

Specifying a shallow grammatical for parsing purposes

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

A Framework for Customizable Generation of Hypertext Presentations

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Multi-Lingual Text Leveling

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

THE VERB ARGUMENT BROWSER

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

Some Principles of Automated Natural Language Information Extraction

Indian Institute of Technology, Kanpur

Grammar Extraction from Treebanks for Hindi and Telugu

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

BULATS A2 WORDLIST 2

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Developing Grammar in Context

A Computational Evaluation of Case-Assignment Algorithms

Ch VI- SENTENCE PATTERNS.

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Transcription:

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Richard Futrell work with Kyle Mahowald and Ted Gibson 22 September 2016

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Quantitative Syntax and Functional Typology This work is about using crosslinguistic dependency corpora to do quantitative syntax. It s also about communicative functional typology, which posits that languages have developed structures that make utterances easy to use in communication. Such theories make predictions at the level of the utterance, since they predict that the average utterance will have desirable properties. So quantitative corpus syntax is a natural way to test communicative hypotheses for language universals. This talk explores the hypothesis that there is a universal pressure to minimize dependency lengths, which leads to easier parsing and generation of sentences.

Preview of Dependency Length Results We find that dependency length of real utterances is shorter than in linguistically motivated random baselines. We develop models of grammatical dependency tree linearizations. In almost all languages, dependency length in real utterances is shorter than random grammatical reorderings of those utterances. Dependency length Ancient Greek Basque Bulgarian Church Slavonic Croatian Czech 100 80 60 40 Danish Dutch English Estonian Finnish French 100 80 60 40 German Gothic Hebrew Hindi Hungarian Indonesian 100 real free random 80 rand_proj_lin_hdr_lic 60 rand_proj_lin_hdr_mle rand_proj_lin_perplex 40 real Irish Italian Japanese Latin Modern Greek Norwegian (Bokmål) 100 80 60 40 Persian Portuguese Romanian Slovenian Spanish Swedish 100 80 60 40 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 Sentence length 10 15 20 60 grc fa la de tr bn We explore crosslinguistic variation in dependency length and find it is non-uniform. Explaining these results is a challenge for functional typology. Dependency length 50 40 30 20 ar ga id fa la grc de cu got orv nlet zh ru no hr xcl sk da enfi eu ro cssl sv pl he bg pt ca el es fr it hu bn hi tr ta ko ja ar fa grc bn la de cu nl tr got orv et zh hi eu hu ru ga sk da fi xcl nocs sl sven el ta pl ro hehrbg ca id esptit fr ko ja ar cu orv got zh nl hi eu ru hu da fi sk sl ga no cs pl xcl sven ro hr bg el idhe pt fr ca es it et ko ja ta 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Proportion head final

Data Sources There has been a recent effort in the NLP community to develop standardized dependency corpora of many languages for use in training parsers. Results: Universal Dependencies: Hand-parsed or -corrected corpora of 35+ languages, modern and ancient (Nivre et al., 2015) HamleDT: Automatic conversion of hand-parsed corpora to Universal Dependencies style. (Zeman et al., 2012, 2014) Google Universal Treebank: A predecessor to UD, which still has some languages which UD doesn t (McDonald et al., 2013) PROIEL: Texts in Indo-European classical languages (Haug and Jøhndal, 2008). Corpora vary in their public availability but most are easy to get.

Language Family/subfamily or Sprachbund Indonesian Austronesian Tamil Dravidian Telugu Dravidian Japanese East Asian Korean East Asian Classical Armenian IE/Armenian Irish IE/Celtic Ancient Greek IE/Classical Latin IE/Classical Danish IE/Germanic German IE/Germanic English IE/Germanic Dutch IE/Germanic Swedish IE/Germanic Gothic IE/Germanic Norwegian (B) IE/Germanic Modern Greek IE/Greek Bengali IE/Indo-Aryan Persian IE/Indo-Aryan Hindi IE/Indo-Aryan Catalan IE/Romance Spanish IE/Romance Language French Italian Portuguese Romanian Bulgarian Church Slavonic Croatian Czech Russian Old Russian Slovak Slovenian Polish Basque Arabic Hebrew Turkish Estonian Finnish Hungarian Mandarin Family/subfamily or Sprachbund IE/Romance IE/Romance IE/Romance IE/Romance IE/Slavic IE/Slavic IE/Slavic IE/Slavic IE/Slavic IE/Slavic IE/Slavic IE/Slavic IE/Slavic Isolate Semitic Semitic Turkic Uralic/Finnic Uralic/Finnic Uralic/Ugric Sino-Tibetan

Universal Dependencies Annotation PRON VERB ADP DET NOUN ADP DET NOUN Sentences annotated with dependency tree, dependency arc labels, wordforms, and Google Universal POS tags. For most but not all languages, all of these levels of annotation are available. Data sources are newspapers, novels (inc. some translated novels), blog posts, and some spoken language. (Sentence 99 in the English UD dev set)

Universal Dependencies Annotation PRON VERB ADP DET NOUN ADP DET NOUN Some of the annotation decisions are surprising from the perspective of purely syntactic dependencies. (Sentence 99 in the English UD dev set)

Universal Dependencies Annotation In order to parse English uniformly with languages that would have item with a case marker and no adposition, UD parses prepositional phrases with the noun as the head and the preposition as a dependent. Similarly, complementizers are dependents of verbs, auxiliary verbs are dependents of content verbs, and predicates are heads of copula verbs (!). The dependencies in their raw from thus reflect grammatical relations more than syntactic dependencies (de Marneffe and Manning, 2008; de Marneffe et al., 2014; Nivre, 2015).

Universal Dependencies Annotation PRON VERB ADP DET NOUN ADP DET NOUN Fortunately, it is often possible to convert UD dependencies automatically into syntactic dependencies. And for HamleDT corpora and other non-ud corpora, Praguestyle syntactic dependencies are often available. We prefer syntactic dependencies for dependency length studies.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Dependency Length Minimization As an empirical phenomenon As a typological theory Cognitive motivations

Dependency Length Minimization as an Empirical Phenomenon: Weight Preferences Behaghel s Four Laws of Word Order (1909): 1. Mentally closely related items will be placed close together. 2. Given before new; 3. Modifier before modified 4. Short constituents before long constituents There is good quantitative evidence for this preference in English (e.g. Wasow, 2002) and German (Hawkins, 2004). In extreme cases the shortbefore-long preference produces orders that are otherwise ungrammatical (Heavy NP Shift). Short-before-long produces shorter dependency lengths than long-before-short in head-initial structures: In head-final structures, long-before-short is preferred (Yamashita & Chang, 2001).

Dependency Length Minimization as an Empirical Phenomenon: Weight Preferences

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Dependency Length Minimization As an empirical phenomenon As a typological theory Cognitive motivations

Dependency Length Minimization as a Typological Theory In addition to explaining order preferences within utterances, a pressure to minimize dependency lengths can explain word order universals (Hawkins, 1991, 1994): When dependency trees have low arity, DLM is achieved by consistent head direction: as opposed to Possible explanation for harmonic orders (Greenberg 1963, Vennemann, 1973): OV order is correlated with Noun-Adposition, Adjective-Noun, Determiner-Noun, etc.: consistently head final. VO order is correlated with Adposition-Noun, Noun-Adjective, Noun-Determiner, etc.: consistently head initial.

Dependency Length Minimization as a Typological Theory For higher-arity trees, a (projective) grammar should arrange phrases outward from a head in order of increasing average length (Gildea & Temperley, 2007). as opposed to or Consistent with Dryer s (1992) observations that exceptions to harmonic orders are usually for short constituents such as determiners.

Dependency Length Minimization as a Typological Theory DLM has been advanced as an explanation for the prevalence of projective dependencies corresponding to context-freeness in language (Ferrer i Cancho, 2004). pobj adjmod pmod object adjmod subject object in nova fert animus mūtātas dīcere formas corpora pobj subject object object adjmod pmod adjmod [animus fert [dīcere [formas [mūtātas [in [nova corpora]]]]]]

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Dependency Length Minimization As an empirical phenomenon As a typological theory Cognitive motivations

Motivation for Dependency Length Minimization When parsing a sentence incrementally, dependency length is the lower bound on the amount of time you have to hold a word in memory (Abney & Johnson, 1991). Making a syntactic connection between the current word and a previous word might be hard if the previous word has been in memory for a long time, because of decay of memory representations (DLT integration cost: Gibson, 1999, 2000) or similarity-based interference (Vasishth & Lewis, 2006). Reading time evidence for integration cost in controlled experiments (Grodner & Gibson, 2005) (but corpus evidence is mixed: see Demberg & Keller, 2008). Short dependencies means a smaller domain to search for the head of a phrase (Hawkins, 1994).

Motivation for Dependency Length Minimization Convergent predictions from multiple theories predicting easier processing when dependency length is minimized. In current work we are agnostic to the precise motivation for DLM.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Comparison to Random Baselines Motivation and Methodology Free Order Projective Baseline Fixed Order Projective Baseline Consistent Head Direction Projective Baseline

DLM is an appealing theory, but But there are other explanations for the putative typological effects of DLM: Consistent head direction might have to do with simplicity of grammar. Projectivity might be motivated by parsing complexity. If actual utterances do not have shorter dependency length than what one would expect from these (and other) independently motivated constraints, then the evidence for DLM as the functional pressure explaining these constraints is weakened. Our research question: Do real utterances in many languages have word orders that minimize dependency length, compared to what one would expect under these constraints?

Random Reorderings as a Baseline Do the recently available parsed corpora of 40+ languages show evidence that dependency lengths are shorter than what we would expect under independently motivated constraints? Methodology: Comparison of attested orders to random reorderings of the same dependency trees with various constraints. Methodology of Gildea & Temperley (2007, 2010), Park & Levy (2009), Hawkins (1999), Gildea & Jaeger (ms) Similar approach: comparison to random tree structures (Liu, 2008; Ferrer i Cancho and Liu, 2015; Lu, Xu, and Liu, 2015) Measure dependency length as number of words intervening between head and dependent + 1.

Why Random Reorderings? Tree structures / content expressed Dependency length Word order rules and preferences Our approach is to hold tree structure constant and study whether word orders are optimized given those tree structures. Allows us to isolate the specific effect of DLM on word order.

Unconstrained Random Baseline Total: 9 9 13 11 13 13 13

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Comparison to Random Baselines Motivation and Methodology Free Order Projective Baseline Fixed Order Projective Baseline Consistent Head Direction Projective Baseline

Projective Random Baseline comes from story AP this the

Projective Random Baseline comes story from this AP the

Projective Random Baseline comes story from this AP the

Projective Random Baseline comes story from this AP the

Projective Random Baseline Total: 9 9 8 6 10 11 6

Previous Results The random projective baseline was used previously in Gildea & Temperley (2007, 2010) and Park & Levy (2009).

Statistical Model To test the significance of the effect that real dependency lengths are shorter than the random baseline, we fit a mixed effects regression for each language. For each sentence, predict dependency length of each linearized tree given: (1) Sentence length (squared), (2) Whether the linearization is real [0] or random [1], (3) Random slope of (2) conditional on sentence identity. The coefficient for (1) is the dependency length growth rate for real sentence. The interaction of (1) and (2) is the difference in dependency length growth rate for baseline linearizations as opposed to attested linearizations. This interaction is the coefficient of interest. The interaction of (1) and (2) is significantly positive in all languages (p < 0.001).

Conclusions So Far Observed dependency length is not explained by projectivity alone.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Comparison to Random Baselines Motivation and Methodology Free Order Projective Baseline Fixed Order Projective Baseline Consistent Head Direction Projective Baseline

Fixed Order Projective Baseline The previous baseline simulated languages with no word order restrictions beyond projectivity. Speakers speaking random languages randomly. Here we simulate random linearization grammars with fixed word order for given dependency types. Speakers speaking random languages deterministically. E.g., languages in which subjects always come before verbs, or adjectives always come before nouns, etc. Might affect dependency length because head direction will be more consistent within utterances.

Fixed Order Projective Baseline adpmod comes nsubj Linearization Grammar from pobj det AP this story det adpmod: -.9 pobj:.5 det:.4 nsubj: -.3 the Procedure: Assign each dependency type (nsubj, adpmod, etc.) a "weight" in [-1, 1]. Call the mapping of dependency types to weights a linearization grammar G. Linearize the sentence according to G: Place each dependent in order of increasing weight, placing the head as if it had weight 0.

Fixed Order Projective Baseline adpmod comes nsubj from the pobj det AP this story det Linearization Grammar adpmod: -.9 pobj:.5 det:.4 nsubj: -.3

Fixed Order Projective Baseline adpmod comes nsubj from the pobj det AP this story det Linearization Grammar adpmod: -.9 pobj:.5 det:.4 nsubj: -.3

Fixed Order Projective Baseline adpmod comes nsubj from the pobj det AP this story det Linearization Grammar adpmod: -.9 pobj:.5 det:.4 nsubj: -.3

Fixed Order Projective Baseline from adpmod pobj AP det the story det comes nsubj this Linearization Grammar adpmod: -.9 pobj:.5 det:.4 nsubj: -.3

Conclusions So Far Observed dependency length is not explained by projectivity alone. Observed dependency length is not explained by projectivity in conjunction with fixed word order.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Comparison to Random Baselines Motivation and Methodology Free Order Projective Baseline Fixed Order Projective Baseline Consistent Head Direction Projective Baseline

Consistent Head Direction Projective Baseline Could observed dependency length be explained by a combination of (1) projectivity and (2) consistent head direction? Let s compare to random projective reorderings with consistent head direction.

Conclusions So Far Observed dependency length is not explained by projectivity alone. Observed dependency length is not explained by projectivity in conjunction with fixed word order. Observed dependency length is not explained by a pressure for consistency in head direction. For strongly head-initial and head-final languages, this implies the existence of short-before-long or longbefore-short order preferences. Overall, dependency length minimization effects are not explained by various alternative principles evidence that dependency length minimization is a pressure in itself.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Grammar and Usage Relevance to Dependency Length Minimization Modeling Grammatical Orders Results

Grammar and Usage We can think of each attested linearization of a tree as resulting from application of multiple filters: choose many choose one All orders Grammatical orders (for the particular language) Attested order Where does DLM happen?

Grammar and Usage choose many choose one All orders Grammatical orders (for the particular language) Attested order Where does DLM happen? (Not exclusive.) Grammar: The language filters out bad orders. A random sample from the set of grammatical orders will have desirable dependency length. Usage: The speaker chooses orders based on dependency length. There need not be optimization at the grammar step.

Grammar and Usage choose many choose one All orders Grammatical orders (for the particular language) Attested order DLM through Usage:

Grammar and Usage choose many choose one All orders Grammatical orders (for the particular language) Attested order DLM through Usage: Choosing optimal orderings on a per-sentence basis. With unconstrained grammar, this would give the best dependency length properties.

Grammar and Usage choose many choose one All orders Grammatical orders Attested order (for the particular language) DLM through Grammar: tall woman Mars on lives Language 1 V final A-N N-P

Grammar and Usage choose many choose one All orders Grammatical orders Attested order (for the particular language) DLM through Grammar: Mars on tall woman lives Language 1 V final A-N N-P

Grammar and Usage choose many choose one All orders Grammatical orders Attested order (for the particular language) DLM through Grammar: woman tall on Mars lives Language 2 V final N-A P-N

Grammar and Usage choose many choose one All orders Grammatical orders Attested order (for the particular language) DLM through Grammar: on Mars woman tall lives Language 2 V final N-A P-N

Grammar and Usage choose many choose one All orders Grammatical orders Attested order (for the particular language) DLM through Grammar: For certain sentences, Language 1 is better on average than Language 2. Language 1 V final A-N N-P Language 2 V final N-A P-N

Random Grammatical Reorderings Total: 9 9 8 6 10 11 6

Random Grammatical Reorderings Total: 9 Probability 6.86 9.14

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Grammar and Usage Relevance to Dependency Length Minimization Modeling Grammatical Orders Results

Linearization Models We want to be able to induce from corpora a model of the possible grammatical linearizations of a given dependency tree. Task: Given an unordered dependency tree U, find the probability distribution over ordered dependency trees T with the same structure as U. This is a known task in NLP, as part of natural language generation pipelines (Belz et al., 2011; Rajkumar & White, 2014). For more details on models and their evaluation, see Futrell & Gibson (2015, EMNLP).

Conditioning on Trees NOUN VERB NOUN DET In an ideal world, we would base a linearization model on joint counts of full tree structures and full word orders. But counts of directed tree structures given unordered full tree structures as the conditioning variable would be far too sparse: most tree structures appear only once. Hans sah den Mann: 1 den Mann sah Hans: 0 First thing to do is drop wordforms, and condition on tree structures with POS tags. But even this will still be sparse.

Breaking Trees Apart VERB So, we get conditional counts of orders of local subtrees. NOUN NOUN DET

Breaking Trees Apart NOUN VERB NOUN NOUN/nsubj VERB/head NOUN/dobj: 55 NOUN/dobj VERB/head NOUN/nsubj: 25 NOUN DET DET/det NOUN/head: 500 NOUN/head DET/det: 1 So, we get conditional counts of orders of local subtrees. Interpretable: We get information about order constraints between sister dependents. Modeling only local subtrees is equivalent to modeling a language with a (strictly headed) PCFG. But: We lose conditioning information from outside the local subtree. Also, we lose the ability to model nonprojective (non-context-free) orders.

What s in a Tree? VERB NOUN NOUN NOUN/nsubj X/nsubj VERB/head X/head X/dobj: NOUN/dobj: 235 13555 NOUN/dobj X/dobj VERB/head X/head X/nsubj: NOUN/nsubj: 193 11525 NOUN Then another question is: what aspects of the local subtrees do we condition on? POS tags for head and dependents, and relation types? Or maybe don t consider the POS of the head? Or maybe don t consider the POS of the dependents? DET DET/det X/det X/head: NOUN/head: 1251 1000 500 NOUN/head X/head DET/det: X/det: 54 1

Linearization Models To strike a balance between accuracy and data sparsity, we combine models that condition on more and less context to form a backoff distribution. We can also smooth the model by considering N-gram probabilities of orders within local subtrees. Backoff weights determined by Baum-Welch algorithm.

Linearization Models from Generative Dependency Models We want a model of ordered trees T conditional on unordered trees U. We can derive these models from head-outward generative models that generate T from scratch (Eisner, 1996; Klein and Manning, 2004). Basic form of these models: comes # from story today # # AP # # this # # # # the # "From the AP comes this story today."

Linearization Models from Generative Dependency Models In these models, dependency trees are generated from a set of N-gram models conditional on head word and direction. So if we want a model of ordered trees conditional on unordered trees, we just need a model of ordered sequences conditional on unordered sequences generated by an N-gram model. p(abc) p(abc {A, B, C}) Dynamic programming permutations of w

Evaluating Linearization Models We have a large parameter space for linearization models. We evaluate different parameters in three ways: 1. Test Set Perplexity: Which model setting gives the highest probability to unseen trees in dependency corpora? 2. Acceptability: Ask people how natural the reordered sentences sound on a scale of 1 to 5. 3. Same meaning: Ask people whether the reordered sentence means the same thing as the original sentence. The last two evaluations were done on Mechanical Turk for English models only.

Best Models The best model for perplexity is the one with the most smoothing. The best models for acceptability and same meaning are more conservative models based on POS N-grams within local subtrees. For English, best acceptability is 3.8 / 5 on average. (Original sentences are 4.7 / 5 on average.) For English, the best model produces orders with the same meaning as the original 85% of the time (close to the state of the art).

Best Models Models that give higher probability to held-out orders also produce orders that are rated more acceptable in English.

Models to Run For dependency length experiments, we compare attested dependency length to random linearizations under three models: 1. The model that selects uniformly among attested orders for local subtrees, conditional on POS tags for head and dependent. 2. The model with the best perplexity score (highly smoothed). 3. The model with the best same-meaning rating for English (more conservative).

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Grammar and Usage Relevance to Dependency Length Minimization Modeling Grammatical Orders Results

100 Ancient Greek Basque Bulgarian Church Slavonic Croatian Czech 75 50 25 0 100 Danish Dutch English Estonian Finnish French 75 50 25 0 Dependency length 100 75 50 25 0 German Gothic Hebrew Hindi Hungarian Indonesian Linearization real free Free random projective rand_proj_lin_hdr_lic Random (licit) rand_proj_lin_hdr_mle Random (same meaning) rand_proj_lin_perplex Random (best perplexity) real Real 100 Irish Italian Japanese Latin Modern Greek Norwegian (Bokmål) 75 50 25 0 100 Persian Portuguese Romanian Slovenian Spanish Swedish 75 50 25 0 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 Sentence length

100 Ancient Greek Basque Bulgarian Church Slavonic Croatian Czech 80 60 40 100 Danish Dutch English Estonian Finnish French 80 60 40 Dependency length 100 80 60 40 German Gothic Hebrew Hindi Hungarian Indonesian Linearization real free Free random projective rand_proj_lin_hdr_lic Random (licit) rand_proj_lin_hdr_mle Random (same meaning) rand_proj_lin_perplex Random (best perplexity) real Real 100 Irish Italian Japanese Latin Modern Greek Norwegian (Bokmål) 80 60 40 100 Persian Portuguese Romanian Slovenian Spanish Swedish 80 60 40 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 15 20 25 30 Sentence length

Conclusions Dependency length of real utterances is shorter than random grammatical linearizations under these models. We would like to conclude that this means: (1) There is a universal pressure in usage for DLM, (2) Grammars are optimized so that the average utterance will have short dependency length. However, our conclusions are only as strong as our linearization models. We only consider projective reorderings within local subtrees. The models are based on limited data and may miss certain licit orders.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Residue of DLM We have studied dependency length with the hypothesis that there is a universal pressure for dependency lengths to be short, and that this affects grammar and usage. But having controlled for various baselines, there remains residual variance between languages in dependency length. No new baselines in this part, rather we ask the question: What linguistic properties determine whether a language has short or long dependencies? We do not have formal explanations for these findings, but offer some directions for explaining them.

Residue of DLM

Head-Finality We see relatively long dependency length for strongly head-final languages such as Japanese, Korean, Tamil, Turkish. Comparing dependency length at fixed sentence lengths to the proportion of head-final dependencies in a corpus, we find correlations of dependency length with head finality: 10 15 20 60 fa la grc de tr bn cu orv got nl zh hi ko Dependency length 50 40 30 ar fa grc la de cu nl got orv et zh eu ru ga sk da fi xcl nocs sl sven el pl ro hehrbg ca id esptit fr bn tr hi hu ta ko ja ar eu ru da fi sk sl ga no cs pl xcl sven ro hr bg el idhe pt fr ca es it et hu ta ja fa la grc 20 ar ga id de cu got orv nlet zh ru no hr xcl sk da enfi pl cs eu rosl sv he bg pt ca el es fr it bn hi tr hu ta ko ja 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Proportion head final

ar bg cs cu da de Weight 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 el en es et eu fa fi fr ga got grc he hi hr hu id it ja la nl no pl pt ro sl sv ta 0.6 0.4 0.2 0.0 1 2 1 2 1 2 Position

Weight 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 ar bg cs cu da de el en es et eu fa fi fr ga got grc he hi hr hu id it ja la nl no pl pt ro sl sv ta 0.0 2.5 2.0 1.5 1.0 0.5 2.5 2.0 1.5 1.0 0.5 2.5 2.0 1.5 1.0 0.5 Position

0.6 ar bg cs cu da de 0.4 0.2 0.0 0.6 el en es et eu fa 0.4 0.2 0.0 0.6 fi fr ga got grc he Weight 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 0.6 0.4 0.2 0.0 hi hr hu id it ja la nl no pl pt ro sl sv ta 1 2 3 1 2 3 1 2 3 Position

ar bg cs cu da de 0.4 0.3 0.2 0.1 0.0 el en es et eu fa 0.4 0.3 0.2 0.1 0.0 fi fr ga got grc he Weight 0.4 0.3 0.2 0.1 0.0 0.4 0.3 0.2 0.1 0.0 hi hr hu id it ja la nl no pl pt ro 0.4 0.3 0.2 0.1 0.0 sl sv ta 0.4 0.3 0.2 0.1 0.0 3 2 1 3 2 1 3 2 1 Position

Dependency Length and Head-Finality Under the integration cost theories of processing difficulty, where there is difficulty for linking a word to another word that has been in memory for a long time, we expect no asymmetry between head final and head initial dependencies. But integration cost effects are typically not observed in head-final constructions where many modifiers precede the head (Konieczny, 2000; Vasishth & Lewis, 2006; Levy, 2008). Perhaps head-final dependencies incur less processing cost, so there is less pressure to minimize the distances of the dependencies.

Back to this figure

Word Order Freedom We measure word order freedom as the conditional entropy of the direction of a word s head, conditional on the part-of-speech of the word and the relation type of the dependency. (Futrell, Mahowald & Gibson, 2015, DepLing). 10 15 20 60 tr de bn fa grc la ko hi zh cu got nl orv Dependency length 50 40 30 bn fa grc la de ko tretnl zh cu hi got orv hu eu ru ja ga ta el slsk dafi encs nosv xcl heca bghr pl es it ro frpt ar id ja hu eu ru dafi sl sk ga en cs no ta sv ro pl el bghr ar he fr ptcaid es it et xcl 20 ga fa grc la bn de nl tr zh got etcu hi ja ko orv hu ru ta en sl skdafi eu no cs hr xcl pt he ca el sv ro bgpl it ar es fr id 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 BDE

Dependency Length and Word Order Freedom In languages with a high degree of freedom in whether the head of a word is to its right or left, we find longer dependencies. One would think that speakers of languages with lots of word order freedom would use that freedom to select the orders that highly minimize dependency length. On the other hand, such languages typically have complex morphology. If the difficulty of processing long dependencies is due to similarity-based interference (Lewis & Vasishth, 2006), then words with more distinctive morphology will be less confusable and retrieving them from memory will be easier. So we might expect morphologically complex languages to have longer dependencies: long dependencies incur less processing difficulty in such languages.

Morphological Complexity We measure morphological complexity as the entropy of words (the information content of words) minus the entropy of lemmas (the information content of lemmas). We estimate these entropies from corpus counts using state-of-the-art entropy estimation methods (the Pitman-Yor Mixture method of Archer et al., 2014). 10 15 20 60 la grc tr hi nl got cu orv mdd 50 40 30 nl hi da en no sv ro esptit la et got hu ru ga xcl cs tael sl bg hr grc cu orv eu fi pl tr da en no sv ro pt es it ga cs xcl ta bg el hr et ru hu sl eu pl fi la grc 20 hi nl da en no sv ro pt es it et got ga ru xcl hu hrcs ta sl bg el cu orv eu fi pl tr 1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 1.0 1.5 2.0 2.5 3.0 3.5 morpho_entropy

Dependency Length and Morphology Consistent with the concept that languages with more informative morphology will have create less difficulty in processing long dependencies, we find longer dependency lengths in such languages. Real formalization of this notion would require a processing model that integrates morphological complexity and dependency length, and a way to find orders that minimize parsing difficulty under such a model.

Crosslinguistic Quantitative Syntax: Dependency Length and Beyond Quantitative Syntax with Dependency Corpora Dependency Length Minimization Comparison to Random Baselines Grammar and Usage Residue of Dependency Length Minimization Conclusion

Conclusion We have provided large-scale corpus evidence for dependency length minimization beyond what is explained by projectivity, fixedness of word order, and consistency of head direction. Evidence for dependency length minimization as a principle that is independent of those other constraints, or which subsumes those constraints.

Conclusion We have shown that attested utterances have shorter dependency length than random grammatical reorderings of those utterances, and that the random grammatical reorderings have shorter dependency length than under random grammars. Evidence for universal DLM in grammar and usage.

Conclusion We have shown residual covariance of dependency length with other linguistic features. Suggests that DLM is not enough we need other, more detailed theories to explain the quantitative distribution of dependency lengths.

Thanks all! Thanks to Tim O Donnell, Roger Levy, Kristina Gulordava, Paola Merlo, Ramon Ferrer i Cancho, Christian Bentz, and Timothy Osborne for helpful discussions. This work was supported by NSF Doctoral Dissertation Improvement Grant #1551543 to Richard Futrell, an NDSEG fellowship to Kyle Mahowald, and NSF grant #6932627 to Ted Gibson.

This talk is based on these papers (but a lot of it isn t published yet!) Futrell, Mahowald & Gibson (2015). Large-scale evidence of dependency length minimization in 37 languages. PNAS. Futrell, Mahowald & Gibson (2015). Quantifying word order freedom in dependency corpora. Proceedings of DepLing. Futrell & Gibson (2015). Experiments with generative models for dependency tree linearization. Proceedings of EMNLP.