The Effect of Multiple Grammatical Errors on Processing Non-Native Writing

Similar documents
Ensemble Technique Utilization for Indonesian Dependency Parser

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Memory-based grammatical error correction

The Ups and Downs of Preposition Error Detection in ESL Writing

Linking Task: Identifying authors and book titles in verbose queries

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Using dialogue context to improve parsing performance in dialogue systems

BULATS A2 WORDLIST 2

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Providing student writers with pre-text feedback

Search right and thou shalt find... Using Web Queries for Learner Error Detection

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Evidence for Reliability, Validity and Learning Effectiveness

A Graph Based Authorship Identification Approach

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Prediction of Maximal Projection for Semantic Role Labeling

The College Board Redesigned SAT Grade 12

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

An Interactive Intelligent Language Tutor Over The Internet

NCEO Technical Report 27

A Case Study: News Classification Based on Term Frequency

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Parsing of part-of-speech tagged Assamese Texts

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Mandarin Lexical Tone Recognition: The Gating Paradigm

The Role of the Head in the Interpretation of English Deverbal Compounds

Derivational and Inflectional Morphemes in Pak-Pak Language

Online Updating of Word Representations for Part-of-Speech Tagging

Advanced Grammar in Use

Early Warning System Implementation Guide

Multi-Lingual Text Leveling

A deep architecture for non-projective dependency parsing

The Effect of Extensive Reading on Developing the Grammatical. Accuracy of the EFL Freshmen at Al Al-Bayt University

5. UPPER INTERMEDIATE

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Progressive Aspect in Nigerian English

Data Driven Grammatical Error Detection in Transcripts of Children s Speech

Modeling function word errors in DNN-HMM based LVCSR systems

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

CS 598 Natural Language Processing

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Grammars & Parsing, Part 1:

1. Introduction. 2. The OMBI database editor

Constructing Parallel Corpus from Movie Subtitles

Parsing Morphologically Rich Languages:

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

TIMSS Highlights from the Primary Grades

Rule Learning With Negation: Issues Regarding Effectiveness

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

The stages of event extraction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Python Machine Learning

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Introduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

The Smart/Empire TIPSTER IR System

Atypical Prosodic Structure as an Indicator of Reading Level and Text Difficulty

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

CLASS EXODUS. The alumni giving rate has dropped 50 percent over the last 20 years. How can you rethink your value to graduates?

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

A Comparison of Two Text Representations for Sentiment Analysis

Proficiency Illusion

BENCHMARK TREND COMPARISON REPORT:

National Literacy and Numeracy Framework for years 3/4

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Rule Learning with Negation: Issues Regarding Effectiveness

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

BUILDING CAPACITY FOR COLLEGE AND CAREER READINESS: LESSONS LEARNED FROM NAEP ITEM ANALYSES. Council of the Great City Schools

Vocabulary Usage and Intelligibility in Learner Language

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Developing a TT-MCTAG for German with an RCG-based Parser

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Cross Language Information Retrieval

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

AQUA: An Ontology-Driven Question Answering System

Problems of the Arabic OCR: New Attitudes

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Experiments with a Higher-Order Projective Dependency Parser

Beyond the Pipeline: Discrete Optimization in NLP

Many instructors use a weighted total to calculate their grades. This lesson explains how to set up a weighted total using categories.

Speech Recognition at ICSI: Broadcast News and beyond

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

The Discourse Anaphoric Properties of Connectives

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Training and evaluation of POS taggers on the French MULTITAG corpus

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

Psychometric Research Brief Office of Shared Accountability

Transcription:

The Effect of Multiple Grammatical Errors on Processing Non-Native Writing Courtney Napoles Johns Hopkins University courtneyn@jhu.edu Aoife Cahill Nitin Madnani Educational Testing Service {acahill,nmadnani}@ets.org Abstract In this work, we estimate erioration of NLP processing given an estimate of amount and nature of grammatical errors in a text. From a corpus of essays written by English-language learners, we extract ungrammatical sentences, controlling number and types of errors in each sentence. We focus on six categories of errors that are commonly made by English-language learners, and consider sentences containing one or more of se errors. To evaluate effect of grammatical errors, we measure erioration of ungrammatical dependency parses using labeled F-score, an adaptation of labeled attachment score. We find notable differences between influence of individual error types on dependency parse, as well as interactions between multiple errors. 1 Introduction With large number of English-language learners and prevalence of informal web text, noisy text containing grammatical errors is widespread. However, majority of NLP tools are developed and trained over clean, grammatical text and performance of se tools may be negatively affected when processing errorful text. One possible workaround is to adapt tools for noisy text, e.g. (Foster et al., 2008; Cahill et al., 2014). However, it is often preferable to use tools trained on clean text, mainly because of resources necessary for training and limited availability of large-scale annotated corpora, but also because tools should work correctly in presence of well-formed text. Our goal is to measure performance degradation of an automatic NLP task based on an estimate of grammatical errors in a text. For example, if we are processing student responses within an NLP application, and responses contain a mix of native and non-native texts, it would be useful to be able to estimate difference in performance (if any) of NLP application on both types of texts. We choose dependency parsing as our prototypic task because it is often one of first complex downstream tasks in NLP pipelines. We will consider six common grammatical errors made by nonnative speakers of English and systematically control number and types of errors present in a sentence. As errors are introduced to a sentence, degradation of dependency parse is measured by decrease in F-score over dependency relations. In this work, we will show that increasing number of errors in a sentence decreases accuracy of dependency parse (Section 4.1); distance between errors does not affect accuracy (Section 4.2); some types of grammatical errors have a greater impact, alone or in combination with or errors (Section 4.3). While se findings may seem self-evident, y have not previously been quantified on a large corpus of naturally occurring errors. Our analysis will serve as first step to understanding what happens to a NLP pipeline when confronted with grammatical errors. 1 Proceedings of 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 1 11, San Diego, California, June 16, 2016. c 2016 Association for Computational Linguistics

2 Data Previous research concerning grammatical errors has artificially generated errors over clean text, such as Foster et al. (2008) and Felice and Yuan (2014), among ors. While this is one approach for building a large-scale corpus of grammatical and ungrammatical sentence pairs, we use text with naturally occurring errors so that our analysis covers types of errors typically seen in non-native writing. As source of our data, we use training section of NUS Corpus of Learner English (NU- CLE), 1 which is a large corpus of essays written by non-native English speakers (Dahlmeier et al., 2013). The NUCLE corpus has been annotated with corrections to grammatical errors, and each error has been labeled with one of 28 error categories. We will only consider following common errors types, which constitute more than 50% of 44 thousand corrections in NUCLE: Article or erminer [Det] Mechanical (punctuation, capitalization, and spelling) [Mec] Noun number [Noun] Preposition [Prep] Word form [Wform] Verb tense and verb form [Verb] While or error coding schemes specify nature of error (wher text is unnecessary, missing, or needs to be replaced) in addition to word class (Nicholls, 2004), NUCLE error categories do not make that distinction. Therefore we automatically labeled each error with an additional tag for operation of correction, depending on wher it was missing a token, had an unnecessary token, or needed to replace a token. We labeled all noun, verb, and word form errors as replacements, and automatically ected label of article, mechanical, and preposition errors by comparing tokens in original and corrected spans of text. If correction had fewer unique tokens than original text, it was labeled unnecessary. If correction had more unique tokens, it was labeled missing. Orwise operation was labeled a replacement. To verify validity of this algorithm, we reviewed 100 most frequent error correction pairs labeled 1 Version 3.2 8000 7000 6000 5000 4000 3000 2000 1000 0 Unnecessary Missing Replace Det Mec Noun Prep Verb Wform Figure 1: The number of corrections by error type and operation that we used in this study. with each operation, which encompasses 69% of errors in corpus. 2 To compile our corpus of sentences, we selected all of corrections from NUCLE addressing one of six error types above. We skipped corrections that spanned multiple sentences or entire length of a sentence, as well as corrections that addressed punctuation spacing, since those errors would likely be addressed during tokenization. 3 We identified 14,531 NUCLE sentences containing errors subject to se criteria. We applied corrections of all or types of errors and, in rest of our analysis, we will use term errors to refer only to errors of types outlined above. On average, each of se sentence has 26.4 tokens and 1.5 errors, with each error spanning 1.2 tokens and correction 1.5 tokens. In total, re are 22,123 errors, and Figure 1 shows total number of corrections by error type and operation. Because of small number of naturally occurring sentences with exactly 1, 2, 3, or 4 errors (Table 1), we chose to generate new sentences with varying numbers of errors from original ungrammatical sentences. For each of NUCLE sentences, we generated ungrammatical sentences with n errors by systematically selecting n corrections to ignore, applying all of or corrections. We generated sentences 2 Many error correction pairs are very frequent: for example, inserting or deleting accounts for 3,851 of errors and inserting or deleting a plural s 2,804. 3 NLTK was used for sentence and token segmentation (http://nltk.org). 2

NUCLE Generated Exactly # errors sentences sentences n errors 1 14,531 22,123 9,474 2 5,030 11,561 3,341 3 572 5,085 0 4 570 3,577 362 Table 1: The number of NUCLE sentences containing at least n errors, number of sentences with n errors that were generated from m, and number of NUCLE sentences with exactly n errors. with n = 1 to 4 errors, when re were at least n corrections to original sentence. For example, a NUCLE sentence with 6 annotated corrections would yield following number of ungrammatical sentences: 6 sentences with one error, ( ) 6 2 = 15 sentences with two errors, ( ) 6 3 = 20 sentences with three errors, and so on. The number of original NU- CLE sentences and generated sentences with each number of errors is shown in Table 1. We also generated a grammatical sentence with all of corrections applied for comparison. We parsed each sentence with ZPar constituent parser (Zhang and Clark, 2011) and generated dependency parses from ZPar output using Stanford Dependency Parser 4 and universal dependencies representation (De Marneffe et al., 2014). We make over-confident assumption that automatic analyses in our pipeline (tokenization, parsing, and error-type labeling) are all correct. Our analysis also depends on quality of NUCLE annotations. When correcting ungrammatical text, annotators are faced with decisions of wher a text needs to be corrected and, if so, how to edit it. Previous work has found low interannotator agreement for basic task of judging wher a sentence is grammatical (0.16 κ 0.40) (Rozovskaya and Roth, 2010). The NUCLE corpus is no different, with three NUCLE annotators having moderate agreement on how to correct a span of text (κ = 0.48) and only fair agreement for identifying what span of text needs to be corrected (κ = 0.39) (Dahlmeier et al., 2013). Low inter-annotator agreement is not necessarily an indication of quality of annotations, since it could 4 Using EnglishGrammaticalStructure class with flags -noncollapsed -keeppunct. also be attributed to diversity of appropriate corrections that have been made. We assume that annotations are correct and complete, meaning that spans and labels of annotations are correct and that all of grammatical errors are annotated. We furr assume that annotations only fix grammatical errors, instead of providing a stylistic alternatives to grammatical text. 3 Metric: Labeled F-score To measure effect of grammatical errors on performance of dependency parser, we compare dependencies identified in corrected sentence to those from ungrammatical sentence. The labeled attachment score (LAS) is a commonly used method for evaluating dependency parsers (Nivre et al., 2004). The LAS calculates accuracy of dependency triples from candidate dependency graph with respect to those of gold standard, where each triple represents one relation, consisting of head, dependent, and type of relation. The LAS assumes that surface forms of sentences are identical but only relations have changed. In this work, we require a method that accommodates unaligned tokens, which occur when an error involves deleting or inserting tokens and unequal surface forms (replacement errors). There are some metrics that compare parses of unequal sentences, including SParseval (Roark et al., 2006) and TEDeval (Tsarfaty et al., 2011), however neir of se metrics operate over dependencies. We chose to evaluate dependencies because dependency-based evaluation has been shown to be more closely related to linguistic intuition of good parses compared to two or tree-based evaluations (Rehbein and van Genabith, 2007). Since we cannot calculate LAS over sentences of unequal lengths, we instead measure F 1 -score of dependency relations. So that substitutions (such as morphological changes) are not severely penalized, we represent tokens with ir index instead of surface form. First, we align tokens in grammatical and ungrammatical sentences and assign an index to each token such that aligned tokens in each sentence share same index. Because reordering is uncommon in NUCLE corrections, we use dynamic programming to find lowest- 3

cost alignment between a sentence pair, where cost for insertions and deletions is 1, and substitutions receive a cost proportionate to Levenshtein edit distance between tokens (to award partial credit for inflections). We calculate Labeled F-score (LF) over dependency relations of form <head index, dependent index, relation>. This evaluation metric can be used for comparing dependency parses of aligned sentences with unequal lengths or tokens. 5 A variant of LAS, Unlabeled Attachment Score, is calculated over pairs of heads and dependents without relation. We considered corresponding unlabeled F-score and, since re was no meaningful difference between that and labeled F-score, we chose to use labeled relations for greater specificity. In subsequent analysis, we will focus on difference in LF before and after an error is introduced to a sentence. We will refer to LF of a sentence with n errors as LF n. The LF of a sentence identical to correct sentence is 100, refore LF 0 is always 100. The decrease in LF of an ungrammatical sentence with n errors from correct parse is LF 0 LF n = 100 LF n, where a higher value indicates a larger divergence from correct dependency parse. 4 Analysis Our analysis will be broken down by different characteristics of ungrammatical sentences and quantifying ir effect on LF. Specifically, we will examine increasing numbers of errors in a sentence, distance between errors, individual error types, and adding more errors to an already ungrammatical sentence. 4.1 Number of errors The first step of our analysis is to verify our hyposis that absolute LF decrease (LF 0 LF n ) increases as number of errors in a sentence increases from n = 1 to n = 4. Pearson s correlation reveals a weak correlation between LF decrease and number of errors (Figure 2). Since this analysis will be considering sentences generated with only a 5 Available for download at https://github.com/ cnap/ungrammatical-dependencies. LF decrease 50 40 30 20 10 0 All generated sentences with n errors r =0.31 1 2 3 4 Number of errors Sentences with exactly n errors originally r =0.22 1 2 3 4 Number of errors Figure 2: Mean absolute decrease in LF by number of errors in a sentence (100 LF n ). 16.4% 11.8% 30.4% 10.9% 20.9% 9.6% Det Mec Noun Prep Verb Wform Figure 3: The distribution of error types in sentences with one error. The distribution is virtually identical (±2 percentage points) in sentences with 2 4 errors. subset of errors from original sentence, we will verify validity of this data by comparing LF decrease of generated sentences to LF decrease of sentences that originally had exactly n errors. Since LF decreases of generated and original sentences are very similar, we presume that generated sentences exhibit similar properties as original sentences with same number of errors. We furr compared distribution of sentences with each error type as number of errors per sentence changes, and find that distribution is fairly constant. The distribution of sentences with one error is shown in Figure 3. We will next investigate wher LF decrease is due to interaction between errors or if re is an additive effect. 4

LF decrease 70 60 50 40 30 20 10 0 r = -0.07 0 10 20 30 40 50 Distance between errors (tokens) Figure 4: Distance between two errors and decrease in LF. 4.2 Distance between errors To ermine wher distance between errors is a factor in dependency performance, we took sentences with only two errors and counted number of tokens between errors (Figure 4). Surprisingly, re is no relationship between distance separating errors and dependency parse accuracy. We hyposized that errors near each or would eir interact and cause parser to misinterpret more of sentence, or conversely that y would disrupt interpretation of only one clause and not greatly effect LF. However, neir of se were evident based on very weak negative correlation. For sentences with more than two errors, we calculated mean, minimum, and maximum distances between all errors in each sentence, and found a weak to very weak negative correlation between those measures and LF decrease ( 0.15 r 0.04). 4.3 Error type and operation Next, we considered specific error types and ir operation wher y were missing, unnecessary, or needed replacement. To isolate impact of individual error types on LF, we calculated mean LF decrease (100 LF 1 ) by error and operation over sentences with only one error (Figure 5). The mean values by error type are shown in Figure 6, column 1. Two trends are immediately visible: re is a clear difference between error types and, except for erminer errors, missing and unnecessary errors have a greater impact on dependency parse than replacements. Nouns and prepositions needing replacement have lowest impact on LF, with 100 LF 1 < 4. This could be because part of speech tag for se substitutions does not often change (or only change NN to NNS in case of nouns), which would refore not greatly affect a dependency parser s interpretation of sentence, but this hyposis needs to be verified in future work. A prepositional phrase and noun phrase would likely still be found headed by that word. Verb replacements exhibit more than twice decrease in LF than nouns and prepositions. Unlike noun and preposition replacements, replacing a verb tends to elicit greater structural changes, since some verbs can be interpreted as nouns or past participles and gerunds could be interpreted as modifying nouns, etc. (Lee and Seneff, 2008). Determiner errors also have a low impact on LF and re is practically no difference by operation of correction. This can be explained because erminers occur at beginning of noun phrases, and so deleting, inserting, or replacing a erminer would typically affect one child of noun phrase and not overall structure. However, mechanical errors and missing or unnecessary prepositions have a great impact on LF, with LF 1 at least 10% lower than LF 0. Inserting or deleting se types of words could greatly alter structure of a sentence. For example, inserting a missing preposition would introduce a new prepositional phrase and subsequent noun phrase would attach to that phrase. Regarding Mec errors, inserting commas can drastically change structure by breaking apart constituents, and removing commas can cause constituents to become siblings. 4.4 Adding errors to ungrammatical sentences We have seen mean LF decrease in sentences with one error, over different error types. Next, we examine what happens to dependency parse when an error is added to a sentence that is already ungrammatical. We calculated LF of sen- 5

14 12 10 8 6 4 2 0 Det Mec Noun Prep Verb Wform Missing Unnecessary Replace Figure 5: The mean decrease in LF (100 LF 1 ) for sentences with one error, by error type. tences with one error (LF 1 ), introduced a second error into that sentence, and calculated decrease in LF (LF 1 LF 2 ). We controlled for types of errors both present in original sentence and introduced to sentence, not differentiating operation of error for ease of interpretation. The mean differences by error types are in Figure 6. Each column indicates what type of error was present in original sentence (or first error), with None indicating original sentence was grammatically correct and had no errors. Each row represents type of error that was added to sentence ( second error). Note that this does not indicate left right order of errors. This analysis considers all combinations of errors: for example, given a sentence with two erminer errors A and B, we calculate LF decrease after inserting error A into sentence that already had error B and vice versa. Generally, with respect to error type, relative magnitude of change caused by adding second error (column 2) is similar to adding that type of error to a sentence with no errors (column 1). However, introducing second error always has a lower mean LF decrease than introducing first error into a sentence, suggesting that each added error is less disruptive to dependency parse as number of errors increase. To verify this, we added an error to sentences with 0 to 3 errors and calculated LF change Inserted error Det Mec Noun Prep Verb Wform Error in original sentence None Det Mec Noun Prep Verb Wform 5.3 4.2 3.5 4.1 4.2 3.7 4.2 8.9 6.7 4.6 6.8 5.6 6.2 6.0 3.6 2.8 2.3 2.7 1.9 2.7 2.4 5.8 4.2 3.8 4.3 3.9 3.0 3.4 8.0 6.4 5.9 6.6 5.2 5.1 7.3 6.8 4.8 5.3 5.1 4.5 4.7 6.6 3.0 4.5 6.0 7.5 Figure 6: Mean decrease in LF (LF 1 LF 2 ) for sentences when introducing an error (row) into a sentence that already has an error of type in column. The None column contains mean decrease when introducing a new error to a grammatical sentence (100 LF 1 ). (LF n LF n+1 ) each time a new error was introduced. Figure 7 shows mean LF decrease after adding an error of a given type to a sentence that already had 0, 1, 2, or 3 errors. Based on Figure 7, it appears that LF decrease may converge for some error types, specifically erminer, preposition, verb, and noun errors. However, LF decreases at a fairly constant rate for mechanical and word form errors, suggesting that ungrammatical sentences become increasingly uninterpretable as se types of errors are introduced. Furr research is needed to make definitive claims about what happens as a sentence gets increasingly errorful. 5 Qualifying LF decrease In previous analysis, range of LF decreases are from 1 to around 10, suggesting that approximately 1% to 10% of dependency parse was changed due to errors. However, this begs question of what a LF decrease of 1, 5, or 10 actually means for a pair of sentences. Is ungrammatical sentence garbled after LF decrease reaches a certain level? How different are dependencies found in a sentence with a LF decrease of 1 versus 10? To illustrate se differences, we selected an example sentence and calculated LF decrease 6

LF decrease after adding an error 9 8 7 6 5 4 3 2 1 0 1 2 3 Number of errors in original sentence Det Mec Noun Prep Verb Wform Figure 7: Mean decrease in LF (LF n LF n+1 ) when an error of a given type is added to a sentence that already has n errors. and dependency graph as more errors were added (Table 2, Figure 8, and Figure 9). Notice that largest decrease in LF occurs after first and second errors are introduced (10 and 13 points, respectively). The introductions of se errors result in structural changes of graph, as does fourth error, which results in a lesser LF decrease of 5. In contrast, third error, a missing erminer, causes a lesser decrease of about 2, since graph structure is not affected by this insertion. Considering LF decrease as percent of a sentence that is changed, for a sentence with 26 tokens ( mean length of sentences in our dataset), a LF decrease of 5 corresponds to a change in 1.3 of tokens, while a decrease of 10 corresponds to a change in 2.6 tokens. Lower LF decreases (< 5 or so) generally indicate insertion or deletion of a token that does not affect graph structure, or changing label of a dependency relation. On or hand, greater decreases likely reflect a structural change in dependency graph of ungrammatical sentence, which affects more relations than those containing ungrammatical tokens. 6 Related work There is a modest body of work focused on improving parser performance of ungrammatical sentences. Unlike our experiments, most previous work has used small (around 1,000 sentences) or artificially generated corpora of ungrammatical/grammatical sentence pairs. The most closely related works compared structure of constituent parses of ungrammatical to corrected sentences: with naturally occurring errors, Foster (2004) and Kaljahi et al. (2015) and evaluate parses of ungrammatical text based on constituent parse and Geertzen et al. (2013) evaluate performance over dependencies. Cahill (2015) examines parser performance using artificially generated errors, and Foster (2007) analyzes parses of both natural and artificial errors. In Wagner and Foster (2009), authors compared parse probabilities of naturally occurring and artificially generated ungrammatical sentences to probabilities of corrected sentences. They found that natural ungrammatical sentences had a lower reduction in parse probability than artificial sentences, suggesting that artificial errors are not interchangeable with spontaneous errors. This analysis suggests importance of using naturally occurring errors, which is why we chose to generate sentences from spontaneous NUCLE errors. Several studies have attempted to improve accuracy of parsing ungrammatical text. Some approaches include self-training (Foster et al., 2011; Cahill et al., 2014), retraining (Foster et al., 2008), and transforming input and training text to be more similar (Foster, 2010). Or work with ungrammatical learner text includes Caines and Buttery (2014), which identifies need to improve parsing of spoken learner English, and Tetreault et al. (2010), which analyzes accuracy of prepositional phrase attachment in presence of preposition errors. 7 Conclusion and future work The performance of NLP tools over ungrammatical text is little understood. Given expense of annotating a grammatical-error corpus, previous studies have used eir small annotated corpora or generated artificial grammatical errors in clean text. This study represents first large-scale analysis of effect of grammatical errors on a NLP task. We have used a large, annotated corpus of grammat- 7

ical errors to generate more than 44,000 sentences with up to four errors in each sentence. The ungrammatical sentences contain an increasing number of naturally occurring errors, facilitating comparison of parser performance as more errors are introduced to a sentence. This is first step toward a larger goal of providing a confidence score of parser accuracy based on an estimate of how ungrammatical a text may be. While many of our findings may seem obvious, y have previously not been quantified on a large corpus of naturally occurring grammatical errors. In future, se results should be verified over a selection of manually corrected dependency parses. Future work includes predicting LF decrease based on an estimate of number and types of errors in a sentence. As yet, we have only measured change by LF decrease over all dependency relations. The decrease can also be measured over individual dependency relations to get a clearer idea of which relations are affected by specific error types. We will also investigate effect of grammatical errors on or NLP tasks. We chose NUCLE corpus because it is largest annotated corpus of learner English (1.2 million tokens). However, this analysis is relies on idiosyncrasies of this particular corpus, such as typical sentence length and complexity. The essays were written by students at National University of Singapore, who do not have a wide variety of native languages. The types and frequency of errors differ depending on native language of student (Rozovskaya and Roth, 2010), which may bias analysis herein. The available corpora that contain a broader representation of native languages are much smaller than NUCLE corpus: Cambridge Learner Corpus First Certificate in English has 420 thousand tokens (Yannakoudakis et al., 2011), and corpus annotated by (Rozovskaya and Roth, 2010) contains only 63 thousand words. One limitation to our method for generating ungrammatical sentences is that relatively few sentences are source of ungrammatical sentences with four errors. Even though we drew sentences from a large corpus, only 570 sentences had at least four errors (of types we were considering), compared to 14,500 sentences with at least one error. Future work examining effect of multiple errors would need to consider a more diverse set of sentences with more instances of at least four errors, since re could be peculiarities or noise in original annotations, which would be amplified in generated sentences. Acknowledgments We would like to thank Martin Chodorow and Jennifer Foster for ir valuable insight while developing this research, and Beata Beigman Klebanov, Brian Riordan, Su-Youn Yoon, and BEA reviewers for ir helpful feedback. This material is based upon work partially supported by National Science Foundation Graduate Research Fellowship under Grant No. 1232825. References Aoife Cahill, Binod Gyawali, and James Bruno. 2014. Self-training for parsing learner text. In Proceedings of First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages, pages 66 73, Dublin, Ireland, August. Dublin City University. Aoife Cahill. 2015. Parsing learner text: To shoehorn or not to shoehorn. In Proceedings of The 9th Linguistic Annotation Workshop, pages 144 147, Denver, Colorado, USA, June. Association for Computational Linguistics. Andrew Caines and Paula Buttery. 2014. The effect of disfluencies and learner errors on parsing of spoken learner language. In Proceedings of First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non- Canonical Languages, pages 74 81, Dublin, Ireland, August. Dublin City University. Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. 2013. Building a large annotated corpus of learner english: The NUS Corpus of Learner English. In Proceedings of Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pages 22 31, Atlanta, Georgia, June. Association for Computational Linguistics. Marie-Carine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D Manning. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of Language Resources and Evaluation Conference (LREC), volume 14, pages 4585 4592. Mariano Felice and Zheng Yuan. 2014. Generating artificial errors for grammatical error correction. In Proceedings of Student Research Workshop at 14th 8

Num. Inserted LF errors error type decrease Sentence 0 n/a n/a One of factors that ermines and shapes technological innovation most is country s economic status. 1 Verb 10.0 One of factors that ermined and shapes technological innovation most is country s economic status. 2 Mec 13.1 One of factors that ermined and shapes technological innovation most is country economic status. 3 Det 1.9 One of factors that ermined and shapes technological innovation most is country economic status. 4 Verb 5.0 One of factors that ermined and shaped technological innovation most is country economic status. Table 2: An example of a sentence with 4 errors added and LF decrease (LF n 1 LF n ) after adding each subsequent error to previous sentence. Changed text is shown in bold italics. ROOT status nsubj cop nmod:poss punct One is country economic. acl:relcl nmod case ermines factors 's nsubj cc conj advmod case that and shapes most of dobj innovation technological Figure 8: Dependency graph of correct sentence in Table 2. 9

1 error 2 errors ROOT ROOT status status nsubj cop nmod:poss punct nsubj cop punct One is country economic. One is country economic. nmod case nmod factors 's factors case acl:relcl case acl:relcl of ermined of ermined nsubj cc conj advmod nsubj cc conj advmod that and shapes most that and shapes most dobj dobj innovation innovation technological technological 3 errors 4 errors ROOT ROOT status status nsubj cop punct nsubj cop punct One is country economic. One is country economic. nmod nmod factors factors case acl:relcl case acl:relcl of ermined of ermined nsubj cc conj advmod nsubj cc conj that and shapes most that and shaped dobj advmod dobj innovation most innovation technological technological Figure 9: The dependency graphs of sentence in Table 2 and Figure 8 after each error is introduced. 10

Conference of European Chapter of Association for Computational Linguistics, pages 116 126, Gonburg, Sweden, April. Association for Computational Linguistics. Jennifer Foster, Joachim Wagner, and Josef Van Genabith. 2008. Adapting a WSJ-trained parser to grammatically noisy text. In Proceedings of 46th Annual Meeting of Association for Computational Linguistics on Human Language Technologies: Short Papers, pages 221 224. Association for Computational Linguistics. Jennifer Foster, Özlem Çetinolu, Joachim Wagner, and Josef van Genabith. 2011. Comparing use of edited and unedited text in parser self-training. In Proceedings of 12th International Conference on Parsing Technologies, pages 215 219, Dublin, Ireland, October. Association for Computational Linguistics. Jennifer Foster. 2004. Parsing ungrammatical input: an evaluation procedure. In Proceedings of Language Resources and Evaluation Conference (LREC). Jennifer Foster. 2007. Treebanks gone bad. International Journal of Document Analysis and Recognition (IJDAR), 10(3-4):129 145. Jennifer Foster. 2010. cba to check spelling : Investigating parser performance on discussion forum posts. In Human Language Technologies: The 2010 Annual Conference of North American Chapter of Association for Computational Linguistics, pages 381 384, Los Angeles, California, June. Association for Computational Linguistics. Jeroen Geertzen, Theodora Alexopoulou, and Anna Korhonen. 2013. Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCAMDAT). In Proceedings of 31st Second Language Research Forum. Somerville, MA: Cascadilla Proceedings Project. Rasoul Kaljahi, Jennifer Foster, Johann Roturier, Corentin Ribeyre, Teresa Lynn, and Joseph Le Roux. 2015. Foreebank: Syntactic analysis of customer support forums. In Conference on Empirical Methods in Natural Language Processing (EMNLP). John Lee and Stephanie Seneff. 2008. Correcting misuse of verb forms. In Proceedings of ACL-08: HLT, pages 174 182, Columbus, Ohio, June. Association for Computational Linguistics. Diane Nicholls. 2004. The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT. In Proceedings of Corpus Linguistics 2003 Conference, pages 572 581. Joakim Nivre, Johan Hall, and Jens Nilsson. 2004. Memory-based dependency parsing. In HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004), pages 49 56, Boston, Massachusetts, USA, May 6 May 7. Association for Computational Linguistics. Ines Rehbein and Josef van Genabith. 2007. Evaluating evaluation measures. In Proceedings of 16th Nordic Conference of Computational Linguistics (NODALIDA), pages 372 379. Brian Roark, Mary Harper, Eugene Charniak, Bonnie Dorr, Mark Johnson, Jeremy G Kahn, Yang Liu, Mari Ostendorf, John Hale, Anna Krasnyanskaya, et al. 2006. SParseval: Evaluation metrics for parsing speech. In Proceedings of Language Resources and Evaluation Conference (LREC). Alla Rozovskaya and Dan Roth. 2010. Annotating ESL errors: Challenges and rewards. In Proceedings of NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, pages 28 36. Association for Computational Linguistics. Joel Tetreault, Jennifer Foster, and Martin Chodorow. 2010. Using parse features for preposition selection and error ection. In Proceedings of 48th Annual Meeting of Association of Computational Linguistics on Human Language Technologies: Short Papers, pages 353 358, Uppsala, Sweden, July. Association for Computational Linguistics. Reut Tsarfaty, Joakim Nivre, and Evelina Andersson. 2011. Evaluating dependency parsing: Robust and heuristics-free cross-annotation evaluation. In Proceedings of 2011 Conference on Empirical Methods in Natural Language Processing, pages 385 396, Edinburgh, Scotland, UK., July. Association for Computational Linguistics. Joachim Wagner and Jennifer Foster. 2009. The effect of correcting grammatical errors on parse probabilities. In Proceedings of 11th International Conference on Parsing Technologies, pages 176 179. Association for Computational Linguistics. Helen Yannakoudakis, Ted Briscoe, and Ben Medlock. 2011. A new dataset and method for automatically grading esol texts. In Proceedings of 49th Annual Meeting of Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 180 189. Association for Computational Linguistics. Yue Zhang and Stephen Clark. 2011. Syntactic processing using generalized perceptron and beam search. Computational Linguistics, 37(1):105 151. 11