III Related Research. IV Z-corpora - Description and Annotation Criteria

Similar documents
ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Advanced Grammar in Use

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Writing a composition

Proof Theory for Syntacticians

The Discourse Anaphoric Properties of Connectives

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Ensemble Technique Utilization for Indonesian Dependency Parser

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Ch VI- SENTENCE PATTERNS.

Emmaus Lutheran School English Language Arts Curriculum

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

California Department of Education English Language Development Standards for Grade 8

Using dialogue context to improve parsing performance in dialogue systems

Parsing of part-of-speech tagged Assamese Texts

AQUA: An Ontology-Driven Question Answering System

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Some Principles of Automated Natural Language Information Extraction

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Today we examine the distribution of infinitival clauses, which can be

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

CX 101/201/301 Latin Language and Literature 2015/16

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

What the National Curriculum requires in reading at Y5 and Y6

Chapter 9 Banked gap-filling

Developing a TT-MCTAG for German with an RCG-based Parser

The College Board Redesigned SAT Grade 12

National Literacy and Numeracy Framework for years 3/4

5 th Grade Language Arts Curriculum Map

Control and Boundedness

Zero Pronominal Anaphora Resolution for the Romanian Language

Grammars & Parsing, Part 1:

EAGLE: an Error-Annotated Corpus of Beginning Learner German

Developing Grammar in Context

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

A Comparative Study of Research Article Discussion Sections of Local and International Applied Linguistic Journals

Loughton School s curriculum evening. 28 th February 2017

Common Core State Standards for English Language Arts

CS 598 Natural Language Processing

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

THE INTERNATIONAL JOURNAL OF HUMANITIES & SOCIAL STUDIES

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Prediction of Maximal Projection for Semantic Role Labeling

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

ScienceDirect. Malayalam question answering system

The Smart/Empire TIPSTER IR System

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Words come in categories

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Nancy Hennessy M.Ed. 1

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Specifying a shallow grammatical for parsing purposes

A Computational Evaluation of Case-Assignment Algorithms

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Hindi-Urdu Phrase Structure Annotation

A Framework for Customizable Generation of Hypertext Presentations

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Software Maintenance

Formulaic Language and Fluency: ESL Teaching Applications

Constructing Parallel Corpus from Movie Subtitles

Multilingual Sentiment and Subjectivity Analysis

Lecture 9. The Semantic Typology of Indefinites

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Mercer County Schools

Theoretical Syntax Winter Answers to practice problems

2006 Mississippi Language Arts Framework-Revised Grade 12

Chinese for Beginners CEFR Level: A1

South Carolina English Language Arts

Specifying Logic Programs in Controlled Natural Language

Appendix D IMPORTANT WRITING TIPS FOR GRADUATE STUDENTS

Course Outline for Honors Spanish II Mrs. Sharon Koller

Constraining X-Bar: Theta Theory

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

Cross Language Information Retrieval

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Oakland Unified School District English/ Language Arts Course Syllabus

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

Underlying and Surface Grammatical Relations in Greek consider

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Linking Task: Identifying authors and book titles in verbose queries

Guidelines for Writing an Internship Report

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Transcription:

The examples below represent the main groups of impersonal sentences in Bulgarian: a) Sentences with impersonal verb (Ex. 6 a). Verbs from this category cannot be part of finite constructs - they are constantly impersonal; b)sentences with verb, which could be used as finite and as impersonal. (Ex. 6 b, c); c)sentences with a copula and predicative word (Ex. 6 d). III Related Research The distribution of zero pronouns is a subject of investigation in some other pro-drop languages - Spanish [12], Portuguese [9] and Romanian [6]. An algorithm for ZP resolution in Spanish can be found in [10]. The authors apply the idea of constraints and preferences; the same idea lies at the root of Mitkov s knowledge-poor pronoun resolution approach [7]. Detection of impersonal clauses, which can improve and complement the algorithm in Spanish, is discussed in [12]. Although anaphora resolution has attracted the attention of many researchers and many approaches have been developed [7], we found only one work dealing with this subject for Bulgarian - [16]. This paper presents an anaphora resolver, which is an adaptation for Bulgarian of Mitkov s knowledge-poor pronoun resolution approach. It resolves only third-person personal pronouns. The problem zero pronoun resolution in Bulgarian has not been studied there. Our first study on this problem is presented in [3,4]. An algorithm for zero pronoun resolution based on constraints and preferences is discussed there. The algorithm takes into account some features of Bulgarian - for instance, noun phrase (NP) can be lexically realized by an adjective with definite article. More rules for identification of impersonal clauses have been added in [4]. One of the goals of the present study is to improve the zero pronoun resolution algorithm with new typically Bulgarian heuristic criteria. IV Z-corpora - Description and Annotation Criteria The annotated corpora play important role in most of the natural language processing applications. Our immediate usage of such corpora is to observe patterns and deduce rules for rulebased anaphora resolver. Further the same corpora will be used for machine learning methods. We had access to the existing annotated corpora described in [14], created in the Linguistic Modeling Department at Bulgarian Academy of Science (BAS). These language resources and tools are presented in [17]. Although the existing corpora are a valuable resource and every word is marked up with detail linguistic information, we took a decision to create our own corpora especially for the purposes of zero pronominal anaphora. The main features which make the existing corpora unsuitable for our goals are the following: Co-referential relations are marked up only within a single sentence. Inter-sentential anaphora is not a rare phenomenon. 28% of the ZPs with lexical antecedent in our annotated corpora are inter-sentential. Impersonal verbs are marked up, but impersonal clauses are not. Impersonal constructs can be expressed by finite verbs in Bulgarian. In the existing corpora the verb in such clause is marked up as finite, but in fact the clause is impersonal. Verb phrases with modal verb are considered as consisting of two verbs. The second verb is marked as having omitted subject. In our opinion, this fact increases the number of zero pronouns unnaturally. Verb phrases with modal verb are a specific case of compound verb predicate. Such predicate expresses unified process of the action 5 [11] and the subject of the first verb unconditionally coincides with the subject of the second. In the existing corpora clauses with verb zero anaphora are also marked as having omitted subject. Our goal is to recover the missing pronoun only when the verb is present (but not omitted!). A specific case of the verb zero anaphora is the omission of the copula. When the compound noun predicate consists of a copula plus past participle, the past participle is used as an adjective [1]. If the adjectives are more than one, the copula is usually used only once, before the first one. We do not consider the remaining participles as verb phrases with ZPs (as our colleagues did), but as adjectives and we do not marked up them as having ZPs. Our final task is to create an application which will recover the missing pronouns in unrestricted texts in different genres. According to this goal the corpora consist of full and partial texts retrieved from the web and digitalized books, encompassing several genres: legal, literary, news and encyclopedic. The Bulgarian Constitution and the beginning of the Labour Code represent legal text. The literary genre contains texts only from Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children s books. The texts in the news genre have been extracted from articles in web newspapers at the end of 2011. Texts with direct speech are avoided. The encyclopedic genre includes texts from computer, historic and medical literature taken from the web portal BooksBg.org. The corpora contain 1029 zero pronouns, more or less evenly distributed in the mentioned genres. Direct speech is not annotated. Annotation criteria are important issue in corpus annotation. Different annotation schemes for annotating anaphora are discussed in [2]. Our annotation scheme is similar to those in [6,9, and 12] with some differences and additions. The authors of the mentioned papers classify the clauses as main, subordinate, coordinate and juxtaposed. Our classification is as main and subordinate, but we include also the type of sentence as annotation criterion. The type is one of the following: simple, compound, complex, complex-compound [1,11]. Before every ZP we put information for: the omitted pronoun, its antecedent (head noun in the NP), its dependency head (the clause verb on which the ZP depends), the relation (anaphora/cataphora), type of the sentence, type of the clause. The antecedent to which the 5 Translated in English by the author. information technologies and control 1 2012 31

Table 2. ZP and impersonal clauses in percentage to total number of clauses Corpus Clauses with ZP Impersonal clauses Legal 26.45 0.32 Literary 26.92 2.73 News 27.27 7.14 Encyclopedic 27.40 8.85 Table 3. Distribution of anaphoric and cataphoric clauses Corpus Clauses with ZP Anaphoric Cataphoric Legal 251 251 0 Literary 266 256 10 News 252 247 5 Encyclopedic 260 257 3 Total 1029 1011 18 Table 4. Distribution of lexical and exophoric antecedent Corpus Lexical ant.; Percentage Exophoric ant.; Percentage Legal 251; 100% 0; 0% Literary 257; 96.62% 9; 3.38% News 145; 57.54% 107; 42.46 Encyclopedic 157; 60.38% 103; 39.62% Table 5. Distribution of ZPs by type of the sentence Corpus Simple Compound Complex Complex-compound Total Main Subordinated Main Subordinated Legal 49 90 2 29 9 72 251 Literary 7 45 13 60 44 97 266 News 32 33 32 97 13 45 252 Encyclopedic 9 49 24 63 19 96 260 Total 98 217 72 251 85 310 1029 Table 6. ZPs in main and subordinated clauses Corpus Main Subordinated Proportion in percentage Legal 150 101 59.76 / 40.24 Literary 109 157 40.98 / 59.02 News 110 142 43.65 / 56.35 Encyclopedic 101 159 38.85 / 61.15 Total 470 559 Avg. 45.81 / 54.19 information technologies and control 1 2012 33

phenomenon - only 18 cataphoric clauses to 1011 anaphoric. This is on average 1.74% of the anaphora phenomenon with standard deviation of 1.58, i.e. non-uniform distribution in the different genres. Our observation shows that cataphoric clauses are part of the author s style. Nine out of the ten cataphoric clauses in the literary genre belong to one of the tree authors -Dimitar Dimov. Another section of the data presents the distribution of the lexical and exophoric antecedents - table 4. Again, different genres diverge a lot. The exophoric antecedents are absent in the legal genre, with only 3.38% in the literature and 42.46% in the news. The analysis of the texts shows that very often in the news and in the encyclopedic texts the authors express their own opinion and address the readers without using personal pronouns. Definite-personal are 83.56% of the clauses with exophoric antecedent. Other literary technique in encyclopedic and news genre is the usage of indefinite-personal (10.96%) and generalized-personal clauses (5.48%). The next aspect of the study refers to the type of the sentence, where the zero pronouns exist. Table 5 gives detailed information about the kind of the sentences which include zero pronouns. The compound sentence consists of independent clauses, but complex and complex compound clauses have independent (main) clause and subordinate clause(s). The complex sentence has one main clause and at least one subordinate. The independent clauses are connected by coordinative conjunctions; the subordinated - by subordinating conjunctions. In complex-compound sentences some of the clauses are connected as independent clauses, while others - as subordinated clauses [11]. Table 6 gives us a clear picture how many ZPs we had in main and how many in subordinated clauses. Literary, news and encyclopedic texts have more ZPs in subordinated clauses in contrast to the legal genre, where the proportion is reverse. The authors who write literature, news and encyclopedic books use more narrative and descriptive sentences. On one side, very often these sentences are complex and complex-compound, but on the other side, to avoid redundancy, they have omitted pronouns. Table 7 comprises next aspect of the study - the distance between the anaphor and the antecedent. It can be seen from the table that in the legal genre this distance is the longest one. In order to be more precise, we calculated not only the average distance (as number of sentences), but also the standard deviation. The legal genre has the highest standard deviation value - 2.96. The literature genre has standard deviation of 1.14, the news - 0.70, and the encyclopedic - 0.55. The tendency is the same when the distance is measured in the number of clauses. It was interesting to know the most frequently occurring value in the arrays of data, i.e. the mode. The results show that the antecedent most often is in the same sentence where the anaphor is and in the previous clause. The usual position of the anaphor is next to the dependent verb. The distance to the verb increases when there is a conjunction, an adverb, negative particle or combination of them preceding the verb. The quantity with the biggest diversion of values is the distance to the antecedent, measured in words. Often this distance is 2, 6 or 7 words. But we have an example of a distance of 163 words in the literary genre. The final aspect of this study is the syntax position of the antecedent. Data from the corpora shows that from 809 anaphoric clauses with lexical antecedents, in 741 (91.59%) of them the antecedents are subjects of some previous clauses and 68 (8.40%) are in some other syntax role: direct object - 28 (3.46%), indirect object - 21 (2.56%), uncoordinated attribute - 16 (1.98%) and adjunct phrase - 3 (0.37%). VI. Qualitative Analysis The parser is based on bottom-up strategy and context free grammars with extensions. It is realized in Java. The extension allows a meta-symbol, which can be linked to the right side of every symbol in every production, to define the number of possible occurrences of the original symbol. The possible metasymbols are:? - the symbol can exist zero or one time; * - the symbol can exist zero or more times; + - the symbol can be repeated one or more times. If there is no meta-symbol, linked to the symbol, it must exist exactly once. These extensions allow reduction of the number of the productions which constitute the grammar. Using the extensions, we do not need a separate rule for each possible place of the words in the clause. For instance, the production in Ex. 7 means that the clause must consist only of a verb phrase (VP). Before and after this VP, it is possible to have all kind of phrases, even no phrases. Ex. 7 Clause Phrase * VP Phrase * Because we do not have at our disposal a morphology Corpus Table 7. Antecedent distance and dependent verb distance Distance to antecedent, avg. of sentences Distance to antecedent, avg. of clauses Distance to antecedent, avg. of words Distance to dependent verb, avg. of words Legal 1.67 3.97 25.75 1.36 Literary 0.40 2.10 11.87 1.60 News 0.40 1.75 9.39 1.54 Encyclopedic 0.25 1.66 12.18 1.45 Average 0.67 2.36 14.75 1.49 34 1 2012 information technologies and control