III Related Research. IV Z-corpora - Description and Annotation Criteria

The examples below represent the main groups of impersonal sentences in Bulgarian: a) Sentences with impersonal verb (Ex. 6 a). Verbs from this category cannot be part of finite constructs - they are constantly impersonal; b)sentences with verb, which could be used as finite and as impersonal. (Ex. 6 b, c); c)sentences with a copula and predicative word (Ex. 6 d). III Related Research The distribution of zero pronouns is a subject of investigation in some other pro-drop languages - Spanish [12], Portuguese [9] and Romanian [6]. An algorithm for ZP resolution in Spanish can be found in [10]. The authors apply the idea of constraints and preferences; the same idea lies at the root of Mitkov s knowledge-poor pronoun resolution approach [7]. Detection of impersonal clauses, which can improve and complement the algorithm in Spanish, is discussed in [12]. Although anaphora resolution has attracted the attention of many researchers and many approaches have been developed [7], we found only one work dealing with this subject for Bulgarian - [16]. This paper presents an anaphora resolver, which is an adaptation for Bulgarian of Mitkov s knowledge-poor pronoun resolution approach. It resolves only third-person personal pronouns. The problem zero pronoun resolution in Bulgarian has not been studied there. Our first study on this problem is presented in [3,4]. An algorithm for zero pronoun resolution based on constraints and preferences is discussed there. The algorithm takes into account some features of Bulgarian - for instance, noun phrase (NP) can be lexically realized by an adjective with definite article. More rules for identification of impersonal clauses have been added in [4]. One of the goals of the present study is to improve the zero pronoun resolution algorithm with new typically Bulgarian heuristic criteria. IV Z-corpora - Description and Annotation Criteria The annotated corpora play important role in most of the natural language processing applications. Our immediate usage of such corpora is to observe patterns and deduce rules for rulebased anaphora resolver. Further the same corpora will be used for machine learning methods. We had access to the existing annotated corpora described in [14], created in the Linguistic Modeling Department at Bulgarian Academy of Science (BAS). These language resources and tools are presented in [17]. Although the existing corpora are a valuable resource and every word is marked up with detail linguistic information, we took a decision to create our own corpora especially for the purposes of zero pronominal anaphora. The main features which make the existing corpora unsuitable for our goals are the following: Co-referential relations are marked up only within a single sentence. Inter-sentential anaphora is not a rare phenomenon. 28% of the ZPs with lexical antecedent in our annotated corpora are inter-sentential. Impersonal verbs are marked up, but impersonal clauses are not. Impersonal constructs can be expressed by finite verbs in Bulgarian. In the existing corpora the verb in such clause is marked up as finite, but in fact the clause is impersonal. Verb phrases with modal verb are considered as consisting of two verbs. The second verb is marked as having omitted subject. In our opinion, this fact increases the number of zero pronouns unnaturally. Verb phrases with modal verb are a specific case of compound verb predicate. Such predicate expresses unified process of the action 5 [11] and the subject of the first verb unconditionally coincides with the subject of the second. In the existing corpora clauses with verb zero anaphora are also marked as having omitted subject. Our goal is to recover the missing pronoun only when the verb is present (but not omitted!). A specific case of the verb zero anaphora is the omission of the copula. When the compound noun predicate consists of a copula plus past participle, the past participle is used as an adjective [1]. If the adjectives are more than one, the copula is usually used only once, before the first one. We do not consider the remaining participles as verb phrases with ZPs (as our colleagues did), but as adjectives and we do not marked up them as having ZPs. Our final task is to create an application which will recover the missing pronouns in unrestricted texts in different genres. According to this goal the corpora consist of full and partial texts retrieved from the web and digitalized books, encompassing several genres: legal, literary, news and encyclopedic. The Bulgarian Constitution and the beginning of the Labour Code represent legal text. The literary genre contains texts only from Bulgarian authors: Dimitar Dimov, Dimitar Talev and Zlatko Enev, an author of children s books. The texts in the news genre have been extracted from articles in web newspapers at the end of 2011. Texts with direct speech are avoided. The encyclopedic genre includes texts from computer, historic and medical literature taken from the web portal BooksBg.org. The corpora contain 1029 zero pronouns, more or less evenly distributed in the mentioned genres. Direct speech is not annotated. Annotation criteria are important issue in corpus annotation. Different annotation schemes for annotating anaphora are discussed in [2]. Our annotation scheme is similar to those in [6,9, and 12] with some differences and additions. The authors of the mentioned papers classify the clauses as main, subordinate, coordinate and juxtaposed. Our classification is as main and subordinate, but we include also the type of sentence as annotation criterion. The type is one of the following: simple, compound, complex, complex-compound [1,11]. Before every ZP we put information for: the omitted pronoun, its antecedent (head noun in the NP), its dependency head (the clause verb on which the ZP depends), the relation (anaphora/cataphora), type of the sentence, type of the clause. The antecedent to which the 5 Translated in English by the author. information technologies and control 1 2012 31

Table 2. ZP and impersonal clauses in percentage to total number of clauses Corpus Clauses with ZP Impersonal clauses Legal 26.45 0.32 Literary 26.92 2.73 News 27.27 7.14 Encyclopedic 27.40 8.85 Table 3. Distribution of anaphoric and cataphoric clauses Corpus Clauses with ZP Anaphoric Cataphoric Legal 251 251 0 Literary 266 256 10 News 252 247 5 Encyclopedic 260 257 3 Total 1029 1011 18 Table 4. Distribution of lexical and exophoric antecedent Corpus Lexical ant.; Percentage Exophoric ant.; Percentage Legal 251; 100% 0; 0% Literary 257; 96.62% 9; 3.38% News 145; 57.54% 107; 42.46 Encyclopedic 157; 60.38% 103; 39.62% Table 5. Distribution of ZPs by type of the sentence Corpus Simple Compound Complex Complex-compound Total Main Subordinated Main Subordinated Legal 49 90 2 29 9 72 251 Literary 7 45 13 60 44 97 266 News 32 33 32 97 13 45 252 Encyclopedic 9 49 24 63 19 96 260 Total 98 217 72 251 85 310 1029 Table 6. ZPs in main and subordinated clauses Corpus Main Subordinated Proportion in percentage Legal 150 101 59.76 / 40.24 Literary 109 157 40.98 / 59.02 News 110 142 43.65 / 56.35 Encyclopedic 101 159 38.85 / 61.15 Total 470 559 Avg. 45.81 / 54.19 information technologies and control 1 2012 33

phenomenon - only 18 cataphoric clauses to 1011 anaphoric. This is on average 1.74% of the anaphora phenomenon with standard deviation of 1.58, i.e. non-uniform distribution in the different genres. Our observation shows that cataphoric clauses are part of the author s style. Nine out of the ten cataphoric clauses in the literary genre belong to one of the tree authors -Dimitar Dimov. Another section of the data presents the distribution of the lexical and exophoric antecedents - table 4. Again, different genres diverge a lot. The exophoric antecedents are absent in the legal genre, with only 3.38% in the literature and 42.46% in the news. The analysis of the texts shows that very often in the news and in the encyclopedic texts the authors express their own opinion and address the readers without using personal pronouns. Definite-personal are 83.56% of the clauses with exophoric antecedent. Other literary technique in encyclopedic and news genre is the usage of indefinite-personal (10.96%) and generalized-personal clauses (5.48%). The next aspect of the study refers to the type of the sentence, where the zero pronouns exist. Table 5 gives detailed information about the kind of the sentences which include zero pronouns. The compound sentence consists of independent clauses, but complex and complex compound clauses have independent (main) clause and subordinate clause(s). The complex sentence has one main clause and at least one subordinate. The independent clauses are connected by coordinative conjunctions; the subordinated - by subordinating conjunctions. In complex-compound sentences some of the clauses are connected as independent clauses, while others - as subordinated clauses [11]. Table 6 gives us a clear picture how many ZPs we had in main and how many in subordinated clauses. Literary, news and encyclopedic texts have more ZPs in subordinated clauses in contrast to the legal genre, where the proportion is reverse. The authors who write literature, news and encyclopedic books use more narrative and descriptive sentences. On one side, very often these sentences are complex and complex-compound, but on the other side, to avoid redundancy, they have omitted pronouns. Table 7 comprises next aspect of the study - the distance between the anaphor and the antecedent. It can be seen from the table that in the legal genre this distance is the longest one. In order to be more precise, we calculated not only the average distance (as number of sentences), but also the standard deviation. The legal genre has the highest standard deviation value - 2.96. The literature genre has standard deviation of 1.14, the news - 0.70, and the encyclopedic - 0.55. The tendency is the same when the distance is measured in the number of clauses. It was interesting to know the most frequently occurring value in the arrays of data, i.e. the mode. The results show that the antecedent most often is in the same sentence where the anaphor is and in the previous clause. The usual position of the anaphor is next to the dependent verb. The distance to the verb increases when there is a conjunction, an adverb, negative particle or combination of them preceding the verb. The quantity with the biggest diversion of values is the distance to the antecedent, measured in words. Often this distance is 2, 6 or 7 words. But we have an example of a distance of 163 words in the literary genre. The final aspect of this study is the syntax position of the antecedent. Data from the corpora shows that from 809 anaphoric clauses with lexical antecedents, in 741 (91.59%) of them the antecedents are subjects of some previous clauses and 68 (8.40%) are in some other syntax role: direct object - 28 (3.46%), indirect object - 21 (2.56%), uncoordinated attribute - 16 (1.98%) and adjunct phrase - 3 (0.37%). VI. Qualitative Analysis The parser is based on bottom-up strategy and context free grammars with extensions. It is realized in Java. The extension allows a meta-symbol, which can be linked to the right side of every symbol in every production, to define the number of possible occurrences of the original symbol. The possible metasymbols are:? - the symbol can exist zero or one time; * - the symbol can exist zero or more times; + - the symbol can be repeated one or more times. If there is no meta-symbol, linked to the symbol, it must exist exactly once. These extensions allow reduction of the number of the productions which constitute the grammar. Using the extensions, we do not need a separate rule for each possible place of the words in the clause. For instance, the production in Ex. 7 means that the clause must consist only of a verb phrase (VP). Before and after this VP, it is possible to have all kind of phrases, even no phrases. Ex. 7 Clause Phrase * VP Phrase * Because we do not have at our disposal a morphology Corpus Table 7. Antecedent distance and dependent verb distance Distance to antecedent, avg. of sentences Distance to antecedent, avg. of clauses Distance to antecedent, avg. of words Distance to dependent verb, avg. of words Legal 1.67 3.97 25.75 1.36 Literary 0.40 2.10 11.87 1.60 News 0.40 1.75 9.39 1.54 Encyclopedic 0.25 1.66 12.18 1.45 Average 0.67 2.36 14.75 1.49 34 1 2012 information technologies and control