Automatic Identification of Explicit Connectives

Automatic Identification of Explicit Connectives Introduction This project was a part of building an automatic Discourse tagger. Automating the process of identifying the discourse connectives, their relations and their arguments is an essential basis for discourse processing studies and applications. In this project we tried to identify Explicit Discourse Connectives using a list of connectives. Corpus Used We annotated a part of Hindi Corpus made available to us during Inter Annotator Agreement. And along with Section17 and 16 of Discourse Corpus, we extracted list of all Explicit connectives. We also extracted there senses and frequency of occurrence. Since AltLexes behave more or less same as Explicits so we analyzed those also. Methodology List Based We auto-annotated the two Sections using the list of Explicits and found that some connectives were annotated with a very high accuracy ratio and some with very low. Some were moderately correct. We sorted the list into these levels and handled them accordingly. Low Frequency Mostly Correct Since these Explicits were mostly correct in the Section 16 and 17, we tested them on our own annotated corpus and found that the results were similar and these were mostly accurate. High Frequency - Mostly Correct These Explicits had high accuracy and so could be assumed to be nonambiguous. Mostly Erroneous These explicits were very erroneous and so were given most of the attention. The first two types were clubbed as Type I and the last one as Type II

Type I 303 56 84.4% Type II 114 748 13.2% Overall 417 804 34.1% Resolving Ambiguity Discourse vs. Non-Discourse Usage We explored the predictive power of syntactic features for both the Discourse vs. Non-Discourse usage. The following examples would illustrate some examples which shows how this is helpful: त र थरय त र य त र य क उपस थ त स थ य त और म सम क क त ररण उत न गर म स भ र य ह म ल गर क ल क वजह स सभ र ल गर दशन नह कर पस थ त रत ह म त रर र सरक त रर क जनत त र क य ह त क त र पस थ र त र ध य त रन ह और वह स त रह स र फ सल ल न स पस थ र छ नह ह टत र Thus we can see, for some connectives there were some restrictions imposed by the syntactic categories of their left and right neighbors. We made use of these restrictions to disambiguate these connectives from their non-discourse usage. Rule Based We used TnT tagger to find out the syntactic categories and got the following results. Result-TnT और 85 254 25 % पस थर 12 343 3.3% पस थह ल 1 49 2% य त र 8 19 29.26% व 2 74 2.63%

आगर 5 11 31.25% But since we used a general TnT tagger and not a gold-data based one so accuracy was not good. So we decided to use Shallow parser for better result. There were also some availability issues associated with the taggers. Result-Shallow Parser और 85 16 84.1 % पस थर 12 25 32.5% पस थह ल 1 42 2.32% य त र 8 5 61.5% व 2 2 50% आगर 5 8 38.45% The above selection of taggers was still facing one problem. The taggers were not capturing the actual syntactic category which we wanted. We analyzed the still occurring errors and it seemed that phrasal category of neighbors would be more appropriate. So we moved onto using Chunkers. Result-Chunker और 85 3 96.5 % पस थर 12 2 85.7% पस थह ल 1 41 2% य त र 8 1 42% व 2 0 100% आगर 5 7 41.65%

Overall Result Type I 303 56 84.4% Type II 114 182 38.51% Overall 417 541 77.2% Conclusion and Future Work We were able to handle most of the Explicit connectives fairly well but there were issues worth mentioning. There were two connectives still आगर, पस थह ल which still had ambiguity. We have less examples in the corpus so we need to have a different approach for them. Since the Chunker and tagger were not based on Gold-data, there were errors because of this. Paired connectives have also issues associated with them as to which word belongs to the pair-set if more than one second word is present In future, we would include richer linguistic information in Rule Based Technique to improve the results. Apart from that Machine-learning techniques would be used to identify explicit connectives. We would then move into implicit connective identification.

Apart from this we had also explored sense-annotation of some explicit connectives. ल य कन {'Comparison': 59, 'Expansion': 1} य य द..त {'Contingency': 12} ब त रद म {'Temporal': 5} जब..त {'Temporal': 3, 'Contingency': 3} बह रह त रल {'Comparison': 8} जब य क {'Comparison': 26} इसक ब त रद {'Temporal': 9} पस थर {'Comparison': 8} स त रथ ह र {'Expansion': 11} इसक स त रथ ह र {'Expansion': 6} इस ल ए {'Contingency': 15} द सर र ओर {'Comparison': 11} और {'Comparison': 3, 'Contingency': 4, 'Temporal': 1, 'Expansion': 89} अगर र..त {'Contingency': 11} क य य क {'Contingency': 6} य त र {'Expansion': 9} वह {'Comparison': 8} इसस {'Contingency': 8} त त र य क {'Contingency': 6} ब त ल क {'Comparison': 6, 'Expansion': 1} उधर {'Comparison': 22} इस पस थर {'Contingency': 5} इसस पस थह ल {'Temporal': 4} आगर {'Temporal': 2, 'Expansion': 3} ह त रल त र य क {'Comparison': 12} As we can see at Top Level errors would be very less, but as we go for more fine grained sense-annotation, errors come up.

References: Emily Pitler and Ani Nenkova 's Using Syntax to Disambiguate Explicit Discourse Connectives in Text. Ziheng Lin, Min-Yen Kan and Hwee Tou Ng Recognizing Implicit Discourse Relations in the Penn Discourse Tree-bank.