Automatic Identification of Explicit Connectives

Similar documents
DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

S. RAZA GIRLS HIGH SCHOOL

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

HinMA: Distributed Morphology based Hindi Morphological Analyzer

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

F.No.29-3/2016-NVS(Acad.) Dated: Sub:- Organisation of Cluster/Regional/National Sports & Games Meet and Exhibition reg.

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

ENGLISH Month August


ह द स ख! Hindi Sikho!

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Using Semantic Relations to Refine Coreference Decisions

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Memory-based grammatical error correction

Word Sense Disambiguation

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Parsing of part-of-speech tagged Assamese Texts

व रण क ए आ दन-पत र. Prospectus Cum Application Form. न दय व kऱय सम त. Navodaya Vidyalaya Samiti ਨਵ ਦ ਆ ਦਵਦ ਆਦ ਆ ਸਦ ਤ. Navodaya Vidyalaya Samiti

Multi-Lingual Text Leveling

Annotation Projection for Discourse Connectives

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Case Study: News Classification Based on Term Frequency

A process by any other name

Compositional Semantics

Cross Language Information Retrieval

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Modeling user preferences and norms in context-aware systems

The Smart/Empire TIPSTER IR System

The stages of event extraction

The Discourse Anaphoric Properties of Connectives

Applications of memory-based natural language processing

BYLINE [Heng Ji, Computer Science Department, New York University,

ScienceDirect. Malayalam question answering system

Linking Task: Identifying authors and book titles in verbose queries

A Graph Based Authorship Identification Approach

Leveraging Sentiment to Compute Word Similarity

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Ensemble Technique Utilization for Indonesian Dependency Parser

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Degree Qualification Profiles Intellectual Skills

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Training and evaluation of POS taggers on the French MULTITAG corpus

Lecture 2: Quantifiers and Approximation

Vocabulary Agreement Among Model Summaries And Source Documents 1

Treebank mining with GrETEL. Liesbeth Augustinus Frank Van Eynde

Specifying a shallow grammatical for parsing purposes

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Unpacking a Standard: Making Dinner with Student Differences in Mind

Rule Learning With Negation: Issues Regarding Effectiveness

Review in ICAME Journal, Volume 38, 2014, DOI: /icame

CS 598 Natural Language Processing

Vocabulary Usage and Intelligibility in Learner Language

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Florida Reading Endorsement Alignment Matrix Competency 1

Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. Grzegorz Chrupa la

Word Segmentation of Off-line Handwritten Documents

Switchboard Language Model Improvement with Conversational Data from Gigaword

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Wonderworks Tier 2 Resources Third Grade 12/03/13

A Bayesian Learning Approach to Concept-Based Document Classification

Context Free Grammars. Many slides from Michael Collins

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

University of Alberta. Large-Scale Semi-Supervised Learning for Natural Language Processing. Shane Bergsma

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books

Evidence for Reliability, Validity and Learning Effectiveness

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Foundations of Knowledge Representation in Cyc

Proceedings of the 19th COLING, , 2002.

Probabilistic Latent Semantic Analysis

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

The following information has been adapted from A guide to using AntConc.

Distant Supervised Relation Extraction with Wikipedia and Freebase

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Prediction of Maximal Projection for Semantic Role Labeling

Reflective problem solving skills are essential for learning, but it is not my job to teach them

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

AQUA: An Ontology-Driven Question Answering System

The Role of the Head in the Interpretation of English Deverbal Compounds

CEFR Overall Illustrative English Proficiency Scales

Learning Computational Grammars

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Transcription:

Automatic Identification of Explicit Connectives Introduction This project was a part of building an automatic Discourse tagger. Automating the process of identifying the discourse connectives, their relations and their arguments is an essential basis for discourse processing studies and applications. In this project we tried to identify Explicit Discourse Connectives using a list of connectives. Corpus Used We annotated a part of Hindi Corpus made available to us during Inter Annotator Agreement. And along with Section17 and 16 of Discourse Corpus, we extracted list of all Explicit connectives. We also extracted there senses and frequency of occurrence. Since AltLexes behave more or less same as Explicits so we analyzed those also. Methodology List Based We auto-annotated the two Sections using the list of Explicits and found that some connectives were annotated with a very high accuracy ratio and some with very low. Some were moderately correct. We sorted the list into these levels and handled them accordingly. Low Frequency Mostly Correct Since these Explicits were mostly correct in the Section 16 and 17, we tested them on our own annotated corpus and found that the results were similar and these were mostly accurate. High Frequency - Mostly Correct These Explicits had high accuracy and so could be assumed to be nonambiguous. Mostly Erroneous These explicits were very erroneous and so were given most of the attention. The first two types were clubbed as Type I and the last one as Type II

Type I 303 56 84.4% Type II 114 748 13.2% Overall 417 804 34.1% Resolving Ambiguity Discourse vs. Non-Discourse Usage We explored the predictive power of syntactic features for both the Discourse vs. Non-Discourse usage. The following examples would illustrate some examples which shows how this is helpful: त र थरय त र य त र य क उपस थ त स थ य त और म सम क क त ररण उत न गर म स भ र य ह म ल गर क ल क वजह स सभ र ल गर दशन नह कर पस थ त रत ह म त रर र सरक त रर क जनत त र क य ह त क त र पस थ र त र ध य त रन ह और वह स त रह स र फ सल ल न स पस थ र छ नह ह टत र Thus we can see, for some connectives there were some restrictions imposed by the syntactic categories of their left and right neighbors. We made use of these restrictions to disambiguate these connectives from their non-discourse usage. Rule Based We used TnT tagger to find out the syntactic categories and got the following results. Result-TnT और 85 254 25 % पस थर 12 343 3.3% पस थह ल 1 49 2% य त र 8 19 29.26% व 2 74 2.63%

आगर 5 11 31.25% But since we used a general TnT tagger and not a gold-data based one so accuracy was not good. So we decided to use Shallow parser for better result. There were also some availability issues associated with the taggers. Result-Shallow Parser और 85 16 84.1 % पस थर 12 25 32.5% पस थह ल 1 42 2.32% य त र 8 5 61.5% व 2 2 50% आगर 5 8 38.45% The above selection of taggers was still facing one problem. The taggers were not capturing the actual syntactic category which we wanted. We analyzed the still occurring errors and it seemed that phrasal category of neighbors would be more appropriate. So we moved onto using Chunkers. Result-Chunker और 85 3 96.5 % पस थर 12 2 85.7% पस थह ल 1 41 2% य त र 8 1 42% व 2 0 100% आगर 5 7 41.65%

Overall Result Type I 303 56 84.4% Type II 114 182 38.51% Overall 417 541 77.2% Conclusion and Future Work We were able to handle most of the Explicit connectives fairly well but there were issues worth mentioning. There were two connectives still आगर, पस थह ल which still had ambiguity. We have less examples in the corpus so we need to have a different approach for them. Since the Chunker and tagger were not based on Gold-data, there were errors because of this. Paired connectives have also issues associated with them as to which word belongs to the pair-set if more than one second word is present In future, we would include richer linguistic information in Rule Based Technique to improve the results. Apart from that Machine-learning techniques would be used to identify explicit connectives. We would then move into implicit connective identification.

Apart from this we had also explored sense-annotation of some explicit connectives. ल य कन {'Comparison': 59, 'Expansion': 1} य य द..त {'Contingency': 12} ब त रद म {'Temporal': 5} जब..त {'Temporal': 3, 'Contingency': 3} बह रह त रल {'Comparison': 8} जब य क {'Comparison': 26} इसक ब त रद {'Temporal': 9} पस थर {'Comparison': 8} स त रथ ह र {'Expansion': 11} इसक स त रथ ह र {'Expansion': 6} इस ल ए {'Contingency': 15} द सर र ओर {'Comparison': 11} और {'Comparison': 3, 'Contingency': 4, 'Temporal': 1, 'Expansion': 89} अगर र..त {'Contingency': 11} क य य क {'Contingency': 6} य त र {'Expansion': 9} वह {'Comparison': 8} इसस {'Contingency': 8} त त र य क {'Contingency': 6} ब त ल क {'Comparison': 6, 'Expansion': 1} उधर {'Comparison': 22} इस पस थर {'Contingency': 5} इसस पस थह ल {'Temporal': 4} आगर {'Temporal': 2, 'Expansion': 3} ह त रल त र य क {'Comparison': 12} As we can see at Top Level errors would be very less, but as we go for more fine grained sense-annotation, errors come up.

References: Emily Pitler and Ani Nenkova 's Using Syntax to Disambiguate Explicit Discourse Connectives in Text. Ziheng Lin, Min-Yen Kan and Hwee Tou Ng Recognizing Implicit Discourse Relations in the Penn Discourse Tree-bank.