Analysis and Reconstruction of Dictionary Definition Units

Similar documents
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Parsing of part-of-speech tagged Assamese Texts

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

CS 598 Natural Language Processing

An Interactive Intelligent Language Tutor Over The Internet

A Syllable Based Word Recognition Model for Korean Noun Extraction

The College Board Redesigned SAT Grade 12

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Writing a composition

Advanced Grammar in Use

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

Chapter 9 Banked gap-filling

Formulaic Language and Fluency: ESL Teaching Applications

A First-Pass Approach for Evaluating Machine Translation Systems

LING 329 : MORPHOLOGY

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Loughton School s curriculum evening. 28 th February 2017

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

California Department of Education English Language Development Standards for Grade 8

Some Principles of Automated Natural Language Information Extraction

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Context Free Grammars. Many slides from Michael Collins

Developing a TT-MCTAG for German with an RCG-based Parser

Linking Task: Identifying authors and book titles in verbose queries

1/20 idea. We ll spend an extra hour on 1/21. based on assigned readings. so you ll be ready to discuss them in class

Ensemble Technique Utilization for Indonesian Dependency Parser

A Framework for Customizable Generation of Hypertext Presentations

BULATS A2 WORDLIST 2

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Constraining X-Bar: Theta Theory

Pseudo-Passives as Adjectival Passives

A Domain Ontology Development Environment Using a MRD and Text Corpus

What the National Curriculum requires in reading at Y5 and Y6

Cross Language Information Retrieval

Derivational and Inflectional Morphemes in Pak-Pak Language

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

AQUA: An Ontology-Driven Question Answering System

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

THE VERB ARGUMENT BROWSER

The Discourse Anaphoric Properties of Connectives

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Words come in categories

Modeling full form lexica for Arabic

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Ch VI- SENTENCE PATTERNS.

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

5 th Grade Language Arts Curriculum Map

Applications of memory-based natural language processing

Intensive English Program Southwest College

A Corpus-based Evaluation of a Domain-specific Text to Knowledge Mapping Prototype

Prediction of Maximal Projection for Semantic Role Labeling

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Today we examine the distribution of infinitival clauses, which can be

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Natural Language Processing. George Konidaris

A Case Study: News Classification Based on Term Frequency

Universiteit Leiden ICT in Business

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Impact of Controlled Language on Translation Quality and Post-editing in a Statistical Machine Translation Environment

Grammars & Parsing, Part 1:

Underlying and Surface Grammatical Relations in Greek consider

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Developing Grammar in Context

Controlled vocabulary

Character Stream Parsing of Mixed-lingual Text

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Chapter 4: Valence & Agreement CSLI Publications

An Introduction to the Minimalist Program

Specifying a shallow grammatical for parsing purposes

Common Core State Standards for English Language Arts

CEFR Overall Illustrative English Proficiency Scales

Adjectives tell you more about a noun (for example: the red dress ).

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

Multiple case assignment and the English pseudo-passive *

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Syntactic and Lexical Simplification: The Impact on EFL Listening Comprehension at Low and High Language Proficiency Levels

ScienceDirect. Malayalam question answering system

Prentice Hall Literature Common Core Edition Grade 10, 2012

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

L1 and L2 acquisition. Holger Diessel

Emmaus Lutheran School English Language Arts Curriculum

A Comparison of Two Text Representations for Sentiment Analysis

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Transcription:

Analysis and Reconstruction of Dictionary Definition Units Chung-Won Seo and Key-Sun Choi Department of Computer Science KAIST/AITRC/KORTERM KAIST 373-1 Kusong-dong, Yusong-ku, Taejon, 305-701, Republic of Korea {cwseo,kschoi}@world.kaist.ac.kr Abstract In this paper, we analyze the dictionary definitions of verbs for finding units of a definition, and study the reconstruction of the dictionary definitions without ambiguity. For finding units of the definition, we analyze the dependency structures of definitions and make a relation between main verb and other syntactic units. We build a hierarchical structure by using the relation between a headword and a main verb of its definition. From that result, we reduce the verbs in the order of frequency to basic verbs that define other verbs. In the dictionary definitions, the uses of conjunctive verb endings are limited in their meanings. We restrict the use of conjunctive verb endings because they contain many errors and ambiguities in syntax and semantics. From the restricted conjunctive verb endings, we reconstruct the dictionary definitions without ambiguity. 1 Introduction Dictionaries are resources for giving lexical information like morphology, semantics and so on. In NLP process, it is important to make a good dictionary for its size and quality. When we define a new lexical entry, it is needed to give semantic category, usage, relations between pre-defined lexical entries and so on. We also need tools for defining and describing a lexical entry. The dictionary definition should be simple and not ambiguous in meanings and contain enough information. It has properties that restricted sentence patterns and pre-defined words like sub-languages. Sub-language is used for improving performance of the machine translation by restricting its lexicon and grammar (Ananiadou, et al.1995; Claire, et al. 2000). In the dictionary definition, we can make a definition of a new lexical entry from pre-defined words and semantic relations between the new entry and existing lexical entries (Lee, et al. 1998; German 1998; Buitelaar 1997). Sub-language for MT just has a restricted lexicon and grammar, but that for dictionary definitions needs basic verbs that describe other verbs and description patterns for definitions. In this paper, we analyze the definition of the verbs in Urimal Korean Unabridged Dictionary (1997) and find units for the verb definitions. From the relations between the headword and the main verb of the definition, we find basic verbs for definitions. We restrict the uses and meanings of the conjunctive verb-endings for resolving ambiguities of automatic processing. 2 Related Works Sub-languages are studied for improving machine translation s performance. In Adriaens (1994), Simplified English Grammar restricts its sentence patterns, lexicons and uses of the modification for human writer and provides SEG, editing and checking programs. In the case of the dictionary definition, we need to find description patterns and basic elements of the definitions. Choi, et al. (2000) gives a method to get the basic verbs for the morpheme dictionary and information retrieval. Their basic verbs are defined by frequencies of corpus and weights by human experts. Conjunctive verb endings connect two or more verbs in one sentence. They are similar with conjunctions. Lee (2000) gives semantic relations of conjunctions based on RST. However, conjunctive verb endings are subdivided by their usages. Hovy (1993) gives 62 discourse relations that were used before and the numbers referred by other researchers.

3 Verb Definitions Urimal Korean Unabridged Dictionary (1997) consists of about 204,350 lexical entries and 274,480 meanings. There are about 12,800 verb entries and 21,700 meanings. There are basic rules for writing definitions as followings (The National Academy of Korean Language, 2000). 1. Avoiding cyclic definition. A. Some compound words, irreplaceable main concepts and derivatives by affix are exceptions. 2. Prohibiting one word definition 3. Definition contains one sentence of meaning and additional explanations A. Definition: difference + genus. < Additional explanation> B. Additional explanation: special features, structures and usages of headword. It is optional 4. Exceptional cases of 3 A. When verb stems are combined with some propositions or endings. B. Dependence nouns C. Grammatical formations D. Adverb, unconjugation adjective and pronouns. 5. Some entries can end with an abbreviation term, a term of respect, a court term, an original term, and so on. Because a lexical entry is defined by difference + genus, the last word of the definition is important to describe concepts. In the verb definition, the last word tends to be the main verb in the definition. We determine basic verbs such that they define other verbs. First, we select verbs by frequency, and then we make a hierarchical structure between headwords and main verbs of the definitions. From that result, we select basic verbs that are in the top level. We look the words of high frequency on the KAIST Corpus (2000) for determining basic verbs. From the KAIST Corpus, we select 1,578 verbs that appear in rank 10,000 words. Table 1: words of high frequency A headword and main verb of the definitions have similar meanings and they differ each meaning by the other units of definition nubda (lie down): alh-geona hayeo ileona-ji moshada. [[ subordinate alh-geona hayeo (be ill) ] sentence ileona-ji moshada. (cannot rise)] bulreoileokida (excite): eoddeon maeum, heangdong, sangtae-reul ileona-ge hada. [[[[[ modifier eoddeon (certain) ] parallel maeum (mind) ] parallel heangdong (behavior)] object sangtae-reul (state) ] sentence ileona-ge hada (cause)] seanggaknada (remind): mueot-eul hago sipeun maeum-i ileonada. [[[[ object mueot-eul (something) ] modifier ha-go sipeun (wants to do) ] subject maeum-i (mind) ] sentence ileona-da (occur) ] Those three words are defined by the same verb ý [ileonada]. It can be differed by the caseframe, mood, voice, and conjunct verbs. 3.1 Basic verbs

From the new 170 words, we remove non-verb entries and get 96 verbs. From the intersection of [1] and 96 verbs definitions we can get new 40 verbs. For 40 verbs definitions, we can precede similar method. Figure 1: the relations between head verbs and main verb. The units of verb definitions consist of a main verb, an auxiliary verb sequence, other verbs connected by conjunctive verb endings, and the caseframe of main verbs. Table 2: the units of the verb definitions Head Main verb Concepts word Auxiliary verb Meanings of verb (negative, passive, and so on.) Caseframe Arguments Conjunctive Additional meaning verb ending When we determine basic verbs used to explain other verbs, it needs to make hierarchical relations of the verbs from headwords and main verbs. It has 350 common words and 500 words do not appear at definitions in the 1,578 verbs and their definitions. New 170 words appear only at definitions. 500 verbs are leaf entries of the verb hierarchy. Table 3: the intersections of the headwords and main verbs Only head words [0] Common [1] Only definition [2] 500 350 170 Figure 2: the contraction process ý[ggieonda] (pour) [2] : ý æ ý[naedeonjida] (throw) [1] When we process three times repeat, most verbs are returned to verbs of [1] like ý [ggieonda]. Nine verbs remain. Following two verbs are not returned to [1]. ý[golhda] (suffer): ý[eongeolmeokda]. ý[eongeolmeokda] (suffer a big loss) ý[gubuleojida] (be curved): [gubutha] (bent) ý. Table 4: the result of the third application. Common words Only in definition <[1]+[2]> [3] 430 9 We can build the 439 basic verbs for the 1,578 verbs of high frequency in KAIST corpus. 3.2 Conjunctive verb endings

The conjunctive verb endings contain many errors in morphological analysis and syntactic analysis. There are three kinds of the conjunctive verb ending (equal, subordinate, and assistant conjunctive verb endings). Most endings are used with more than two kinds of POS and they cause errors in automatic analysis. If sentences contain conjunctive verb endings, syntactic analysis accuracy is lowered very much. In the 1,578 verb definitions, 396 sentences contain the equal or subordinate conjunctive verb endings. It covers one of fourth. If it contains assistant conjunctive verb endings, it covers 87% of the definitions. For accurate analysis of the definitions, we must restrict the use of the conjunctive verb endings. Only 15 conjunctive verb endings cover 95% in definitions. The high frequent words are used more than two POSs and have more ambiguities than the low frequent words. Table 5: the distributions of conjunctive verb endings in the verb definitions. Equal and Conjunctive Total sentence subordinate Verb endings endings 396 (25%) 1373 (87%) 1578 Many conjunctive verb endings are used more than two POSs and meanings. The others without restriction can replace some endings, some can be replaced with some restrictions and some cannot be replaced. If we can give one ending to one POS and meaning, we can resolve the ambiguities for conjunctive verb endings. There are 170 conjunctive verb endings in the morpheme dictionary. In corpus, there are 280 endings, because they contain the transformed morphologies. In the dictionary definitions, 106 conjunctive endings are appeared. In the case of the verb definitions, there are only 44 conjunctive verb endings. We can reduce the number by the restriction of usages and meanings. Table 6: the usages of conjunctive verb endings. Figure 3: the surface and deep structures of the dictionary definitions. The conjunctive verb endings make a semantic relation between two verbs. We can contract the numbers of conjunctive verb endings by restricting their relations. First, we find the usages and meanings of each conjunctive verb endings and bind them with similar meanings. Some conjunctive verb endings belong to more than two relations. For the relations that belong to two or more conjunctive verb endings, it can be chosen one conjunctive verb endings with one relation. However, some relations contain only one verb endings and not replaceable with other verb endings. Those verb endings are only used for one POS. In the morphology level, they contain no ambiguity. For frequent verb endings like ~ [go], ~ [eo], ~ [ge], they can be replaced with other verb endings. ~ [eo] is used as

subordinate and assistant conjunctive verb endings. When we need to use ~ [eo] as subordinate verb endings, we can replace ~ [eo] with [eoseo] of the same meaning. When it is used as assistant verb endings, we do not replace any other verb endings. We can restrict ~ [eo] as assistant conjunctive ending and [eoseo] as subordinate ending of the meaning of conditions. ~ [go] and ~ [ge] also can be restricted by using with the same relations and replaceable endings. In the verb definitions, we can find 16 relations (sequence, cause/reason, purpose, background and so on.). <Table 7> shows the mapping between relations and verb endings. Table 7: the relations of conjunctive verb endings Relations Conjunctive Restriction Verb endings Sequence [geona] [deunji] Purpose [goja] [ryeogo] Purpose Precedence Reason /Cause Background Reason ~û(~ û) [ni, euni] [meuro] Excuse [neura] Contrast [neunde] Cause ~ý [dago] Condition () [eoseo/aseo] ~ (~ ) [eo / a] [neunde] Include effort Opposite reason Following state or action. From the relations, we can restrict the numbers of the conjunctive verb endings to 20 and resolve POS ambiguities. 4 Evaluation From the result of restricted conjunctive verb endings, we can make new 200 verb definitions that contain conjunctive verb endings. Those works are done by hand. We evaluate the accuracy of dependency parsing (Seo, 2001). Because the uses of conjunctive verbs endings are restricted, it can improve the accuracy of the parsing result. The 200 sentences consist of 1453 word phrases. The inputs for the dependency parser are raw sentences. For modified sentences, we add post-processors for conjunctive verb endings. For the dependency parser, we add the rules to determine the dependent of the node that includes conjunctive verb endings. We evaluate the sentence accuracy. Table 8: The result of parsing Accuracy Original definitions 105/200 (52.5%) Modified definitions 149/200 (74.5%) The sentence accuracy is improved by 22%. The original definitions include tagging errors of the conjunctive verb endings and ambiguities of subordinate clauses boundary. For modified definitions, we add some heuristics for conjunctive verb endings (determining the dependent of nodes that include conjunctive verb endings and boundaries of subordinate clauses). Most errors occur at the parallel noun phrases and the adverbs in the subordinate clauses or the relative clauses. From the result, we can conclude that the restriction of the conjunctive verb endings resolve ambiguities of syntax. We need to evaluate the differences of meanings between original definitions and modified definitions for verifying that the restrictions do not lose information. 5 Conclusion In this paper, we analyze the verb definitions and find the units of definitions. We determine basic verbs that define other verbs. We set the relations about the conjunctive verb endings for resolving

ambiguities. From those results, we can reconstruct the verb definitions. The dictionary definitions consist of the terms that represent differences and genus term. We can find basic verbs from the genus term. The genus terms of verb definitions are converged to the basic verbs. From the 1,578 definitions, we get 439 basic verbs. The conjunctive verb endings represent the relations between main verbs of definitions and the other events. Discourse relations can define those relations. In the dictionary definitions, the use of relations is limited. We restrict the use of conjunctive verb endings with 16 discourse relations. The restriction can resolve ambiguities because they have many ambiguities in syntax and meaning. From the basic verbs and restricted verb endings, we can reconstruct verb definitions without ambiguity. In future work, we plan to analyze the ambiguities of restrictions of lexical entries, and generate the definitions automatically. The basic verbs and its relations can be compared with thesaurus, and we can get the definitions from corpus by constructing the semantic hierarchy. From the result of verb definitions, it can be extended to the adverbs or nouns. References In Artificial Intelligence 63, Special Issue on Natural Language Processing. German Rigau (1998). Automatic Acquisition of Lexical Knowledge from MRDs, PhD Thesis, Departament de Llenguatges i Sistemes Inform`atics, Universitat Polit`ecnica de Catalunya. Hanguel Society, Ed. (1997). Urimal Korean Unabridged Dictionary, Eomungag. Ivan A. Sag and Thomas Wasow (1999). A Syntax Theory, CSLI publications. Korterm (2000) KAIST Corpus, http://morph.kaist.ac.kr/kcp/. Lee, Chungmin, Seungho Nam and Beom-mo Kang (1998). A Generative Approach to the Lexical Semantics of Korean Predicates, Korean Journal of Cognitive Science 9.3, Korean Society for Cognitive Science. Lee, Yu-Ri (2000). Text Summarization using Rhetorical Structure, MS. thesis, Dept. Of Computer Science, KAIST Seo, Chung-Won (2001). Dependency Parsing of Simple Korean Sentence Usings Verb Caseframe, MS. Thsis, Dept. of Electirical Engineering & Computer Science, KAIST The National Academy of Korean Language, (2000). Standard Korean Big Dictionary Editing Guids II, the National Academy of Korean Language Adriaens G. (1994). The LRE SECC Project: Simplified English Grammar and Style Correction in an MT Framework, in Proceedings of Linguistic Engineering Convention, Paris Ananiadou, S., Radford, I. and Tsujii, J. (1995). Sublanguage knowledge acquisition for hypertext optimization, Proceedings of NLPRS, Seoul. Buitelaar (1997). A Lexicon for Underspecified Semantic Tagging, Proceedings of ANLP 97. Choi Young-suk, Woon-jae Lee, and Key-sun Choi (2000). A Study on Verbs Statistics in Corpus, Proceedings of KLIP2000 Claire Grover, et al (2000). Designing a controlled language for interactive model checking, Proceedings of the Third International Workshop on Controlled Language Application. Eduar H. Hovy (1993). Automatic Discourse Generation using Discourse Structure Relations,