Citation for published version (APA): Gaustad, T. (2004). Linguistic Knowledge and Word Sense Disambiguation Groningen: s.n.

Similar documents
University of Groningen. Systemen, planning, netwerken Bosman, Aart

Parsing of part-of-speech tagged Assamese Texts

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

University of Groningen. Peer influence in clinical workplace learning Raat, Adriana

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Guidelines for Writing an Internship Report

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

A Case Study: News Classification Based on Term Frequency

The College Board Redesigned SAT Grade 12

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

CEFR Overall Illustrative English Proficiency Scales

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Citation for published version (APA): Veenstra, M. J. A. (1998). Formalizing the minimalist program Groningen: s.n.

AQUA: An Ontology-Driven Question Answering System

Using dialogue context to improve parsing performance in dialogue systems

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Linking Task: Identifying authors and book titles in verbose queries

Developing Grammar in Context

CS 598 Natural Language Processing

Think A F R I C A when assessing speaking. C.E.F.R. Oral Assessment Criteria. Think A F R I C A - 1 -

Applications of memory-based natural language processing

A Correlation of. Grade 6, Arizona s College and Career Ready Standards English Language Arts and Literacy

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

This publication is also available for download at

Memory-based grammatical error correction

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Writing a composition

NAME: East Carolina University PSYC Developmental Psychology Dr. Eppler & Dr. Ironsmith

Integrating simulation into the engineering curriculum: a case study

ScienceDirect. Malayalam question answering system

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

A Bayesian Learning Approach to Concept-Based Document Classification

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Lecture 1: Machine Learning Basics

Controlled vocabulary

- «Crede Experto:,,,». 2 (09) ( '36

UC Berkeley Berkeley Undergraduate Journal of Classics

THE VERB ARGUMENT BROWSER

Advanced Grammar in Use

BULATS A2 WORDLIST 2

Success Factors for Creativity Workshops in RE

Word Segmentation of Off-line Handwritten Documents

Context Free Grammars. Many slides from Michael Collins

Multi-Lingual Text Leveling

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

English Language Arts Missouri Learning Standards Grade-Level Expectations

Phonological and Phonetic Representations: The Case of Neutralization

A Graph Based Authorship Identification Approach

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Ch VI- SENTENCE PATTERNS.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Degree Qualification Profiles Intellectual Skills

10.2. Behavior models

Ensemble Technique Utilization for Indonesian Dependency Parser

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Cross Language Information Retrieval

Rendezvous with Comet Halley Next Generation of Science Standards

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Rubric for Scoring English 1 Unit 1, Rhetorical Analysis

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Unit 8 Pronoun References

Common Core State Standards for English Language Arts

Derivational and Inflectional Morphemes in Pak-Pak Language

1. Introduction. 2. The OMBI database editor

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Vocabulary Usage and Intelligibility in Learner Language

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

What the National Curriculum requires in reading at Y5 and Y6

PART C: ENERGIZERS & TEAM-BUILDING ACTIVITIES TO SUPPORT YOUTH-ADULT PARTNERSHIPS

Windows 7 home premium free download 32 bit with key. The adverb always follows the verb. Need even more information..

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Mandarin Lexical Tone Recognition: The Gating Paradigm

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Achievement Level Descriptors for American Literature and Composition

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

English IV Version: Beta

Coast Academies Writing Framework Step 4. 1 of 7

Morphosyntactic and Referential Cues to the Identification of Generic Statements

9.85 Cognition in Infancy and Early Childhood. Lecture 7: Number

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Proof Theory for Syntacticians

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Word Sense Disambiguation

Words come in categories

The stages of event extraction

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

On the nature of voicing assimilation(s)

The presence of interpretable but ungrammatical sentences corresponds to mismatches between interpretive and productive parsing.

Facing our Fears: Reading and Writing about Characters in Literary Text

School: Business Course Number: ACCT603 General Accounting and Business Concepts Credit Hours: 3 hours Length of Course: 8 weeks Prerequisite: None

Transcription:

University of Groningen Linguistic Knowledge and Word Sense Disambiguation Gaustad, Tanja IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below. Document Version Publisher's PDF, also known as Version of record Publication date: 2004 Link to publication in University of Groningen/UMCG research database Citation for published version (APA): Gaustad, T. (2004). Linguistic Knowledge and Word Sense Disambiguation Groningen: s.n. Copyright Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons). Take-down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum. Download date: 30-09-2018

Chapter 1 Introduction 1.1 Ambiguity in Language In the field of computational linguistics, researchers are mainly concerned with the computational processing of natural language. A number of results have already been obtained, ranging from concrete and applicable systems able to understand or produce language to theoretical descriptions of the underlying algorithms. However, a number of important research problems have not been solved. A particular challenge for computational linguistics pertaining to all levels of language is ambiguity. Most people are quite unaware of how vague and ambiguous human languages really are, and they are disappointed when computers are hardly able to understand language and linguistic communication the way humans do. Ambiguity means that a word or sentence can be interpreted in more than one way, has more than one meaning. It should not be confused with vagueness, in which a word or phrase has only one meaning whose boundaries are not sharply defined. Mostly ambiguity does not pose a problem for humans and is therefore not perceived as such. The only exception where ambiguity is actively employed are jokes and puns. For a computer, however, ambiguity is one of the main problems encountered in the analysis and generation of natural languages. We can distinguish various kinds of ambiguity. A word can be ambiguous with regard to its internal structure (morphological ambiguity). Compounds are a typical source of morphological ambiguity. Two Dutch examples are massagebed, which can be analyzed as massage-bed (massage bed) or massagebed (mass prayer), and computertaalkunde, with the two analyses computertaalkunde (computer linguistics) and computertaal-kunde (programming language knowledge). 1

2 Chapter 1. Introduction This kind of ambiguity can also be observed more implicitly, such as for example with the English verb form look: It can either be the infinitive, first or second person singular or plural, but as soon as the word immediately preceding look is known, the ambiguity can be resolved in most cases (e.g. to look is the infinitive, I look is first person singular, etc.). 1 Look can also be ambiguous with regard to its syntactic class, its so called part-of-speech. In the sentence We look at her look is a verb whereas in She gave him a warning look it is a noun. Another kind of syntactic ambiguity can be found at sentence level. A classic example is so called PP attachment ambiguity: The sentence The man saw the girl with the telescope is ambiguous with respect to whether the man had the telescope and was using it to see the girl or whether the girl was carrying the telescope. In contrast, the sentence The man saw the girl with the ice cream is not ambiguous for the human reader (we know that ice cream cannot be used to see), while it presents the same difficulty as the telescope sentence for the computer to resolve. Pragmatics can also lead to ambiguity, as e.g. with the interpretation of pronouns. Consider for example the two utterances in (1): (1) Mary s mother is a gardener. John likes her. The pronoun her in the second sentence can either refer to Mary or to her mother. The preferred (and congruent) reading would be that John likes Mary s mother, but once more there is potential ambiguity that needs to be resolved. At word level again, lexical semantic ambiguity occurs when a single word is associated with multiple senses. We will be focusing on this type of ambiguity in the present thesis. To illustrate the problem of lexical ambiguity, consider the noun party. It can refer to (at least) 4 different things: an organization to gain political power (political party), a band of people associated temporarily in some activity (search party/ party of three) a group of people gathered together for pleasure (birthday party) a person involved in legal proceedings (third party rights) 1 An exception occurs if the preceding word is the personal pronoun you which can either be singular or plural and which, in addition, can also be used as a direct object instead of as the subject. In those cases, more context and more information has to be taken into account to achieve disambiguation.

1.1. Ambiguity in Language 3 Without any further information, a list of possible senses like the one above is the best we can do to decide what party refers to. One could also argue that all these meanings are related and could be subsumed in a more general sense of party, namely group of people (but for many other words no such general sense can be found). However, for various applications, such as information retrieval queries or machine translation, it is important to be able to distinguish between the different senses of the word party. In order to correctly translate an English sentence containing party to Dutch for example, we first have to know which meaning of party is intended in English and then find the best translation equivalent in the given context in Dutch. The preferred translation for birthday party would be (verjaardags)feestje, whereas for political party it would be partij two words with quite distinct meanings. Also, when we formulate an Internet query, there is usually one specific meaning we intend and we only want to retrieve documents or links relevant for that particular meaning. So, if, for instance, we are looking for information on a political party, we are not interested in documents on search parties that have been conducted or legal issues. For this reason, it is crucial to be able to distinguish the various senses of a word. Now let us consider the meaning of party in the following sentence: (2) The guests left John s party right away. It is quite clear to the human reader that the only possible reading here is the social gathering for pleasure. It is interesting to note that most people are not even aware of the potential ambiguity contained in this sentence. Humans are so skilled at resolving potential ambiguities that they do not realize that they are doing it. There has been research on how people resolve ambiguities (see Small et al. (1988) for a collection of articles from a psycholinguistic and neurolinguistic point of view), but since we (still) do not exactly know how lexical ambiguity resolution is done by humans, it is even more difficult to teach a computer to achieve the same thing. Especially if more than one ambiguous word is present in a sentence, the number of potential interpretations of the sentence explodes : the number of interpretations is the product of all possible meanings of the words. Assume that only left and party are ambiguous in the example sentence, and that they both have 4 senses. This brings the number of possible interpretations to 16. Imagine what happens if there are more senses to take into account as illustrated in figure 1.1 (on page 4) or if the sentence gets longer. The most prominent way to determine the meaning of a word in a particular usage is to examine its context. The context can be seen as the words

4 Chapter 1. Introduction The guests left John s party right away the guest leave john s political entitlem. along not right search immed. gone pol. lib. birthday pol.cons. legal not left Figure 1.1: Figure illustrating the possible interpretations of the sentence The guests left John s party right away. The dotted lines show all possible combinations of senses for all words, the black line indicates the correct path. surrounding the ambiguous word, in this case party. A word such as guest might be a good cue for a particular sense of party. But words surrounding the ambiguous word is not the only kind of information that is available. Underneath the simple words lies information on whether a word in the context is a noun or a verb (its syntactic class), on whether that same word plays the role of subject or object, on the syntactic structure of the entire sentence, etc. All this information is certainly available to people in the process of disambiguation and a combination of all these different kinds of information together with general knowledge about the situation and the world is used to rule out improbable readings. The main research question we will try to answer in the present thesis is which linguistic knowledge sources are most useful for word sense disambiguation, more specifically word sense disambiguation of Dutch. Therefore, the structure of the thesis is based on the various levels of linguistic information tested for word sense disambiguation, including morphology, information on the syntactic class of a particular ambiguous word, and the syntactic structure of the entire sentence containing an ambiguous word. Each source of linguistic knowledge is tested and evaluated individually in order to assess its value for word sense disambiguation. Finally, combinations of knowledge sources are investigated and evaluated. The goal of our project was to develop a tool which is able to automatically determine the meaning of a particular ambiguous word in context, a so called word sense disambiguation system. In order to achieve this, we make use of the information contained in the context similar to what humans do. So we use the words surrounding the ambiguous word, and additional underlying information, such as syntactic class and structure, to build a stat-

1.2. Overview 5 istical language model. This model is then used to determine the meaning of examples of that particular ambiguous word in new contexts. 1.2 Overview Chapter 2 contains an overview of word sense disambiguation, starting with an outline of the problem of word sense disambiguation and the difficulty of defining word senses. We then continue with an elaboration of the different approaches possible and the various information types used for sense disambiguation in computational linguistics. Next, a crucial, yet difficult issue in word sense disambiguation is addressed, namely the problem of evaluation. A description of the general approach adopted in this thesis concludes this chapter. In chapter 3, preliminary experiments with pseudowords instead of real ambiguous words are reported on, investigating the importance of corpus size and frequency of context words. Furthermore, the equivalence between employing pseudowords or real ambiguous words to test word sense disambiguation algorithms is examined. The main conclusion is that the task of disambiguating pseudowords and real ambiguous words is not comparable. The experimental setup used in the remainder of this thesis is introduced in chapter 4. We describe the classification algorithm and smoothing techniques as well as the corpus employed. A detailed explanation of the system and its implementation, as well as first results make up the rest of the chapter. These first results using only the context for disambiguation show that maximum entropy works well as a classification algorithm for word sense disambiguation when compared to the frequency baseline. Chapter 5 presents a variation on the word sense disambiguation system introduced, the lemma-based approach. It tests the hypothesis that lemmas as bases for classifiers improve generalization and therefore accuracy. Comparing the lemma-based approach with the (traditional) word form approach on the Dutch Senseval-2 data shows a significant improvement when lemmatization is used. Furthermore, the resulting word sense disambiguation system is smaller and more robust. We can conclude from this that the lemma-based approach is a better alternative than the word formbased approach. A detailed description and evaluation of a newly built stemmer/lemmatizer for Dutch (a necessary pre-processing tool for the lemmabased approach) are included, too. Extending our word sense disambiguation system with information on part-of-speech and reporting on its impact on word sense disambiguation is the subject of chapter 6. We were especially interested in the importance

6 Chapter 1. Introduction of the quality of the part-of-speech tagger used during pre-processing. We therefore compare the accuracy of our system including the part-of-speech of the ambiguous word generated by three different part-of-speech taggers. Two conclusions can be drawn from our results: first, that the most accurate tagger on a stand-alone task also outperforms the other taggers on the word sense disambiguation task, and second, that including information about the part-of-speech of the ambiguous word increases performance significantly. Including parts-of-speech of the context leads to an even bigger improvement of the disambiguation accuracy achieved. The addition of deep linguistic knowledge, in the form of syntactic dependency relations, is discussed and evaluated in chapter 7. The results of our maximum entropy word sense disambiguation system including dependency relations are preceded by a detailed explanation of Alpino, the wide-coverage parser for Dutch used to annotate the data, as well as a description of the dependency relations employed. The results show that adding dependency relations to our statistical disambiguation system results in a significant increase in performance compared with all results presented earlier. The best results on the tuning data are achieved with a combination of features, including the part-of-speech of the ambiguous word, the context, and the dependency relations linked to the ambiguous word. Chapter 8 presents the results on the Dutch Senseval-2 test data with the best model based on the tuning evaluation. First, we summarize our findings using the training data in a leave-one-out approach. Then, the results on the test data are presented. The first conclusion we reach is that the best model on the tuning data including syntactic information also works best on the test data. When applying the same model in a comparison between the word form-based approach and the lemma-based approach, we find that the lemma-based approach using dependency relations as features achieves the best overall performance of our system on the test data. In a last step, we compare our best model to another word sense disambiguation system which, to the best of our knowledge, has produced the best results for Dutch to date. Our system achieves significantly higher disambiguation accuracy than the other model which makes it state-of-the-art for Dutch word sense disambiguation. This is mainly due to the combination of the lemma-based approach and the integration of deep linguistic knowledge in the form of dependency relations. We conclude in chapter 9 with some final remarks on the findings presented in the present thesis and thoughts on future work.