WordNet Structure and use in natural language processing

Similar documents
Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Leveraging Sentiment to Compute Word Similarity

Word Sense Disambiguation

A Bayesian Learning Approach to Concept-Based Document Classification

The MEANING Multilingual Central Repository

Vocabulary Usage and Intelligibility in Learner Language

AQUA: An Ontology-Driven Question Answering System

A Bottom-up Comparative Study of EuroWordNet and WordNet 3.0 Lexical and Semantic Relations

On document relevance and lexical cohesion between query terms

Combining a Chinese Thesaurus with a Chinese Dictionary

2.1 The Theory of Semantic Fields

1. Introduction. 2. The OMBI database editor

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Extended Similarity Test for the Evaluation of Semantic Similarity Functions

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Introduction to Text Mining

Chapter 9 Banked gap-filling

CS 598 Natural Language Processing

Controlled vocabulary

Parsing of part-of-speech tagged Assamese Texts

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Grammars & Parsing, Part 1:

Part III: Semantics. Notes on Natural Language Processing. Chia-Ping Chen

A Comparison of Two Text Representations for Sentiment Analysis

Automatic Extraction of Semantic Relations by Using Web Statistical Information

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

TextGraphs: Graph-based algorithms for Natural Language Processing

Ensemble Technique Utilization for Indonesian Dependency Parser

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

A Comparative Evaluation of Word Sense Disambiguation Algorithms for German

Accuracy (%) # features

Multilingual Sentiment and Subjectivity Analysis

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Writing a composition

CORPUS ANALYSIS CORPUS ANALYSIS QUANTITATIVE ANALYSIS

Introduction, Organization Overview of NLP, Main Issues

Distant Supervised Relation Extraction with Wikipedia and Freebase

Robust Sense-Based Sentiment Classification

Ontologies vs. classification systems

The taming of the data:

THE VERB ARGUMENT BROWSER

! # %& ( ) ( + ) ( &, % &. / 0!!1 2/.&, 3 ( & 2/ &,

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Prediction of Maximal Projection for Semantic Role Labeling

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

A Case Study: News Classification Based on Term Frequency

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Applications of memory-based natural language processing

Language Independent Passage Retrieval for Question Answering

Using dialogue context to improve parsing performance in dialogue systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Matching Similarity for Keyword-Based Clustering

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

The College Board Redesigned SAT Grade 12

Intension, Attitude, and Tense Annotation in a High-Fidelity Semantic Representation

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Compositional Semantics

I. INTRODUCTION. for conducting the research, the problems in teaching vocabulary, and the suitable

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Constraining X-Bar: Theta Theory

DKPro WSD A Generalized UIMA-based Framework for Word Sense Disambiguation

SAMPLE PAPER SYLLABUS

Big Fish. Big Fish The Book. Big Fish. The Shooting Script. The Movie

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The stages of event extraction

Epping Elementary School Plan for Writing Instruction Fourth Grade

Development of the First LRs for Macedonian: Current Projects

ANALYSIS OF LEXICAL COHESION IN APPLIED LINGUISTICS JOURNALS. A Thesis

A Domain Ontology Development Environment Using a MRD and Text Corpus

Semantic Inference at the Lexical-Syntactic Level for Textual Entailment Recognition

rat tail Overview: Suggestions for using the Macmillan Dictionary BuzzWord article on rat tail and the associated worksheet.

Semantic Inference at the Lexical-Syntactic Level

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Lexical Similarity based on Quantity of Information Exchanged - Synonym Extraction

An Introduction to the Minimalist Program

Linking Task: Identifying authors and book titles in verbose queries

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

ScienceDirect. Malayalam question answering system

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Natural Language Processing. George Konidaris

Loughton School s curriculum evening. 28 th February 2017

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Context Free Grammars. Many slides from Michael Collins

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Text-mining the Estonian National Electronic Health Record

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Intensive English Program Southwest College

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

Transcription:

WordNet Structure and use in natural language processing Abstract There are several electronic dictionaries, thesauri, lexical databases, and so forth today. WordNet is one of the largest and most widely used of these. It has been used for many natural language processing tasks, including word sense disambiguation and question answering. This is an attempt to explore and understand the structure of WordNet, and how it is used and for what applications it is used, and also to see where it's strength and weakness lies. 1. WordNet as a lexical database 1.1 Background Before the 1990s, most of the dictionaries for English existed only in paper form. The dictionaries that were available in electronic form was limited to a few groups of researchers. This was something that hindered much work to be done in certain areas of computational linguistics, for example word sense disambiguation (WSD). In 1993, WordNet was introduced. It is a lexical database, organized as a semantic network. The development began in 1985 at Princeton University by a group of psychologists and linguists, and the university still is the maintainer of this lexical database. Even though it was not created with the intention to serve as knowledge source for tasks in computational linguistics, it has been used as such. It has been widely used as a lexical resource for different tasks, have been ported to several different languages, and has spawned many different subsets. One task that it has been widely used for is the previous mentioned WSD. 1.2 Structure WordNet consists of three separate databases, one for nouns, one for verbs and one for adjectives and adverbs. It does not include closed class words. The current version available for download is WordNet 3.0, which was released in December 2006. It contains 117,097 nouns, 22,141 adjectives, 11,488 verbs and 4,601 adverbs. [2] There is a later release, 3.1, which is available for online usage. The basic structure is synsets. These are sets of synonyms, or more correct, near-synonyms, since there exists none to few true synonyms. Synsets contains a set of lemmas, and these sets are tagged with the sense they represent. These senses can be said to be concepts, all of the lemmas (or words), can be said to express the same concept. Word forms which have different meanings appear in different synsets. For example the noun bank, has 10 different senses in WordNet, and thus it appear in 10 different synsets. It also appear as verb in 8 different synsets. Each of these synsets are also connected in some way to other synsets, expressing some kind of relation. Which these relations are depend on the part of speech of the word itself, although the hypernym/hyponym (the

is-a relation) relationship is the most common, and appears for both nouns and verbs (hypoyms for verbs are known as troponyms, and the difference between them leading to the different names will be expanded on). One thing that WordNet does not take in to account is pronunciation, which can be observed by looking at the noun bass. The pronunciation differs whether talking about bass in the sense of the low tone or the instrument, or talking about the fish bass. 1.2.1 Nouns Nouns have the richest set of relations of all parts of speech represented in WordNet, with 12 different relations. As previously stated, the hyponym/hypernym relation is the most frequent used one. For example, if we look at the noun bass again (which have 8 different senses), now in the context of sea bass, it is a saltwater fish, which is a kind of a seafood, which is a kind of solid food, and so on. These relations are also transitive, which means that sea bass is a type of food, as much as it is a type of saltwater fish. Sense 4 sea bass, bass => saltwater fish => seafood => food, solid food => solid => matter => physical entity => entity Table 1: Hypernyms of bass in the sense of sea bass. WordNet also separates the hyponyms between types and instances. A chair is a type of furniture. Hesse, however is not a type of author, but an instance of author. So an instance is a specific form of hyponyms, and these instances are usually proper nouns, describing a unique entity, such as persons, cities and companies. These instances goes both ways, just like the types. Meronymi, the part-of relationship is divided into three different types, member meronymi, part meronymi and substance meronymi. It also has it's counterpart, just like hyponyms, holonymi. Where meronymi is has-part, holonymi is part-of. And just like homonyms, meronyms are a transitive relationship. If a tree has branches, and a branch has leaves, the tree has leaves. Part meronymi, which is the relationship most commonly associated with the word, describes parts of an entity. Substance meronymi describes substances contained in an entity. For example, using the word water in the sense of the chemical substance H2O, it has substance hydrogen and substance oxygen. The last subset of meronymi, member meronymi, describes the relationship of belonging to a larger kind of group. Looking at the word tree again, we can see that it is a member of the entity forest,

wood and woods. See table 2 for a description of the different types of meronymis. Part meronymi: Sense 1 tree HAS PART: stump, tree stump HAS PART: crown, treetop HAS PART: limb, tree branch HAS PART: trunk, tree trunk, bole HAS PART: burl Substance meronymi: Sense 1 water, H2O HAS SUBSTANCE: hydrogen, H, atomic number 1 HAS SUBSTANCE: oxygen, O, atomic number 8 Member meronymi: Sense 1 forest, wood, woods HAS MEMBER: underbrush, undergrowth, underwood HAS MEMBER: tree Table 2: Different types of meronymi used in WordNet. Antonyms describe words that are semantically opposed. If you are a parent, you can not be a child in the sense of someones child. However, they do not have to rule out one another. Even though poor and rich are antonyms, just saying that one is rich does not automatically mean that they are poor. 1.2.2 Verbs Verbs, just like nouns, have the hypernym relationship. Where the counterpart to hypernyms in the case of nouns is called hyponyms, this relationship among verbs are called troponyms. These goes from the event to a superordinate event, and from an event to a subordinate event, respectively. Troponyms can also be described as in which manner something is done, therefor explaining the difference of names. Antonymi also exists for verbs, and functions the same way, stop is an antonym of start. The third relation, entails goes from an event to an event it entails. Entailment is used in pragmatics to describe a relationship between to sentences, where the truth condition of one sentence depends on the truth of the other. If sentence A is true, then sentence B also has to be true. For example The criminal was sentenced to death (A), and The criminal is dead (B). If A is true, then B also has to be true. This is the kind of relationship described by entails in WordNet. If you snore, you are also sleeping, which is represented as an entails relation of the two words, and thus you have an entails mapping from snore to sleep.

1.2.3 Adjectives and adverbs Adjectives are mostly organized in the terms of antonymi. As in the case of nouns and verbs, these are words which have meanings that are semantically opposed. As all words in WordNet, they are also part of a synset. The other adjectives in this particular synset also have their antonyms, and thus the antonyms of the other words become indirect antonyms for the synonyms. Pertainyms is a relation which points the adjective to the nouns that they were derived from. This is one of the relations that cross the part of speech, though there are a few rare cases in which it points to another adjective. The amount of adverbs are quite small. This depends on the fact that most of the adverbs in English are derived from adjectives. Those that does exist are ordered mostly in the same way that adjectives, with antonyms. They also have a relationship that is like the pertainym relation of adjectives, which also is a cross part of speech pointer, and points to the adjective that they were derived from. 1.2.4 Relations across part of speech Most of the relations in WordNet are relations among words of the same part of speech. There is however some pointers across the subfields the part of speeches it consists of. One has already been mentioned, pertainyms, which points from an adjective to the noun that it was derived from. Other than that, there are pointers that points to semantically similar words which share the same stem, called derivationally related form. For many of these pairs of nouns and verbs, the thematic role is also described. The verb kill has a pointer to the noun killer, and killer would be the agent of kill. 2. Using WordNet for Natural Language Processing There are several subfields in natural language processing which can benefit from having a large lexical database, especially one as big and extensive as WordNet. Obviously, many semantic applications can draw benefits from using WordNet, including WSD and sentiment analysis. Many papers have been published regarding WordNet and WSD, exploring different approaches and algorithms, which is the main field for using this. In fact, WordNet can be said to be the de facto standard knowledge source for WSD in English.[4] This success depends on several factors. It is not domain specific, it is very extensive and publicly available. Since WSD has been the subfield which has used WordNet most extensively, this is what will be focused on here. Though, an interesting mention is that there do exists packages to access WordNet in several programming languages, including Perl and Python. For Python, the Natural Language Tool Kit (NLTK), which offers many modules and tools to analyze and process natural language and is widely used, has tools for using WordNet, such as finding synsets and other relations between words.

2.1 WordNet for Word Sense Disambiguation WSD is a field which has been around since humans have tried to process natural language with computers. It is has been described as an AI-complete problem and is considered to be an intermediate step in many NLP tasks. The two main approaches to solving this problem are knowledge-based methods and supervised methods. Supervised methods suffer from sparseness in data to train on, in contrast to syntactic parsing, where there exists many resources of tagged data to work with. SemCor is a subset of the Brown Corpus, tagged with senses from WordNet. 186 files out of the 500 that constitutes the Brown Corpus have tags for all of the content words (nouns, verbs, adjectives and adverbs) and another 166 files have tags for the verbs. Even if this may be sufficient for evaluation, it is not enough for building a robust system for WSD.[3] The knowledge-based methods use some kind of knowledge source, such as WordNet, to retrieve word senses. It is for these methods that WordNet has been used extensively. WordNet keeps occurring in papers regarding WSD to this date. Due to the knowledge-based methods using WordNet performs worse than supervised methods, approaches to extend the knowledge contained in WordNet have been proposed. They range from semantically tagging the glosses in WordNet to enrich the semantic relations, to extracting knowledge from Wikipedia. Combining WordNet with ConceptNet, a semantic network which contains semantic relations, to improve performance have also been proposed.[5] 3. Discussion WordNet is an impressive database, with its large amount of words and the encoding of the relations. Also being freely available makes it very practical to use for natural language processing, just as it have been. However there are quite a few things that may speak against it. The very fine-grained distinctions in the database can be problematic for several tasks. Difficulty, for example, have four different senses in WordNet, all of them very similar, and can be hard to set apart, just not for computers, but also for humans. As such, not all senses may be relevant to disambiguate a word. Other problems may be that it was mainly annotated and tagged by humans, which may produce some inconsistencies, and that it was not produced just to solve NLP tasks. WordNet is still widely used by people working in semantic natural language processing, as can be understood when reading papers, specifically regarding WSD. This can be seen in recent research, where WordNet have not been abandoned, but instead been used in combination with other resources, or has been tried to be improved in different ways. And since WordNet 3.0, it also contains a corpus of semantically annotated disambiguated glosses, which itself can prove to be useful.[8] WordNet will be used for a time to come for WSD, mostly because the sparseness of data for supervised methods. Improvement of the lexical knowledge and algorithms to use for this may be the best way to go for the time being.

Bibliograhy [1] George A. Miller, Richard Beckwith, Christine, Fellbaum, Derek Gross & Katherine Miller, Introduction to WordNet: An On-line Lexical Database (1993) [2] Daniel Jurafsky & James H. Martin, Speech and Language Processing (Pearson Education International, 2009) [3] Eneko Agirre & Philip Edmonds, Word Sense Disambiguation: Algorithms and Applications (Springer, 2006) [4] Robert Navigli, Word Sense Disambiguation: A Survey (2009) [5] Junpeng Chen & Juan Liu, Combining ConceptNet and WordNet for Word Sense Disambiguation (2011) [6] Jorge Morato, Miguel Ángel Marzal, Juan Lloréns & José Moreiro, Wordnet Applications (2003) [7] Julian Szymański & Włodzisław Duch, Annotating Words Using WordNet Semantic Glosses (2012)