An Entity-Relation Approach to Information Retrieval 1

Similar documents
Cross Language Information Retrieval

Language Independent Passage Retrieval for Question Answering

The Smart/Empire TIPSTER IR System

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AQUA: An Ontology-Driven Question Answering System

Linking Task: Identifying authors and book titles in verbose queries

CS 598 Natural Language Processing

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Using dialogue context to improve parsing performance in dialogue systems

Interactive Corpus Annotation of Anaphor Using NLP Algorithms

Chapter 4: Valence & Agreement CSLI Publications

ScienceDirect. Malayalam question answering system

Probabilistic Latent Semantic Analysis

THE VERB ARGUMENT BROWSER

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Parsing of part-of-speech tagged Assamese Texts

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Using Semantic Relations to Refine Coreference Decisions

Constructing Parallel Corpus from Movie Subtitles

BULATS A2 WORDLIST 2

Compositional Semantics

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Applications of memory-based natural language processing

Underlying and Surface Grammatical Relations in Greek consider

Postprint.

An Interactive Intelligent Language Tutor Over The Internet

A Case Study: News Classification Based on Term Frequency

Memory-based grammatical error correction

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Writing a composition

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

The stages of event extraction

Matching Similarity for Keyword-Based Clustering

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Ensemble Technique Utilization for Indonesian Dependency Parser

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

5 th Grade Language Arts Curriculum Map

A First-Pass Approach for Evaluating Machine Translation Systems

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

A Graph Based Authorship Identification Approach

Developing Grammar in Context

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Resolving Complex Cases of Definite Pronouns: The Winograd Schema Challenge

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

On document relevance and lexical cohesion between query terms

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

The MEANING Multilingual Central Repository

BYLINE [Heng Ji, Computer Science Department, New York University,

The College Board Redesigned SAT Grade 12

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Summarizing Text Documents: Carnegie Mellon University 4616 Henry Street

Context Free Grammars. Many slides from Michael Collins

Natural Language Processing. George Konidaris

1. Introduction. 2. The OMBI database editor

Multi-Lingual Text Leveling

Corpus Linguistics (L615)

Universal Grammar 2. Universal Grammar 1. Forms and functions 1. Universal Grammar 3. Conceptual and surface structure of complex clauses

BASIC ENGLISH. Book GRAMMAR

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

The Discourse Anaphoric Properties of Connectives

Cross-Lingual Text Categorization

Construction Grammar. University of Jena.

Some Principles of Automated Natural Language Information Extraction

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

Detecting English-French Cognates Using Orthographic Edit Distance

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Prediction of Maximal Projection for Semantic Role Labeling

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Introduction to Text Mining

Inleiding Taalkunde. Docent: Paola Monachesi. Blok 4, 2001/ Syntax 2. 2 Phrases and constituent structure 2. 3 A minigrammar of Italian 3

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

SCHEMA ACTIVATION IN MEMORY FOR PROSE 1. Michael A. R. Townsend State University of New York at Albany

HLTCOE at TREC 2013: Temporal Summarization

Controlled vocabulary

The Role of the Head in the Interpretation of English Deverbal Compounds

Developing a TT-MCTAG for German with an RCG-based Parser

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

CONCEPT MAPS AS A DEVICE FOR LEARNING DATABASE CONCEPTS

Argument structure and theta roles

I N T E R P R E T H O G A N D E V E L O P HOGAN BUSINESS REASONING INVENTORY. Report for: Martina Mustermann ID: HC Date: May 02, 2017

Zero Pronominal Anaphora Resolution for the Romanian Language

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Common Core State Standards for English Language Arts

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Copyright 2017 DataWORKS Educational Research. All rights reserved.

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Georgetown University at TREC 2017 Dynamic Domain Track

Course Outline for Honors Spanish II Mrs. Sharon Koller

A Comparison of Two Text Representations for Sentiment Analysis

Emmaus Lutheran School English Language Arts Curriculum

Transcription:

An Entity-Relation Approach to Information Retrieval 1 Antonio Ferrández, Julio Martínez and Jesús Peral Dept. Languages and Information Systems, University of Alicante Carretera San Vicente S/N. 03080 Alicante, SPAIN {antonio, jmartinez, jperal}@dlsi.ua.es Abstract In this paper, a novel model of indexation in IR is proposed, in order to overcome the problems of traditional bag of words approaches, by means of indexing the entities and the relations between these entities in the documents through the clauses and anaphoric relations. This model has been evaluated on the Los Angeles Times collection. The obtained results have been compared with the vectorial model and an increase of the 12% in the average precision, and an increase of the 13% in the R-Precision have been obtained. 1 Introduction In the literature, the Natural Language Processing (NLP) techniques have been reported to show no significant improvement in retrieval performance, although it looks clear that they may overcome the inadequacies of purely quantitative methods of text Information Retrieval (IR): statistical full-text retrieval or bag of words representations. As examples of the attempts to overcome these inadequacies, the works from Strzalkowski can be read (e.g. Strzalkowski, 1999). As they say, one possible explanation is that the syntactic analysis is just not going far enough. Or perhaps more appropriately, that the semantic uniformity predictions made on the basis of syntactic structures are less reliable than we have hoped for. Of course the relatively low quality of parsing may be a major problem, although there is little evidence to support that. In this paper, we propose a novel IR model that incorporates NLP techniques such as POStagging and partial parsing to improve the traditional bag of words representations. This model indexes entities and the relations between these entities. These relations are based on the clause splitting of the document, and the resolution of anaphora phenomenon between these entities. In this way, we improve other approaches that use this kind of knowledge, such as Zhai et al. (1996) work, in which only sets of nouns and/or adjectives are indexed thorough the vector space retrieval model, because these relations between entities are not considered. In the following section, the model proposed in this paper is presented in its intuitive view. This is followed by its implementation in a computational system, which is finally evaluated on the Los Angeles Times collection and compared with the vectorial model. 2 The intuitive model The model proposed in this paper tries to overcome the problems of traditional bag of words approaches, by means of extracting the entities in the documents. The entities are obtained from the syntactic knowledge of the document, i.e., the complex noun phrases (NP) that are going to be parsed (the NPs should be complex enough, in order to capture all the 1 This paper has been partially supported by the Spanish Government (CICYT) project number TIC2000-0664- C02-02 and (PROFIT) project number FIT-150500-2002-416. Copyright held by the author 285

information from each entity, i.e., they could be formed by relative clauses, appositions, coordinated PPs and coordinated adjectives). These NPs interact with each other by means of clauses, whose main head is the verb, as it is shown in Figure 1. In this Figure, the sentence (1) is represented by means of four entities: Peter s son, Peter, the garden and flowers; where these entities interact with each other by means of two clauses whose heads are stay and catch respectively. In the clause 1, the entity 1 appears as the agent 2 or subject, and the entity 3 as the modifier since it is included in a prepositional phrase (PP). (1) Peter s son stayed in the garden. He was catching flowers. ENTITY 1 Identifier: V Head: son sing, masc,third MODIFIER: Peter ENTITY 2 Identifier: X Head: Peter sing, masc, third CLAUSE 1 Sentence_ID: 1 ACTION: stay AGENT: V MODIFIER: Cat: PP, Id: Q, Prep: in ENTITY:Y COREFERENCE CHAIN ENTITY 3 Identifier: Y Head: garden sing, masc, third CLAUSE 2 Sentence_ID: 2 ACTION: catch THEME: Z AGENT: Cat: PRONOUN, Type, Num, Gend Person, Head ENTITY:V ENTITY 4 Identifier: Z Head: flower plural, fem, third Figure 1. IRS Intuitive model: entities vs. clauses in sentence (1). Moreover, these entities interact with other entities by means of anaphora phenomenon, which is defined by Hirst (1981) as the device, in discourse, of making an abbreviated reference to some entity or entities, in the expectation that the receiver of the discourse will be able to dis-abbreviate it and determine the identity of the entity. For example, in Figure 1, the pronoun he allows that the entity 1 interacts with the entity 4 through the verb of the clause 2. The anaphoric relations between entities also allow capturing more information about the entities themselves. For example, let us suppose that the sentence (2) occurs after the (1). In 2 In Figure 1, the clauses store the semantic roles: ACTION, AGENT, THEME and MODIFIER that correspond to verb, subject, object and prepositional phrases of the clause respectively. 286

this case, the information that Peter is Jane s husband is added to the previous information of the entity 2. (2) Peter, Jane s husband, called his son. The model proposed in this paper overcomes the drawbacks of the bag of words approaches, because it does not index independent words, but entities and their relations. In this way, our approach also overcomes other IR approaches that use NLP, because we do not index just contiguous words as pairs, ternary expressions or phrases, but we index whole entities by adding the new information that is presented in different points of the document, by means of resolving anaphora. Therefore, if a query asks for information about Peter as husband of Susan, this document will not be returned. 3 The implementation of the intuitive model In order to implement the model proposed in the previous section, we have worked on the output of the computational system called Slot Unification Parser for Anaphora Resolution (SUPAR). This system, which was presented in Ferrández et al. (1999), resolves anaphora in both English and Spanish texts, although it can be easily extended to other languages 3. SUPAR works on the output of a POS tagger, and partial parses the text. SUPAR partial parses coordinated NPs, coordinated PPs, verbal phrases and conjunctions, where NPs can include relative clauses, appositions, coordinated PPs and coordinated adjectives. Conjunctions are used to split sentences into clauses. An example of the parsing process and the detection of noun phrase entities in a sentence can be observed in (3), where 10 entities have been extracted. (3) [[David R. Marples s] 1 new book, his second on [the Chernobyl accident of [April 26, 1986] 2 ] 3 ] 4, is [a shining example of [the best type of [non-soviet analysis into [topics] 5 ] 6 ] 7 ] 8 that only recently were [absolutely taboo in [Moscow official circles] 9 ] 10. The output of SUPAR is stored in three tables: ER, PP and CC. The first one stores the entities and relations between entities in the document, where each entity corresponds to a noun phrase, and each relation corresponds to a clause whose head is a verb. The second one stores the entities that appear in a prepositional phrase, jointly with the preposition, and finally, the third one stores the clauses represented in Figure 1. In Table 1, the representation of the sentence (1) is shown, where each table as well as the document identification, also stores the frequency of each entity, e.g. the frequency of the son entity is 2 because it appears as the subject of the clause 1 and 2 (due to the pronoun resolution), whereas the frequency of the remaining entities is 1. As well as the frequency of each entity, the number of documents in which the entity appears is also stored. With regard to the ER table, when the sentence (2) occurs, due to the anaphora resolution process, the modifiers Jane s husband are added to the entity Peter. Therefore, this entity remains as Peter [husband, Jane], and its frequency is assigned 2. However, if the entity John s son appears in the document, then a new entity with the head son is indexed, because the modifiers do not match. In Table 1, the record son will store two different entities: [[Peter], [John]], although there is only one frequency for both entities, i.e., its frequency is assigned equal to 2. Let us suppose that the sentence Peter s son is black, since the verb of the clause is copulative, then a new characteristic of the entity Peter s son is added (black), so 3 The SUPAR system can be tested in http://supar.dlsi.ua.es/supar/. It resolves English pronominal anaphora with a 74% of success rate, and Spanish pronominal anaphora with an 81%. 287

the modifiers of son entity remains as [[black, Peter], [John]]. Therefore, as conclusion, whenever the modifiers of the new entity are included in a list of modifiers of an entity previously stored in the table, then the new modifiers are added to that list. Otherwise, a new list of modifiers is stored. ER PP CC Head Modifiers Preposition NP Verb Subject Objects Head NP Head son [Peter] in garden stayed son [garden] Peter [] catching son [flowers] garden [] flowers [] stayed [garden, Peter, son] catching [flowers, Peter, son] Table 1. Tables used to represent the entities in the sentence (1). With reference to the table PP, only the head of the NP is stored, and when there are several heads, a new entry is stored. For example, in the PP for books and cigars, two entries are stored: for books and for cigars. In the table CC, the omitted subjects are also detected due to the clause splitting, as in Ross carefully folded his trousers and climbed into bed, where Ross is also included as the subject of the second clause. In this way, different tree-structures are normalized into the same entity, e.g. Chinese communist invasion, invasion of communist Chineses, invasion of communists of China, invasion of communists that are from China, invasion of communists that are Chinese, invasion of Chinese communists, and invasion of Chineses that are communists, which are conflated in the entity invasion [Chin--, communist]. Moreover, due to the anaphora resolution process, when the entities appear separately, this relation can also be captured. The user s query is processed in a similar way, and the three tables ER, PP and CC which are obtained, are compared with the tables for each document. Therefore, each table is used as the vectorial model. As the similarity measure, the one proposed in Kaszkiel et al. (1999) is used, although this measure is improved in the ER table, where the traditional vectorial weights are multiplied by the factors (F) in Table 2. These factors depend on the list of modifiers of the entity stored in the table (MT), the modifiers of the entity that appears in the user s query (MU) and the number of common modifiers between both lists (Common). MT=table modifiers F=Factor MU= user s query modifiers MT MU = [] F = 1.3 MT = [] F = 0 MU = [] F = 2.1 (MU MT) (MU MT []) F = 2.2 * log(common+1) (MT MU) (MU MT []) F = 1.6 * log(common+1) (Common 0) (MU MT) (MT F = 1.4 * log(common+1) MU) Common = 0 F = 1.1 Table 2. Factors in the table ER. 288

4 Evaluation Several experiments have been carried out to measure the improvement of our proposal with regard to the vectorial model as proposed in Kaszkiel et al. (1999). We have worked with the Cross Lingual Evaluation Forum (CLEF) queries; specifically we have used queries from 41 to 90 for the evaluation results presented in this section, whereas we have used the remaining queries for the training of the system, in order to obtain the factors of Table 2. The corpus on which these experiments were carried out is the Los Angeles Times collection, that is, a set of 113,005 papers from the English newspaper (approximately 425 Mb). For each query, 1,000 documents are returned. Finally, we have only used the short version of these queries (i.e., the title and description fields). During the training process, the best factor values were obtained, specifically those in Table 2. Moreover, these best results were obtained when the stem of the lemma was used, where the lemma was obtained from the Tree-Tagger 4. Precision vs. Recall 0,70 ER PP CC Vect orial 0,60 0,50 0,40 0,30 0,20 0,10 0,00 1 2 3 4 5 6 7 8 9 1 11 Recall Figure 2 The obtained results are shown in Figure 2 and Figure 3. In Figure 2, the interpolated recall-precision averages obtained with each independent table (ER, PP, CC) in comparison with the vectorial model are shown. It can be observed that only the ER table obtains better results than the vectorial model, specifically an increase of the 12% in the average precision, and an increase of the 13% in the R-Precision. The PP and CC tables always obtain low results. Therefore, we should improve them in order to obtain better results when they are used jointly with the ER table. In Figure 3, the precision at N documents is shown when we are using just the ER table; it should be remarked the improvement with reference to the vectorial model, when only 5 documents are returned. 4 http://www.ims.uni-stuttgart.de/projekte/corplex/treetagger/decisiontreetagger.html 289

0,45 0,40 0,35 0,30 0,25 0,20 0,15 0,10 0,05 0,00 Precision.vs. N docs Vectorial IRS 5 docs 10 docs 15 docs 20 docs 30 docs Figure 3 5 Conclusion In this paper, we have proposed a novel model of indexation in IR. This model tries to overcome the problems of traditional bag of words approaches, by means of indexing the entities and the relations between these entities in the documents. The entities are obtained from the partial parsing of the document. These entities interact with each other by means of clauses and anaphoric relations. In the implementation of this model, we have used three tables: ER, PP and CC, in which we have stored the entities and their relations, the prepositional phrases, and the clauses of the documents. These tables are used in a similar way to the traditional vectorial model, by means of using the similarity measure proposed in Kaszkiel et al. (1999). This model has been evaluated on the short CLEF queries (the queries from 41 to 90 for the evaluation process, where the previous ones for the training process) and on the Los Angeles Times collection. The obtained results have been compared with the vectorial model and an increase of the 12% in the average precision, as well an increase of the 13% in the R- Precision have been obtained when we only use the ER table. As a future project, the authors will try to improve the working of the PP and CC tables in order to combine them with the ER table. Moreover, we expect to evaluate this model on the narrative version of these queries, where we expect to obtain better results due to their length (i.e., a greater number of clauses and noun phrases), because long and descriptive queries usually responded well to NLP, while terse one-sentence search directives showed hardly any improvement. 6 References Chengxiang Zhai, Xiang Tong, Natasa Milic-Frayling, David A. Evans (1996). Evaluation of Syntactic Phrase Indexing - CLARIT NLP Track Report. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). Ferrández, A., Palomar, M. and Moreno, L. (1999). An empirical approach to Spanish anaphora resolution. Machine Translation, 14(3/4), 191-216. Hirst, G. (1981). Anaphora in Natural Language Understanding. Berlin: Springer-Verlag. Kaszkiel, M., Zobel, J., Sacks-Davis, R. (1999). Efficient passage ranking for document databases. ACM Transactions of Information Systems. 17(4), 406-439. Strzalkowski, T. (1999). Natural Language Information Retrieval. Kluwer Academic Publishers. 290