An Entity-Relation Approach to Information Retrieval 1

An Entity-Relation Approach to Information Retrieval 1 Antonio Ferrández, Julio Martínez and Jesús Peral Dept. Languages and Information Systems, University of Alicante Carretera San Vicente S/N. 03080 Alicante, SPAIN {antonio, jmartinez, jperal}@dlsi.ua.es Abstract In this paper, a novel model of indexation in IR is proposed, in order to overcome the problems of traditional bag of words approaches, by means of indexing the entities and the relations between these entities in the documents through the clauses and anaphoric relations. This model has been evaluated on the Los Angeles Times collection. The obtained results have been compared with the vectorial model and an increase of the 12% in the average precision, and an increase of the 13% in the R-Precision have been obtained. 1 Introduction In the literature, the Natural Language Processing (NLP) techniques have been reported to show no significant improvement in retrieval performance, although it looks clear that they may overcome the inadequacies of purely quantitative methods of text Information Retrieval (IR): statistical full-text retrieval or bag of words representations. As examples of the attempts to overcome these inadequacies, the works from Strzalkowski can be read (e.g. Strzalkowski, 1999). As they say, one possible explanation is that the syntactic analysis is just not going far enough. Or perhaps more appropriately, that the semantic uniformity predictions made on the basis of syntactic structures are less reliable than we have hoped for. Of course the relatively low quality of parsing may be a major problem, although there is little evidence to support that. In this paper, we propose a novel IR model that incorporates NLP techniques such as POStagging and partial parsing to improve the traditional bag of words representations. This model indexes entities and the relations between these entities. These relations are based on the clause splitting of the document, and the resolution of anaphora phenomenon between these entities. In this way, we improve other approaches that use this kind of knowledge, such as Zhai et al. (1996) work, in which only sets of nouns and/or adjectives are indexed thorough the vector space retrieval model, because these relations between entities are not considered. In the following section, the model proposed in this paper is presented in its intuitive view. This is followed by its implementation in a computational system, which is finally evaluated on the Los Angeles Times collection and compared with the vectorial model. 2 The intuitive model The model proposed in this paper tries to overcome the problems of traditional bag of words approaches, by means of extracting the entities in the documents. The entities are obtained from the syntactic knowledge of the document, i.e., the complex noun phrases (NP) that are going to be parsed (the NPs should be complex enough, in order to capture all the 1 This paper has been partially supported by the Spanish Government (CICYT) project number TIC2000-0664- C02-02 and (PROFIT) project number FIT-150500-2002-416. Copyright held by the author 285

information from each entity, i.e., they could be formed by relative clauses, appositions, coordinated PPs and coordinated adjectives). These NPs interact with each other by means of clauses, whose main head is the verb, as it is shown in Figure 1. In this Figure, the sentence (1) is represented by means of four entities: Peter s son, Peter, the garden and flowers; where these entities interact with each other by means of two clauses whose heads are stay and catch respectively. In the clause 1, the entity 1 appears as the agent 2 or subject, and the entity 3 as the modifier since it is included in a prepositional phrase (PP). (1) Peter s son stayed in the garden. He was catching flowers. ENTITY 1 Identifier: V Head: son sing, masc,third MODIFIER: Peter ENTITY 2 Identifier: X Head: Peter sing, masc, third CLAUSE 1 Sentence_ID: 1 ACTION: stay AGENT: V MODIFIER: Cat: PP, Id: Q, Prep: in ENTITY:Y COREFERENCE CHAIN ENTITY 3 Identifier: Y Head: garden sing, masc, third CLAUSE 2 Sentence_ID: 2 ACTION: catch THEME: Z AGENT: Cat: PRONOUN, Type, Num, Gend Person, Head ENTITY:V ENTITY 4 Identifier: Z Head: flower plural, fem, third Figure 1. IRS Intuitive model: entities vs. clauses in sentence (1). Moreover, these entities interact with other entities by means of anaphora phenomenon, which is defined by Hirst (1981) as the device, in discourse, of making an abbreviated reference to some entity or entities, in the expectation that the receiver of the discourse will be able to dis-abbreviate it and determine the identity of the entity. For example, in Figure 1, the pronoun he allows that the entity 1 interacts with the entity 4 through the verb of the clause 2. The anaphoric relations between entities also allow capturing more information about the entities themselves. For example, let us suppose that the sentence (2) occurs after the (1). In 2 In Figure 1, the clauses store the semantic roles: ACTION, AGENT, THEME and MODIFIER that correspond to verb, subject, object and prepositional phrases of the clause respectively. 286

this case, the information that Peter is Jane s husband is added to the previous information of the entity 2. (2) Peter, Jane s husband, called his son. The model proposed in this paper overcomes the drawbacks of the bag of words approaches, because it does not index independent words, but entities and their relations. In this way, our approach also overcomes other IR approaches that use NLP, because we do not index just contiguous words as pairs, ternary expressions or phrases, but we index whole entities by adding the new information that is presented in different points of the document, by means of resolving anaphora. Therefore, if a query asks for information about Peter as husband of Susan, this document will not be returned. 3 The implementation of the intuitive model In order to implement the model proposed in the previous section, we have worked on the output of the computational system called Slot Unification Parser for Anaphora Resolution (SUPAR). This system, which was presented in Ferrández et al. (1999), resolves anaphora in both English and Spanish texts, although it can be easily extended to other languages 3. SUPAR works on the output of a POS tagger, and partial parses the text. SUPAR partial parses coordinated NPs, coordinated PPs, verbal phrases and conjunctions, where NPs can include relative clauses, appositions, coordinated PPs and coordinated adjectives. Conjunctions are used to split sentences into clauses. An example of the parsing process and the detection of noun phrase entities in a sentence can be observed in (3), where 10 entities have been extracted. (3) [[David R. Marples s] 1 new book, his second on [the Chernobyl accident of [April 26, 1986] 2 ] 3 ] 4, is [a shining example of [the best type of [non-soviet analysis into [topics] 5 ] 6 ] 7 ] 8 that only recently were [absolutely taboo in [Moscow official circles] 9 ] 10. The output of SUPAR is stored in three tables: ER, PP and CC. The first one stores the entities and relations between entities in the document, where each entity corresponds to a noun phrase, and each relation corresponds to a clause whose head is a verb. The second one stores the entities that appear in a prepositional phrase, jointly with the preposition, and finally, the third one stores the clauses represented in Figure 1. In Table 1, the representation of the sentence (1) is shown, where each table as well as the document identification, also stores the frequency of each entity, e.g. the frequency of the son entity is 2 because it appears as the subject of the clause 1 and 2 (due to the pronoun resolution), whereas the frequency of the remaining entities is 1. As well as the frequency of each entity, the number of documents in which the entity appears is also stored. With regard to the ER table, when the sentence (2) occurs, due to the anaphora resolution process, the modifiers Jane s husband are added to the entity Peter. Therefore, this entity remains as Peter [husband, Jane], and its frequency is assigned 2. However, if the entity John s son appears in the document, then a new entity with the head son is indexed, because the modifiers do not match. In Table 1, the record son will store two different entities: [[Peter], [John]], although there is only one frequency for both entities, i.e., its frequency is assigned equal to 2. Let us suppose that the sentence Peter s son is black, since the verb of the clause is copulative, then a new characteristic of the entity Peter s son is added (black), so 3 The SUPAR system can be tested in http://supar.dlsi.ua.es/supar/. It resolves English pronominal anaphora with a 74% of success rate, and Spanish pronominal anaphora with an 81%. 287

the modifiers of son entity remains as [[black, Peter], [John]]. Therefore, as conclusion, whenever the modifiers of the new entity are included in a list of modifiers of an entity previously stored in the table, then the new modifiers are added to that list. Otherwise, a new list of modifiers is stored. ER PP CC Head Modifiers Preposition NP Verb Subject Objects Head NP Head son [Peter] in garden stayed son [garden] Peter [] catching son [flowers] garden [] flowers [] stayed [garden, Peter, son] catching [flowers, Peter, son] Table 1. Tables used to represent the entities in the sentence (1). With reference to the table PP, only the head of the NP is stored, and when there are several heads, a new entry is stored. For example, in the PP for books and cigars, two entries are stored: for books and for cigars. In the table CC, the omitted subjects are also detected due to the clause splitting, as in Ross carefully folded his trousers and climbed into bed, where Ross is also included as the subject of the second clause. In this way, different tree-structures are normalized into the same entity, e.g. Chinese communist invasion, invasion of communist Chineses, invasion of communists of China, invasion of communists that are from China, invasion of communists that are Chinese, invasion of Chinese communists, and invasion of Chineses that are communists, which are conflated in the entity invasion [Chin--, communist]. Moreover, due to the anaphora resolution process, when the entities appear separately, this relation can also be captured. The user s query is processed in a similar way, and the three tables ER, PP and CC which are obtained, are compared with the tables for each document. Therefore, each table is used as the vectorial model. As the similarity measure, the one proposed in Kaszkiel et al. (1999) is used, although this measure is improved in the ER table, where the traditional vectorial weights are multiplied by the factors (F) in Table 2. These factors depend on the list of modifiers of the entity stored in the table (MT), the modifiers of the entity that appears in the user s query (MU) and the number of common modifiers between both lists (Common). MT=table modifiers F=Factor MU= user s query modifiers MT MU = [] F = 1.3 MT = [] F = 0 MU = [] F = 2.1 (MU MT) (MU MT []) F = 2.2 * log(common+1) (MT MU) (MU MT []) F = 1.6 * log(common+1) (Common 0) (MU MT) (MT F = 1.4 * log(common+1) MU) Common = 0 F = 1.1 Table 2. Factors in the table ER. 288

4 Evaluation Several experiments have been carried out to measure the improvement of our proposal with regard to the vectorial model as proposed in Kaszkiel et al. (1999). We have worked with the Cross Lingual Evaluation Forum (CLEF) queries; specifically we have used queries from 41 to 90 for the evaluation results presented in this section, whereas we have used the remaining queries for the training of the system, in order to obtain the factors of Table 2. The corpus on which these experiments were carried out is the Los Angeles Times collection, that is, a set of 113,005 papers from the English newspaper (approximately 425 Mb). For each query, 1,000 documents are returned. Finally, we have only used the short version of these queries (i.e., the title and description fields). During the training process, the best factor values were obtained, specifically those in Table 2. Moreover, these best results were obtained when the stem of the lemma was used, where the lemma was obtained from the Tree-Tagger 4. Precision vs. Recall 0,70 ER PP CC Vect orial 0,60 0,50 0,40 0,30 0,20 0,10 0,00 1 2 3 4 5 6 7 8 9 1 11 Recall Figure 2 The obtained results are shown in Figure 2 and Figure 3. In Figure 2, the interpolated recall-precision averages obtained with each independent table (ER, PP, CC) in comparison with the vectorial model are shown. It can be observed that only the ER table obtains better results than the vectorial model, specifically an increase of the 12% in the average precision, and an increase of the 13% in the R-Precision. The PP and CC tables always obtain low results. Therefore, we should improve them in order to obtain better results when they are used jointly with the ER table. In Figure 3, the precision at N documents is shown when we are using just the ER table; it should be remarked the improvement with reference to the vectorial model, when only 5 documents are returned. 4 http://www.ims.uni-stuttgart.de/projekte/corplex/treetagger/decisiontreetagger.html 289

0,45 0,40 0,35 0,30 0,25 0,20 0,15 0,10 0,05 0,00 Precision.vs. N docs Vectorial IRS 5 docs 10 docs 15 docs 20 docs 30 docs Figure 3 5 Conclusion In this paper, we have proposed a novel model of indexation in IR. This model tries to overcome the problems of traditional bag of words approaches, by means of indexing the entities and the relations between these entities in the documents. The entities are obtained from the partial parsing of the document. These entities interact with each other by means of clauses and anaphoric relations. In the implementation of this model, we have used three tables: ER, PP and CC, in which we have stored the entities and their relations, the prepositional phrases, and the clauses of the documents. These tables are used in a similar way to the traditional vectorial model, by means of using the similarity measure proposed in Kaszkiel et al. (1999). This model has been evaluated on the short CLEF queries (the queries from 41 to 90 for the evaluation process, where the previous ones for the training process) and on the Los Angeles Times collection. The obtained results have been compared with the vectorial model and an increase of the 12% in the average precision, as well an increase of the 13% in the R- Precision have been obtained when we only use the ER table. As a future project, the authors will try to improve the working of the PP and CC tables in order to combine them with the ER table. Moreover, we expect to evaluate this model on the narrative version of these queries, where we expect to obtain better results due to their length (i.e., a greater number of clauses and noun phrases), because long and descriptive queries usually responded well to NLP, while terse one-sentence search directives showed hardly any improvement. 6 References Chengxiang Zhai, Xiang Tong, Natasa Milic-Frayling, David A. Evans (1996). Evaluation of Syntactic Phrase Indexing - CLARIT NLP Track Report. In Proceedings of the Fifth Text REtrieval Conference (TREC-5). Ferrández, A., Palomar, M. and Moreno, L. (1999). An empirical approach to Spanish anaphora resolution. Machine Translation, 14(3/4), 191-216. Hirst, G. (1981). Anaphora in Natural Language Understanding. Berlin: Springer-Verlag. Kaszkiel, M., Zobel, J., Sacks-Davis, R. (1999). Efficient passage ranking for document databases. ACM Transactions of Information Systems. 17(4), 406-439. Strzalkowski, T. (1999). Natural Language Information Retrieval. Kluwer Academic Publishers. 290