Combining and selecting characteristics of information use

Combining an selecting characteristics of information use Ian Ruthven *1, Mounia Lalmas 2 an Keith van Rijsbergen 3 1 Department of Computer an Information Sciences, University of Strathclye, Glasgow, G1 IXH 2 Department of Computer Science, Queen Mary, University of Lonon, Lonon, E1 4NS 3 Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ Ian.Ruthven@cis.strath.ac.uk, mounia@cs.qmul.ac.uk, keith@cs.gla.ac.uk Abstract In this paper we report on a series of experiments esigne to investigate the combination of term an ocument weighting functions in Information Retrieval. We escribe a series of weighting functions, each of which is base on how information is use within ocuments an collections, an use these weighting functions in two types of experiments: one base on combination of evience for a-hoc retrieval, the other base on selective combination of evience within a relevance feeback situation. We iscuss the ifficulties involve in preicting goo combinations of evience for a-hoc retrieval, an suggest the factors that may lea to the success or failure of combination. We also emonstrate how, in a relevance feeback situation, the relevance assessments can provie a goo inication of how evience shoul be selecte for query term weighting. The use of relevance information to guie the combination process is shown to reuce the variability inherent in combination of evience. 1 Introuction Most relevance feeback (RF) algorithms attempt to bring a query closer to the user s information nee by reweighting or moifying the terms in a query. The implicit assumption behin these algorithms is that we can fin an optimal combination of weighte terms to escribe the user s information nee at the current stage in a search. However, relevance as a user jugement is not necessarily ictate only by the presence or absence of terms in a ocument. Rather relevance is a factor of what concepts the terms represent, the relations between these concepts, how users interpret them an how they relate to the information in the ocument. From stuies such as those conucte by Barry an Schamber, [BS98], it is clear that current moels of RF, although successful at improving recall-precision, are not very sophisticate in expressing what makes a ocument relevant to a user. Denos et al., [DBM97] for example, make the goo point that although users can make explicit jugements on why ocuments are relevant, most systems cannot use this information to improve a search. Not only are users' jugements affecte by a variety of factors but they are base on the ocument text. RF algorithms, on the other han, typically are base on a representation of a text an only consier frequency information or the presence or absence of terms in ocuments. These algorithms o not look eeper to see what it is about terms that inicate relevance; they ignore information on how the term is use within ocuments. For example a ocument may only be relevant if the terms appear in a certain context, if certain combinations of terms occur or if the main topic of the ocument is important. Extening feeback algorithms to incorporate the usage of a term within ocuments woul not only allow more precise querying by the user but also allows relevance feeback algorithms to aapt more subtly to users relevance jugements. In this paper we investigate how incorporating more information on the usage of terms can improve retrieval effectiveness. We examine a series of term an ocument weighting functions in combination an in selective combination: selecting which characteristics of a term (e.g. frequency, context, istribution within ocuments) or ocument (complexity, ratio of useful information) shoul be use to retrieve ocuments. This research extens the initial stuy presente in [RL99] which emonstrate that a subset of the weighting function use in this paper were successful for precision enhancement. In particular we investigate the role of combination of evience in RF. * Corresponing author. This work was complete while the first author was at the University of Glasgow. 1

The following sections outline how we escribe term an ocument characteristics (section 2), the ata we use in our experiments (section 3), our experimental methoology (section 4), an the results of three sets of experiments. The first set of experiments examines each characteristic as a single retrieval function (section 5). The secon set looks at combining evience of term use in stanar retrieval (section 6), an the thir set examines selecting evience in RF (sections 7 an 8). We summarise our finings in section 9. 2 Term an ocument characteristics In this section we outline five alternative ways of escribing term importance in a ocument or collection - or term characteristics. Three of these are stanar term weighting functions, if, tf an noise, another two are evelope for this work. inverse ocument frequency, base on how often a term appears within a collection, escribe in section 2.1 noise, also base on how often a term appears within a collection but base on within-ocument frequency, section 2.2 term frequency, base on how often a term appears within a ocument, section 2.3 thematic nature, or theme, base on how a term is istribute within a ocument, section 2.4 context, base on the proximity of one query term to another query term within the same ocument, section 2.5 In aition, we introuce two ocument characteristics. These escribe some aspect of a ocument that ifferentiates it from other ocuments. specificity, base on how many unique terms appear in a ocument, section 2.6. information-noise, base on the proportion of useful to non-useful content within a ocument, section 2.7 These characteristics were chosen to be representative of general weighting schemes those that represent information on general term appearance, e.g. if, tf, - an specific weighting schemes those that represent specific features on how terms appear in ocuments, e.g. theme. 2.1 if Inverse ocument frequency, or if, [SJ72] is a stanar IR term weighting function that measures the infrequency, or rarity, of a term's occurrence within a ocument collection. The less likely a term is to appear in a collection the better is it likely to be at iscriminating relevant from irrelevant ocuments. In these experiments we measure if by the equation shown in equation 1. n if ( t) = log + 1 N Equation 1: inverse ocument frequency (if) where n is the number of ocuments containing the inexing term t an N is the number of ocuments in the collection 2.2 noise The secon term characteristic we investigate was the noise characteristic iscusse in [Sal83, Har86], equation 2. The noise characteristic gives a measure of how important a term is within a collection but unlike if, noise is base on within-ocument frequency. noise ( t) = k N i= 1 frequency ik total _ frequency k total _ frequency log frequency ik k Equation 2: noise where N = number of ocuments in the collection, frequency ik = the frequency of term k in ocument i, total_frequency k = total frequency of term k in the collection From equation 2, if a term appears in only one ocument, it receives a noise score of zero. Terms that appear more commonly throughout a collection receive a higher noise value. The noise value is then inversely proportional to its iscrimination power. The noise characteristic as efine here therefore requires 2

normalisation, [Har86], to ensure that the noise value of a term reflects its iscriminatory power. To normalise the noise score, we subtracte the noise score of a term from the maximum noise score. The normalise noise characteristic gives a maximum noise score to a term if all its occurrences appear in one ocument an the lowest noise score if all occurrences of the term appear in ifferent ocuments. 2.3 tf Incluing information about how often a term occurs in a ocument - term frequency (tf) information - has often been shown to increase retrieval performance, e.g. [Har92]. For these experiment we use the following formula, tf ( t) = log( occurrences ( ) 1) / log( occurrences ( )) t + Equation 3: term frequency (tf) where occurrences t () occurrences t () is the number of occurrences of term t in ocument, occurrences total () is the total number of term occurrences in ocument. 2.4 theme Previous work by for example Hearst an Plaunt [HP93] an Parais an Berrut [PB96], emonstrate that taking into account the topical or thematic nature of ocuments can improve retrieval effectiveness. Hearst an Plaunt present a metho specifically for long ocuments, whereas Parais an Berrut's metho is base on precise conceptual inexing of ocuments. We present a simple term-base alternative base on the istribution of term occurrences within a ocument. This is base on the assumption that the less evenly istribute the occurrences of a term are within a ocument, then the more likely the term is to correspon to a localise iscussion in the ocument, e.g. a topic in one section of the ocument only, Figure 1 left-han sie. Conversely, if the term s occurrences are more evenly sprea throughout the ocument, then we may assume that the term is somehow relate to the main topic of the ocument, Figure 1 right-han sie. Unlike Hearst an Plaunt we o not split the ocument into topics an assign a sub- or main-topic classification. Instea we efine a theme value of a term, which is base on the likelihoo of a term to be a main topic. total Figure 1: Localise iscussion of term X (left-han sie), general iscussion of term X (right-han sie) The algorithm which we evelope for this is shown in equation 4. This value is base on the ifference between the position of each occurrence of a term an the expecte positions. Table 1 gives a short example for a ocument with 1000 wors, an five occurrences of term t. First, we calculate whether the first occurrence of term t occurs further into the ocument that we woul expect, base on the expecte istribution (first(t) - line two, equation 1; Column 7, Table 1). Next we calculate whether the last occurrence of the term appears further from the en of the ocument than we woul expect (last(t) - line two, equation 1; Column 8, Table 1). For the remainer of the terms we calculate the ifference between the expecte position of a term, base on the actual 3

position of the last occurrence an the expecte ifference between two occurrences ( line two; Column 4-6, n 1 Table 1) ( i= 2 preicte _ position ( t) actual _ position ( t) ). i theme ( t) = ( length ifference ( t)) / length i where ifference ( t) = first ( t) + last ( t) + n 1 i= 2 preicte _ position ( t) actual _ position ( t) i i first ( t) = 0, if actual _ position1( t) istribution ( t) = actual _ position1 ( t) istribution ( t), ow last ( t) = 0, if ( length actual _ position ( t) istribution ( t)) = length ( actual _ positionn + istribution ( t)), ow preicte _ positioni = actual _ positioni 1 + istribution ( t) istribution ( t) length / occurrences ( t) = Equation 4: theme characteristic where istribution(t) is the expecte istribution of term t in ocument, assuming all occurrences of t are equally istribute, preicte_positioni is the expecte position of the ith occurrence of term t, actual_positioni is the actual position of the ith occurrence. occurrences (t) is the number of occurrences of term t in ocument. n is the number of query terms in the ocument. n length occs istr epos pos iff first last ifference theme 1000 5 200-100 0 300 500 200 700 551 349 751 553 547 753 700 600 900 100 600 0 100 700 0.3 Table 1: Example calculation of theme value for a term We then sum these values to get a measure of the ifference between the expecte position of the term occurrences an their actual positions. The greater the ifference between where term occurrences appear an where we woul expect them to appear, the smaller the theme value for the term. The smaller the ifference, the larger the theme value for the term. 2.5 context There are various ways in which one might incorporate information about the context of a query term. For example, we might rely on coocurrence information [VRHP81], information about phrases [Lew92], or information about the logical structures, e.g. sentences, in which the term appears [TS98]. We efine the importance of context to a query term as being measure by its istance from the nearest query term, relative to the average expecte istribution of all query terms in the ocument. 4

context( t) = ( istribution ( q) min ( t)) / istribution ( q) min ( t) = min ' ( position ( t) position ( t' )) t t istribution ( q) length / occurrences ( q) = Equation 5: context characteristic for term t in ocument where istribution (q) is the expecte istribution of all query terms in the ocument, assuming terms are istribute equally position (t) is the position of term t an min (t) is the minimum ifference from any occurrence of term t to another, ifferent query term. 2.6 specificity The first ocument characteristic we propose is the specificity characteristic which is relate to if. The if characteristic measures the infrequency of a term's occurrence within a ocument collection; the less likely a term is to appear in a ocument the better is it likely to be at iscriminating relevant from irrelevant ocuments. However, if oes not consier the relative iscriminatory power of other terms in the ocument. If a ocument contains a higher proportion of terms with a high if, it may be more ifficult to rea, e.g. if it contains a lot of technical terms. On the other han a ocument containing a lot of terms with very low if values may contain too few information-bearing wors. We propose the specificity characteristic as a measure of how specialise a ocument s contents are, relative to the other ocuments in the collection. This is a very simple measure as we o not take into account the omain of the ocument or external knowlege sources, which woul allow us to represent the complexity of the ocument base on its semantic content. The specificity characteristic is a ocument characteristic, giving a score to an entire ocument rather than iniviual terms. It is measure by the sum of the if values of each term in the ocument, ivie by the number of terms in the ocument, giving an average if value for the ocument, equation 6. specificity( ) = n i n if ( i) Equation 6: specificity ocument characteristic of ocument where n = number of terms in ocument 2.7 information-to-noise The specificity characteristic measure the complexity of the ocument base on if values. An alternative measure is the information-to-noise ratio, [ZG00], abbreviate to info-noise. This is calculate as the number of tokens after processing (stemming an stopping) of the ocument ivie by the length of the ocument before stopping an stemming, equation 7. info_noise() = processe _ length( ) length( ) Equation 7: info_noise ocument characteristic of ocument where processe_length() = number of terms in ocument after stopping an stemming length() = number of terms in ocument before stopping an stemming info_noise, as escribe in [ZG00], measures the proportion of useful to non-useful information content within a ocument. 2.8 Summary The if an noise characteristics give values to a term epening on its importance within a collection, the tf an theme characteristics give values epening on the term s importance within an iniviual ocument an the specificity an info_noise characteristics give values to iniviual ocuments base on their content. The context characteristic gives a value to a term base on its proximity to another query term in the same ocument. Each of 5

the term characteristics can be use to ifferentiate ocuments base on how terms are use within the ocuments an the ocument characteristics allow ifferentiation of ocuments base on their content. The ocument characteristics also allow RF algorithms to base feeback ecisions on the ocument taken as a whole, rather than only iniviual components of the ocument. Each of the algorithms that calculate the characteristic values give scores in ifferent ranges. In our experiments we scale all values of the characteristics to fall within the same range, 0-50, to ensure that we were working with comparable values for each characteristic. In the next section we outline the ata we use in our experiments. 3 Data For the experiments reporte in this paper we use two sets of collections. The first is a set of three small test collections (CACM, CISI an MEDLARS collections 1 ), the secon is a set of two larger collections (the Associate Press (1988) (AP) an the Wall Street Journal (1990-92) (WSJ) collection from the TREC initiative [VH96]). Statistics of these collections are given in Table 2. CACM CISI MEDLARS AP WSJ Number of ocuments 3204 1460 1033 79 919 74 520 Number of queries use 2 52 76 30 48 45 Average ocument length 3 47.36 75.4 89 284 326 Average wors per query 4 11.88 27.27 10.4 3.04 3.04 Average relevant ocuments 15.3 41 23 35 24 per query Number of unique terms in the collection 7861 7156 9397 129 240 123 852 Table 2: Details of CACM, CISI, MEDLARS, AP an WSJ collections The AP an WSJ test collections each come with fifty so-calle TREC topics. Each topic escribes an information nee an those criteria that were use in assessing relevance when the test collection was create. A TREC topic has a number of sections, (see Figure 2 for an example topic). In our experiments we only use the short Title section from topics 251 300 as queries, as using any more of the topic escription may be an unrealistic as a user query. Number: 301 Title: International Organize Crime Description: Ientify organisations that participate in international criminal activity, the activity, an, if possible, collaborating organisations an the countries involve. Narrative: A relevant ocument must as a minimum ientify the organisation an the type of illegal activity (e.g., Columbian cartel exporting cocaine). Vague references to international rug trae without ientification of the organisation(s) involve woul not be relevant. Figure 2: Example of a TREC topic Stopwors were remove, using the stopwor list in [VR79], an the collections were stemme using the Porter stemming algorithm [Por80]. 4 Outline of experiments In this paper we escribe three sets of experiments: 1 http://www.cs.gla.ac.uk/iom/ir_resources/test_collections/ 2 Each collection comes with a number of queries. However, for some queries there are no relevant ocuments in the collection. As these queries cannot be use to calculate recall-precision figures they are not use in these experiments. This row shows the number of queries, for each collection, for which there is at least one relevant ocument. 3 After the application of stemming an stopwor removal. 4 This row shows the average length of the queries that were use in the experiments. 6

i. retrieval by single characteristic. In section 5 we present results obtaine by running each characteristic as a single retrieval function. In this section we examine the relative performance of each characteristics on the test collections, an iscuss why some characteristics perform better than others as retrieval functions. ii. retrieval by combination of characteristics. In section 6 we investigate whether combining characteristics can improve retrieval effectiveness over retrieval by single characteristic. We also iscuss factors that affect the success of combination, such as the size of the combination an which characteristics are combine. iii. relevance feeback. In section 7 we investigate how we can use relevance assessments to select goo combinations of characteristics of terms an ocuments to use for relevance feeback. We escribe several methos of selecting which characteristics are important for a query an compare these methos against methos that o not use selection of characteristics. 5 Retrieval by single characteristic In this section we examine the performance of running each characteristic (term an ocument characteristics) as a single retrieval function (retrieval by the sum of the if value of each query term, retrieval by the sum of tf values of each query term, etc.). The results are presente in section 5.2 but before this, in section 5.1, we look at how ocument characteristics shoul be use to score ocuments. 5.1 Document characteristics - initial investigations As the specificity an info-noise characteristics are ocument rather than term characteristics, they give a value to each ocument irrespective of which terms are in the query. However, we can use the ocument characteristics to prouce ifferent rankings base on two criteria: i. which ocuments receive a score. Although all ocuments have a value for the specificity an infonoise characteristics, we may choose to score only those ocuments that contain at least one query term, as these ocuments are those that are the most likely to be relevant. To investigate this, we assesse two methos of retrieving ocuments - the query epenent - an the query inepenent strategies. In the query inepenent strategy the retrieval score of a ocument is the characteristic score (info_noise or specificity). This metho gives an ientical ranking of ocuments for all queries. In the query epenent strategy the retrieval score of a ocument is also the characteristic score but this score is only assigne to those ocuments that contain at least one query term. If the ocument contains no query terms then the retrieval score is zero. In this metho we, then, retrieve all ocuments that contain a query term before the ocuments that contain no query terms, giving a ifferent ranking to each query. ii. how to orer the ocuments. The specificity characteristic gives high scores to more complex ocuments, whereas the info_noise characteristic gives high scores to ocuments that have a high proportion of useful information. This means that we are asserting that relevant ocuments are more likely to have a higher amount of useful information or a higher complexity. This requires testing. We teste two strategies - stanar - in which we rank ocuments in ecreasing orer of characteristic score an reverse - in which we rank ocuments in increasing orer of characteristic score. These two criteria give us four combinations of strategy - query epenent an stanar, query inepenent an stanar, query epenent an reverse, query epenent an reverse. Each of these strategies correspon to a ifferent metho of ranking ocuments. The results of these ranking strategies are shown in Table 3 for the specificity characteristic an Table 4 for the info_noise characteristic 5. Also shown in each table, for comparison, are the results of two ranom retrieval runs on each collection 6. These are also base on a query epenent strategy (ranom orer of all ocuments containing a query term, followe by ranom orer of the remaining ocuments) an a query inepenent strategy (a completely ranom orering of all ocuments). 5 Full recall-precision tables for all experiments are given in an electronic appenix, available at http://www.cs.strath.ac.uk/~ir/papers/appenixapart1.pf. The corresponing tables will be note as footnotes throughout the paper. Appenix A, Tables A.3 A.13. 6 Tables A.1 an A.2. 7

stanar specificity reverse specificity ranom Collection query epenent query inepenent query epenent query inepenent query epenent query inepenent CACM 1.19 0.98 1.19 1.18 1.14 0.36 CISI 10.55 2.83 2.75 3.51 4.66 3.86 MEDLARS 4.62 3.33 4.62 4.48 12.39 4.82 AP 0.33 0.06 0.47 0.05 0.28 0.05 WSJ 0.42 0.10 0.57 0.02 0.35 0.04 Table 3: Average precision figures for specificity characteristic Highest average precision figures for each collection are shown in bol From Table 3, the specificity characteristic is best applie using a query epenent strategy. Whether or not it is applie in ecreasing orer of characteristic value (stanar), or increasing orer of characteristic score (reverse) is collection epenent. However the overall preference is for the reverse strategy. From Table 4 the info_noise characteristic is best applie using the query-epenent stanar strategy: orering ocuments containing a query term an with the highest proportion of useful information at the top of the ranking. stanar info_noise reverse info_noise ranom Collection query epenent query inepenent query epenent query inepenent query epenent query inepenent CACM 1.67 0.50 0.86 1.63 1.14 0.36 CISI 4.08 3.28 3.48 2.78 4.66 3.86 MEDLARS 8.67 2.56 8.25 2.98 12.39 4.82 AP 0.44 0.05 0.29 0.05 0.28 0.05 WSJ 0.48 0.03 0.32 0.03 0.35 0.04 Table 4: Average precision figures for info_noise characteristic Highest average precision figures for each collection are shown in bol On all collections, except the MEDLARS collection, at least one metho of applying the characteristics gave better performance than ranom (query inepenent), an with the exception of MEDLARS an CISI also performe better than the query epenent ranom run. One possible reason for the poorer results on these collections is that the range of ocument characteristic values for these collections is not very wie. Consequently the characteristics o not have enough information to iscriminate between ocuments. It is better to rank only those ocuments that contain a query term than all ocuments. This is not surprising as, using the query epenent strategy, we are in fact re-ranking the basic if ranking for each query. We shall iscuss the relative performance of the ocument characteristics against the term characteristics in the next section. Although the ocument characteristics o not give better results than the term characteristics (see next section), they o generally give better results than the ranom retrieval runs an can be use in combination to ai retrieval. 5.2 Single retrieval on all characteristics The results from running each characteristic as a single retrieval function are summarise in Table 5 7, measure against the query epenent ranom strategy. This is use as a baseline for this experiment as all the characteristics prioritise retrieval of ocuments that contain a query term over those ocuments that contain no query terms. Hence this metho of running a ranom retrieval is more similar in nature to the term characteristics an, as it gives higher average precision, provies a stricter baseline measure for comparison. The majority of characteristics outperform the query epenent ranom retrieval baseline. However some characteristics o perform more poorly than a ranom retrieval of the ocuments (info_noise on CISI, theme, specificity an info_noise on MEDLARS, context on WSJ) 8. 7 Tables A.14 A. 18. 8 All characteristics, for all collections except MEDLARS, outperforme a completely ranom retrieval. 8

Characteristic Collection if tf theme context specificity noise info-noise ranom CACM 22.00 22.70 4.36 14.80 1.19 24.15 1.67 1.14 CISI 11.50 12.50 5.10 9.60 10.55 11.00 4.08 4.66 MEDLARS 43.10 43.70 11.10 36.10 4.60 43.90 8.80 12.39 AP 10.10 9.86 4.63 9.57 0.47 1.00 0.44 0.28 WSJ 12.19 7.39 1.00 0.04 0.42 1.05 0.48 0.38 Table 5: Average precision figures for term an ocument characteristics use as single retrieval functions Highest average precision figures for each collection are shown in bol The orer in which the characteristics performe is shown in Figure 3 where > inicates statistical significance an >= inicates non-statistical significance. 9 The ocument characteristics perform quite poorly as they are insensitive to query terms. That is, although, when using these characteristics we score only ocuments that contain a query term, the ocument characteristics o not istinguish between ocuments that contain goo query terms an ocuments that contain poor query terms. CACM noise >= tf >= if > context >= theme > inf_n > spec > ranom CISI tf > if > noise > spec > context > theme > ranom >= inf_n MEDLARS noise >= tf >= if > context > ranom > theme > inf_n > spec AP if > tf >= context > theme > noise > spec >= inf_n > ranom WSJ if > tf > noise >= theme >= spec >= inf_n > ranom > context Figure 3: Statistical an non-statistical ifferences between characteristics on all collections where spec = specificity, inf_n = info_noise On nearly all collections the stanar characteristics (if, tf, noise 10 ) outperforme our new characteristics. One possible reason for this is that, although, the new term characteristics (theme, context) give a weight to every term in a ocument, unlike the stanar characteristics they o not always give a non-zero weight. The context characteristic, for example, will only assign a weight to a term if at least two query terms appear in the same ocument. In the case of the two larger collections we have relatively smaller queries. Hence the co-occurrence of query terms within a ocument may be low with the resulting effect that most terms have a zero weight for this characteristic. This, in turn, will lea to a poor retrieval result as the characteristic cannot istinguish well between relevant an non-relevant ocuments. Similarly, the theme characteristic, as implemente here, will also lea a high proportion of terms being assigne a zero weight compare with the tf characteristic. One reason for this is that theme assigns a zero weight to a term if it only appears once within a ocument. A collection such as the MEDLARS collection, which has a high number of terms that only appear in one ocument may be more susceptible to this, as it contains a large number of unique terms. The stanar characteristics are also less strict algorithms: the information they represent, e.g. frequency of a term within a ocument, is more general than that represente by the new characteristics. This will mean that the stanar characteristics will be useful for a wier range of queries. For example, tf will be a useful characteristic for most query terms as, generally, the more often a query term appears within a ocument, the more likely the ocument is to be relevant. The theme characteristic, on the other han, will only be useful for those queries where the query terms are relate to the main topic of the ocument. For queries where this conition is not met, the theme characteristic will not be useful. Even though the new characteristics o not perform as well as the traitional weighting functions they o improve retrieval effectiveness over ranom retrieval. These algorithms shoul not be seen as alternative weighting schemes but as aitional ones: ones that provie aitional methos of iscriminating relevant from non-relevant material. In RF these aitional characteristics will be use to score query terms if they are useful at inicating relevant ocuments for iniviual queries. That is, by proviing evience of ifferent aspects of 9 Calculate using a paire t-test, p < 0.05, holing recall fixe an varying precision 10 Harman s, [Har86], experimental investigation of the noise term weighting function on the Cranfiel collection showe superior results for noise over if. In these experiments, this hel for the shorter CACM an MEDLARS collection. However in the larger collections, the noise characteristic performe relatively poorly. 9

information use, they can be use to help retrieval performance in combination with other characteristics. This combination of evience is the subject of the next section. 6 Retrieval by combination of characteristics In the previous section we looke at the performance of each characteristic iniviually. In this section we look at whether the retrieval effectiveness of characteristics will be improve if we use them in combination. Belkin et al., [BKFS95], examine the role of multiple query representations in a-hoc IR. Their argument in favour of ifferent representations of queries is twofol: i. empirical evience that ifferent retrieval functions retrieve ifferent ocuments, e.g. [Lee98]. In our approach combinations of ifferent representations of query terms can retrieve ocuments that fulfil ifferent criteria for relevance. ii. theoretical: ifferent query representations can provie ifferent interpretations of a user's unerlying information nee. This has a strong connection to Ingwersen's work on polyrepresentation - multiple representations of the same object, in our case a query, rawn from ifferent perspectives can provie better insight into what constitutes relevance than a single goo representation, [Ing94]. In this experiment we teste all possible combinations of the characteristics, running each possible combination as a retrieval algorithm. For each collection, we effectively run the powerset of combinations, each set comprising a ifferent combination of characteristics. For each combination, the retrieval score of a ocument was given by sum of the score of each characteristic of each query term that occurre in the ocument. For example, for the combination of tf an theme, the score of a ocument was equal to the sum of the tf value of each query term plus the sum of the theme value of each query term. Two versions of this experiment were run, the first use the values of characteristics given at inexing time, the secon treate the characteristics as being more or less important than each other. There are several reasons for this. For example, some characteristics may reflect aspects of information use that are more easily measure than another; others are better as retrieval functions an shoul be treate as being more important. We incorporate this by introucing a set of scaling weights (if 1, tf 0.75, theme 0.15, context 0.5, noise 0.1, specificity an information_noise 0.1 11 ) that are use to alter the weight given to a term at inexing time. Each inexing weight of a term characteristic is multiplie by the corresponing scaling weight, e.g. all tf values are multiplie by 0.75, all theme values by 0.15, etc. This gives us two conitions - weighting an non-weighting of characteristics - for each combination of characteristics. In the following sections we shall summarise our finings regaring three aspects: the effect on retrieval effectiveness of combining characteristics, the effect of weighting characteristics, an the effect of aing iniviual characteristics to other combinations. Each of these will be iscusse in a separate section in sections 6.1-6.3. We shall summarise in section 6.4 12. 6.1 Effecting of combining characteristics Our experimental hypothesis is that combining characteristics can increase retrieval effectiveness over using iniviual characteristics. In section 6.3 we shall iscuss how well the iniviual characteristics performe in combination. In this section we shall examine the basic hypothesis an iscuss general finings. In Table 6 we outline the effect on iniviual characteristic performance by the aition of other characteristics. Of the 127 possible combinations of characteristics for each collection, each characteristic appeare in 63 combinations. Each row is a count of how many of these 63 combinations containing each characteristic ha higher average precision (increase) than the characteristic as a single retrieval function, lower average precision (ecrease), or no change in average precision (none). For example, how many combinations containing if gave an average precision figure that was better, worse or ientical to the average precision of if alone? The first general conclusion from Table 6 is that all characteristics can benefit from combination with another characteristic or set of characteristics. Furthermore, with the exception of the noise characteristic on the CACM, 11 These weights were erive from experiments using a sample of the ata from each collection. 12 Tables A.19 A.70. 10

an the tf an if characteristics on the CISI, any characteristic was more likely to benefit from combination than be harme by it. This conclusion hel uner both the weighing an non-weighting conitions. The secon general conclusion is that the performance of a characteristic as a single retrieval function (section 4.2) is a goo inicator of how well the characteristic will perform in combination. The poorer the characteristic is at retrieving relevant ocuments the more likely it is to benefit from combination with another characteristic. For each collection, on the whole, the poorer characteristics 13 improve more often in combination with other characteristics. The reverse also hols: if a characteristic is goo as a single retrieval function, then there is less chance that it will be improve in combination. For example the best characteristics in the small collections (tf, if on CISI, an noise on CACM) showe the lowest overall improvement in combination. However the overall tenency is beneficial: combination benefits more characteristics than it harms. Collection Conition Change if tf theme context spec noise info_ noise CACM NW increase 54 41 63 63 62 15 62 ecrease 9 22 0 0 0 48 0 none 0 0 0 0 1 0 1 W increase 50 42 63 63 62 11 62 ecrease 8 18 0 0 0 52 0 none 5 3 0 0 1 0 1 CISI NW increase 27 1 63 63 49 39 63 ecrease 35 62 0 0 14 24 0 none 1 0 0 0 0 0 0 W increase 23 7 63 63 52 40 63 ecrease 34 53 0 0 0 23 0 none 6 3 0 0 11 0 0 MEDLARS NW increase 47 44 63 63 63 43 63 ecrease 16 19 0 0 0 20 0 none 0 0 0 0 0 0 0 W increase 45 55 63 60 63 37 63 ecrease 18 8 0 3 0 26 0 none 0 0 0 0 0 0 0 AP NW increase 47 55 63 59 62 62 62 ecrease 16 8 0 4 1 1 1 none 0 0 0 0 0 0 0 W increase 54 60 62 61 63 60 63 ecrease 4 0 3 0 0 0 0 none 5 3 0 2 0 3 0 WSJ NW increase 40 63 63 63 63 63 63 ecrease 22 0 0 0 0 0 0 none 1 0 0 0 0 0 0 W increase 46 63 63 63 63 60 63 ecrease 8 0 0 0 0 3 0 none 9 0 0 0 0 0 0 Table 6: Effect of combination on iniviual characteristics where increase = increase in average precision when combine, ecrease = ecrease in average precision when in combination, none = no ifference in average precision when in combination, NW = non-weighting conition, W = weighting conition Bol figures inicate the preominant effect of the characteristic in combination In the remainer of this section we look at what affects the success of combination. In particular, we look examine the size of combinations an the components of combinations. 13 These were the theme, context, specificity an info_noise for the CACM, CISI an MEDLARS collections an theme, context, noise, specificity an info_noise for the AP an WSJ collections. 11

In Table 7 we analyse the success of combination by size of combination, that is how many characteristics were combine. For each conition, weighting an non-weighting, on each collection we ranke all combinations by average precision 14. We then took the meian 15 value an the size of the combinations that appeare above an below this point. In Table 7 bol figures inicate where most combinations, of a given size, appeare (above or below the meian point). In the majority of cases the larger combinations (combinations of 4-7 characteristics) performe better than the meian value, an the smaller combinations (combinations of 1-3 characteristics) performe worse than the meian. There was little ifference between the weighting an non-weighting conitions. One possible reason for the success of the larger combinations is that poor characteristics have a lower overall effect in a larger combination. That is, if we only combine two characteristics an one of these is a poor characteristic, then there is a greater chance that the combination will perform less well than the better iniviual characteristic. Conversely, if we combine a number of characteristics, an one is poorer than the rest, then this will not have such a great effect on the performance of the combination. Collection Position Conition 1 2 3 4 5 6 7 CACM Above NW 2 5 12 20 17 7 1 W 2 6 13 21 15 6 1 Below NW 5 16 23 15 4 0 0 W 5 15 22 14 6 0 0 CISI Above NW 2 7 19 21 15 0 1 W 2 9 17 22 11 2 1 Below NW 5 14 16 14 6 7 0 W 5 12 18 13 10 5 0 MEDLARS Above NW 0 5 15 24 13 6 1 W 0 7 18 13 18 7 1 Below NW 7 16 20 11 8 1 0 W 7 14 18 22 3 0 0 AP Above NW 0 7 11 20 18 7 1 W 0 3 11 23 19 7 1 Below NW 7 14 24 15 3 0 0 W 7 18 24 12 2 0 0 WSJ Above NW 1 5 13 21 17 7 1 W 0 3 12 23 18 7 1 Below NW 7 16 22 14 4 0 0 W 7 18 23 12 3 0 0 Table 7: Distribution of combinations over ranking of average precision where Above = combination falls above or at meian point of ranking, Below = combination falls below meian point of ranking, NW = non-weighting conition, W = weighting conition A further reason for larger combinations performing more effectively is that they allow for a more istinct ranking. That is, the more methos we have of scoring ocuments, the less chance that ocuments will receive an equal retrieval score. Now we look at how the components of the combinations affect the success of combining characteristics. As state before, each characteristic appeare in a total of 63 combinations. Table 8 presents how many of these combinations appeare above the meian combination in the ranking of average precision, i.e. how many times a combination containing a characteristic performe better than average. The better iniviual characteristics, e.g. if an tf, appeare in more combinations above the meian than below for all collections. The poorer characteristics, e.g. info_noise, tene to appear in more combinations below the meian than above. 14 Tables A.131 A.141, in http://www.cs.strath.ac.uk/~ir/papers/appenixapart2.pf 15 For each collection, in each conition, there were 127 possible combinations, the meian point was taken to be the 64 th combination in the ranking of all combinations. 12

CACM CACM CISI CISI MEDLARS MEDLARS AP AP WSJ WSJ NW W NW W NW W NW W NW W if 42 41 38 43 41 40 39 43 41 46 tf 47 52 41 44 42 50 51 47 52 47 theme 33 32 44 38 48 42 30 41 32 41 context 29 30 20 16 28 28 41 45 44 42 spec 30 32 30 32 31 33 37 32 32 33 noise 49 50 27 29 41 37 36 36 32 34 inf 32 32 32 31 28 31 32 31 34 30 Table 8: Number of appearances of a characteristic in a combination appearing above meian combination Bol figures inicate where the majority of the combinations containing an iniviual characteristic appeare above the meian value. This is not necessarily to say, however, that poor characteristics always ecrease the performance of a combination, see section 6.4 for example. Often a characteristic that performs less well as a single characteristic can improve a combination. What is important is how well a combination of characteristics separates relevant from irrelevant ocuments for an iniviual query: a particular combination may work poorly on average but work well for certain queries. This is important for our RF experiments, in which we select which are goo characteristics for iniviual queries, section 7. To summarise our finings: combinations of characteristics, whether weighte or not, is beneficial for all characteristics on all collections teste. This benefit is greater when the characteristic is poor as a single retrieval function but the overall benefits of combination still hols for goo characteristics. The larger combinations (4-7 characteristics) ten to be better than small (1-3 characteristics) as retrieval functions over the collections. 6.2 Effect of weighting characteristics Our basis behin weighting characteristics was that some characteristics may be better at inicating relevance than others. In Table 9 we summarise the effect of weighting on each collection, inicating the number of combinations that increase/ecrease in average precision when using weighting. Overall, 47% of combinations improve using weighting on CACM collection, 61% on CISI, 60% MEDLARS, 69% on AP an 66% on WSJ. As can be seen for all collections, except CACM, weighting was beneficial in that it improve the average precision of more combinations than it ecrease. Generally these improvements were statistically significant. Increase Decrease Collection Significant Nonsignificansignificant Significant Non- CACM 24 20% 32 27% 31 26% 33 28% CISI 59 49% 14 12% 37 31% 10 8% MEDLARS 45 38% 27 23% 23 19% 25 21% AP 51 43% 32 27% 22 18% 15 13% WSJ 67 56% 12 10% 26 22% 15 13% Table 9: Effect of weighting on combination performance Significant = statistically significant change, Non-significant = non statistically significant change Bol figures inicate preominant effect of weighting on each collection Table 10 breaks own these figures by size of combination, the number of characteristics in the combination. The combination that benefite most from weighting were also these tene to be the ones that performe best in combination, i.e. those combination of four or greater characteristics. 13

Collection Change 2 3 4 5 6 7 CACM Increase 8 14 17 12 4 0 Decrease 13 21 18 9 3 1 CISI Increase 9 22 24 11 7 1 Decrease 12 13 11 10 0 0 MEDLARS Increase 9 19 23 14 6 0 Decrease 12 16 12 7 1 1 AP Increase 8 21 27 7 1 1 Decrease 13 14 8 19 6 0 WSJ Increase 8 19 25 19 7 1 Decrease 13 16 10 2 0 0 Table 10: Effect of weighting by size of combination Bol figures inicate preominant effect on each size of combination In Table 11, we analyse which characteristics appeare in the combinations that i better using weighting than no weighting. Generally, combinations containing if an tf were helpe by weighting across the collections an theme an context were helpe in the larger collection. The only characteristic to be consistently harme by weighting was the noise characteristic. if tf theme context spec noise info_noise CACM 36 42 34 23 33 18 26 64% 75% 61% 41% 59% 32% 46% CISI 46 49 27 32 42 21 38 63% 67% 37% 44% 58% 29% 52% MEDLARS 43 40 29 35 46 9 48 60% 56% 40% 49% 64% 13% 67% AP 52 46 55 45 40 15 48 63% 55% 66% 54% 48% 18% 58% WSJ 54 45 49 45 39 20 39 68% 57% 62% 57% 49% 25% 49% Table 11: Appearance of iniviual characteristics in combinations that were improve by weighting Bol figures inicate those characteristics for which weighting was beneficial overall. Weighting is generally beneficial but it is important to get goo values for the characteristics. For example, both if an tf were goo iniviual retrieval algorithms an were highly weighte which helpe their performance in combination as the combination was more heavily biase towars the ranking given by these characteristics. noise, on the other han, was a variable retrieval algorithm in that it performe well on some collections an more poorly on others. As it was weighte lowly the overall effect of noise in combination was lessene in the weighting conition. Consequently in cases where noise woul have been a goo iniviual retrieval algorithm the combination i not perform as well as it might have without weighting. A final observation is that although weighting i not generally improve the best combination for the collections 16, it i ten to improve the performance of the mile ranking combinations significantly. These were the combinations that appeare in the mile of the ranking of combinations escribe in section 6.1. Weighting then was a success in that it improve the performance of most combinations. However it achieve this by ecreasing the performance of the poorer combinations an increasing the performance of the average combinations. 16 Tables A.131 A.141 14

6.3 Effect of aing iniviual characteristics In section 6.1 we gave general conclusions about the effect of combining characteristics. In this section we look more closely at the effect of combining iniviual characteristics an the effect of characteristics on the performance of a combination of characteristics. In Table 12 we summarise the effect of aing a characteristic to other combinations, e.g. aing if to the 63 combinations that i not alreay contain if. We measure whether the new information causes an increase in average precision (aing if improves retrieval), a ecrease in average precision (aing if worsens retrieval), or no change in average precision (aing if gives the same retrieval effectiveness). We look first at the aition of iniviual characteristics to any combination of other characteristics. On all collections the aition of if or tf information to a combination of characteristics was beneficial. This was more pronounce in the larger AP an WSJ collections, an hel uner both the weighting an nonweighting conitions. The aition of theme information improves the performance of other combinations in smaller collections using either weighting or non-weighting. In the larger collections, the theme characteristic only improve performance uner the weighting conition. CACM CISI MEDLARS AP WSJ No Wgt No Wgt No Wgt No Wgt No Wgt Wgt Wgt Wgt Wgt Wgt if Inc 51 58 54 50 47 48 55 63 62 62 Same 0 1 0 0 0 0 0 0 0 0 Dec 12 4 9 13 16 15 8 0 1 1 tf Inc 60 59 57 54 53 56 60 62 62 62 Same 1 3 1 1 1 1 1 1 1 1 Dec 2 1 5 8 9 6 2 0 0 0 theme Inc 33 26 48 45 51 49 22 38 26 54 Same 2 6 3 2 1 1 1 2 2 2 Dec 28 31 12 16 11 13 40 23 35 7 context Inc 27 18 8 12 17 14 56 63 59 48 Same 2 4 0 0 0 0 0 0 0 0 Dec 34 41 55 51 46 49 7 0 4 15 spec Inc 19 14 16 22 17 13 46 4 22 6 Same 1 36 3 17 0 35 1 0 2 54 Dec 43 13 44 24 46 15 14 56 39 3 noise Inc 60 50 9 29 51 53 48 57 52 48 Same 1 6 1 0 2 1 2 2 5 15 Dec 2 7 53 34 10 9 13 4 6 0 info_ Inc 37 18 46 18 18 16 31 5 45 5 noise Same 0 35 1 16 0 32 1 57 0 54 Dec 26 10 16 29 45 15 31 1 18 4 Table 12: Effect of the aition of a characteristic to combinations of characteristics Bol figures inicate preominant effect of each characteristic The aition of context characteristic performe poorly in the smaller collections, performing more poorly when using weighting. In the larger collections the majority of combinations improve after the aition of context information. With exception of the CISI, the aition of the noise characteristic improves performance in both weighting an non-weighting conitions. The two ocument characteristics specificity an info_noise are very susceptible to how they are treate. The specificity characteristic tens to ecrease the effectiveness of a combination of characteristics if the 15

characteristics are not weighte. If the characteristics are weighte, then aition of specificity information is neutral: the combination performs as well as without the specificity information. The WSJ collection is the exception to this general conclusion. For this collection, uner no weighting, the aition of specificity increases the effectiveness of a combination. Uner weighting specificity ecreases the effectiveness of a combination. The info_noise characteristic tens to improve the effectiveness of a combination when using no weighting an to be neutral with respect to weighting, i.e. it oes not change the performance of the combination. The main exception to this is the MEDLARS collection in which info_noise tens to harm the performance of a combination when not using weighting. Having consiere which characteristics improve or worsene combinations, we now examine which combinations are affecte by the aition of new information. In Tables A.142 A.151, in the Appenix, we present a summary of how often iniviual characteristics will improve a combination containing another characteristic, e.g. how many combinations containing if are improve by the aition of tf. Uner both the weighting an non-weighting conitions the following generally hel: if improve combinations containing context more than other characteristics an improve combinations containing noise least of all tf improve combinations containing context or noise more than other characteristics an theme least theme improve combinations containing context most an combinations containing tf least context improve combinations containing noise least specificity improve combinations that containe theme an info_noise more than combinations containing other characteristics for the noise characteristic there were no general finings except that combinations containing if were usually less likely to be improve by the aition of noise information info_noise improve combinations containing theme an specificity most often. Weighting slightly altere which combinations performe well but the basic trens were the same across the conitions. On the larger collections, one effect of weighting was to reuce the effect of iniviual characteristics in that the effect of aing a characteristic was less likely to be epenent on which characteristics were alreay in the combination. One further observation is that term weighting schemes that represent similar features (e.g. if an noise which both represent global term statistics, an tf/theme which both represent within-ocument statistics) generally combine less well. That is combining these pairs of weights oes not generally help retrieval as much as combining complementary weights, e.g. if an tf, if an theme, etc. Combining the two ocument characteristics, however, oes seem to give better results. 6.4 Summary Our hypothesis was that combining evience combining characteristics of terms can improve retrieval effectiveness over retrieval by single characteristics. In section 6 we emonstrate that this was generally the case: all characteristics coul benefit from combination. Where combinations work well is where the aitional characteristics act as aitional means of ranking ocuments. That is separating ocuments by other sets of features. Other researchers have consiere this, e.g. Salton an Buckley examine if as a precision-enhancing funciton an tf as a recall-enhancing function, [SB88]. Similarly, Cooper, [Coo73] 17, iscusses the ifficulty of assessing likely utility without consiering aitional features of ocument content. However not all combinations are successful. Two aspects of combination that are likely to preict success are the nature of the characteristics complementary functions combine better an the success of the characteristic as a single retrieval function. Weighting the characteristics to reflect the strength of each characteristic as a single retrieval function is also generally a goo iea. However it can be ifficult to set optimal weights for two reasons: firstly it is likely that goo weights will be collection epenent as the iniviual characteristics have ifferent levels of effectiveness on ifferent collections. Seconly the weights shoul reflect the effectiveness of the characteristics relative to each other. However this becomes ifficult to assess when we combine characteristics, as we have to measure the relative strength of each 17 We are grateful to the anonymous referee for pointing us to this paper. 16