Data Fusion and Bias Performance evaluation of various data fusion methods İlker Nadi Bozkurt Computer Engineering Department Bilkent University Ankara, Turkey bozkurti@cs.bilkent.edu.tr Hayrettin Gürkök Computer Engineering Department Bilkent University Ankara, Turkey gurkok@cs.bilkent.edu.tr Eyüp Serdar Ayaz Computer Engineering Department Bilkent University Ankara, Turkey serdara@cs.bilkent.edu.tr Abstract Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown that this can greatly improve retrieval effectiveness over that of the individual results. This paper compares some major data fusion and system selection techniques by experimenting on TREC ad hoc collections. Results Merger: the results of search engines are merged using metasearch algorithms Keywords: data fusion; metasearch; information retrieval; rank aggregation; performance evaluation I. INTRODUCTION Data fusion (metasearch) is the term usually applied to techniques that combine final results from a number of search engines in order to improve retrieval. Briefly, a metasearch engine takes as input n ranked lists output by each of n search engines in response to a given query. Then, it computes a single ranked list as output which is usually an improvement over any of the used input lists. Metasearch offers significant advantages to information retrieval. First, it improves upon the performance of individual search engines because different retrieval methods often return very different irrelevant documents, but many of the same relevant documents. Second, metasearch provides more consistent and reliable performance than individual search engines. Since metasearch aggregates the advice of several systems, it does not reflect the tendency of a single system, resulting in a more reliable search system. [1] A reference software component architecture of a metasearch engine is illustrated in Figure 1. [2] The numbers on the edges indicate the sequence of actions for a query to be processed. According to this illustration, the functionality of each software component and the interactions among these components are as follows: Database Selector: the search engines to be fused selected using some system selection methods Query Dispatcher: the queries are submitted to the selected search engines Document Selector: the documents to be used from each search engine are determined Fig. 1: Metasearch software component architecture In this paper, we will deal with Database Selector and Results Merger components. The rest will be used as is from TREC results. We can classify the metasearch algorithms that will be mentioned throughout this paper by the data they require: whether they need relevance scores or only ranks. See Figure 2. ranks only relevance scores Reciprocal Rank Borda-fuse Condorcet-fuse CombMNZ CombANZ CombSUM Fig. 2: Some fusion methods according to the data they require As we have different fusion methods, we also have options for system selection process. In this work we will concentrate on all, best and bias selection methods. 1
The organization of the paper is as follows: In section 2 we briefly mention related work that is done in the area of data fusion. In section 3, we describe data fusion and system selection methods used in this work in detail. In section 4, we explain our experimental design in terms of the data sets and measures used. We present the experimental results and comparisons in Section 5. Finally, we conclude the paper in Section 6. II. RELATED WORK A number of fusion techniques based on normalised scores were proposed by Fox and Shaw [2]. These techniques use relevance scores from different systems. Among these, CombSUM and CombMNZ were shown to achieve the best performance and have been used in subsequent research. Aslam and Montague compared the fusion process to a democratic election in which there are voters (the search engines) and many candidates (documents). They achieved positive results by implementing adapted implementations of two algorithms designed for that situation. Borda-fuse [3] awards a score to each document depending on its position in each input result set, with its final score being the sum of these. Condorcet-fuse [4] ranks documents based on a pairwise comparison of each. A document is ranked above another if it appears above it in more input result sets. These two methods, together with the Reciprocal Rank method, use only ranks in contrast to relevance scores. In addition to fusion methods, there has been some work on system selection methods. Mowshowitz and Kawaguchi proposed selecting systems whose query responses are different from the norm so that more refined results can be achieved [5]. III. DATA FUSION AND SYSTEM SELECTION METHODS A. Data fusion methods 1) CombSUM CombSUM uses the summation of relevance scores by each system as the fused relevance score. 2) CombMNZ CombMNZ defines the fused relevance score as the sum of the relevance scores given by each system (that is, CombSUM), multiplied by the number of systems that returned the document. 3) CombANZ CombANZ is similar to CombMNZ except that, instead of multiplying, we divide CombSUM by the number of systems that returned the document. In other words, it returns the average relevance score. We can use the following general formula to calculate CombMNZ, CombANZ and CombSUM. Let n denote the number of systems that returned the document and rel ij relevance score of document i given by system j: score(i) = where p = 0, 1 and -1 for CombSUM, CombMNZ and CombANZ respectively. 4) Reciprocal rank fusion In this approach, retrieval systems determine the rank positions. When a duplicated document is found the inverse of its rankings are summed up, since the documents returned by more than one retrieval system might be more likely to be relevant. Systems that are not ranking a document are skipped. The following equation shows the computation of the rank score of document i using the position information of this document in all of the systems (j = 1...n). 1 1/ First, Rank Position score of each document to be combined is evaluated, then using these rank position scores, documents are sorted in non-decreasing order [6]. Example: Suppose that we have four different retrieval systems A, B, C, and D with a document collection composed of documents a, b, c, d, e, f, and g. Let us assume that for a given query their results are ranked as follows: A = (b, d, c, a) B = (a, b, c, f, g) C = (c, a, f, e, b, d) D = (a, d, g, f) Now, we compute the rank position of each document in the document list, and the rank scores of the documents are as follows: r(a) = 1/(1/4+1/1+1/2+1/1) = 0.36 r(b) = 1/(1/1+1/2+1/5+1/2) = 0.45 r(c) = 1/(1/3+1/3+1/1) = 0.60 r(d) = 1/(1/2+1/6+1/2) = 0.85 r(e) = 1/(1/4) = 4.00 r(f) = 1/(1/4+1/3+1/4) = 1.20 r(g) = 1/(1/5+1/3) = 1.87 2
Sorting the scores we have found in non-decreasing order, the final ranked list of documents is: a > b > c > d > f > g > e i.e. a is the document with the highest rank (top most document). 5) Borda-fuse This is a method taken from social theory of voting and used in data fusion. The highest ranked individual (in an n- way vote) gets n votes and each subsequent gets one vote less (so the number two gets n-1, the number three gets n-2 and so on). If there are candidates left unranked by the voter, the remaining points are divided evenly among the unranked candidates. Then, for each alternative, all the votes are added up and the alternative with the highest number of votes wins the election. Example: Consider the example given in Reciprocal Rank fusion again. The Borda count (BC) of a document i is computed by summing the Borda count values in individual systems (BC A in system A, etc.) as follows: BC(i) = BC A (i) + BC B (i) + BC C (i) + BC D (i) Now we compute the BC for each document: BC(a) = 4 + 7 + 6 + 7 = 24 BC(b) = 7 + 6 + 3 + 2 = 18 BC(c) = 5 + 5 + 7 + 2 = 19 BC(d) = 6 + 1.5 + 2 + 6 = 15.5 BC(e) = 2 + 1.5 + 4 + 2 = 9.5 BC(f) = 2 + 4 + 5 + 4 = 15 BC(g) = 2 + 3 + 1 + 5 = 11 Sorting the scores we have found in non-increasing order, the final ranked list of documents is: a > c > b > d > f > g > e 6) Condorcet-fuse This is another method taken from social theory of voting. In the Condorcet election method, voters rank the candidates in the order of preference. The vote counting procedure then takes into account each preference of each voter for one candidate over another. The Condorcet voting algorithm is a majoritarian method that specifies the winner as the candidate, which beats each of the other candidates in a pair wise comparison. Example: Again, consider the previous example but this time, let the ranks be given as: A: a > c = b > g B: b > a > c > d > f > e > g C: a = b > c > f > g > e D: c > e > d Note that, for instance, in system A documents b and c have the same original rank. In the first stage, we use an N x N matrix for the pair wise comparison, where N is the number of candidates. Each non-diagonal entry (i, j) of the matrix shows the number of votes i over j (i.e., cell [a,b] shows the number of wins, loses, and ties of document a over document b, respectively). In a system while counting votes, a document loses to all other retrieved documents if it is not retrieved by that system. a b c d e f g a - 1, 1, 2 3, 1, 0 3, 1, 0 3, 1, 0 3, 0, 1 3, 0, 1 b 1, 1, 2-2, 1, 1 3, 1, 0 3, 1, 0 3, 0, 1 3, 0, 1 c 1, 3, 0 1, 2, 1-4, 0, 0 4, 0, 0 4, 0, 0 4, 0, 0 d 1, 3, 0 1, 3, 0 0, 4, 0-1, 2, 1 2, 1, 1 2, 1, 1 e 1, 3, 0 1, 3, 0 0, 4, 0 2, 1, 1-1, 2, 1 2, 2, 0 f 0, 3, 1 0, 3, 1 0, 4, 0 1, 2, 1 2, 1, 1-2, 1, 1 g 0, 3, 1 0, 3, 1 0, 4, 0 1, 2, 1 2, 2, 0 1, 2, 1 - Table 1: Pairwise wins, losses and ties in Condorcet After that, we determine the pair wise winners. Each complimentary pair is compared, and the winner receives one point in its win column and the loser receives one point in its lose column. If the simulated pair wise election is a tie, both receive one point in the tie column. Win Lose Tie a 5 0 1 b 5 0 1 c 4 2 0 d 2 4 0 e 1 4 1 f 2 4 0 g 0 5 1 Table 2: Total wins, losses and ties in Condorcet To rank the documents we use their win and lose values. If the number of wins that a document has, is higher than the other one, then that document wins. Otherwise, if their win property is equal, we consider their lose scores; the document which has smaller lose score wins. If both win and lose scores are equal then both documents are tied. The final ranking of the documents in the example is: a = b > c > d = f > e > g 3
In our implementation, the documents d and f are assigned the rank of 4 and 5 in a random fashion and documents a and b will be assigned the rank of 1 and 2 in a random fashion. Condorcet Paradox: In a paradoxical case, there would be an equivalence class of winners, and one would be unable to pick the top winner or rank them. A commonly used example for this is the following: A: a > b > c B: b > c > a C: c > a > b In this example, if equivalent sources are considered tied, this problem is resolved. B. System selection methods 1) Normal All systems are selected to be used in fusion process. 2) Best Only the top performing systems are selected to be used in data fusion. One common way is to select a number of systems that yield high MAP when evaluated against TREC qrels. 3) Bias The systems that behave differently from the norm (majority of all systems used in fusion) are selected. The motivation behind this approach is that, usage of such systems would eliminate ordinary systems from data fusion and this could provide better discrimination among documents and systems. To compute the bias of a particular system, we first calculate the similarity of the vectors of the norm and the retrieval system, using a metric, e.g., their dot product divided by the square root of the product of their lengths, (the cosine similarity measure). The bias value is obtained by subtracting this similarity value from 1. So the similarity function for vectors v and w is the following:, The bias between these two vectors is defined as follows [5]: B(v, w) = 1 s(v, w) Two variants of bias calculation exist. One ignores the order of documents in the retrieved set and the other does not. To ignore position, frequency of document occurrence is used to calculate bias. To take the order of documents into account, we may increment the frequency of a document by m/i where m is the number of positions and i is the position of the document in the retrieved result set. Since users usually just look at the documents of higher rank, considering order in bias calculation gains importance [6]. Example: Suppose that we use two hypothetical retrieval systems A and B to define the norm, and three queries processed by each retrieval system. The documents retrieved by A and B for three queries are as follows (first row corresponds to the first query, etc.): Then the (seven) distinct documents retrieved by either A or B are a, b, c, d, e, f, and g and the response vectors for A, B and the norm are: respectively. X A = (2, 4, 2, 3, 1, 2, 0) X B = (2, 4, 1, 3, 1, 3, 1) X = (4, 8, 3, 6, 2, 5, 1) The similarity vector X A to X is: 76/ 38 155 / 0.9903 and that of X B to X is: 79/ 41 155 / = 0.991. So the bias values for each system are: Bias(A) = 1-0.9903 = 0.0097 Bias(B) = 1-0.991 = 0.009 If we repeat the calculations by taking the order of documents into account, response vectors are: X A = (11/2, 9, 11/2, 11/3, 1, 3, 0) X B = (8, 7, 4, 16/3, 1, 25/6, 1) X = (27/2, 16, 19/2, 9, 2, 43/6, 1) Then the bias values are found in the same way as: Bias(A) = 1-0.986704 = 0.013296 Bias(B) = 1-0.986702 = 0.013298 IV. DATA SETS AND MEASURES We used the ad hoc tracks of TREC-3, -4, -5 and -7 [7]. Table 3 gives the number of runs for each TREC used in this track and in our experiments. 4
TREC Runs 3 40 4 33 5 61 7 103 Table 3: Number of TREC runs We used mean non-interpolated average precision (MAP) to evaluate systems. Precision is the proportion of the retrieved documents that are relevant. Average precision for a single topic is the average of the precision after each relevant document is retrieved and using zero as the precision for not retrieved document. For multiple topics (queries), we used mean of these average precisions. All experiments are done on a Linux system with an Intel Core 2 Duo processor with 4GB of RAM. None of the steps has taken more than a few seconds to yield result so we were able to test with various pool depths and pseudorel percentages easily to find the optimum values. V. EXPERIMENTAL RESULTS In our first set of experiments we examined the effect of different pool depths and pseudorel percentages. We tested this using TREC7 dataset with Borda fuse and CombMNZ methods. A pool depth range of 30 to 150 and pseudorel percentage of 0.1 to 0.7 is examined. The following figures show the results: TREC-7 up to pool depth of 150 and pseudorel percentage of 0.7. We did not examine if these values are also optimal for other TRECs we used, but we nevertheless used these values in the rest of the experiments, as we do not expect much performance degradation if these values are not optimal. Fig. 4: Comparison of pool depth and pseudorel percentage with Normal system selection The following tables show our experiment results for BordaCount, CombMNZ, CombANZ and CombSUM fusion methods with Best, Bias and Normal system selection approaches on TREC 3,4,5 and 7. Note that we used different number of systems in each TREC for testing the bias concept, because the number of systems participated in each TREC is different and we have to use a certain percentage of the number of systems with bias to get meaningful results. In the following tables maximum entry in each column is shown in bold, showing the best system selection method for that data fusion method in the corresponding TREC. The maximum entry of each row is shown underlined, showing the best data fusion method for that system selection method in the corresponding TREC. Fig. 3: Comparison of pool depth and pseudorel percentage with Best system selection The figures show that as we increase the pool depth and pseuodrel percentage, the MAP score of the fused results improve. As we increase the pool depth and pseudorel percentage, if relevant documents continue to appear higher ranked than non-relevant documents, the average precision of the fused system will improve. This is clearly the case for TREC-3 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.2068 0.3875 0.1917 0.3657 BEST20 0.2011 0.3967 0.1879 0.3518 BIAS10 0.1451 0.2581 0.1093 0.2098 BIAS20 0.1534 0.2863 0.1040 0.2239 NORMAL 0.1478 0.3973 0.1856 0.3575 Table 4: TREC-3 results 5
TREC-4 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.0491 0.0399 0.2107 0.0687 BEST20 0.0378 0.0149 0.0176 0.0162 BIAS10 0.0000 0.0000 0.0000 0.0000 BIAS20 0.0003 0.0001 0.0000 0.0001 NORMAL 0.0232 0.0086 0.0096 0.0092 Table 5: TREC-4 results TREC-5 BORDA COMBMNZ COMBANZ COMBSUM BEST10 0.2901 0.336 0.1696 0.3142 BEST20 0.2608 0.3732 0.1368 0.3452 BIAS10 0.0858 0.1545 0.0398 0.1392 BIAS30 0.1316 0.3246 0.0926 0.0011 NORMAL 0.1278 0.3416 0.0934 0.3206 Table 6: TREC-5 results TREC-7 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.3625 0.4639 0.2052 0.4585 BEST20 0.3291 0.4725 0.2097 0.4661 BIAS20 0.1276 0.4108 0.1227 0.4083 BIAS50 0.1257 0.3614 0.1083 0.3399 NORMAL 0.1038 0.3593 0.0849 0.3428 Table 7: TREC-7 results The above tables contain information for both comparing different system selection methods for each data fusion method and and comparing different data fusion methods for each system selection method. When we examine the columns of the above tables we see that for all but one of the test results, best system selection always yields better results than normal system selection and bias concept in system selection. This is an expected result, because best systems have high relevant overlap and when relevant overlap is higher than nonrelevant overlap data fusion yields performance improvement as conjectured by Lee[8]. Further examination of the columns of the above tables shows that using bias concept in system selection shows improvement over normal system selection on TREC 5 and TREC 7. On TREC 3 normal system selection yields better results than bias system selection but a further examination of TREC 3 results show that normal system selection yields competitive results against even best system selection. One reason for this maybe that all input systems participating in TREC 3 produce similar results. On TREC4 we see very weird results, almost the opposite of everything we expected. This may be the reason it was not used in experiments in [6]. Also in [3],[9] and [10], TREC3 and TREC5 data is used in experiments but not TREC4. To see which data fusion method performs the best, we examine the rows of the above tables. The results show that except for TREC4, CombMNZ yields the best results. The second best is almost always CombSUM. These results agree with the previuos results in the literature, e.g. see[3], as CombMNZ is a very competitive data fusion method. To see the effectiveness of our fusion methods, we compare their MAP scores against that of the top and the median systems of the TREC under consideration with normal system selection. The MAP values for all systems are seen on Table 8. Figure 5 demonstrates the performance of our fusion methods. The performance of CombMNZ and CombSUM are comparable to that of the top system in TREC-5 and TREC-7. CombANZ and Borda-fuse never outperforms other systems in any TREC. The median system s is worse than best fusion methods (CombMNZ and CombSUM) except for TREC-4. TOP SYSTEM MEDIAN SYSTEM BORDA- FUSE TREC-3 TREC-4 TREC-5 TREC-7 0.4748 0.3631 0.3165 0.3702 0.2640 0.2287 0.1951 0.2032 0.1478 0.0232 0.1278 0.1038 COMBANZ 0.1856 0.0096 0.0934 0.0849 COMBMNZ 0.3973 0.0086 0.3416 0.3593 COMBSUM 0.3575 0.0092 0.3206 0.3428 Table 8: MAP values of different systems with normal system selection MAP 0.5 0.4 0.3 0.2 0.1 0 TREC TREC-3 TREC-4 TREC-5 TREC-7 TOP SYSTEM MEDIAN SYSTEM BORDA-FUSE COMBANZ COMBMNZ COMBSUM Fig. 5: MAP graph of various systems with normal system selection 6
VI. CONCLUSION In this paper, we evaluated and made comparisons on system selection and merging methods used in data fusion. For system selection methods, we considered the effectiveness of ranking by selecting all of the systems, some of the best performing systems and finally the systems that behave differently from the majority (i.e. biased systems). Experiments show that, the superior system selection method is best selection. We demonstrate that usage of biased systems improves retrieval effectiveness. In some cases bias selection outperforms normal selection and usually yields MAP scores close to best selection. Among data fusion methods, CombSUM and CombMNZ performs much better than any other method.. However, these two methods require existence of relevance scores from systems. In cases where we only have the ordering but not the original scores, these methods can not be applied, and we may have to use methods which only use ranking information such as Condorcet method [6] and Borda Count. REFERENCES [1] M. Montague and J. A. Aslam. Condorcet Fusion for Improved Retrieval. In Proceedings of the 11th international conference on information and knowledge management (pp. 538-548). [2] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text Retrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500-215, pages 243 252, 1994. [3] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR 01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276 284, New York, NY, USA, 2001. ACM Press. [4] M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In CIKM 01: Proceedings of the tenth international conference on Information and knowledge management, pages 427 433, New York, NY, USA, 2001. ACM Press. [5] A. Mowshowitz and A. Kawaguchi, Measuring search engine bias, Information Processing and Management: an International Journal, v.41 n.5, p.1193-1205, September 2005. [6] R. Nuray and F. Can, Automatic ranking of information retrieval systems using data fusion, Information Processing and Management: an International Journal, v.42 n.3, p.595-614, May 2006. [7] Text REtrieval Conference (TREC) Home Page, <http://trec.nist.gov> [8] J.H.Lee, Analyses of Multilple Evidence Combination, Proceedings of the 20 th Annual ACM-SIGIR, pp 267-276, 1995 [9] M. Montague and J. A. Aslam, Metasearch Consistency, SIGIR `01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276 284, New York, NY, USA, 2001. ACM Press. [10] D. Lillis, F. Toolan, R. Collier, J. Dunnion, ProbFuse: A Probabilistic Approach to Data Fusion, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139 146, New York, NY, USA, 2006. ACM Press. 7