Data Fusion and Bias

Similar documents
OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Grade 6: Correlated to AGS Basic Math Skills

UNIT ONE Tools of Algebra

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Rule Learning With Negation: Issues Regarding Effectiveness

Term Weighting based on Document Revision History

Learning to Rank with Selection Bias in Personal Search

Rule Learning with Negation: Issues Regarding Effectiveness

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

This scope and sequence assumes 160 days for instruction, divided among 15 units.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

On document relevance and lexical cohesion between query terms

Learning From the Past with Experiment Databases

Primary National Curriculum Alignment for Wales

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

On the Combined Behavior of Autonomous Resource Management Agents

Diagnostic Test. Middle School Mathematics

Comment-based Multi-View Clustering of Web 2.0 Items

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Linking Task: Identifying authors and book titles in verbose queries

Assignment 1: Predicting Amazon Review Ratings

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Mathematics process categories

Extending Place Value with Whole Numbers to 1,000,000

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

A cognitive perspective on pair programming

Algebra 1 Summer Packet

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Mathematics Success Level E

Mathematics subject curriculum

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Python Machine Learning

A Case Study: News Classification Based on Term Frequency

NCEO Technical Report 27

Dublin City Schools Mathematics Graded Course of Study GRADE 4

On-the-Fly Customization of Automated Essay Scoring

Cal s Dinner Card Deals

(Sub)Gradient Descent

Remainder Rules. 3. Ask students: How many carnations can you order and what size bunches do you make to take five carnations home?

Word Segmentation of Off-line Handwritten Documents

Probabilistic Latent Semantic Analysis

Are You Ready? Simplify Fractions

Statewide Framework Document for:

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Introduction to Causal Inference. Problem Set 1. Required Problems

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

Data Fusion Models in WSNs: Comparison and Analysis

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Mathematics Scoring Guide for Sample Test 2005

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

CSC200: Lecture 4. Allan Borodin

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

A Comparison of Charter Schools and Traditional Public Schools in Idaho

Team Formation for Generalized Tasks in Expertise Social Networks

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A simulated annealing and hill-climbing algorithm for the traveling tournament problem

Functional Skills Mathematics Level 2 assessment

Test Effort Estimation Using Neural Network

Truth Inference in Crowdsourcing: Is the Problem Solved?

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Attributed Social Network Embedding

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Ohio s Learning Standards-Clear Learning Targets

Lecture 2: Quantifiers and Approximation

AP Statistics Summer Assignment 17-18

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Artificial Neural Networks written examination

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Organizational Knowledge Distribution: An Experimental Evaluation

Probability and Statistics Curriculum Pacing Guide

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Software Maintenance

An Effective Framework for Fast Expert Mining in Collaboration Networks: A Group-Oriented and Cost-Based Method

Maths Games Resource Kit - Sample Teaching Problem Solving

An Investigation into Team-Based Planning

Variations of the Similarity Function of TextRank for Automated Summarization

Classroom Connections Examining the Intersection of the Standards for Mathematical Content and the Standards for Mathematical Practice

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

A study of speaker adaptation for DNN-based speech synthesis

Human Emotion Recognition From Speech

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Conversation Starters: Using Spatial Context to Initiate Dialogue in First Person Perspective Games

South Carolina English Language Arts

CS Machine Learning

Finding Translations in Scanned Book Collections

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Math 121 Fundamentals of Mathematics I

Algebra 2- Semester 2 Review

School of Innovative Technologies and Engineering

Axiom 2013 Team Description Paper

Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Transcription:

Data Fusion and Bias Performance evaluation of various data fusion methods İlker Nadi Bozkurt Computer Engineering Department Bilkent University Ankara, Turkey bozkurti@cs.bilkent.edu.tr Hayrettin Gürkök Computer Engineering Department Bilkent University Ankara, Turkey gurkok@cs.bilkent.edu.tr Eyüp Serdar Ayaz Computer Engineering Department Bilkent University Ankara, Turkey serdara@cs.bilkent.edu.tr Abstract Data fusion is the combination of the results of independent searches on a document collection into one single output result set. It has been shown that this can greatly improve retrieval effectiveness over that of the individual results. This paper compares some major data fusion and system selection techniques by experimenting on TREC ad hoc collections. Results Merger: the results of search engines are merged using metasearch algorithms Keywords: data fusion; metasearch; information retrieval; rank aggregation; performance evaluation I. INTRODUCTION Data fusion (metasearch) is the term usually applied to techniques that combine final results from a number of search engines in order to improve retrieval. Briefly, a metasearch engine takes as input n ranked lists output by each of n search engines in response to a given query. Then, it computes a single ranked list as output which is usually an improvement over any of the used input lists. Metasearch offers significant advantages to information retrieval. First, it improves upon the performance of individual search engines because different retrieval methods often return very different irrelevant documents, but many of the same relevant documents. Second, metasearch provides more consistent and reliable performance than individual search engines. Since metasearch aggregates the advice of several systems, it does not reflect the tendency of a single system, resulting in a more reliable search system. [1] A reference software component architecture of a metasearch engine is illustrated in Figure 1. [2] The numbers on the edges indicate the sequence of actions for a query to be processed. According to this illustration, the functionality of each software component and the interactions among these components are as follows: Database Selector: the search engines to be fused selected using some system selection methods Query Dispatcher: the queries are submitted to the selected search engines Document Selector: the documents to be used from each search engine are determined Fig. 1: Metasearch software component architecture In this paper, we will deal with Database Selector and Results Merger components. The rest will be used as is from TREC results. We can classify the metasearch algorithms that will be mentioned throughout this paper by the data they require: whether they need relevance scores or only ranks. See Figure 2. ranks only relevance scores Reciprocal Rank Borda-fuse Condorcet-fuse CombMNZ CombANZ CombSUM Fig. 2: Some fusion methods according to the data they require As we have different fusion methods, we also have options for system selection process. In this work we will concentrate on all, best and bias selection methods. 1

The organization of the paper is as follows: In section 2 we briefly mention related work that is done in the area of data fusion. In section 3, we describe data fusion and system selection methods used in this work in detail. In section 4, we explain our experimental design in terms of the data sets and measures used. We present the experimental results and comparisons in Section 5. Finally, we conclude the paper in Section 6. II. RELATED WORK A number of fusion techniques based on normalised scores were proposed by Fox and Shaw [2]. These techniques use relevance scores from different systems. Among these, CombSUM and CombMNZ were shown to achieve the best performance and have been used in subsequent research. Aslam and Montague compared the fusion process to a democratic election in which there are voters (the search engines) and many candidates (documents). They achieved positive results by implementing adapted implementations of two algorithms designed for that situation. Borda-fuse [3] awards a score to each document depending on its position in each input result set, with its final score being the sum of these. Condorcet-fuse [4] ranks documents based on a pairwise comparison of each. A document is ranked above another if it appears above it in more input result sets. These two methods, together with the Reciprocal Rank method, use only ranks in contrast to relevance scores. In addition to fusion methods, there has been some work on system selection methods. Mowshowitz and Kawaguchi proposed selecting systems whose query responses are different from the norm so that more refined results can be achieved [5]. III. DATA FUSION AND SYSTEM SELECTION METHODS A. Data fusion methods 1) CombSUM CombSUM uses the summation of relevance scores by each system as the fused relevance score. 2) CombMNZ CombMNZ defines the fused relevance score as the sum of the relevance scores given by each system (that is, CombSUM), multiplied by the number of systems that returned the document. 3) CombANZ CombANZ is similar to CombMNZ except that, instead of multiplying, we divide CombSUM by the number of systems that returned the document. In other words, it returns the average relevance score. We can use the following general formula to calculate CombMNZ, CombANZ and CombSUM. Let n denote the number of systems that returned the document and rel ij relevance score of document i given by system j: score(i) = where p = 0, 1 and -1 for CombSUM, CombMNZ and CombANZ respectively. 4) Reciprocal rank fusion In this approach, retrieval systems determine the rank positions. When a duplicated document is found the inverse of its rankings are summed up, since the documents returned by more than one retrieval system might be more likely to be relevant. Systems that are not ranking a document are skipped. The following equation shows the computation of the rank score of document i using the position information of this document in all of the systems (j = 1...n). 1 1/ First, Rank Position score of each document to be combined is evaluated, then using these rank position scores, documents are sorted in non-decreasing order [6]. Example: Suppose that we have four different retrieval systems A, B, C, and D with a document collection composed of documents a, b, c, d, e, f, and g. Let us assume that for a given query their results are ranked as follows: A = (b, d, c, a) B = (a, b, c, f, g) C = (c, a, f, e, b, d) D = (a, d, g, f) Now, we compute the rank position of each document in the document list, and the rank scores of the documents are as follows: r(a) = 1/(1/4+1/1+1/2+1/1) = 0.36 r(b) = 1/(1/1+1/2+1/5+1/2) = 0.45 r(c) = 1/(1/3+1/3+1/1) = 0.60 r(d) = 1/(1/2+1/6+1/2) = 0.85 r(e) = 1/(1/4) = 4.00 r(f) = 1/(1/4+1/3+1/4) = 1.20 r(g) = 1/(1/5+1/3) = 1.87 2

Sorting the scores we have found in non-decreasing order, the final ranked list of documents is: a > b > c > d > f > g > e i.e. a is the document with the highest rank (top most document). 5) Borda-fuse This is a method taken from social theory of voting and used in data fusion. The highest ranked individual (in an n- way vote) gets n votes and each subsequent gets one vote less (so the number two gets n-1, the number three gets n-2 and so on). If there are candidates left unranked by the voter, the remaining points are divided evenly among the unranked candidates. Then, for each alternative, all the votes are added up and the alternative with the highest number of votes wins the election. Example: Consider the example given in Reciprocal Rank fusion again. The Borda count (BC) of a document i is computed by summing the Borda count values in individual systems (BC A in system A, etc.) as follows: BC(i) = BC A (i) + BC B (i) + BC C (i) + BC D (i) Now we compute the BC for each document: BC(a) = 4 + 7 + 6 + 7 = 24 BC(b) = 7 + 6 + 3 + 2 = 18 BC(c) = 5 + 5 + 7 + 2 = 19 BC(d) = 6 + 1.5 + 2 + 6 = 15.5 BC(e) = 2 + 1.5 + 4 + 2 = 9.5 BC(f) = 2 + 4 + 5 + 4 = 15 BC(g) = 2 + 3 + 1 + 5 = 11 Sorting the scores we have found in non-increasing order, the final ranked list of documents is: a > c > b > d > f > g > e 6) Condorcet-fuse This is another method taken from social theory of voting. In the Condorcet election method, voters rank the candidates in the order of preference. The vote counting procedure then takes into account each preference of each voter for one candidate over another. The Condorcet voting algorithm is a majoritarian method that specifies the winner as the candidate, which beats each of the other candidates in a pair wise comparison. Example: Again, consider the previous example but this time, let the ranks be given as: A: a > c = b > g B: b > a > c > d > f > e > g C: a = b > c > f > g > e D: c > e > d Note that, for instance, in system A documents b and c have the same original rank. In the first stage, we use an N x N matrix for the pair wise comparison, where N is the number of candidates. Each non-diagonal entry (i, j) of the matrix shows the number of votes i over j (i.e., cell [a,b] shows the number of wins, loses, and ties of document a over document b, respectively). In a system while counting votes, a document loses to all other retrieved documents if it is not retrieved by that system. a b c d e f g a - 1, 1, 2 3, 1, 0 3, 1, 0 3, 1, 0 3, 0, 1 3, 0, 1 b 1, 1, 2-2, 1, 1 3, 1, 0 3, 1, 0 3, 0, 1 3, 0, 1 c 1, 3, 0 1, 2, 1-4, 0, 0 4, 0, 0 4, 0, 0 4, 0, 0 d 1, 3, 0 1, 3, 0 0, 4, 0-1, 2, 1 2, 1, 1 2, 1, 1 e 1, 3, 0 1, 3, 0 0, 4, 0 2, 1, 1-1, 2, 1 2, 2, 0 f 0, 3, 1 0, 3, 1 0, 4, 0 1, 2, 1 2, 1, 1-2, 1, 1 g 0, 3, 1 0, 3, 1 0, 4, 0 1, 2, 1 2, 2, 0 1, 2, 1 - Table 1: Pairwise wins, losses and ties in Condorcet After that, we determine the pair wise winners. Each complimentary pair is compared, and the winner receives one point in its win column and the loser receives one point in its lose column. If the simulated pair wise election is a tie, both receive one point in the tie column. Win Lose Tie a 5 0 1 b 5 0 1 c 4 2 0 d 2 4 0 e 1 4 1 f 2 4 0 g 0 5 1 Table 2: Total wins, losses and ties in Condorcet To rank the documents we use their win and lose values. If the number of wins that a document has, is higher than the other one, then that document wins. Otherwise, if their win property is equal, we consider their lose scores; the document which has smaller lose score wins. If both win and lose scores are equal then both documents are tied. The final ranking of the documents in the example is: a = b > c > d = f > e > g 3

In our implementation, the documents d and f are assigned the rank of 4 and 5 in a random fashion and documents a and b will be assigned the rank of 1 and 2 in a random fashion. Condorcet Paradox: In a paradoxical case, there would be an equivalence class of winners, and one would be unable to pick the top winner or rank them. A commonly used example for this is the following: A: a > b > c B: b > c > a C: c > a > b In this example, if equivalent sources are considered tied, this problem is resolved. B. System selection methods 1) Normal All systems are selected to be used in fusion process. 2) Best Only the top performing systems are selected to be used in data fusion. One common way is to select a number of systems that yield high MAP when evaluated against TREC qrels. 3) Bias The systems that behave differently from the norm (majority of all systems used in fusion) are selected. The motivation behind this approach is that, usage of such systems would eliminate ordinary systems from data fusion and this could provide better discrimination among documents and systems. To compute the bias of a particular system, we first calculate the similarity of the vectors of the norm and the retrieval system, using a metric, e.g., their dot product divided by the square root of the product of their lengths, (the cosine similarity measure). The bias value is obtained by subtracting this similarity value from 1. So the similarity function for vectors v and w is the following:, The bias between these two vectors is defined as follows [5]: B(v, w) = 1 s(v, w) Two variants of bias calculation exist. One ignores the order of documents in the retrieved set and the other does not. To ignore position, frequency of document occurrence is used to calculate bias. To take the order of documents into account, we may increment the frequency of a document by m/i where m is the number of positions and i is the position of the document in the retrieved result set. Since users usually just look at the documents of higher rank, considering order in bias calculation gains importance [6]. Example: Suppose that we use two hypothetical retrieval systems A and B to define the norm, and three queries processed by each retrieval system. The documents retrieved by A and B for three queries are as follows (first row corresponds to the first query, etc.): Then the (seven) distinct documents retrieved by either A or B are a, b, c, d, e, f, and g and the response vectors for A, B and the norm are: respectively. X A = (2, 4, 2, 3, 1, 2, 0) X B = (2, 4, 1, 3, 1, 3, 1) X = (4, 8, 3, 6, 2, 5, 1) The similarity vector X A to X is: 76/ 38 155 / 0.9903 and that of X B to X is: 79/ 41 155 / = 0.991. So the bias values for each system are: Bias(A) = 1-0.9903 = 0.0097 Bias(B) = 1-0.991 = 0.009 If we repeat the calculations by taking the order of documents into account, response vectors are: X A = (11/2, 9, 11/2, 11/3, 1, 3, 0) X B = (8, 7, 4, 16/3, 1, 25/6, 1) X = (27/2, 16, 19/2, 9, 2, 43/6, 1) Then the bias values are found in the same way as: Bias(A) = 1-0.986704 = 0.013296 Bias(B) = 1-0.986702 = 0.013298 IV. DATA SETS AND MEASURES We used the ad hoc tracks of TREC-3, -4, -5 and -7 [7]. Table 3 gives the number of runs for each TREC used in this track and in our experiments. 4

TREC Runs 3 40 4 33 5 61 7 103 Table 3: Number of TREC runs We used mean non-interpolated average precision (MAP) to evaluate systems. Precision is the proportion of the retrieved documents that are relevant. Average precision for a single topic is the average of the precision after each relevant document is retrieved and using zero as the precision for not retrieved document. For multiple topics (queries), we used mean of these average precisions. All experiments are done on a Linux system with an Intel Core 2 Duo processor with 4GB of RAM. None of the steps has taken more than a few seconds to yield result so we were able to test with various pool depths and pseudorel percentages easily to find the optimum values. V. EXPERIMENTAL RESULTS In our first set of experiments we examined the effect of different pool depths and pseudorel percentages. We tested this using TREC7 dataset with Borda fuse and CombMNZ methods. A pool depth range of 30 to 150 and pseudorel percentage of 0.1 to 0.7 is examined. The following figures show the results: TREC-7 up to pool depth of 150 and pseudorel percentage of 0.7. We did not examine if these values are also optimal for other TRECs we used, but we nevertheless used these values in the rest of the experiments, as we do not expect much performance degradation if these values are not optimal. Fig. 4: Comparison of pool depth and pseudorel percentage with Normal system selection The following tables show our experiment results for BordaCount, CombMNZ, CombANZ and CombSUM fusion methods with Best, Bias and Normal system selection approaches on TREC 3,4,5 and 7. Note that we used different number of systems in each TREC for testing the bias concept, because the number of systems participated in each TREC is different and we have to use a certain percentage of the number of systems with bias to get meaningful results. In the following tables maximum entry in each column is shown in bold, showing the best system selection method for that data fusion method in the corresponding TREC. The maximum entry of each row is shown underlined, showing the best data fusion method for that system selection method in the corresponding TREC. Fig. 3: Comparison of pool depth and pseudorel percentage with Best system selection The figures show that as we increase the pool depth and pseuodrel percentage, the MAP score of the fused results improve. As we increase the pool depth and pseudorel percentage, if relevant documents continue to appear higher ranked than non-relevant documents, the average precision of the fused system will improve. This is clearly the case for TREC-3 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.2068 0.3875 0.1917 0.3657 BEST20 0.2011 0.3967 0.1879 0.3518 BIAS10 0.1451 0.2581 0.1093 0.2098 BIAS20 0.1534 0.2863 0.1040 0.2239 NORMAL 0.1478 0.3973 0.1856 0.3575 Table 4: TREC-3 results 5

TREC-4 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.0491 0.0399 0.2107 0.0687 BEST20 0.0378 0.0149 0.0176 0.0162 BIAS10 0.0000 0.0000 0.0000 0.0000 BIAS20 0.0003 0.0001 0.0000 0.0001 NORMAL 0.0232 0.0086 0.0096 0.0092 Table 5: TREC-4 results TREC-5 BORDA COMBMNZ COMBANZ COMBSUM BEST10 0.2901 0.336 0.1696 0.3142 BEST20 0.2608 0.3732 0.1368 0.3452 BIAS10 0.0858 0.1545 0.0398 0.1392 BIAS30 0.1316 0.3246 0.0926 0.0011 NORMAL 0.1278 0.3416 0.0934 0.3206 Table 6: TREC-5 results TREC-7 BORDA COMBMNZCOMBANZ COMBSUM BEST10 0.3625 0.4639 0.2052 0.4585 BEST20 0.3291 0.4725 0.2097 0.4661 BIAS20 0.1276 0.4108 0.1227 0.4083 BIAS50 0.1257 0.3614 0.1083 0.3399 NORMAL 0.1038 0.3593 0.0849 0.3428 Table 7: TREC-7 results The above tables contain information for both comparing different system selection methods for each data fusion method and and comparing different data fusion methods for each system selection method. When we examine the columns of the above tables we see that for all but one of the test results, best system selection always yields better results than normal system selection and bias concept in system selection. This is an expected result, because best systems have high relevant overlap and when relevant overlap is higher than nonrelevant overlap data fusion yields performance improvement as conjectured by Lee[8]. Further examination of the columns of the above tables shows that using bias concept in system selection shows improvement over normal system selection on TREC 5 and TREC 7. On TREC 3 normal system selection yields better results than bias system selection but a further examination of TREC 3 results show that normal system selection yields competitive results against even best system selection. One reason for this maybe that all input systems participating in TREC 3 produce similar results. On TREC4 we see very weird results, almost the opposite of everything we expected. This may be the reason it was not used in experiments in [6]. Also in [3],[9] and [10], TREC3 and TREC5 data is used in experiments but not TREC4. To see which data fusion method performs the best, we examine the rows of the above tables. The results show that except for TREC4, CombMNZ yields the best results. The second best is almost always CombSUM. These results agree with the previuos results in the literature, e.g. see[3], as CombMNZ is a very competitive data fusion method. To see the effectiveness of our fusion methods, we compare their MAP scores against that of the top and the median systems of the TREC under consideration with normal system selection. The MAP values for all systems are seen on Table 8. Figure 5 demonstrates the performance of our fusion methods. The performance of CombMNZ and CombSUM are comparable to that of the top system in TREC-5 and TREC-7. CombANZ and Borda-fuse never outperforms other systems in any TREC. The median system s is worse than best fusion methods (CombMNZ and CombSUM) except for TREC-4. TOP SYSTEM MEDIAN SYSTEM BORDA- FUSE TREC-3 TREC-4 TREC-5 TREC-7 0.4748 0.3631 0.3165 0.3702 0.2640 0.2287 0.1951 0.2032 0.1478 0.0232 0.1278 0.1038 COMBANZ 0.1856 0.0096 0.0934 0.0849 COMBMNZ 0.3973 0.0086 0.3416 0.3593 COMBSUM 0.3575 0.0092 0.3206 0.3428 Table 8: MAP values of different systems with normal system selection MAP 0.5 0.4 0.3 0.2 0.1 0 TREC TREC-3 TREC-4 TREC-5 TREC-7 TOP SYSTEM MEDIAN SYSTEM BORDA-FUSE COMBANZ COMBMNZ COMBSUM Fig. 5: MAP graph of various systems with normal system selection 6

VI. CONCLUSION In this paper, we evaluated and made comparisons on system selection and merging methods used in data fusion. For system selection methods, we considered the effectiveness of ranking by selecting all of the systems, some of the best performing systems and finally the systems that behave differently from the majority (i.e. biased systems). Experiments show that, the superior system selection method is best selection. We demonstrate that usage of biased systems improves retrieval effectiveness. In some cases bias selection outperforms normal selection and usually yields MAP scores close to best selection. Among data fusion methods, CombSUM and CombMNZ performs much better than any other method.. However, these two methods require existence of relevance scores from systems. In cases where we only have the ordering but not the original scores, these methods can not be applied, and we may have to use methods which only use ranking information such as Condorcet method [6] and Borda Count. REFERENCES [1] M. Montague and J. A. Aslam. Condorcet Fusion for Improved Retrieval. In Proceedings of the 11th international conference on information and knowledge management (pp. 538-548). [2] E. A. Fox and J. A. Shaw. Combination of multiple searches. In Proceedings of the 2nd Text Retrieval Conference (TREC-2), National Institute of Standards and Technology Special Publication 500-215, pages 243 252, 1994. [3] J. A. Aslam and M. Montague. Models for metasearch. In SIGIR 01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276 284, New York, NY, USA, 2001. ACM Press. [4] M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In CIKM 01: Proceedings of the tenth international conference on Information and knowledge management, pages 427 433, New York, NY, USA, 2001. ACM Press. [5] A. Mowshowitz and A. Kawaguchi, Measuring search engine bias, Information Processing and Management: an International Journal, v.41 n.5, p.1193-1205, September 2005. [6] R. Nuray and F. Can, Automatic ranking of information retrieval systems using data fusion, Information Processing and Management: an International Journal, v.42 n.3, p.595-614, May 2006. [7] Text REtrieval Conference (TREC) Home Page, <http://trec.nist.gov> [8] J.H.Lee, Analyses of Multilple Evidence Combination, Proceedings of the 20 th Annual ACM-SIGIR, pp 267-276, 1995 [9] M. Montague and J. A. Aslam, Metasearch Consistency, SIGIR `01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 276 284, New York, NY, USA, 2001. ACM Press. [10] D. Lillis, F. Toolan, R. Collier, J. Dunnion, ProbFuse: A Probabilistic Approach to Data Fusion, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139 146, New York, NY, USA, 2006. ACM Press. 7