Measuring Search Effectiveness: Lessons from Interactive TREC

Similar documents
Cross Language Information Retrieval

Probabilistic Latent Semantic Analysis

10.2. Behavior models

Designing a Rubric to Assess the Modelling Phase of Student Design Projects in Upper Year Engineering Courses

ECE-492 SENIOR ADVANCED DESIGN PROJECT

The Smart/Empire TIPSTER IR System

Evaluation for Scenario Question Answering Systems

Linking Task: Identifying authors and book titles in verbose queries

MSW POLICY, PLANNING & ADMINISTRATION (PP&A) CONCENTRATION

Tun your everyday simulation activity into research

A Case Study: News Classification Based on Term Frequency

M55205-Mastering Microsoft Project 2016

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

P. Belsis, C. Sgouropoulou, K. Sfikas, G. Pantziou, C. Skourlas, J. Varnas

Using LibQUAL+ at Brown University and at the University of Connecticut Libraries

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

HLTCOE at TREC 2013: Temporal Summarization

Software Maintenance

Managing the Student View of the Grade Center

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

School Leadership Rubrics

Deploying Agile Practices in Organizations: A Case Study

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

STEPS TO EFFECTIVE ADVOCACY

Knowledge based expert systems D H A N A N J A Y K A L B A N D E

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

TotalLMS. Getting Started with SumTotal: Learner Mode

Developing an Assessment Plan to Learn About Student Learning

General syllabus for third-cycle courses and study programmes in

What is a Mental Model?

Speech Recognition at ICSI: Broadcast News and beyond

AQUA: An Ontology-Driven Question Answering System

An NFR Pattern Approach to Dealing with Non-Functional Requirements

MASTER OF ARTS IN APPLIED SOCIOLOGY. Thesis Option

Analysis: Evaluation: Knowledge: Comprehension: Synthesis: Application:

Medical Complexity: A Pragmatic Theory

Motivation to e-learn within organizational settings: What is it and how could it be measured?

Constructing a support system for self-learning playing the piano at the beginning stage

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Learning Microsoft Office Excel

DSTO WTOIBUT10N STATEMENT A

UCEAS: User-centred Evaluations of Adaptive Systems

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

Georgetown University at TREC 2017 Dynamic Domain Track

Software Development Plan

elearning OVERVIEW GFA Consulting Group GmbH 1

Presentation Advice for your Professional Review

AUTHOR COPY. Techniques for cold-starting context-aware mobile recommender systems for tourism

Learning to Rank with Selection Bias in Personal Search

PEDAGOGICAL LEARNING WALKS: MAKING THE THEORY; PRACTICE

South Carolina English Language Arts

Using Virtual Manipulatives to Support Teaching and Learning Mathematics

Evidence for Reliability, Validity and Learning Effectiveness

CROSS COUNTRY CERTIFICATION STANDARDS

On document relevance and lexical cohesion between query terms

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Bluetooth mlearning Applications for the Classroom of the Future

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Multilingual Sentiment and Subjectivity Analysis

PRINCE2 Foundation (2009 Edition)

OVERVIEW OF CURRICULUM-BASED MEASUREMENT AS A GENERAL OUTCOME MEASURE

Matching Similarity for Keyword-Based Clustering

Value Creation Through! Integration Workshop! Value Stream Analysis and Mapping for PD! January 31, 2002!

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

A student diagnosing and evaluation system for laboratory-based academic exercises

Diploma in Library and Information Science (Part-Time) - SH220

Running head: THE INTERACTIVITY EFFECT IN MULTIMEDIA LEARNING 1

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Using dialogue context to improve parsing performance in dialogue systems

Graduate Program in Education

Radius STEM Readiness TM

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Organizational Knowledge Distribution: An Experimental Evaluation

What is beautiful is useful visual appeal and expected information quality

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Accelerated Learning Course Outline

Implementing a tool to Support KAOS-Beta Process Model Using EPF

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Evaluation of Learning Management System software. Part II of LMS Evaluation

Self Assessment. InTech Collegiate High School. Jason Stanger, Director 1787 Research Park Way North Logan, UT

Hongyan Ma. University of California, Los Angeles

Running head: DELAY AND PROSPECTIVE MEMORY 1

GACE Computer Science Assessment Test at a Glance

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

MARKETING MANAGEMENT II: MARKETING STRATEGY (MKTG 613) Section 007

PUBLIC CASE REPORT Use of the GeoGebra software at upper secondary school

A cognitive perspective on pair programming

Rule Learning With Negation: Issues Regarding Effectiveness

A Topic Maps-based ontology IR system versus Clustering-based IR System: A Comparative Study in Security Domain

Inquiry Learning Methodologies and the Disposition to Energy Systems Problem Solving

Faculty Feedback User s Guide

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Transcription:

Measuring Search Effectiveness: Lessons from Interactive TREC School of Communication, Information and Library Studies Rutgers University http://www.scils.rutgers.edu/~muresan/

Objectives Discuss methodologies and measures of effectiveness that, in our experience, mainly in the TREC Interactive track, have proven successful in painting an accurate picture of the user interaction when seeking information. Classify the measures and discuss the contexts when they can be used. Attempt to provide guidelines as to which measures are appropriate in certain conditions.

Before doing IR evaluation, ask: What do we want from an IRS? Systemic approach Goal (for a known information need): Return as many relevant documents as possible and as few non-relevant documents as possible Cognitive approach Goal (in an interactive information-seeking environment, with a given IRS): Support the user s exploration of the problem domain and the task completion.

The role of an IR system a modern view Support the user in exploring a problem domain, understanding its terminology, concepts and structure clarifying, refining and formulating an information need finding documents that match the info need description as many relevant docs as possible as few non-relevant documents as possible exploring the retrieved documents

Aspects to evaluate INPUT Problem definition Source selection Problem articulation Engine OUTPUT Examination of results Extraction of information Integration with overall task

Some IR Evaluation Issues How best to evaluate performance of the system as a whole How to be realistic yet controlled How to gather sufficient and adequate data from which it is possible to generalize meaningfully How to tailor evaluation measures and methods to specific contexts and tasks

Evaluation: IR specific vs. non-specific IR-specific evaluation Systemic Quality of search engine Influence of various modelling decisions (stopword removal, stemming, indexing, weighting scheme, ) Interaction Support for query formulation Support for exploration of search output Non-specific evaluation Task-oriented evaluation Usefulness, usability Task completion, user satisfaction

Task-oriented evaluation (non-ir specific) Time to complete a task Time to complete a task after a specified time away from the product Number and type of errors per task Number of errors per unit of time Number of navigations to online help or manuals Number of users making a particular error Number of users completing task successfully

Evaluation: Qualitative vs. Quantitative Qualitative: Heuristic evaluation, expert reviews, cognitive walkthroughs etc - preferred if the purpose of the study is to establish the usability of a system; Naturalistic/ethnographic studies - preferred if the purpose of the study is to capture the behavior or preferences of a group of people in a certain setting. Quantitative studies: Systematic studies can produce invaluable insight into the effect of various parameters, mathematical models, interaction models, or even of interface elements such as the query formulation mechanism or the layout of the search results Control over experimental variables, repeatability, observability

Measures and dimensions of evaluation Task Specificity General Task-specific Interactivity Non-interactive (laboratory evaluation of the retrieval algorithm) Interactive (evaluation of the interaction process and outcome) Effectiveness: Recall, Precision, E, F, Expected search length Efficiency: Time and space complexity User satisfaction User effort (clicks, iterations, scrolling, documents seen, viewed or read) Effectiveness: Expected search length, Precision at N seen Efficiency: Time to complete task Question answering: mean reciprocal rank (MRR) Filtering: utility Topic distillation: coverage and accuracy Aspect retrieval: Aspectual recall, number of saved documents Question answering: completeness and correctness of answer Topic distillation: coverage and accuracy

Interactive TREC Human in the loop Searcher characteristics influence performance Familiarity with the topic, expertise Searching skills, experience with a certain system Relevance judgments (different from assessors) Experimental design needs to take into account user variability Real user searches are interactive: multiple queries are submitted, documents from multiple runs are saved User studies are expensive (time, effort)

Interactive TREC a brief history TREC 1-8 Tasks: routing (initially) & ad-hoc (later) Manual (human) intervention in query construction Multiple iterations and relevance feedback was allowed At some point the query was considered final and it was evaluated Results: Manual query formulation beats automatic formulation Insights into the human query formulation and judging process are gained

Interactive TREC a brief history TREC 3 Tasks: routing Topic: title, description, narrative Training provided in the form of relevance judgments Results: Humans do not find the routing task natural they are better at seeking relevant information than at formulating one best query Algorithms are better than humans at learning from training data

Interactive TREC a brief history TREC 4 Tasks: ad-hoc Find as many relevant documents for each topic as possible, without collecting too much rubbish Submit the lists of saved documents Frozen rank evaluation conducted Construct the final best query Submit the top 1000 documents, for comparison to the automatic runs Results: Ad-hoc task more natural than routing Frozen ranks difficult to evaluate The main differences observed were between relevance judgments (searcher-searcher, searcher-assessor)

Interactive TREC a brief history TREC 5-6 Tasks: aspectual/instance recall Find documents that cover as many aspects of a topic as possible; once an aspect is covered, additional documents are not needed Submit: sets of documents Judgments by assessors: Aspectual for each document, list of aspects covered Binary judgments of relevance for each document Measures of performance Aspectual recall; precision Experimental design: Baseline system (NIST s ZPRISE) allowed inter-site comparisons Results Assessor s judgments inconsistent The experiment is labor intensive; fatigue may have an effect

Interactive TREC a brief history TREC 7-9 Task: aspectual recall Experimental design: Inter-site comparison dropped Participating groups encouraged to evaluate various research hypotheses Number of queries and overall duration reduced Measures: Aspectual recall, aspectual precision, elapsed time Results: No significant differences between baselines and experimental systems Decision to use two-year cycle: Observational (qualitative) studies to identify key issues and generate research questions Detailed metric-based evaluations of research questions

Interactive TREC a brief history TREC 10-11, TREC 12 Interactive sub-track of Web track Task: aspectual recall Experimental design: No inter-site comparison Participating groups encouraged to pursue own interests Support in query formulation, effect of output layout, etc Measures: Aspectual recall, aspectual precision, elapsed time, effort Results: Specific to each participating group Development of experimental design, and instruments (questionnaires, interviews etc) widely used in IR use studies

Lessons Learned from the TREC Experience IR is inherently interactive measures of search effectiveness alone are insufficient Information seeking is engaged in for many different purposes, in many different contexts, to accomplish many different tasks one (or one set of) measure(s) for evaluating IR in general is a Chimera It may not be a good idea to rely on external objective judgments for evaluation purposes Experimental methods can be used successfully in user-centered evaluation of interactive IR

Some conclusions or recommendations Perceptions of performance are as important as objective measures; both should be interpreted w.r.t. measures of the search process Different measures need to be established w.r.t. goals of different tasks Specific experimental tasks should be designed so that the subjects performance in the task, and the subjects own evaluation of performance, are the criteria for the evaluation measures

References Nicholas J. Belkin and Measuring Web Search Effectiveness: Rutgers at Interactive TREC, in Measuring Web Search Effectiveness: The User Perspective, workshop at WWW 2004, May 2004, New York (paper, presentation). Ellen M. Voorhees and Donna Harman TREC: Experiment and Evaluation in Information Retrieval, MIT Press, 2005, ISBN 0-262-22073-3. Ch.3: Retrieval System Evaluation by Chris Buckley and Ellen Voorhees Ch.6: The TREC Interactive Tracks: Putting the User into Search by Susan T. Dumais and Nicholas J. Belkin