Incremental Retrieval of documents relevant to a topic

Similar documents
Linking Task: Identifying authors and book titles in verbose queries

Evidence for Reliability, Validity and Learning Effectiveness

A Case Study: News Classification Based on Term Frequency

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule Learning With Negation: Issues Regarding Effectiveness

Speech Emotion Recognition Using Support Vector Machine

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Human Emotion Recognition From Speech

Word Segmentation of Off-line Handwritten Documents

Leveraging Sentiment to Compute Word Similarity

Postprint.

Beyond the Pipeline: Discrete Optimization in NLP

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Rule Learning with Negation: Issues Regarding Effectiveness

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Creative Media Department Assessment Policy

Matching Similarity for Keyword-Based Clustering

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Reducing Features to Improve Bug Prediction

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

How to Judge the Quality of an Objective Classroom Test

Degree Qualification Profiles Intellectual Skills

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Create Quiz Questions

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Level 1 Mathematics and Statistics, 2015

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

The Smart/Empire TIPSTER IR System

A Comparison of Two Text Representations for Sentiment Analysis

UK flood management scheme

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Lecture 2: Quantifiers and Approximation

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Cross Language Information Retrieval

Using dialogue context to improve parsing performance in dialogue systems

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

On-the-Fly Customization of Automated Essay Scoring

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

Speech Recognition at ICSI: Broadcast News and beyond

Modeling function word errors in DNN-HMM based LVCSR systems

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Australian Journal of Basic and Applied Sciences

Chapters 1-5 Cumulative Assessment AP Statistics November 2008 Gillespie, Block 4

Mining Association Rules in Student s Assessment Data

Modeling user preferences and norms in context-aware systems

Vocabulary Usage and Intelligibility in Learner Language

Probabilistic Latent Semantic Analysis

PROGRAM HANDBOOK. for the ACCREDITATION OF INSTRUMENT CALIBRATION LABORATORIES. by the HEALTH PHYSICS SOCIETY

Knowledge Transfer in Deep Convolutional Neural Nets

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Algebra 2- Semester 2 Review

Modeling function word errors in DNN-HMM based LVCSR systems

Holy Family Catholic Primary School SPELLING POLICY

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Device Independence and Extensibility in Gesture Recognition

AQUA: An Ontology-Driven Question Answering System

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Designing Autonomous Robot Systems - Evaluation of the R3-COP Decision Support System Approach

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

arxiv: v1 [cs.cl] 2 Apr 2017

Draft Budget : Higher Education

Examinee Information. Assessment Information

Vocabulary Agreement Among Model Summaries And Source Documents 1

Houghton Mifflin Online Assessment System Walkthrough Guide

Assignment 1: Predicting Amazon Review Ratings

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

A study of speaker adaptation for DNN-based speech synthesis

Grade 6: Correlated to AGS Basic Math Skills

Learning Methods in Multilingual Speech Recognition

TEACHING IN THE TECH-LAB USING THE SOFTWARE FACTORY METHOD *

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Running head: LISTENING COMPREHENSION OF UNIVERSITY REGISTERS 1

Characteristics of the Text Genre Informational Text Text Structure

Reference to Tenure track faculty in this document includes tenured faculty, unless otherwise noted.

Georgetown University at TREC 2017 Dynamic Domain Track

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Characteristics of the Text Genre Realistic fi ction Text Structure

Learning to Schedule Straight-Line Code

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

An Evaluation of the Interactive-Activation Model Using Masked Partial-Word Priming. Jason R. Perry. University of Western Ontario. Stephen J.

HIST 3300 HISTORIOGRAPHY & METHODS Kristine Wirts

Automating the E-learning Personalization

Proficiency Illusion

Mandarin Lexical Tone Recognition: The Gating Paradigm

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

SARDNET: A Self-Organizing Feature Map for Sequences

Transcription:

Incremental Retrieval of documents relevant to a topic Caroline Lyon, Bob Dickerson, James Malcolm University of Hertfordshire, UK c.m.lyon@herts.ac.uk Introduction As new participants to TREC, on the Filtering Track, we have started by first investigating two methods of producing document profiles. We begin by looking for "obvious" profiles that detect closely related documents. This year we have started by looking for: lexically similar cases semantically similar cases based on a simple combination of keywords. Characteristics of the Reuter s data Before addressing specific tasks we investigated the Reuter's data. It was expected in this domain that there would be some similar text in different documents: the extent is quite significant. We used the Ferret software, designed to ferret out similar passages of text in large document collections, which we have recently developed [2]. An experiment was carried out to compare each document with about 1000 others, taken in date order. We went through the test corpus (723141 documents) and for every set of 1000 documents compared each with each (that is 499500 comparisons for each set). Of course if file A is similar to file B and to file C, then it is quite likely that File B is similar to file C. We found 48,918 with identical text. Some of the files were very short, for instance regular industrial reports might have no more than 10 content words in the text. Omitting files with 10 or less content words, 6,616 had identical text. The analysis also showed that in a further large number of file pairs texts were very close this and other terms will be explained below. 287,391 pairs fell into this category. Without those files containing 10 or less content words in their texts, 24,017 were very close. There are 718, 443 pairs with significant matching passages. Of those with more than 10 content words in the text 228,130 fall into this category.

Method of determining similarity The method used is as follows. First each document is pre-processed so that only the id number, the headline, and the text are kept, while tags are omitted. Stop words are filtered out. There are 440 stop words, and the list includes entries which, though not function words, have little semantic content. Then each document is converted into a set of word triples, composed of every sequential triple. Thus, the sentence: Given a topic description and some example relevant documents build a filtering profile. would be converted into the set: given a topic a topic description topic description and etc. or, after taking out stop words: given topic description topic description example description example relevant etc. Then each pair of documents is compared for matching word triples. This raw score is converted into the metric resemblance, based on set-theoretic principles. Informally, resemblance is the number of matches between two sets, scaled by joint set size. It is also known as the Jaccard coefficient. Let S(A) and S(B) be the set of trigrams from documents A and B respectively. Let R(A,B) be the resemblance between A and B R = S(A) S(B) S(A) S(B) For the preliminary investigations into the Reuter s data, documents are identical if, after pre-processing, R =1.0. The category very close takes 1.0 > R 0.8, while significant matching passages takes 0.8 > R 0.4. These are arbitrary boundaries. As an indication of the scale of similarity, it is worth considering measures used in another field. The Ferret was originally developed for detecting plagiarism in students work. At a level of R > 0.04 (a degree of magnitude smaller than that used here) matching passages were typically found, possibly quite short. Time taken to process each set of 1000 files was about 1 minute, about 11 hours for the full test set, on a Pentium III processor, with 700MHz, 512 MB RAM. However there is considerable scope for increasing the efficiency of this implementation.

Theoretical background The dominant approach in statistical pattern analysis is based on the well known method of abstracting significant features and lining them up in a feature vector for further processing. However, there are relationships between the number of elements of the feature vector, the amount of training data available and the level of generalization achieved. In text processing a very large number of words have to be processed, even after filtering through a stop word list. The amount of training data will typically not be enough to ensure a satisfactory level of probably approximately correct outcomes. For further details see [1, 3]. Therefore, a set theoretic approach may be appropriate in word based text processing, as described in [2]. Routing filtering with lexical profiles The method described above was then applied to give a preliminary analysis of topics in the filtering task. For this we just took the three sample documents given for the adaptive filtering task, and did not refer to the topic description. The three sample documents are stripped of xml tags, edited by filtering through the stop word list and concatenated. This text is then compared to all the documents in the test data (similarly detagged and filtered through the stop word list). For Topic 102 a pairing producing 16 matches, resemblance 0.05, is displayed, Figure 1. The number of matching word triples shown in the display is much greater than that produced by the match detection software, since for display we go back to the original documents which include stop words and xml tags. Figure 1: From Topic 102, display of sample text 79021 and relevant document 287139

In Figure 1 the lower of the two files, id 79021, is one of the 3 example documents for Topic 102. The upper file, id 287139, has short passages of matching text. It seems that the original story was picked up again some time later. This illustration shows how short passages of matching text can be detected. Lexically similar text is often semantically similar too. However, this is not always the case, as when the processor picks up commonly occurring comments such as Reuters has not verified these reports and cannot vouch for their accuracy. The type of lexical similarity described above indicates semantic similarity. However, the opposite is not true. If two people write on the same topic independently the resulting articles will not be lexically similar in this way, as previous experiments have shown. When texts are lexically similar it indicates that there has been some element of cutting and pasting. Routing filtering with simple keyword profiles The concept behind this method is to have several sets of keywords, and for a document to be considered relevant it must have at least one member in each set. The keywords have been selected manually at this point, from the topic description and three sample documents for the adaptive filtering task. The rest of the training data was used for primary evaluation of this approach. Topics R101 and R125 were entered on this track. For initial work there were 3 sets of keywords. It was essential to have a member in sets key1 and key2. Key3 was a set of supporting keywords whose frequency of occurrence determined the ranking. As an example, the keywords for Topic R101 on industrial espionage were as follows: espionage spy spying key1 key2 business commercial economic industrial technical Figure 2 : Essential keywords Using this method cuts down on possible combinatorial explosion of combinations of terms industrial espionage, commercial espionage, industrial spying etc. On inspection later, it seemed that key1 might have included secrets and key2 company. This would have caught some documents that slipped through the net, but might have produced false positives too.

charges confidential court courts covert intelligence investigation key3 police prosecution prosecutor prosecutors secret secrets surveillance Figure 3: Non-essential keywords used for ranking Results Using this method on topic R101 produced a score of 0.428 compared to median 0.469 and maximum0.902. On R125 it produced a result of 0.062, compared to a median of 0.327 and maximum of 0.565. In both case the number of relevant documents was well below the specified number. For R101 477 were found, for R125 260 were found. However, limited random sampling indicated that no false positives were found. Discrepancies in the data In some cases the topic description and the training documents were not consistent. For example, Topic R110 was entitled Terrorism Middle East tourism, and the narrative said relevant documents should correlate terrorism with tourism. However, terrorism and associated terms were not mentioned in the 3 training documents for adaptive filtering (42439, 82926, 85147). Topic R125 was entitled Scottish Independence but there was no mention of Scotland in any form in some documents judged relevant (27974, 48375, 68664). In Topic 134 the narrative of the topic description said that documents were relevant only if statistics were included. There were no statistics in one of the three training documents (73372). Conclusion The first method employed detected little of that lexical similarity between training and testing documents, which is indicative of re-using text. However, our investigation of general characteristics of the data showed that there is much re-use of text on close dates. Taking a sideways glance at the Novelty Track, this method could be useful to sort out similar versions of a story from ones with new information. Whether the new information is strictly relevant would be another matter. For instance, reports on ABA banking policy (100017,100398) had similarities (resemblance 0.55). The second had additional information, on the speakers clothes, which might not be considered relevant.

The second method employed, using combinations of keywords, is a useful way of a detecting a core of relevant documents. This could possibly be automated using thesauri and/or Wordnet. If the Filtering track is reinstated we plan to move on to the more interesting hard-todetect cases, and to integrate different profiles as in co-training. References 1. A K Jain, R P W Duin and J Mao. Statistical Pattern Recognition: A review. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22, 1. 2000. 2. Caroline Lyon, James Malcolm, and Bob Dickerson, Detecting short passages of similar text in large document collections, Proc. of Conference on Empirical Methods in Natural Language Processing, 2001. 3. C Lyon and R Frank. Using single layer networks for discrete sequential data: an example from Natural Language Processing. Neural Computing Applications 5 (4) 1997