TCDSCSS: Dimensionality Reduction to Evaluate Texts of Varying Lengths - an IR Approach

Similar documents
Probabilistic Latent Semantic Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Latent Semantic Analysis

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Case Study: News Classification Based on Term Frequency

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Linking Task: Identifying authors and book titles in verbose queries

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Switchboard Language Model Improvement with Conversational Data from Gigaword

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Evidence for Reliability, Validity and Learning Effectiveness

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Knowledge-Free Induction of Inflectional Morphologies

arxiv: v1 [cs.cl] 2 Apr 2017

Statewide Framework Document for:

Grade 6: Correlated to AGS Basic Math Skills

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

DOMAIN MISMATCH COMPENSATION FOR SPEAKER RECOGNITION USING A LIBRARY OF WHITENERS. Elliot Singer and Douglas Reynolds

Lecture 1: Machine Learning Basics

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Python Machine Learning

Comment-based Multi-View Clustering of Web 2.0 Items

1. READING ENGAGEMENT 2. ORAL READING FLUENCY

A Bayesian Learning Approach to Concept-Based Document Classification

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Cross Language Information Retrieval

Leveraging Sentiment to Compute Word Similarity

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Vocabulary Agreement Among Model Summaries And Source Documents 1

Mathematics process categories

HLTCOE at TREC 2013: Temporal Summarization

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The Smart/Empire TIPSTER IR System

On document relevance and lexical cohesion between query terms

On-the-Fly Customization of Automated Essay Scoring

Learning From the Past with Experiment Databases

ScienceDirect. Malayalam question answering system

Detecting English-French Cognates Using Orthographic Edit Distance

Matching Similarity for Keyword-Based Clustering

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Concepts and Properties in Word Spaces

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

A Statistical Approach to the Semantics of Verb-Particles

A Comparison of Two Text Representations for Sentiment Analysis

PowerTeacher Gradebook User Guide PowerSchool Student Information System

A Coding System for Dynamic Topic Analysis: A Computer-Mediated Discourse Analysis Technique

Lecture 2: Quantifiers and Approximation

Word Segmentation of Off-line Handwritten Documents

Do multi-year scholarships increase retention? Results

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Mining Association Rules in Student s Assessment Data

As a high-quality international conference in the field

Pre-Algebra A. Syllabus. Course Overview. Course Goals. General Skills. Credit Value

Modeling function word errors in DNN-HMM based LVCSR systems

Multi-Lingual Text Leveling

MOODLE 2.0 GLOSSARY TUTORIALS

16.1 Lesson: Putting it into practice - isikhnas

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Physics 270: Experimental Physics

Handling Sparsity for Verb Noun MWE Token Classification

Ricopili: Postimputation Module. WCPG Education Day Stephan Ripke / Raymond Walters Toronto, October 2015

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Mike Cohn - background

The College Board Redesigned SAT Grade 12

Rule Learning With Negation: Issues Regarding Effectiveness

Modeling function word errors in DNN-HMM based LVCSR systems

TxEIS Secondary Grade Reporting Semester 2 & EOY Checklist for txgradebook

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Evaluating Statements About Probability

Once your credentials are accepted, you should get a pop-window (make sure that your browser is set to allow popups) that looks like this:

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

What the National Curriculum requires in reading at Y5 and Y6

OFFICE SUPPORT SPECIALIST Technical Diploma

12- A whirlwind tour of statistics

End-of-Module Assessment Task

Multiplication of 2 and 3 digit numbers Multiply and SHOW WORK. EXAMPLE. Now try these on your own! Remember to show all work neatly!

Term Weighting based on Document Revision History

Radius STEM Readiness TM

Psychometric Research Brief Office of Shared Accountability

The CTQ Flowdown as a Conceptual Model of Project Objectives

AQUA: An Ontology-Driven Question Answering System

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

South Carolina English Language Arts

Vocabulary Usage and Intelligibility in Learner Language

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Transcription:

TCDSCSS: Dimensionality Reduction to Evaluate Texts of Varying Lengths - an IR Approach Arun Jayapal Dept of Computer Science Trinity College Dublin jayapala@cs.tcd.ie Martin Emms Dept of Computer Science Trinity College Dublin martin.emms@cs.tcd.ie John D.Kelleher School of Computing Dublin Institute of Technology john.d.kelleher@dit.ie Abstract This paper provides system description of the cross-level semantic similarity task for the SEMEVAL-2014 workshop. Crosslevel semantic similarity measures the degree of relatedness between texts of varying lengths such as Paragraph to Sentence and Sentence to Phrase. Latent Semantic Analysis was used to evaluate the cross-level semantic relatedness between the texts to achieve above baseline scores, tested on the training and test datasets. We also tried using a bag-of-vectors approach to evaluate the semantic relatedness. This bag-of-vectors approach however did not produced encouraging results. 1 Introduction Semantic relatedness between texts have been dealt with in multiple situations earlier. But it is not usual to measure the semantic relatedness of texts of varying lengths such as Paragraph to Sentence (P2S) and Sentence to Phrase (S2P). This task will be useful in natural language processing applications such as paraphrasing and summarization. The working principle of information retrieval system is the motivation for this task, where the queries are not of equal lengths compared to the documents in the index. We attempted two ways to measure the semantic similarity for P2S and S2P in a scale of 0 to 4, 4 meaning both texts are similar and 0 being dissimilar. The first one is Latent Semantic Analysis (LSA) and second, a bag-of-vecors (BV) approach. An example of target similarity ratings for comparison type S2P is provided in table 1. This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/ by/4.0/ Sentence: Schumacher was undoubtedly one of the very greatest racing drivers there has ever been, a man who was routinely, on every lap, able to dance on a limit accessible to almost no-one else. Score Phrase 4 the unparalleled greatness of Schumachers driving abilities 3 driving abilities 2 formula one racing 1 north-south highway 0 orthodontic insurance Table 1: An Example - Sentence to Phrase similarity ratings for each scale 2 Data The task organizers provided training data, which included 500 pairs of P2S, S2P, Phrase to Word (P2W) and their similarity scores. The training data for P2S and S2P included text from different genres such as Newswire, Travel, Metaphoric and Reviews. In the training data for P2S, newswire text constituted 36% of the data, while reviews constituted 10% of the data and rest of the three genres shared 54% of the data. Considering the different genres provided in the training data, a chunk of data provided for NIST TAC s Knowledge Base Population was used for building a term-by-document matrix on which to base the LSA method. The data included newswire text and web-text, where the web-text included data mostly from blogs. We used 2343 documents from the NIST dataset 1, which were available in extended Markup Language format. Further to the NIST dataset, all the paragraphs in the training data 2 of paragraph to sentence were added to the dataset. To add these paragraphs to the dataset, we converted each paragraph into a 1 Distributed by LDC (Linguistic Data Consortium) 2 provided by the SEMEVAL task-3 organizers 619 Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 619 623, Dublin, Ireland, August 23-24, 2014.

new document and the documents were added to the corpus. The unique number of words identified in the corpus were approximately 40000. 3 System description We tried two different approaches for evaluating the P2S and S2P. Latent Semantic Analysis (LSA) using SVD worked better than the Bag-of-Vectors (BV) approach. The description of both the approaches are discussed in this section. 3.1 Latent Semantic Analysis LSA has been used for information retrieval allowing retrieval via vectors over latent, arguably conceptual, dimensions, rather than over surface word dimensions (Deerwester et al., 1990). It was thought this would be of advantage for comparison of texts of varying length. 3.1.1 Representation The data corpus was converted into a mxn termby-document matrix, A, where the counts (c m,n ) of all terms (w m ) in the corpus are represented in rows and the respective documents (d n ) in columns: d 1 d 2 d n w 1 c 1,1 c 1,2 c 1,n w 2 c 2,1 c 2,2 c 2,n A =....... w m c m,1 c m,2 c m,n The document indexing rules such as text tokenization, case standardization, stop words removal, token stemming, and special characters and punctuations removal were followed to get the matrix A. Singular Value Decomposition (SVD) decomposes the matrix into U, Σ and V matrices (ie., A = UΣV T ) such that U and V are orthonormal matrices and Σ is a diagonal matrix with singular values. Retaining just the first k columns of U and V, gives an approximation of A A A k = U k Σ k V T k (1) According to LSA, the columns of U k are thought of as representing latent, semantic dimensions, and an arbitrary m-dimensional vector #» v can be projected onto this semantic space by taking the dot-product with each column of U k ; we will call the result v #» sem. In the experiments reported later, the m- dimensional vector #» v is sometimes a vector of word counts, and sometimes a thresholded or boolean version, mapping all non-zero numbers to 1. 3.1.2 Similarity Calculation To evaluate the similarity of a paragraph, p, and a sentence, s, first these are represented as vectors of word counts, #» p and #» s, then these are projected in the latent semantic space, to give p #» sem and s #» sem, and then between these the cosine similarity metric is calculated: cos( #» p sem. #» s sem ) = p #» sem. s #» sem p #» sem. s #» sem (2) The cosine similarity metric provides a similarity value in the range of 0 to 1, so to match the target range of 0 to 4, the cosine values were multiplied by 4. Exactly the same procedure is used for the sentence to phrase comparison. Further, the number of retained dimensions of U k was varied, giving different dimensionalities of the LSA space. The results of testing at the reduced dimensions are discussed in 4.1 3.2 Bag-of-Vectors Another method we experimented on could be termed a bag-of-vectors (BV) approach: each word in an item to be compared is replaced by a vector representing its co-occurrence behavior and the obtained bags of vectors enter into the comparison process. 3.2.1 Representation For the BV approach, the same data sources as was used for the LSA approach is turned into a m m term-by-term co-occurrence matrix C: C = w 1 w 2 w m w 1 c 1,1 c 1,2 c 1,m w 2 c 2,1 c 2,2 c 2,m....... w m c m,1 c m,2 c m,m The same preprocessing steps as for the LSA approach applied (text tokenization, case standardization, stop words removal, special characters and punctuations removal). Via C, if one has a bagof-words representing a paragraph, sentence or phrase, one can replace it by a bag-of-vectors, replacing each word w i by the corresponding row of C we will call these rows word-vectors. 620

3.2.2 Similarity Calculation For calculating P2S similarity, the procedure is as follows. The paragraph and sentence are tokenized, and stop-words were removed and are represented as two vectors #» p and #» s. For each word p i from #» p, its word vector from C is found, and this is compared to the word vector for each word s i in #» s, via the cosine measure. The highest similarity score for each word p i in #» p is stored in a vector S #» p shown in (3). The overall semantic similarity score between paragraph and sentence is then the mean value of the vector S #» p 4 see (4). S p = [ ] S p1 S p2 S pi (3) n i=1 S sim = S p i 4 (4) n Exactly corresponding steps are carried out for the S2P similarity. Although experiments were carried out this particular BV approach, the results were not encouraging. Details of the experiments carried out are explained in 4.2. 4 Experiments Different experiments were carried out using LSA and BV systems described in sections 3.1 and 3.2 on the dataset described in section 2. Pearson correlation and Spearman s rank correlation were the metrics used to evaluate the performance of the systems. Pearson correlation provides the degree of similarity between the system s score for each pair and the gold standard s score for the said pair while Spearman s rank correlation provides the degree of similarity between the rankings of the pairs according to similarity. 4.1 LSA The LSA model was used to evaluate the semantic similarity between P2S and S2P. 4.1.1 Paragraph to Sentence An initial word-document matrix A was built by extracting tokens just based on spaces, stop words removed and tokens sorted in alphabetical order. As described in 3.1.1, via the SVD of A, a matrix U k is obtained which can be used to project an m dimensional vector into a k dimensional one. In one setting the paragraph and sentence vectors which are projected into the LSA space have unique word counts for their dimensions. In another setting before projection, these vectors are Dimensions 100% 90% 50% 30% 10% Basic word-doc representation 0.499-0.494 0.484 0.426 Evaluation-boolean counts 48-33 11 0.420 Constrained tokenization 0.368 64 40 16 0.480 Added data 0.461 0.602 68 17 22 Table 2: Pearson scores at different dimensions - Paragraph to Sentence thresholded into boolean versions, with 1 for every non-zero count. The Pearson scores for these settings are in the first and second rows of table 2. They show the variation with the number of dimensions of the LSA representation (that is the number of columns of U that are kept) 3. An observation is that the usage of boolean values instead of word counts showed improved results. Further experiments were conducted, retaining the boolean treatment of the vectors to be projected. In a new setting, further improvements were made to the pre-processing step, creating a new word-document matrix A using constrained tokenization rules, removing unnecessary spaces and tabs, and tokens stemmed 4. The performance of the similarity calculation is shown as the third row of Table 2: there is a trend of increase in correlation scores with respect to the increase in dimensionality up to a maximum of 64, reached at 90% dimension. Semantic similarity 0.7 0.65 0.6 5 0.45 0.4 Basic word doc representation Evaluation with Boolean values Constrained Tokenization Added data representation 0.35 0 20 40 60 80 100 Percent Dimensions maintained Figure 1: Paragraph to Sentence - Pearson correlation scores for four different experiments at different dimensions 3 (represented in percent) of U k Not convinced with the pearson scores, more 3 Here, the dimension X% means k = (X/100) N, where N is the total number of columns in A in the unreduced SVD. 4 Stemmed using Porter Stemmer module availabe from http://tartarus.org/ martin/porterstemmer/ 621

documents were added to the dataset to build a new word-document matrix representation A. The documents included all the paragraphs from the training set. Each paragraph provided in the training set was added to the dataset as a separate document. The experiment was performed maintaining the settings from the previous experiment and the results are shown in the fourth row of table 2. The increase in trend of correlation scores with respect to the increase in dimensionality is followed by the new U produced from A after applying SVD. Figure 2 provides the distribution of similarity scores evaluated at 90% dimension of the model with respect to the gold standard. Further to compare the performance of different experiments, all the experiment results are plotted in Figure 1. It can be observed that every subsequent model built has shown improvements in performance. The first two experiments shown in the first two rows of table 2 are shown in red and blue lines in the figure. It can be observed that in both the settings, the pearson correlation scores were increasing as the the number of dimensions maintained also increased, whereas in the other two settings, the pearson correlation scores reached their maximum at 90% and came down at 100% dimension, which is unexpected and so is not justified. It is observed from Figure 2 that the scores Similarity scores 4 3.5 3 2.5 2 1.5 1 0 0 100 200 300 400 500 Training data Examples Figure 2: Semantic similarity scores - Gold standard (Line plot) vs System scores (Scatter plot) for examples in training data of the system in scatter plot are not always clustered around the gold standard scores, plotted as a line. As the gold standard score goes up, the system prediction accuracy has come down. One reason for this pattern can be attributed to the training set which had data mostly data from Newswire Dimensions 100% 90% 70% 50% 30% 10% Basic word-doc representation 0.493 - - 0.435 0.423 0.366 Evaluation boolean counts 0.472 - - 0.449 0.430 0.363 Constrained tokenization 0.498 0.494 17 0.485 0.470 0.434 Added data 0.493 04 0.498 0.498 0.488 0.460 Table 3: Pearson scores at different dimensions 3 - Sentence to Phrase and webtext. Therefore, during evaluation all the words from paragraph and/or sentence would not have got a position while getting projected on the latent semantic space, which we believe has pulled down the accuracy. 4.1.2 Sentence to Phrase The experiments carried out for P2S provided in 4.1.1 were conducted for S2P examples as well. The pearson scores produced by different experiments at different dimensions are provided in table 3. This table shows that the latest worddocument representation made with added documents, did not have any impact on the correlation scores, while the earlier word-document representation provided in 3 rd row, which used the original dataset preprocessed with constrained tokenization rules, removing unnecessary spaces and tabs, and tokens stemmed, provided better correlation score at 70% dimension. Further the comparison of different experiments carried out at different settings are plotted in Figure 3. Semantic similarity 5 0.45 Basic word doc representation 0.4 Evaluation with Boolean values Constrained Tokenization Added data representation 0.35 0 20 40 60 80 100 Percent Dimensions maintained Figure 3: Sentence to Phrase - Pearson correlation scores for four different experiments at different dimensions 3 (represented in percentage) of U k 622

4.2 Bag of Vectors BV was tested in two different settings. The first representation was created with bi-gram cooccurance count as mentioned in section 3.2.1 and experiments were carried out as mentioned in section 3.2.2. This produced negative Pearson correlation scores for P2S and S2P. Then we created another representation by getting co-occurance count in a window of 6 words in a sentence, on evaluation produced correlation scores of 0.094 for P2S and 0.145 for S2P. As BV showed strong negative results, we did not continue using the method for evaluating the test data. But we strongly believe that the BV approach can produce better results if we could compare the sentence to the paragraph rather than the paragraph to the sentence as mentioned in section 3.2.2. During similarity calculation, when comparing sentence to the paragraph, for each word in the sentence, we look for the best semantic match from the paragraph, which would increase the mean value by reducing the number of divisions representing the number of words in the sentence. In the current setting, it is believed that while computing the similarity for the paragraph to sentence, the words in the paragraph (longer text) will consider a few words in the sentence to be similar multiple times. This could not be right when we compare the texts of varying lengths. by comparing the sentence to the paragraph, which we believe will yield promising results to compare the texts of varying lengths. References Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer and Richard Harshman 1990. Indexing by latent semantic analysis Journal of the American society for information science, 41(6):391 401 Thomas Hofmann 2001. Unsupervised Learning by Probabilistic Latent Semantic Analysis Journal Machine Learning, Volume 42 Issue 1-2, January- February 2001 Pages 177-196 5 Conclusion and Discussion On manual verification, it was identified that the dataset used to build the representation did not have documents related to the genres Metaphoric, CQA and Travel. The original dataset mostly had documents from Newswire text and blogs which included reviews as well. Further, it can be identified from tables 2 and 3, the word-document representation with added documents from the training set improved Pearson scores. This allowed to assume that the dataset did not have completely relevant set of documents to evaluate the training set which included data from different genres. For evaluation of the model on test data, we submitted two runs and best of them reported Pearson score of 0.607 and 52 on P2S and S2P respectively. In the future work, we should be able to experiment with more relevant data to build the model using LSI and also use statistically strong unsupervised classifier plsi (Hofmann T, 2001) for the same task. Further to this, as discussed in 4.2 we would be able to experiment with the BV approach 623