Vector Space Models (VSM) and Information Retrieval (IR)

Similar documents
Probabilistic Latent Semantic Analysis

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Using Web Searches on Important Words to Create Background Sets for LSI Classification

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Linking Task: Identifying authors and book titles in verbose queries

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Cross Language Information Retrieval

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Case Study: News Classification Based on Term Frequency

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Python Machine Learning

Comment-based Multi-View Clustering of Web 2.0 Items

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Bayesian Learning Approach to Concept-Based Document Classification

Universiteit Leiden ICT in Business

Speech Recognition at ICSI: Broadcast News and beyond

Finding Translations in Scanned Book Collections

Disambiguation of Thai Personal Name from Online News Articles

ScienceDirect. Malayalam question answering system

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

AQUA: An Ontology-Driven Question Answering System

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Latent Semantic Analysis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Indian Institute of Technology, Kanpur

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

A Comparison of Two Text Representations for Sentiment Analysis

Using dialogue context to improve parsing performance in dialogue systems

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

The Smart/Empire TIPSTER IR System

The taming of the data:

Constructing Parallel Corpus from Movie Subtitles

Cross-Lingual Text Categorization

Memory-based grammatical error correction

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Assignment 1: Predicting Amazon Review Ratings

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Artificial Neural Networks written examination

Text-mining the Estonian National Electronic Health Record

Search right and thou shalt find... Using Web Queries for Learner Error Detection

Distant Supervised Relation Extraction with Wikipedia and Freebase

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Evaluating vector space models with canonical correlation analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The Role of String Similarity Metrics in Ontology Alignment

On document relevance and lexical cohesion between query terms

Matching Similarity for Keyword-Based Clustering

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Australian Journal of Basic and Applied Sciences

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

WHEN THERE IS A mismatch between the acoustic

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Statewide Framework Document for:

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

CS Machine Learning

Learning From the Past with Experiment Databases

arxiv: v1 [cs.cl] 2 Apr 2017

Lecture 1: Machine Learning Basics

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Bug triage in open source systems: a review

Loughton School s curriculum evening. 28 th February 2017

As a high-quality international conference in the field

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

HLTCOE at TREC 2013: Temporal Summarization

Applications of memory-based natural language processing

USER ADAPTATION IN E-LEARNING ENVIRONMENTS

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Evaluation of Usage Patterns for Web-based Educational Systems using Web Mining

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

(Sub)Gradient Descent

Cross-lingual Short-Text Document Classification for Facebook Comments

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Multi-Lingual Text Leveling

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Experts Retrieval with Multiword-Enhanced Author Topic Model

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Postprint.

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Online Updating of Word Representations for Part-of-Speech Tagging

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Rule Learning With Negation: Issues Regarding Effectiveness

Detecting English-French Cognates Using Orthographic Edit Distance

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

OFFICE SUPPORT SPECIALIST Technical Diploma

Transcription:

Vector Space Models (VSM) and Information Retrieval (IR) T-61.5020 Statistical Natural Language Processing 24 Feb 2016 Mari-Sanna Paukkeri, D. Sc. (Tech.)

Lecture 3: Agenda Vector space models word-document matrices stemming, weighting, dimensionality reduction similarity measures Information retrieval queries evaluation

From rule-based to statistical approach Earlier: Sentence-level processing Part of Speech tagging and parsing Traditional NLP What if the language is not English? Part-of-speech (PoS) taggers may not be available or they perform poorly the language doesn't follow spelling rules? Twitter, Facebook, e-mails, SMS's, blog posts... there are multiple languages present?

Vector space models (VSM) The use of a high-dimensional space of documents (or words) Closeness in the vector space resembles closeness in the semantics or structure of the documents (depending on the features extracted). Makes the use of data mining possible Applications: Document clustering/classification/ Finding similar documents Finding similar words Word disambiguation Information retrieval Term discrimination

Vector space models (VSM) Steps to build a vector space model 1. Preprocessing 2. Defining word-document or word-word matrix choosing features 3. Dimensionality reduction choosing features removing noise easing computation 4. Weighting and normalization emphasizing the features 5. Similarity / distance measures comparing the vectors

VSM: (1) Preprocessing

What kind of punctuation usages and usage differences there are between languages?

VSM: (2) Word-word matrix Example from Europarl corpus (Koehn, 2005):

VSM: (2) Word-word matrix Choosing features First-order similarity: collected for target word ("fruits") by counting the frequencies of context words Second-order similarity: words that co-occur with the same target words e.g. "trees" which co-occurs with both "oranges" and "citrus" -> second-order similarity between "fruits" and "trees"

VSM: (2) Word-document matrix A document may be text document e-mail message tweet paragraph of a text sentence of a text phrase

VSM: (2) Word-document matrix Sliding window n words before and after the target Bag-of-words word order not taken into account Word order word order taken into account, e.g. left and right context N-grams unigrams, bigrams, trigrams, n-grams

VSM: (3) Dimensionality reduction Choosing features, removing noise, easing computation Feature selection Choose the best features (= representation words) for your task, remove the rest Can be mapped back to the original features Feature extraction: reparametrization Calculate a new, usually lower-dimensional feature space New features are (complex) combinations of the old features Mapping back to the original features (representation words) might be difficult

VSM: (3) Dimensionality reduction Feature selection excluding very frequent and/or very rare words excluding stop words ('of', 'the', 'and', 'or,...) Words which are filtered out prior to processing of natural language texts, in particular, before storing the documents in the inverted index. A stop word list contains typically words such as a, about, above, across, after, afterwards, again, etc. The list reduces the size of the index but can also prevent from querying some special phrases like it magazine, The Who, Take That. remove punctuation, non-alphabetic characters forward feature selection algorithm keyphrase extraction

VSM: (3) Dimensionality reduction Feature extraction: reparametrization Stemming and lemmatizing Singular value decomposition (SVD), Latent semantic indexing/analysis (LSI/LSA) Principal component analysis (PCA) Independent component analysis (ICA) Random projection

VSM: (3) Dimensionality reduction -> Feature extraction -> Stemming and lemmatizing Lemmatizing: Finding the base form of an inflected word (requires a dictionary) laughs -> laugh, matrices -> matrix, Helsingille -> Helsinki, saunoihin -> sauna Stemming is an approximation for morpological analysis (a set of rules is enough). The stem of each word is used instead of the inflected form. Examples: Stem laughgalleryööisaun- Word forms laughing, laugh, laughs, laughed gallery, galleries yöllinen, yötön, yöllä öisin, öinen saunan, saunaan, saunoihin, saunasta, saunoistamme Stemming is a simplifying solution and does not suit well for languages like Finnish in all NLP applications. For one basic word form there may be several stems for search (e.g. "yö-" and "öi-" in the table refer to the same base form "yö" (night))

VSM: (3) Dimensionality reduction -> Feature extraction -> Singular value decomposition (SVD) Latent Semantic Indexing (LSI) finds a low-rank approximation to the original term-document matrix using Singular Value Decomposition (SVD). W is a document-word matrix, the elements of which contain a value of a function based on the number of a word in a document E.g., using normalized entropy of words in the whole corpus Often tf-idf weighting is in use. A singular value decomposition of rank R is calculated: SVD(W): (W) = USV T in which S is a diagonal matrix which contains the singular values in its diagonal, U and V are used to project the words and documents into the latent space (T: matrix transpose). SVD calculates an optimal R-dimensional approximation for W. A typical value of R ranges between 100 and 200. http://xieyan87.com/2015/06/stochastic- gradient- descent- sgd- singular- value- decomposition- svd- algorithms- notes/

VSM: (3) Dimensionality reduction -> Feature extraction -> Random projection In random projection, a random matrix is used to project data vectors into a lower dimensional space. n i - original document vector for document i R - random matrix the columns of which are normally distributed unit vectors. Dimensionality is rdim ddim, in which ddim is the original dimension and rdim the new one, rdim << ddim x i - new, randomly projected document vector for document i, with vector dimension rdim. The projected document vectors are obtained as follows: x i = Rn i In this kind of dimensionality reduction, it is essential that the unit vectors of the projection matrix are as orthogonal as possible (i.e. correlations between them are small). In the case of R, the vectors are not fully orthogonal but if rdim is sufficiently large, and the vectors are taken randomly from an even distribution on a hyperball, the correlation between any two vectors tend to be small. The values used for rdim typically range between 100 and 1000.

VSM: (4) Weighting and normalization

VSM: (4) Weighting and normalization Normalization L1 Norm: Divide every item of a vector with the Manhattan distance i.e. City Block distance L2 Norm: Divide every item of a vector with the Euclidean length of the vector Not required for cosine distance

VSM: (5) Similarity / distance measures

VSM: (5) Similarity / distance measures Applications: document classification accuracy when using dimensionality reduction to 2-1000 dimensions (on x axis)

What s the use of a vector space? Document clustering / classification Finding similar documents Finding similar words Word disambiguation Information retrieval Term discrimination

Information retrieval (IR) A traditional research area, currently part of NLP research Information retrieval from a large document collection 1. Produce an indexed version (e.g. vector space) of the collection 2. User provides a query term/phrase/document 3. Query is compared to the index and the best matching results are given Example: Google search engine

IR Traditionally: Exact match retrieval No NLP processing of the query nor index Often Boolean queries (AND, OR, NOT) can be used e.g. Q = (mouse OR mice) AND (dog OR dogs OR puppy OR puppies) AND NOT (cat OR cats) Works well for small document sets and if the user is experienced with IR Problems especially with large and heterogeneous collections: Order: The results are not ordered by any meaningful criteria. Size: The result may be an empty set or there may be a very large set of results. Relevance: It is difficult to formulate a query in such a manner that one would receive relevant documents but as small number of non-relevant ones as possible. One cannot know what kind of relevant documents there are that do not quite match the search criteria.

IR: Indexing & VSM The documents in the document collection are processed in the similar way as in the vector space modelling Preprocessing removing punctuation removing capitalization stemming / lemmatizing Defining word-document matrix Weighting and normalizing Tf.idf The queries are then mapped to the same vector space The relevance is assessed in terms of (partial) similarity between query and document. The vector space model is one of the most used models for ad-hoc retrieval

IR: Ranking of query results Most IR systems compute a numeric score on how well each object in the database matches the query Distance in the vector space Content and structure of the document collection can be used Number of hits in a document Number of hits in title, first paragraph, elsewhere Other meta information in the documents or external knowledge The retrieved objects are ranked according to this numeric score and the top ranking objects are then shown to the user. For instance, Google s PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents. The purpose is to measure its relative importance within the set. https://fi.wikipedia.org/wiki/pagerank

IR: More terminology Index term: A term (character string) that is part of an index. Index terms are typically full words but can also be, for instance, numerical codes or word segments such as stems. (Inverted) index: The purpose of using an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. [http : //en.wikipedia.org/wiki/index (search engine)] Relevance: How well the retrieved document(s) meet(s) the information need of the user. Relevance feedback: Taking the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query. The feedback can be explicit or implicit. [http : //en.wikipedia.org/wiki/relevance feedback] Information extraction: A type of information retrieval whose goal is to automatically extract structured information, i.e. categorized and contextually and semantically well-defined data from a certain domain, from unstructured machinereadable documents. [http : //en.wikipedia.org/wiki/information extraction]

IR: Various data types In addition to text documents (in any language) there are also other types of data to be retrieved, such as Pictures (image retrieval) Videos (video/multimedia retrieval) Audio (speech retrieval, music retrieval) Data/Document classifications, tags, categories (e.g. hashtags), graphs, Cross-language information retrieval How can this types of documents be retrieved using previously seen information retrieving methods?

IR: Evaluation N = number of documents retrieved (true positive + false positive) REL = number of relevant documents in the whole collection (true positive + false negative) rel = number of relevant documents in the retrieved set of documents (true positive) http://en.wikipedia.org/wiki/precision_and_recall Precision P: the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search, P = rel/n = true positive / (true positive + false positive) Recall R: the number of relevant documents retrieved by a search divided by the total number of relevant documents R = rel/rel = true positive / (true positive + false negative) An inverse relationship typically exists between P and R. It is not possible to increase without the cost of reducing the other. One can usually increase R by retrieving a larger number of documents, also increasing number of irrelevant documents and thus decreasing P.

IR: Evaluation F-measure Precision and Recall scores can be combined into a single measure, such as the F- measure, which is the weighted harmonic mean of P and R: Accuracy Not a good measure if the number of relevant documents is small, which is the case usually in IR (true positive + true negative)/(true positive + true negative + false positive + false negative) Method comparison Different IR methods are usually compared using precision (P) and recall (R) measures or the F-measure over a number of queries (e.g. 50), and the obtained averages are studied. A statistical test (e.g. Student s t-test) can be used to ensure the statistical significance of the observed differences.

References (Manning, Schütze, 1999): Foundations of Statistical Natural Language Processing. The MIT Press. (Koehn, 2005) Europarl: A parallel corpus for statistical machine translation. MT Summit. (Paukkeri, 2012) Language- and domain-independent text mining. Doctoral dissertation, Aalto University. Figures and tables: (Paukkeri, 2012) Language- and domain-independent text mining. Doctoral dissertation, Aalto University.

To the exam from this lecture These lecture slides Exercises Chapter 15 (Topics in Information Retrieval) from Manning, Schütze: Foundations of Statistical Natural Language Processing