Short Text Similarity with Word Embeddings

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Case Study: News Classification Based on Term Frequency

arxiv: v1 [cs.cl] 20 Jul 2015

Assignment 1: Predicting Amazon Review Ratings

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Variations of the Similarity Function of TextRank for Automated Summarization

Linking Task: Identifying authors and book titles in verbose queries

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Cross-Lingual Scaling of Political Texts

Rule Learning With Negation: Issues Regarding Effectiveness

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Attributed Social Network Embedding

CS 446: Machine Learning

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Handling Sparsity for Verb Noun MWE Token Classification

Georgetown University at TREC 2017 Dynamic Domain Track

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

A study of speaker adaptation for DNN-based speech synthesis

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Probabilistic Latent Semantic Analysis

Summarizing Answers in Non-Factoid Community Question-Answering

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Online Updating of Word Representations for Part-of-Speech Tagging

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Rule Learning with Negation: Issues Regarding Effectiveness

CS Machine Learning

Cross-lingual Text Fragment Alignment using Divergence from Randomness

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Finding Translations in Scanned Book Collections

A Graph Based Authorship Identification Approach

Short Text Understanding Through Lexical-Semantic Analysis

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Matching Similarity for Keyword-Based Clustering

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

On document relevance and lexical cohesion between query terms

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Python Machine Learning

Combining a Chinese Thesaurus with a Chinese Dictionary

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Calibration of Confidence Measures in Speech Recognition

Ensemble Technique Utilization for Indonesian Dependency Parser

A Comparison of Two Text Representations for Sentiment Analysis

Detecting English-French Cognates Using Orthographic Edit Distance

HLTCOE at TREC 2013: Temporal Summarization

Lecture 1: Machine Learning Basics

The stages of event extraction

Semantic and Context-aware Linguistic Model for Bias Detection

Using dialogue context to improve parsing performance in dialogue systems

A deep architecture for non-projective dependency parsing

The Role of String Similarity Metrics in Ontology Alignment

A Bayesian Learning Approach to Concept-Based Document Classification

AQUA: An Ontology-Driven Question Answering System

Word Sense Disambiguation

Second Exam: Natural Language Parsing with Neural Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Probing for semantic evidence of composition by means of simple classification tasks

Modeling function word errors in DNN-HMM based LVCSR systems

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Word Embedding Based Correlation Model for Question/Answer Matching

(Sub)Gradient Descent

arxiv: v2 [cs.ir] 22 Aug 2016

Model Ensemble for Click Prediction in Bing Search Ads

Term Weighting based on Document Revision History

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Artificial Neural Networks written examination

arxiv: v4 [cs.cl] 28 Mar 2016

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Modeling function word errors in DNN-HMM based LVCSR systems

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Graph Alignment for Semi-Supervised Semantic Role Labeling

Cross Language Information Retrieval

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Towards a MWE-driven A* parsing with LTAGs [WG2,WG3]

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning a Cross-Lingual Semantic Representation of Relations Expressed in Text

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Introduction to Causal Inference. Problem Set 1. Required Problems

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Indian Institute of Technology, Kanpur

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

SARDNET: A Self-Organizing Feature Map for Sequences

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

The Importance of Social Network Structure in the Open Source Software Developer Community

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Constructing Parallel Corpus from Movie Subtitles

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Switchboard Language Model Improvement with Conversational Data from Gigaword

Mandarin Lexical Tone Recognition: The Gating Paradigm

Beyond the Pipeline: Discrete Optimization in NLP

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Transcription:

Short Text Similarity with s CS 6501 Advanced Topics in Information Retrieval @UVa Tom Kenter 1, Maarten de Rijke 1 1 University of Amsterdam, Amsterdam, The Netherlands Presented by Jibang Wu Apr 19th, 2017 Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 1 / 32

Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 2 / 32

Introduction Why Short Text Similarity? Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 3 / 32

Introduction Why Short Text Similarity? Why Short Text Similarity? Example The procedure is generally performed in the second or third trimester. The technique is used during the second and, occasionally, third trimester of pregnancy. Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 4 / 32

Introduction Why Short Text Similarity? Why Short Text Similarity? Example The procedure is generally performed in the second or third trimester. The technique is used during the second and, occasionally, third trimester of pregnancy. Word-level similarity not enough query-query similarity, query-image caption similarity Cannot easily go from word-level to text-level similarity text structure should be taken into account Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 4 / 32

Introduction How Traditional Approaches Fail? Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 5 / 32

Introduction How Traditional Approaches Fail? Lexical Matching Largest common substring, edit distance, lexical overlap 1 United States United Kingdom 2 United States USA Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 6 / 32

Introduction How Traditional Approaches Fail? Lexical Matching Largest common substring, edit distance, lexical overlap 1 United States United Kingdom 2 United States USA FAILED: The second one should be better matched Linguistic Analysis Parse tree following grammar feature Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 6 / 32

Introduction How Traditional Approaches Fail? Lexical Matching Largest common substring, edit distance, lexical overlap 1 United States United Kingdom 2 United States USA FAILED: The second one should be better matched Linguistic Analysis Parse tree following grammar feature Not all texts are necessarily parseable (e.g., tweets) High-quality parses usually expensive to compute at run time. Structured Semantic Knowledge WordNet, Wikipedia Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 6 / 32

Introduction How Traditional Approaches Fail? Lexical Matching Largest common substring, edit distance, lexical overlap 1 United States United Kingdom 2 United States USA FAILED: The second one should be better matched Linguistic Analysis Parse tree following grammar feature Not all texts are necessarily parseable (e.g., tweets) High-quality parses usually expensive to compute at run time. Structured Semantic Knowledge WordNet, Wikipedia Not available to all language, and domain-specific terms Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 6 / 32

Introduction Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 7 / 32

Introduction How do we represent the meaning of a word? Navies Approach: one-hot representation store in a vector of vocabulary set size Example hotel = [0 0 0 0 0 0 0 0 0 0 1 0... 0 0 0 0 0 0 0 0] motel = [0 0 0 0 0 0 0 0 0 0 0 0... 1 0 0 0 0 0 0 0] Dimensionality: 20K (speech) 500K (dictionary) 13M (Google 1T) Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 8 / 32

Introduction How do we represent the meaning of a word? Navies Approach: one-hot representation store in a vector of vocabulary set size Example hotel = [0 0 0 0 0 0 0 0 0 0 1 0... 0 0 0 0 0 0 0 0] motel = [0 0 0 0 0 0 0 0 0 0 0 0... 1 0 0 0 0 0 0 0] Dimensionality: 20K (speech) 500K (dictionary) 13M (Google 1T) Problems: Waste of memory Hard to show semantic similiarity Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 8 / 32

Introduction How do we represent the meaning of a word? : distributional similarity based representations build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context Example hotel = [0.286 0.792-0.177-0.107 0.109-0.542 0.349 0.271] motel = [0.280 0.772-0.171-0.107 0.109-0.542 0.349 0.271] Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 9 / 32

Introduction How do we represent the meaning of a word? : distributional similarity based representations build a dense vector for each word type, chosen so that it is good at predicting other words appearing in its context Example hotel = [0.286 0.792-0.177-0.107 0.109-0.542 0.349 0.271] motel = [0.280 0.772-0.171-0.107 0.109-0.542 0.349 0.271] Dimensionality: 300-500 (Word2vec) 300 (GloVe) Neural network trained from extensive unlabeled context. [more details] Advantage: Efficient in memory and computation Easy to show semantic similarity Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 9 / 32

Introduction Intuitions Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 10 / 32

Methodology From Word-level to Text-level Semantics Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 11 / 32

Methodology From Word-level to Text-level Semantics Semantic Space : w S 1 : w S 2 : S 1 = w S 1 w : S 2 = w S 2 w Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 12 / 32

Methodology From Word-level to Text-level Semantics Semantic Space : w S 1 : w S 2 : S 1 = w S 1 w : S 2 = w S 2 w Average sum?= Sentence similarity Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 12 / 32

Methodology From Word-level to Text-level Semantics Unweighted Semantic Similarity 1 For each pair of terms (w 1, w 2 ) in S 1 and S 2, compute the cosine similarities 2 Fully connected, unweighted, bipartite graph 3 Maximum Bipartite Matching 4 Separate the word pairs into bins of different similarity level Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 13 / 32

Methodology From Word-level to Text-level Semantics Unweighted Semantic Similarity Not all terms are equally important Longer text has more probability to hit 1 For each pair of terms (w 1, w 2 ) in S 1 and S 2, compute the cosine similarities 2 Fully connected, unweighted, bipartite graph 3 Maximum Bipartite Matching 4 Separate the word pairs into bins of different similarity level Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 13 / 32

Methodology Saliency-weighted Semantic Similarity Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 14 / 32

Methodology Saliency-weighted Semantic Similarity From BM25 r(q, d) = w q d IDF (w) c(w, d) (k 1 + 1) c(w, d) + k 1 (1 b + b n n avg ) c(w, d) literal match of words Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 15 / 32

Methodology Saliency-weighted Semantic Similarity From BM25 f sts (s l, s s ) = sem(w, s s ) (k 1 + 1) IDF (w) w s l sem(w, s s ) + k 1 (1 b + b s s avg sl ) sem(w, s s ) = max w s f sem(w, w ) f sem (w, w ) returns semantic match score from word embedding Common words has smaller IDF(w) than rare words. Bin summands of different range of score together Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 16 / 32

Methodology Learning Algorithm Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 17 / 32

Methodology Learning Algorithm Models Pre-trained Out-of-the-Box word embeddings Word2vec 300-dimensions by Mikolov et al. Word2vec 400-dimensions by Baroni et al. GloVe 300-dimensional trained on 840 billion token corpus GloVe 300-dimensional trained on 42 billion token corpus Auxiliary word embeddings trained on INEX with 1.2 billion tokens based either on Word2vec or GloVe Algorithm to optimize parameter setting for predicting short text similarity Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 18 / 32

Methodology Learning Algorithm Binary Classifier from Supervised Learning Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 19 / 32

Summary Experiment Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 20 / 32

Summary Experiment Experiment Setup Dataset: Microsoft Research Paraphrase(MSR) Corpus 5801 sentence pairs annotated with binary labels divided into training set of 4076, and testing set of 1725 Handle Out-of-vocabulary word ignore in training, map randomly in runtime Paremeter settings for f sts, k 1 = 1.2, b = 0.75, IDF calculated from INEX data Three bin threshold: Similarity level Highly Medium Unlikely Saliency-weighted Semantic Network 0 0.15 0.15 0.4 0.4 Unweighted Semantic Network 0 0.45 0.45 0.8 0.8 Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 21 / 32

Summary Experiment Experiment Results OoB: aux: w2v: glv: out-of-the-box vectors auxiliary vectors Word2vec GloVe unwghtd: unweighted semantic feature swsn: saliency-weighted semantic feature Best model uses all features and word embedding models The method overall outperform previous approaches Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 22 / 32

Summary Analysis Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 23 / 32

Summary Analysis Performance Across Sentence Length Perform better on sentences that are alike in length Tend to predict dissimilarity when texts substantially differ in length Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 24 / 32

Summary Analysis Performance Across Levels of Lexical Overlap At low lexical overlap level, the algorithm shows the benefit of semantic matching over lexical matching Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 25 / 32

Summary Conclusion Outline 1 Introduction Why Short Text Similarity? How Traditional Approaches Fail? 2 Methodology From Word-level to Text-level Semantics Saliency-weighted Semantic Similarity Learning Algorithm 3 Summary Experiment Analysis Conclusion Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 26 / 32

Summary Conclusion Advantages: Word embedding based unsupervised learning Substitute methods based on external semantic knowledge Crucial application in search, query suggestion Limitations: The order of words is not taken into account Context awareness is important in real applications Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 27 / 32

Appendix Citation Citation I Kenter, Tom, and Maarten de Rijke Short Text Similarity with s. Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015. Mikolov, Tomas, et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013. Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. Glove: Global Vectors for Word Representation.s. EMNLP. Vol. 14. 2014. Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 28 / 32

More Outline 4 More Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 29 / 32

More Mainstream Algorithms Word2Vec predict surrounding words in a window of radius m of every word Continuous bag-of-words (CBOW) predicting the word given its context several times faster to train than the skip-gram slightly better accuracy for the frequent words Skip-gram predicting the context given a word works well with small amount of the training data represents well even rare words or phrases Global Vectors for Word Representation (GloVe) combines the advantages of global matrix factorization and local context window methods Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 30 / 32

More Window based co-occurrence matrix Slide from Stanford CS224n Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 31 / 32

More Feature Highlights W ( woman ) W ( man ) W ( aunt ) W ( uncle ) Presented by Jibang Wu Short Text Similarity with s Apr 19th, 2017 32 / 32