A review of word embedding and document similarity algorithms applied to academic text

Similar documents
Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v1 [cs.cl] 2 Apr 2017

On document relevance and lexical cohesion between query terms

Unsupervised Cross-Lingual Scaling of Political Texts

Innovative Teaching in Science, Technology, Engineering, and Math

Linking Task: Identifying authors and book titles in verbose queries

Assignment 1: Predicting Amazon Review Ratings

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Case Study: News Classification Based on Term Frequency

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Attributed Social Network Embedding

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Comment-based Multi-View Clustering of Web 2.0 Items

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Speech Recognition at ICSI: Broadcast News and beyond

A Vector Space Approach for Aspect-Based Sentiment Analysis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Second Exam: Natural Language Parsing with Neural Networks

Lecture 1: Machine Learning Basics

Word Segmentation of Off-line Handwritten Documents

Georgetown University at TREC 2017 Dynamic Domain Track

Cross Language Information Retrieval

arxiv: v2 [cs.ir] 22 Aug 2016

The taming of the data:

Bridging Lexical Gaps between Queries and Questions on Large Online Q&A Collections with Compact Translation Models

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Semantic and Context-aware Linguistic Model for Bias Detection

Probing for semantic evidence of composition by means of simple classification tasks

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

South Carolina English Language Arts

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Variations of the Similarity Function of TextRank for Automated Summarization

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Noisy SMS Machine Translation in Low-Density Languages

The Smart/Empire TIPSTER IR System

Joint Learning of Character and Word Embeddings

Residual Stacking of RNNs for Neural Machine Translation

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

The Role of String Similarity Metrics in Ontology Alignment

Learning From the Past with Experiment Databases

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Deep Neural Network Language Models

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Short Text Understanding Through Lexical-Semantic Analysis

Finding Translations in Scanned Book Collections

Probabilistic Latent Semantic Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

A Bayesian Learning Approach to Concept-Based Document Classification

Beyond the Pipeline: Discrete Optimization in NLP

Radius STEM Readiness TM

How to read a Paper ISMLL. Dr. Josif Grabocka, Carlotta Schatten

Bug triage in open source systems: a review

AQUA: An Ontology-Driven Question Answering System

Using dialogue context to improve parsing performance in dialogue systems

Applications of memory-based natural language processing

Truth Inference in Crowdsourcing: Is the Problem Solved?

Seminar - Organic Computing

A Reinforcement Learning Variant for Control Scheduling

Detecting English-French Cognates Using Orthographic Edit Distance

THE world surrounding us involves multiple modalities

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Calibration of Confidence Measures in Speech Recognition

Handling Sparsity for Verb Noun MWE Token Classification

Florida Reading Endorsement Alignment Matrix Competency 1

Grade 6: Correlated to AGS Basic Math Skills

Using Semantic Relations to Refine Coreference Decisions

CS Machine Learning

HLTCOE at TREC 2013: Temporal Summarization

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Modeling function word errors in DNN-HMM based LVCSR systems

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Human Emotion Recognition From Speech

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

The College Board Redesigned SAT Grade 12

A Comparison of Two Text Representations for Sentiment Analysis

Feature Selection based on Sampling and C4.5 Algorithm to Improve the Quality of Text Classification using Naïve Bayes

Improvements to the Pruning Behavior of DNN Acoustic Models

Modeling function word errors in DNN-HMM based LVCSR systems

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Universiteit Leiden ICT in Business

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v2 [cs.cv] 30 Mar 2017

Memory-based grammatical error correction

Transcription:

A review of word embedding and document similarity algorithms applied to academic text Computer Science Bachelor s Thesis Author: Jon Ezeiza Alvarez Supervisor: Prof. Dr. Hannah Bast

Motivation A consequence of two projects: IXA group practicum SCITODATE A realization: There is no human endevour as well documented as science. With faster progress and increased publication rate it is getting hard for humans to keep a global grasp of science. A long-term goal: An AI toolbox for automatic understanding of large amounts of academic literature.

Scope A small first step Literature review of the state-of-the-art in word embeddings and semantic textual similarity. Empirical review of the algorithms on academic literature.

What are word embeddings? Dense algebraic representations of semantic content. Trained on large corpora or knowledge graphs. Why? An alternative to knowledge graphs. Input for Machine Learning.

What are word embeddings? Words are placed in a high dimensional vector space such that their distances equate similarity or relatedness. Side effect: Analogy, real-world knowledge

Semantic Textual Similarity (STS) Task: approximate similarity between pairs of text. Phrases Sentences Paragraphs Documents Document embeddings Word embedding compositionality.

Training dataset A corpus to learn from Bio-medical articles from PubMed 3 billion tokens Separate titles, abstracts and bodies. Cleaned and normalized: Tokenization Stemming

Testing datasets Triplets: distinguish similarity from noise. The first two elements are related. The third element is non-related. Goal: sim(1, 2) > sim(1, 3) Word embeddings: UMLS synonyms. Document similarity: ORCID author linking.

Word2Vec (Mikolov, K. Chen, et al., 2013) Mayor breakthrough Key to success: deep vs shallow models Window scanning method: Assumption: words that appear in similar contexts have similar meaning (Harris, 1954).

GloVe (Pennington, Socher, and C. Manning, 2014) Formalization of window scanning method: implicit factorization of wordword global statistics matrix. Alternative: Explicit factorization of co-occurrene matrix.

FastText (Bojanowski et al., 2016) Word2Vec with subword components. Modular word embeddings. N-gram embeddings. Compositon of subword structures. Robustness to language inconsistencies and morphological variations.

WordRank (Ji et al., 2015) Optimizes Nearest Neighbour ranking Instead of target-context pairwise distance. Ranking tuned to have more resolution at the top. Similar results to state-of-the-art with smaller corpora. Not reflected in our experiments.

Results and conclusions Word embeddings accuracy 1M 10M 100M 1B 2B W2V CBow - Total 0.03 0.17 0.46 0.83 0.89 W2V Skip-gram - Total 0.04 0.18 0.46 0.83 0.89 W2V CBow - Known 0.67 0.73 0.80 0.85 0.90 W2V Skip-gram - Known 0.67 0.79 0.80 0.88 0.90 GloVe - Total 0.04 0.17 0.45 0.80 0.87 GloVe - Known 0.71 0.73 0.78 0.85 0.88 FastText - Total 0.81 0.88 0.90 0.93 - WordRank - Total 0.02 0.21 0.45 0.78 0.89 Wordrank - Known 0.69 0.75 0.77 0.84 0.90

STS Baseline It is early days for STS Make sure that the state-of-the-art beats naive methods. Baseline: VSM similarity: BoW, Tf-Idf, BM25 Weighted word embedding centroids

Doc2Vec (Quoc V. Le and Mikolov, 2014) Adaptation of Word2Vec Add global document vector to the context.

Doc2VecC (M. Chen, 2017) Realization: simple word embedding average is a hard baseline to beat. Optimize word embeddings such that averaging them results in meaningful document vector representations. Heavy corruption to improve generality.

Word Mover s Distance (Kusner et al., 2015) A pairwise document similarity metric. Compares two sets of embeddings with weights (frequencies, VSM). Earth Mover s Distance

Skip-thoughts (Kiros et al., 2015) Exploits sentence adjacency to train sentence embeddings. Encoder-decoder RNN architecture Breakthrough in machine translation

Sent2Vec (Pagliardini, Gupta, and Jaggi, 2017) Shallow sentence embedding model Heavily based on Wor2Vec CBow The window is a full semantic unit (sentence, paragraph, document ) instead of a few consequtive words words.

Results and conclusions Best results of each algorithm STS eval Baseline Doc2Vec Doc2VecC WMD Sent2Vec Titles 0.91 (EMB) 0.65 (1M) 0.87 (1M) 0.90 0.91 (1M) Abstracts 0.93 (both) 0.86 (1M) 0.92 (50K) 0.92 0.87 (100K) Bodies 0.96 (VSM) 0.97 (500K) 0.94 (10K) - 0.83 (10K)

Summary Accomplishments Thorough literature review of state-of-the-art Analysed 10 algorithms: Intuituion The maths Computational complesity Empirical study Computational benchmark Evaluation

Conclusions Word embeddings Very active field since Word2Vec Most algorithms are derivative of Word2Vec, no clear advantages on evaluation. Some breakthoughs: FastText. Semantic Textual Similarity Active but early days. Most models barely match naive baselines. A lot of innovation and exploration, may lead to a breakthrough in a few years.

Future work Main barrier: lack of official datasets in the scientific domain. Human scored similarity pairs in scientific domain. Stronger article linkage Training set for document similarity SCITODATE R&D roadmap: NER for linking to BioPortal Vocabulary mining Fact and relationship mining Named Entity prediction