ENLP Lecture 21b Word & Document Representations; Distributional Similarity

Similar documents
Probabilistic Latent Semantic Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A Bayesian Learning Approach to Concept-Based Document Classification

Matching Similarity for Keyword-Based Clustering

Python Machine Learning

arxiv: v1 [cs.cl] 2 Apr 2017

A Comparison of Two Text Representations for Sentiment Analysis

Word Sense Disambiguation

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

A Case Study: News Classification Based on Term Frequency

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Vocabulary Usage and Intelligibility in Learner Language

(Sub)Gradient Descent

Controlled vocabulary

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Assignment 1: Predicting Amazon Review Ratings

On document relevance and lexical cohesion between query terms

Modeling function word errors in DNN-HMM based LVCSR systems

Short Text Understanding Through Lexical-Semantic Analysis

Compositional Semantics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Context Free Grammars. Many slides from Michael Collins

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Latent Semantic Analysis

arxiv: v1 [cs.cl] 20 Jul 2015

arxiv: v2 [cs.ir] 22 Aug 2016

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Text-mining the Estonian National Electronic Health Record

Comment-based Multi-View Clustering of Web 2.0 Items

A Semantic Similarity Measure Based on Lexico-Syntactic Patterns

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Ensemble Technique Utilization for Indonesian Dependency Parser

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Lexical Similarity based on Quantity of Information Exchanged - Synonym Extraction

Unsupervised Learning of Narrative Schemas and their Participants

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Concepts and Properties in Word Spaces

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

The stages of event extraction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Lecture 1: Machine Learning Basics

Second Exam: Natural Language Parsing with Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

CSL465/603 - Machine Learning

A Domain Ontology Development Environment Using a MRD and Text Corpus

Speech Recognition at ICSI: Broadcast News and beyond

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

The taming of the data:

Assessing Entailer with a Corpus of Natural Language From an Intelligent Tutoring System

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Statewide Framework Document for:

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Cross Language Information Retrieval

What's My Value? Using "Manipulatives" and Writing to Explain Place Value. by Amanda Donovan, 2016 CTI Fellow David Cox Road Elementary School

Modeling function word errors in DNN-HMM based LVCSR systems

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

APES Summer Work PURPOSE: THE ASSIGNMENT: DUE DATE: TEST:

Abstractions and the Brain

Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings

Let's Learn English Lesson Plan

A study of speaker adaptation for DNN-based speech synthesis

The Smart/Empire TIPSTER IR System

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Mining Topic-level Opinion Influence in Microblog

Leveraging Sentiment to Compute Word Similarity

Getting Started with Deliberate Practice

Data-driven Type Checking in Open Domain Question Answering

UC Santa Cruz Graduate Research Symposium 2016

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Algebra 2- Semester 2 Review

Experts Retrieval with Multiword-Enhanced Author Topic Model

As a high-quality international conference in the field

Copyright by Sung Ju Hwang 2013

Georgetown University at TREC 2017 Dynamic Domain Track

Summarizing Answers in Non-Factoid Community Question-Answering

Dublin City Schools Mathematics Graded Course of Study GRADE 4

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Syntactic and Semantic Factors in Processing Difficulty: An Integrated Measure

Rule Learning With Negation: Issues Regarding Effectiveness

A Re-examination of Lexical Association Measures

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

The Role of String Similarity Metrics in Ontology Alignment

Multilingual Sentiment and Subjectivity Analysis

Distant Supervised Relation Extraction with Wikipedia and Freebase

ME 4495 Computational Heat Transfer and Fluid Flow M,W 4:00 5:15 (Eng 177)

A Vector Space Approach for Aspect-Based Sentiment Analysis

Transcription:

ENLP Lecture 21b Word & Document Representations; Distributional Similarity Nathan Schneider (some slides by Marine Carpuat, Sharon Goldwater, Dan Jurafsky) 28 November 2016 1

Topics Similarity Thesauri & their limitations Distributional hypothesis Clustering (Brown clusters, LDA) Vector representations (count-based, dimensionality reduction, embeddings) 2

Word & Document Similarity 3

Question Answering Q: What is a good way to remove wine stains? A1: Salt is a great way to eliminate wine stains A2: How to get rid of wine stains A3: How to get red wine out of clothes A4: Oxalic acid is infallible in removing iron-rust and ink stains. 4

Document Similarity Given a movie script, recommend similar movies. 5

Word Similarity 6

Intuition of Semantic Similarity Semantically close bank money apple fruit tree forest bank river pen paper run walk mistake error car wheel Semantically distant doctor beer painting January money river apple penguin nurse fruit pen river clown tramway car algebra 7

Why are 2 words similar? Meaning The two concepts are close in terms of their meaning World knowledge The two concepts have similar properties, often occur together, or occur in similar contexts Psychology We often think of the two concepts together 8

Two Types of Relations Synonymy: two words are (roughly) interchangeable Semantic similarity (distance): somehow related Sometimes, explicit lexical semantic relationship, often, not 9

Validity of Semantic Similarity Is semantic distance a valid linguistic phenomenon? Experiment (Rubenstein and Goodenough, 1965) Compiled a list of word pairs Subjects asked to judge semantic distance (from 0 to 4) for each of the word pairs Results: Rank correlation between subjects is ~0.9 People are consistent! 10

Why do this? Task: automatically compute semantic similarity between words Can be useful for many applications: Detecting paraphrases (i.e., automatic essay grading, plagiarism detection) Information retrieval Machine translation Why? Because similarity gives us a way to generalize beyond word identities 11

Evaluation: Correlation with Humans Ask automatic method to rank word pairs in order of semantic distance Compare this ranking with human-created ranking Measure correlation 12

Evaluation: Word-Choice Problems Identify that alternative which is closest in meaning to the target: accidental wheedle ferment inadvertent abominate imprison incarcerate writhe meander inhibit 13

Evaluation: Malapropisms Jack withdrew money from the ATM next to the band. band is unrelated to all of the other words in its context 14

Word Similarity: Two Approaches Thesaurus-based We ve invested in all these resources let s exploit them! Distributional Count words in context 15

Thesaurus-based Similarity Use the structure of a resource like WordNet Examine the relationship between the two concepts, use a metric that converts the relationship into a real number E.g., path length: sim(c 1, c 2 ) = log path(c 1, c 2 ) How would you deal with ambiguous words? 16

Thesaurus Methods: Limitations Measure is only as good as the resource Limited in scope Assumes IS-A relations Works mostly for nouns Role of context not accounted for Not easily domain-adaptable Resources not available in many languages 17

Distributional Similarity Differences of meaning correlates with differences of distribution (Harris, 1970) Idea: similar linguistic objects have similar contents (for documents, sentences) or contexts (for words) 18

Two Kinds of Distributional Contexts 1. Documents as bags-of-words Similar documents contain similar words; similar words appear in similar documents 2. Words in terms of neighboring words You shall know a word by the company it keeps! (Firth, 1957) Similar words occur near similar sets of other words (e.g., in a 5-word window) 19

20

Word Vectors A word type can be represented as a vector of features indicating the contexts in which it occurs in a corpus w (f, f, f,... f 1 2 3 N ) 21

Context Features Word co-occurrence within a window: Grammatical relations: 22

Context Features Feature values Boolean Raw counts Some other weighting scheme (e.g., idf, tf.idf) Association values (next slide) 23

Association Metric Commonly-used metric: Pointwise Mutual Information association PMI ( w, f ) log 2 P( w, f P( w) P( ) f ) Can be used as a feature value or by itself 24

Computing Similarity Semantic similarity boils down to computing some measure on context vectors Cosine distance: borrowed from 25

Words in a Vector Space In 2 dimensions: v = (v1, v2) v = cat w = computer w = (w1, w2) 26

Euclidean Distance Σi (vi wi)² Can be oversensitive to extreme values 27

Cosine Similarity 28 Cosine distance: borrowed from information retrieval N i i N i i N i i i w v w v w v w v w v 1 2 1 2 1 cosine ), ( sim

Distributional Approaches: Discussion No thesauri needed: data driven Can be applied to any pair of words Can be adapted to different domains 29

Distributional Profiles: Example 30

Distributional Profiles: Example 31

Problem? 32

Distributional Profiles of Concepts 33

Semantic Similarity: Celebrity Semantically distant 34

Semantic Similarity: Celestial body Semantically close! 35

Word Clusters E.g., Brown clustering algorithm produces hierarchical clusters based on word context vectors Words in similar parts of hierarchy occur in similar contexts Chairman,is,0010,, months,=,01,,and,verbs,=,1 0 1 00 01 000 001 010 011 100 101 CEO 0010 0011 November October chairman president Brown clusters created from Twitter data: http://www.cs.cmu.edu/~ark/tweetnlp/cluster_viewer.html 36 10 11 run sprint walk

Document-Word Models Features in the word vector can be word context counts or PMI scores Also, features can be the documents in which this word occurs Document occurrence features useful for topical/ thematic similarity 37

Topic Models Latent Dirichlet Allocation (LDA) and variants are known as topic models Learned on a large document collection (unsupervised) Latent probabilistic clustering of words that tend to occur in the same document. Each topic cluster is a distribution over words. Generative model: Each document is a sparse mixture of topics. Each word in the document is chosen by sampling a topic from the document-specific topic distribution, then sampling a word from that topic. Learn with EM or other techniques (e.g., Gibbs sampling) 38

Topic Models 39 http://cacm.acm.org/magazines/2012/4/147361-probabilistic-topic-models/fulltext

More on topic models Mark Dredze (JHU) Topic Models for Identifying Public Health Trends Tomorrow, 11:00 in STM 326 40

DIMENSIONALITY REDUCTION 41 Slides based on presentation by Christopher Potts

Why dimensionality reduction? So far, we ve defined word representations as rows in F, a m x n matrix m = vocab size n = number of context dimensions / features Problems: n is very large, F is very sparse Solution: find a low rank approximation of F Matrix of size m x d where d << n 42

Methods Latent Semantic Analysis Also: Principal component analysis Probabilistic LSA Latent Dirichlet Allocation Word2vec 43

Latent Semantic Analysis Based on Singular Value Decomposition 44

LSA illustrated: SVD + select top k dimensions 45

Word embeddings based on neural language models So far: Distributional vector representations constructed based on counts (+ dimensionality reduction) Recent finding: Neural networks trained to predict neighboring words (i.e., language models) learn useful low-dimensional word vectors Dimensionality reduction is built into the NN learning objective Once the neural LM is trained on massive data, the word embeddings can be reused for other tasks 46

Word vectors as a byproduct of language modeling A neural probabilistic Language Model. Bengio et al. JMLR 2003 47

48

Using neural word representations in NLP word representations from neural LMs aka distributed word representations aka word embeddings How would you use these word vectors? Turian et al. [2010] word representations as features consistently improve performance of Named-Entity Recognition Text chunking tasks 49

Word2vec [Mikolov et al. 2013] introduces simpler models https://code.google.com/p/word2vec 50

Word2vec claims Useful representations for NLP applications Can discover relations between words using vector arithmetic king male + female = queen Paper+tool received lots of attention even outside the NLP research community try it out at word2vec playground : 51 http://deeplearner.fz-qqq.net/

Summary Given a large corpus, the meanings of words can be approximated in terms of words occurring nearby: distributional context. Each word represented as a vector of integer or real values. Different ways to choose context, e.g. context windows Different ways to count cooccurrence, e.g. (positive) PMI Vectors can be sparse (1 dimension for every context) or dense (reduced dimensionality, e.g. with Brown clustering or LSA) This facilities measuring similarity between words useful for many NLP tasks! Different similarity measures, e.g. cosine (= normalized dot product) Evaluations: human relatedness judgments; extrinsic tasks 52