Similarity and Vectors

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Linking Task: Identifying authors and book titles in verbose queries

Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

A Comparison of Two Text Representations for Sentiment Analysis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Universiteit Leiden ICT in Business

Cross Language Information Retrieval

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.cl] 2 Apr 2017

A Bayesian Learning Approach to Concept-Based Document Classification

Switchboard Language Model Improvement with Conversational Data from Gigaword

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

CS 446: Machine Learning

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Multilingual Sentiment and Subjectivity Analysis

Rule Learning With Negation: Issues Regarding Effectiveness

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Unit 3 Ratios and Rates Math 6

Lecture 1: Machine Learning Basics

Attributed Social Network Embedding

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Leveraging Sentiment to Compute Word Similarity

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

AQUA: An Ontology-Driven Question Answering System

LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

On document relevance and lexical cohesion between query terms

CS Machine Learning

A Vector Space Approach for Aspect-Based Sentiment Analysis

Grade 6: Correlated to AGS Basic Math Skills

Postprint.

The Role of String Similarity Metrics in Ontology Alignment

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Rule Learning with Negation: Issues Regarding Effectiveness

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

The stages of event extraction

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Multi-Lingual Text Leveling

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Statewide Framework Document for:

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

(Sub)Gradient Descent

Which verb classes and why? Research questions: Semantic Basis Hypothesis (SBH) What verb classes? Why the truth of the SBH matters

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Learning From the Past with Experiment Databases

Comment-based Multi-View Clustering of Web 2.0 Items

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Mathematics Success Grade 7

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Matching Similarity for Keyword-Based Clustering

The Smart/Empire TIPSTER IR System

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Methods for Fuzzy Systems

The taming of the data:

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

As a high-quality international conference in the field

Second Exam: Natural Language Parsing with Neural Networks

Generative models and adversarial training

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Conversational Framework for Web Search and Recommendations

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

About the Mathematics in This Unit

South Carolina English Language Arts

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

This scope and sequence assumes 160 days for instruction, divided among 15 units.

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

A deep architecture for non-projective dependency parsing

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

arxiv: v2 [cs.cv] 30 Mar 2017

Learning Methods in Multilingual Speech Recognition

Georgetown University at TREC 2017 Dynamic Domain Track

Translating Collocations for Use in Bilingual Lexicons

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

A Domain Ontology Development Environment Using a MRD and Text Corpus

arxiv: v1 [cs.cl] 20 Jul 2015

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Ensemble Technique Utilization for Indonesian Dependency Parser

Applications of memory-based natural language processing

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Transcription:

Adam Meyers New York University

Summary Vectors representing Documents IR and Document Classification Similarity between vectors Vectors representing Words Word Similarity, Word Sense Disambiguation, Paraphrase/Entailement Reducing Dimensions of Large Vectors Neural Networks, aka, Deep Learning

Term Document Matrix: Information Retrieval Lecture 6 & Homework 5 Matrix of documents and words Columns documents Rows words Rows are vectors and columns are dimensions of the vectors Scores in matrix = TF-IDF scores How significant is word t at row for document at column? TFIDF(t) = TF(t) IDF(t) TF(t) = measure of frequency of t in document IDF(t) = measure of how few documents contain t IDF (t)=log ( NumberOfDocuments NumberOfDocumentsContaining(t) )

Example: coconut milk vs. tablespoon coconut milk occurs ~ 3 times in chicken and coconut soup recipe Term frequency = 3 occurs in 4 out of 10,000 documents in collection inverse document frequency = log(10000/4) = log(2500) = 7.82 TFIDF = 3 7.82 = 23.46 tablespoon occurs 4 times in chicken and coconut soup recipe Term frequency = 4 occurs in 1200 out of 10,000 documents in corpus inverse document frequency = log(10000/1200) = log(8.33) = 2.12 TFIDF = 4 2.12 = 8.48 coconut milk is more highly weighted for Thai Soup recipes than tablespoon Note: Suitability of query term may depend on the nature of the collection Is this a collection of recipes? tablespoon not good search term Is collection diverse: instructions, news,? tablespoon may be good search term

Cosine Similarity: Similarity Between Vectors Similarity (A, B)= Cosine of the Angle Between the Vectors Cosine similarity high i a i b i i a i 2 i b i 2 if values of a and b are similar If angle between vectors is small Used for all kinds of vectors We applied these to Information Retrieval But also apply to Word Sense Disambiguation, Sentiment Analysis, Paraphrase/Entailment, Other similarity metrics: Jaccard, Dice, KL divergence, etc.

Information Retrieval Example Vectors have values corresponding to terms: potato chip, chicken, sesame seed, coconut milk, ground beef 2 Queries Q1 chicken, coconut milk: (0,5,0,5,0) Q2 ground beef, potato chip: (4,0,0,0,7) 2 Documents D1 Chicken and Coconut Soup Recipe: (0,7,0,9,0) D2 Hamburger Recipe: (3,0,2,0,9) Cosign similarities Q1 Q2 D1 99.2 0 D2 0 95.9

Other Uses of Document Vectors Document Classification Given sets of documents with known classifications Computer average vectors for each class Create vector for unclassified document Place new document in class with the highest similarity Sentiment Analysis Like Document classification, but classes are sentiments But may need different vectors for different domains/types of products/etc. Words relevant to sentiment are selected for dimensions of vectors part of challenge = choice of words (great, terrible,.) maybe domain specific (low interest: loans vs. investments) Adjustments to account for negation combine negative words with nearby sentiment words, e.g., don't like not_like

Word Word Matrix Using Pointwise Mutual Information Word Word Matrix (aka word embedding) Rows represent word R Columns (aka dimensions) represent words co-occurring with word C Can be generalized to multi-words (n-grams, phrases, ) word to multi-word multi-word to multi-word Context can be defined other ways, e.g., proximity in syntactic tree Approximation of meaning: Words in the same contexts tend to have similar meanings (Harris, 1954) You shall know a word by the company it keeps (Firth, 1957) Scores in Matrix How related is word R to word C represented by column C Pointwise Mutual Information PMI=log( prob(word R, word C ) prob(word R ) prob(word C ) )

Modifications to PMI Negative values should be treated as 0 PMI is high for low frequency words banana occurs once in the corpus of 1K words face occurs twice in that corpus Banana face occurs once in that corpus.5 PMI (banana, face)=log 2 (.001.002 )=12.42 Smoothing different methods that raise the denominator slightly which offset this effect Example: La Place add a small constant to all e.g., add 1(banana = 2, banana face = 2, face = 3.667 PMI (banana, face)=log 2 ( (.002.003) )=11.6

Sample Word Embedding 1 Assume a bag of words approach Order of words don't matter Assume that words are stemmed Use words in a window of K words before and K words after word R Let's assume K = 5 (for this example) Eliminate stop words and high frequency (low IDF) words Use integers in vectors (scores usually between 0 and 1)

Sample Word Embedding 2 From Hypothetical Recipe Corpus Rows = words being classified Columns = words in context Numbers = arbitrary score ranking likelihood that column word +/- 5 words from row word (higher number higher rank) cup ounce taste chicken stir bake chocolate beef 1 4 1 0 4 5 0 cabbage 3 0 0 0 0 5 0 lemon 3 3 4 2 2 0 1 parsley 2 1 4 2 1 2 0 pepper 0 4 4 3 0 5 0 salt 1 3 4 4 0 5 1 sugar 5 1 4 0 1 2 5

Cosine similarity for Word Vectors from Previous Slide beef cabbage lemon parsley pepper salt sugar beef 1.63.54.57.72.66.41 cabbage.63 1.25.51.53.58.51 lemon.54.25 1.86.64.68.74 parsley.57.51.86 1.81.86.69 pepper.72.53.64.81 1.97.44 salt.66.58.68.86.97 1.56 sugar.41.51.74.69.44.56 1

Demo for find similar words http://demo.patrickpantel.com/demos/lexsem/thesaurus.htm

Word Sense Disambiguation Demo of A word sense diambiguator demo http://www.ling.gu.se/~lager/home/pwe_ui.html Shared tasks include Semcor http://web.eecs.umich.edu/~mihalcea/downloads.html#semcor Using Word Vectors for Word Sense Disambiguation Vectors represent word senses rather than words Need sense annotated corpus Create vectors for words in new text Compute similarity of words in new text with sense vectors and choose most similar sense

Paraphrase and Entailment SemEval Text Similarity Task: (Task 1) http://alt.qcri.org/semeval2014/task1/ (webpage) https://aclweb.org/anthology/s/s16/s16-1081.pdf (write-up) Input pairs of text snippets English/English (like previous year tasks) Spanish/English pairs (innovation for ) previous snippets, with one member of pair translated System produces score from 0 to 5 indicating similarity Manually tagged data (test, dev, training sets) Data collection of snippets based on heuristics and manually annotate One heuristic is based on word embedding similarity embedding of sentence = sum of the embeddings of words

Human Judge similarity 0 to 5 (from Agirre et al ) 5 mean exactly the same thing The bird is in the sink Birdie is washing itself in the water basin 4 mostly the same, differences unimportant In May 2010, the troops attempted to invade Kabul The US army invaded Kabul on May 7 th last year, 2010 3 roughly same with important differences/omissions John said he is considered a witness but not a suspect He is not a suspect anymore. John said 2 same topic, share some details They flew out of the nest in groups They flew out of the nest together 1 same topic The woman is playing the violin The young lady enjoys listening to guitar 0 disimilar John went horse back riding with a whole group of friends Sunrise at dawn is a manificent view to take in if you wake up early enough for it

Evaluation Systems scored by the Pearson correlation between their scores and the Manual Annotation Samsung's system got the highest score:.7781 I looked at papers about the top 3 systems All used word embeddings in one form or another

Top System (Samsung) used Word Embeddings Vectors contained words & multi-word phrases Methods for combining embeddings of words into embeddings of sentences Used other features, e.g., from WordNet Used dependency parses of snippets Machine Learning Algorithms (e.g., SVM) To predict 0 to 5 Textual Similarity Score Features include cosine similarity of roots of parses Similarity derived by combining children similarities according to an algorithm Most top systems used Word Embeddings

Real Vectors have Many Dimensions Preceding toy examples use few dimensions Vectors often have tens of thousands of dimensions More dimensions Better output (higher recall and precision) Slower speed (e.g., takes longer to computer similarity) Large Vectors are sparse (lots of zeros) Context: window of 3 to 17 (or the whole sentence) Reducing dimensions to make smaller, less sparse vectors Capture Generalizations, more efficient processing, etc. One such method is called Latent Semantic Analysis Many other methods for refining vector-based analyses

Latent Semantic Analysis: Reducing Dimensions Ori gi nal 2-D Vector Rotate/Move So Poi nts Are Cl oser To The X and Y Axes El i mi nate One Di mensi on

Other factors Softmax functions: functions that normalize a range of values from 0 to 1, so they can be used as probabilities Eliminating dimensions that do not discrimate between vectors, high/low frequency words, words with low IDF, etc. Feature types Bag of Words Feature (so far) Features that include Relative positions Features based on parser output, dictionaries, other databases,

Deep Learning Initialize vectors with scores predicting words given neighboring words Randomly initialize weights according to a prior distribution Randomly initialize parameterized-length matrices (weights of the network) these represent layers of the Neural Network Weights are tuned by running multiple times on different pieces of training corpus On each batch, weights are adjusted to improve probabilities For example, maximizing the average log of the probabilities that each (center) word is predicted by neighboring words Training ends when probabilities converge or after maximum number of iterations Example Deep Learning (aka Neural Network) approaches Word2Vec CBOW and Skip-gram Convolutional Neural Networks Recurrent Neural Networks

Deep Learning at NYU Machine Translation Prof. Kyunghyun Cho (http://www.kyunghyuncho.me/) Natural Language Semantics Prof. Sam Bowman (https://www.nyu.edu/projects/bowman/) ACE Event Detection Thien Nguyen (http://www.cs.nyu.edu/~thien/) And Others

Documentation and Code Jurafsky and Martin 3 rd Edition (Chapters 15 and 16) https://web.stanford.edu/~jurafsky/slp3/ Word2Vec https://www.tensorflow.org/versions/r0.12/tutorials/word2vec/index.html https://deeplearning4j.org/word2vec https://github.com/dav/word2vec

Summary Vector characterizations of documents Dimensions represent terms relevant to classification IR dimenions represent query terms Sentiment dimensions represent opinion words Topics dimensions represent topic words Vector characterization of words (word embeddings) Dimensions represent words in context within a window Related words/word-senses/translations/etc. have similar embeddings Dimensions are weighted using TF-IDF, PMI and other metrics Similarity is calculated with Cosine Similarity, Jaccard similarity, Real systems use large sparse vectors which are converted into smaller dense vectors, using various deep learning methods