Distributional Semantics

Similar documents
arxiv: v1 [cs.cl] 20 Jul 2015

Probabilistic Latent Semantic Analysis

Learning Fields Unit and Lesson Plans

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A deep architecture for non-projective dependency parsing

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

12- A whirlwind tour of statistics

Introduction to Causal Inference. Problem Set 1. Required Problems

Second Exam: Natural Language Parsing with Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using Web Searches on Important Words to Create Background Sets for LSI Classification

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Attributed Social Network Embedding

A Vector Space Approach for Aspect-Based Sentiment Analysis

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

An NWO Hands-On STEM Activity Mathematics and Language Arts with The Mitten by Jan Brett

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Backwards Numbers: A Study of Place Value. Catherine Perez

Unsupervised Cross-Lingual Scaling of Political Texts

Active Ingredients of Instructional Coaching Results from a qualitative strand embedded in a randomized control trial

MTH 215: Introduction to Linear Algebra

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Topic Modelling with Word Embeddings

Semantic and Context-aware Linguistic Model for Bias Detection

Grade 8: Module 4: Unit 1: Lesson 8 Reading for Gist and Answering Text-Dependent Questions: Local Sustainable Food Chain

Comment-based Multi-View Clustering of Web 2.0 Items

Lecture 1: Machine Learning Basics

arxiv: v2 [cs.ir] 22 Aug 2016

Shockwheat. Statistics 1, Activity 1

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

HAVE YOU ever heard of someone

Lexia Skill Builders: Independent Student Practice

Economics Unit: Beatrice s Goat Teacher: David Suits

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

The following shows how place value and money are related. ones tenths hundredths thousandths

Assignment 1: Predicting Amazon Review Ratings

Deep Neural Network Language Models

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Grade 4: Module 2A: Unit 1: Lesson 3 Inferring: Who was John Allen?

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Pentomino Problem. Use the 3 pentominos that are provided to make as many different shapes with 12 sides or less. Use the following 3 shapes:

Using focal point learning to improve human machine tacit coordination

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Axiom 2013 Team Description Paper

Calibration of Confidence Measures in Speech Recognition

CS Machine Learning

Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings

J j W w. Write. Name. Max Takes the Train. Handwriting Letters Jj, Ww: Words with j, w 321

Mathematics Success Level E

Language skills to be used and worked upon : Listening / Speaking PPC-PPI / Reading / Writing

Word learning as Bayesian inference

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

arxiv: v4 [cs.cl] 28 Mar 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Probing for semantic evidence of composition by means of simple classification tasks

Process improvement, The Agile Way! By Ben Linders Published in Methods and Tools, winter

Text-mining the Estonian National Electronic Health Record

Houghton Mifflin Online Assessment System Walkthrough Guide

Introductory Topic for Kindergarten: Questions, puzzlement and what is okay

Are You Ready? Simplify Fractions

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cl] 2 Apr 2017

Guide to Teaching Computer Science

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Learning Lesson Study Course

There are some definitions for what Word

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Artificial Neural Networks written examination

Model Ensemble for Click Prediction in Bing Search Ads

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

Grade Five Chapter 6 Add and Subtract Fractions with Unlike Denominators Overview & Support Standards:

Getting Started with Deliberate Practice

Learning to Schedule Straight-Line Code

Summarizing Answers in Non-Factoid Community Question-Answering

Function Tables With The Magic Function Machine

Mongoose On The Loose/ Larry Luxner/ Created by SAP District

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

Course INTRODUCTION TO DEGREE PROGRAM EXPECTATIONS: WHAT FACULTY NEED TO KNOW NOW

Human Emotion Recognition From Speech

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Genevieve L. Hartman, Ph.D.

Exploration. CS : Deep Reinforcement Learning Sergey Levine

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Speaker Identification by Comparison of Smart Methods. Abstract

Concepts and Properties in Word Spaces

Welcome to ACT Brain Boot Camp

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Contents. Foreword... 5

Switchboard Language Model Improvement with Conversational Data from Gigaword

Rule Learning With Negation: Issues Regarding Effectiveness

Go fishing! Responsibility judgments when cooperation breaks down

Going to School: Measuring Schooling Behaviors in GloFish

Transcription:

Distributional Semantics Advanced Machine Learning for NLP Jordan Boyd-Graber SLIDES ADAPTED FROM YOAV GOLDBERG AND OMER LEVY Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 1 of 1

From Distributional to Distributed Semantics The new kid on the block Deep learning / neural networks Distributed word representations Feed text into neural-net. Get back word embeddings. Each word is represented as a low-dimensional vector. Vectors capture semantics word2vec (Mikolov et al) Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 2 of 1

From Distributional to Distributed Semantics This part of the talk word2vec as a black box a peek inside the black box relation between word-embeddings and the distributional representation tailoring word embeddings to your needs using word2vec Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 3 of 1

word2vec Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 4 of 1

word2vec Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 5 of 1

word2vec dog cat, dogs, dachshund, rabbit, puppy, poodle, rottweiler, mixed-breed, doberman, pig sheep cattle, goats, cows, chickens, sheeps, hogs, donkeys, herds, shorthorn, livestock november october, december, april, june, february, july, september, january, august, march jerusalem tiberias, jaffa, haifa, israel, palestine, nablus, damascus katamon, ramla, safed teva pfizer, schering-plough, novartis, astrazeneca, glaxosmithkline, sanofi-aventis, mylan, sanofi, genzyme, pharmacia Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 6 of 1

Working with Dense Vectors Word Similarity Similarity is calculated using cosine similarity: sim( dog, cat) = dog cat dog cat For normalized vectors ( x = 1), this is equivalent to a dot product: sim( dog, cat) = dog cat Normalize the vectors when loading them. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 7 of 1

Working with Dense Vectors Finding the most similar words to dog Compute the similarity from word v to all other words. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 8 of 1

Working with Dense Vectors Finding the most similar words to dog Compute the similarity from word v to all other words. This is a single matrix-vector product: W v Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 8 of 1

Working with Dense Vectors Finding the most similar words to dog Compute the similarity from word v to all other words. This is a single matrix-vector product: W v Result is a V sized vector of similarities. Take the indices of the k-highest values. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 8 of 1

Working with Dense Vectors Finding the most similar words to dog Compute the similarity from word v to all other words. This is a single matrix-vector product: W v Result is a V sized vector of similarities. Take the indices of the k-highest values. FAST! for 180k words, d=300: 30ms Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 8 of 1

Working with Dense Vectors Most Similar Words, in python+numpy code W,words = load_and_norm_vectors("vecs.txt") # W and words are numpy arrays. w2i = {w:i for i,w in enumerate(words)} dog = W[w2i[ dog ]] # get the dog vector sims = W.dot(dog) # compute similarities most_similar_ids = sims.argsort()[-1:-10:-1] sim_words = words[most_similar_ids] Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 9 of 1

Working with Dense Vectors Similarity to a group of words Find me words most similar to cat, dog and cow. Calculate the pairwise similarities and sum them: W cat + W dog + W cow Now find the indices of the highest values as before. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 10 of 1

Working with Dense Vectors Similarity to a group of words Find me words most similar to cat, dog and cow. Calculate the pairwise similarities and sum them: W cat + W dog + W cow Now find the indices of the highest values as before. Matrix-vector products are wasteful. Better option: W ( cat + dog + cow) Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 10 of 1

Working with dense word vectors can be very efficient. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 11 of 1

Working with dense word vectors can be very efficient. But where do these vectors come from? Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 11 of 1

How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 12 of 1

How does word2vec work? word2vec implements several different algorithms: Two training methods Negative Sampling Hierarchical Softmax Two context representations Continuous Bag of Words (CBOW) Skip-grams We ll focus on skip-grams with negative sampling intuitions apply for other models as well Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 12 of 1

How does word2vec work? Represent each word as a d dimensional vector. Represent each context as a d dimensional vector. Initalize all vectors to random weights. Arrange vectors in two matrices, W and C. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 13 of 1

How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ]. c 1 c 2 c 3 w c 4 c 5 c 6 w is the focus word vector (row in W ). c i are the context word vectors (rows in C). Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 14 of 1

How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ]. c 1 c 2 c 3 w c 4 c 5 c 6 Try setting the vector values such that: σ(w c 1 ) + σ(w c 2 ) + σ(w c 3 ) + σ(w c 4 ) + σ(w c 5 ) + σ(w c 6 ) is high Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 14 of 1

How does word2vec work? While more text: Extract a word window: A springer is [ a cow or heifer close to calving ]. c 1 c 2 c 3 w c 4 c 5 c 6 Try setting the vector values such that: σ(w c 1 ) + σ(w c 2 ) + σ(w c 3 ) + σ(w c 4 ) + σ(w c 5 ) + σ(w c 6 ) is high Create a corrupt example by choosing a random word w [ a cow or comet close to calving ] c 1 c 2 c 3 w c 4 c 5 c 6 Try setting the vector values such that: σ(w c 1 )+σ(w c 2 )+σ(w c 3 )+σ(w c 4 )+σ(w c 5 )+σ(w c 6 ) is low Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 14 of 1

How does word2vec work? The training procedure results in: w c for good word-context pairs is high w c for bad word-context pairs is low w c for ok-ish word-context pairs is neither high nor low As a result: Words that share many contexts get close to each other. Contexts that share many words get close to each other. At the end, word2vec throws away C and returns W. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 15 of 1

Reinterpretation Imagine we didn t throw away C. Consider the product WC Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 16 of 1

Reinterpretation Imagine we didn t throw away C. Consider the product WC The result is a matrix M in which: Each row corresponds to a word. Each column corresponds to a context. Each cell: w c, association between word and context. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 16 of 1

Reinterpretation Does this remind you of something? Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 17 of 1

Reinterpretation Does this remind you of something? Very similar to SVD over distributional representation: Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 17 of 1

Relation between SVD and word2vec SVD Begin with a word-context matrix. Approximate it with a product of low rank (thin) matrices. Use thin matrix as word representation. word2vec (skip-grams, negative sampling) Learn thin word and context matrices. These matrices can be thought of as approximating an implicit word-context matrix. Levy and Goldberg (NIPS 2014) show that this implicit matrix is related to the well-known PPMI matrix. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 18 of 1

Relation between SVD and word2vec word2vec is a dimensionality reduction technique over an (implicit) word-context matrix. Just like SVD. With few tricks (Levy, Goldberg and Dagan, TACL 2015) we can get SVD to perform just as well as word2vec. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 19 of 1

Relation between SVD and word2vec word2vec is a dimensionality reduction technique over an (implicit) word-context matrix. Just like SVD. With few tricks (Levy, Goldberg and Dagan, TACL 2015) we can get SVD to perform just as well as word2vec. However, word2vec...... works without building / storing the actual matrix in memory.... is very fast to train, can use multiple threads.... can easily scale to huge data and very large word and context vocabularies. Advanced Machine Learning for NLP Boyd-Graber Distributional Semantics 19 of 1