Distributional Semantics and Word Embeddings

Similar documents
Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

A deep architecture for non-projective dependency parsing

arxiv: v1 [cs.cl] 20 Jul 2015

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Attributed Social Network Embedding

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Comparison of Two Text Representations for Sentiment Analysis

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v1 [cs.cl] 2 Apr 2017

A Case Study: News Classification Based on Term Frequency

Assignment 1: Predicting Amazon Review Ratings

(Sub)Gradient Descent

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v2 [cs.ir] 22 Aug 2016

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Deep Neural Network Language Models

Latent Semantic Analysis

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Decision Making. Unsure about how to decide which sorority to join? Review this presentation to learn more about the mutual selection process!

Modeling function word errors in DNN-HMM based LVCSR systems

CSL465/603 - Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

A Vector Space Approach for Aspect-Based Sentiment Analysis

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Matching Similarity for Keyword-Based Clustering

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Methods for Fuzzy Systems

arxiv: v2 [cs.cl] 26 Mar 2015

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Summarizing Answers in Non-Factoid Community Question-Answering

A study of speaker adaptation for DNN-based speech synthesis

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.lg] 3 May 2013

There are some definitions for what Word

Linking Task: Identifying authors and book titles in verbose queries

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

THE world surrounding us involves multiple modalities

Human Emotion Recognition From Speech

arxiv: v2 [cs.cv] 30 Mar 2017

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Word Sense Disambiguation

A Bayesian Learning Approach to Concept-Based Document Classification

Calibration of Confidence Measures in Speech Recognition

MTH 215: Introduction to Linear Algebra

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Dialog-based Language Learning

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Model Ensemble for Click Prediction in Bing Search Ads

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Context Free Grammars. Many slides from Michael Collins

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v4 [cs.cl] 28 Mar 2016

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Semi-Supervised Face Detection

WHEN THERE IS A mismatch between the acoustic

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Unsupervised Cross-Lingual Scaling of Political Texts

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

Comment-based Multi-View Clustering of Web 2.0 Items

Introduction to Causal Inference. Problem Set 1. Required Problems

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Improvements to the Pruning Behavior of DNN Acoustic Models

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

Constraining X-Bar: Theta Theory

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

arxiv: v1 [cs.lg] 15 Jun 2015

The Role of Semantic and Discourse Information in Learning the Structure of Surgical Procedures

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Speaker recognition using universal background model on YOHO database

GIS 5049: GIS for Non Majors Department of Environmental Science, Policy and Geography University of South Florida St. Petersburg Spring 2011

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

College Pricing and Income Inequality

Ensemble Technique Utilization for Indonesian Dependency Parser

Investigation on Mandarin Broadcast News Speech Recognition

Transcription:

Distributional Semantics and Word Embeddings

Announcements Midterm returned at end of class today Only exams that were taken on Thursday Today: moving into neural nets via word embeddings Tuesday: Introduc=on to basic neural net architecture. Chris Kedzie to lecture. Homework out on Tuesday Language applica=ons using different architectures

Slide from Kapil Thadani

Slide from Kapil Thadani

Slide from Kapil Thadani

Methods so far WordNet: an amazing resource.. But What are some of the disadvantages?

Methods so far Bag of words Simple and interpretable In vector space, represent a sentence John likes milk [ 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0] one-hot vector Values could be frequency, TF*IDF Sparse representa=on Dimensionality: 50K unigrams, 500K bigrams Curse of dimensionality!

From Symbolic to Distributed Representations Its problem, e.g., for web search If user searches for [Dell notebook bayery], should match documents with Dell laptop bayery If user searches for [SeaYle motel] should match documents containing SeaYle hotel But Motel [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0] Hotel [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0] Our query and document vectors are orthogonal There is no natural no=on of similarity in a set of onehot vectors -> Explore a direct approach where vectors encode it Slide from Chris Manning

Distributional Semantics You shall know a word by the company it keeps [J.R. Firth 1957] Marco saw a hairy li;le wampunuk hiding behind a tree Words that occur in similar contexts have similar meaning Record word co-occurrence within a window over a large corpus

Word Context Matrices Each row i represents a word Each column j represents a linguis=c context Matrix ij represents strength of associa=on M f ε R, M f i,j = f(w i,c j ) where f is an associa=on measure of the strength between a word and a context I hamburger book gi. spoon ate.45.56.02.03.3 gave.46.13.67.7.25 took.46.1.7.5.3

Associations and Similarity Effec=ve associa=on measure: Pointwise Mutual Informa=on (PMI) log P(w,c)/P(w)P(c) = log #(w,c)* D /#(w)*#(c) Compute similarity between words and text Cosine Similarity Σ i u i v i / Σ i (u i ) 2 Σ i (v i ) 2

Dimensionality Reduction Captures context, but s=ll has sparseness issues Singular value decomposi=on (SVD) Factors matrix M into two narrow matrices: W, a word matrix, and C, a context matrix such that WC T = M is the best rank-d approxima=on of M A smoothed version of M Adds words to contexts if other words in this context seem to co-locate with each other Represents each word as a dense d-dimensional vector instead of a sparse V C one

Slide from Kapil Thadani

Neural Nets A family of models within deep learning The machine learning approaches we have seen to date rely on feature engineering With neural nets, instead we learn by op=mizing a set of parameters

Slide adapted from Chris Manning Why Deep Learning? Representa?on learning ayempts to automa=cally learn good features or representa=ons Deep learning algorithms ayempt to learn (mul=ple levels of) representa=on and an output From raw inputs x (e.g., sound, characters, words)

Reasons for Exploring Deep Learning Manually designed features can be overspecific or take a long =me to design but can provide an intui=on about the solu=on Learned features are easy to adapt Deep learning provides a very flexible framework for represen=ng word, visual and linguis=c informa=on Both supervised and unsupervised methods Slide adapted from Chris Manning

Progress with deep learning Huge leaps forward with Speech Vision Machine Transla=on [Krizhevsky et al. 2012] More modest advances in other areas

From Distributional Semantics to Neural Networks Instead of count-based methods, distributed representa=ons of word meaning Each word associated with a vector where meaning is captured in different dimensions as well as in dimensions of other words Dimensions in a distributed representa=on are not interpretable Specific dimensions do not correspond to specific concepts

Basic Idea of Learning Neural Network Embeddings Define a model that aims to predict between a center word w t and context words in terms of word vectors p(context w t )=. Which has a loss func=on, e.g., J = 1- p(w -t w t) We look at many posi=ons t in a large corpus We keep adjusing the vector representa=ons of words to minimize loss Slide adapted from Chris Manning

Embeddings Are Magic vector( king ) - vector( man ) + vector( woman ) vector( queen ) 20 Slide from Dragomir Radev, Image courtesy of Jurafsky & Mar=n

Relevant approaches: Yoav and Goldberg Chapter 9: A neural probabilis=c language model (Bengio et al 2003) Chapter 10, p. 113 NLP (almost) from Scratch (Collobert & Weston 2008) Chapter 10, p 114 Word2vec (Mikolog et al 2013)

Main Idea of word2vec Predict between every word and its context Two algorithms Skip-gram (SG) Predict context words given target (posi=on independent) Con=nuous Bag of Words (CBOW) Predict target word from bag-of-words context Slide adapted from Chris Manning

Training Methods Two (moderately efficient) training methods Hierarchical so{max Nega=ve sampling Today: naïve so{max Slide adapted from Chris Manning

Instead, a bank can hold the investments in a custodial account Context center context words words word 2 word t 2 word window window But as agriculture burgeons on the east bank, the river will shrink Context words center context 2 word window t 2 word window

Objective Function Maximize the probability of context words given the center word J (Θ) = Π Π P(w t+j w t j Θ) t=1 -m j m j 0 Nega=ve log likelihood J (Θ) = -1/T Σ Σ log P(w t+j w t ) t=1 -m j m j 0 Where Θ represents all variables to be op=mized Slide adapted from Chris Manning

Softmax using word c to obtain probability of word o Convert P(w t+j w t ) P(o c) = exp(u T o v c )/Σ v w=1 exp(u w T v c ) exponen=ate normalize to make posi=ve where o is the outside (or output) word index and c is the center word index, v c and u o are center and outside vectors of indices c and o Slide adapted from Chris Manning

Softmax Slide from Dragomir Radev

Dot Product u T v = uv = Σ n i=1 u i v i Bigger if u and v are more similar

Slide from Kapil Thadani

Embeddings Are Magic vector( king ) - vector( man ) + vector( woman ) vector( queen ) 30 Slide from Dragomir Radev, Image courtesy of Jurafsky & Mar=n

Evaluating Embeddings Nearest Neighbors Analogies (A:B)::(C:?) Informa=on Retrieval Seman=c Hashing Slide from Dragomir Radev

Similarity Data Sets [Table from Faruqui et al. 2016]

[Mikolov et al. 2013]

Semantic Hashing [Salakhutdinov and Hinton 20

How are word embeddings used? As features in supervised systems As the main representa=on with a neural net applica=on/task

Are Distributional Semantics and Word Embeddings all that different?

Homework2 Max 99.6, Min 4, Stdev: 21.4 Mean 82.2, Median 92.1 Vast majority of F1 scores between 90 and 96.5.

Midterm Max: 95, Min: 22.5 Mean: 66.6, Median 68.5 Standard Devia=on: 15 Will be curved and the curve will be provided in the next lecture