Word Embeddings Can Vectors Encode Meaning?

Similar documents
CS Machine Learning

A Comparison of Two Text Representations for Sentiment Analysis

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Probabilistic Latent Semantic Analysis

Text-mining the Estonian National Electronic Health Record

arxiv: v1 [cs.cl] 2 Apr 2017

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

arxiv: v1 [cs.cl] 20 Jul 2015

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lecture 1: Machine Learning Basics

Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features

The taming of the data:

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Learning Methods for Fuzzy Systems

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Probing for semantic evidence of composition by means of simple classification tasks

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Getting Started with Deliberate Practice

Second Exam: Natural Language Parsing with Neural Networks

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Using dialogue context to improve parsing performance in dialogue systems

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Georgetown University at TREC 2017 Dynamic Domain Track

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Speech Recognition at ICSI: Broadcast News and beyond

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

Deep Neural Network Language Models

Unsupervised Cross-Lingual Scaling of Political Texts

Multi-Lingual Text Leveling

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Attributed Social Network Embedding

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v2 [cs.ir] 22 Aug 2016

Matching Similarity for Keyword-Based Clustering

Content Language Objectives (CLOs) August 2012, H. Butts & G. De Anda

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Assignment 1: Predicting Amazon Review Ratings

Python Machine Learning

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

A deep architecture for non-projective dependency parsing

Detecting English-French Cognates Using Orthographic Edit Distance

There are some definitions for what Word

Word Sense Disambiguation

The Revised Math TEKS (Grades 9-12) with Supporting Documents

Guidelines for drafting the participant observation report

Measuring Web-Corpus Randomness: A Progress Report

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Tap vs. Bottled Water

Topic Modelling with Word Embeddings

How People Learn Physics

Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings

Lecture 1: Basic Concepts of Machine Learning

A Case Study: News Classification Based on Term Frequency

Indian Institute of Technology, Kanpur

Mock Trial Preparation In-Class Assignment to Prepare Direct and Cross Examination Roles 25 September 2015 DIRECT EXAMINATION

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Natural Language Processing. George Konidaris

A Vector Space Approach for Aspect-Based Sentiment Analysis

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Word learning as Bayesian inference

Semantic and Context-aware Linguistic Model for Bias Detection

SEMAFOR: Frame Argument Resolution with Log-Linear Models

Multilingual Sentiment and Subjectivity Analysis

COMPUTATIONAL COMPLEXITY OF LEFT-ASSOCIATIVE GRAMMAR

Extending Place Value with Whole Numbers to 1,000,000

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Linking Task: Identifying authors and book titles in verbose queries

Using Synonyms for Author Recognition

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

music downloads. free and free music downloads like

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Copyright by Sung Ju Hwang 2013

Knowledge Transfer in Deep Convolutional Neural Nets

Word Embedding Based Correlation Model for Question/Answer Matching

English Language and Applied Linguistics. Module Descriptions 2017/18

The Moodle and joule 2 Teacher Toolkit

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Postprint.

Vocabulary Usage and Intelligibility in Learner Language

Introduction to the Revised Mathematics TEKS (2012) Module 1

Investigation on Mandarin Broadcast News Speech Recognition

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Effect of Word Complexity on L2 Vocabulary Learning

CAFE ESSENTIAL ELEMENTS O S E P P C E A. 1 Framework 2 CAFE Menu. 3 Classroom Design 4 Materials 5 Record Keeping

Transcription:

Word Embeddings Can Vectors Encode Meaning? Katy Gero and Jeff Jacobs NYC Digital Humanities Week Feb 9 2018

Who are we?

Who are you?

Plan 1. Theory of using vectors to represent words (20 min) 2. Practice of creating embeddings (20 min) 3. Applications of embeddings (20 min) 4. Pitfalls and bias (20 min)

Theory: Word representations

How to represent words in computing?

Dictionaries for computers, aka lexical resources

Visualization of ConceptNet https://blog.conceptnet.io/tag/conceptnet/

Problems with lexical resources 1. 2. 3. Requires skilled people time consuming to create Personal judgements prejudiced to the views of the creators Representations are discrete hard to share info between words "...the WordNet team relied on existing lexicographic sources as well as on introspection. Fellbaum 2010. Fellbaum, Christiane (2010) WordNet pp. 231 243. Dordrecht: Springer Netherlands, 2010

How do we know what a word means? litofar

Does this help? The hairy little litofar hid behind a tree.

The distributional hypothesis The meaning of words can be discovered purely by the context in which they are used. You shall know a word by the company it keeps. Firth, 1957 Here we will discuss how each language can be described in terms of a distributional structure, i.e. in terms of the occurrence of parts (ultimately sounds) relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. Harris, 1954 John Rupert Firth (1957). "A synopsis of linguistic theory 1930-1955." In Special Volume of the Philological Society. Oxford: Oxford University Press.; Harris, Z. S. (1954) Distributional structure. WORD, vol.10, no.2-3, pp.146-162.

A simple measure of context target The hairy little litofar hid behind a tree. context context window

Related to, but not the same as, n-grams The hairy little litofar hid behind a tree. Bigrams: - The hairy hairy little little litofar...

Can we do better than n-gram counts? Speech and Language Processing (3rd ed. draft) Dan Jurafksy and James H. Martin, https://web.stanford.edu/~jurafsky/slp3/4.pdf

Pointwise Mutual Information (PMI) Compares the joint probability of seeing word_1 and word_2 together with the probability they are seen together by chance (based on how frequently they are seen separately) Church, Kenneth Ward, and Patrick Hanks. "Word association norms, mutual information, and lexicography." Computational linguistics 16.1 (1990): 22-29.

Meaning is distributed across context Word Context tree hairy... plant dog 5 3... 1 cat 7 2... 2 litofar 5 6... 0 It s like this but millions of rows and columns

Embeddings are vectors dog tree hairy... plant 5 3... 1 cat 7 2... 2 litofar 5 6... 0 - Continuous Represents meaning, not just uniqueness Can calculate similarity as the cosine (normalized distance) Can do other vector operations dog = [5,3,...,1] cat litofar

Dimensionality reduction SVD, PCA, LSA/I: We can get to embeddings from PMI and other count-based measures with dimensionality reduction. pxp mxn mxp pxn

Now for something completely different... Neural networks

Get your embeddings for free! show embedding me embedding the embedding... embedding layer (one big shared matrix) hidden layer.1....8 litofar Try to predict this all possible words Probability that a word is the next word

Just give me a good embedding... show embedding me embedding the embedding.1....8... litofar Dear neural net, please make this useful. 200 dimensions. Thanks.

What s your context? Word Context??...? dog 5 3... 1 cat 7 2... 2 litofar 5 6... 0 Count-based: context is interpretable; other words or selected features Neural network: context is learned

Practice: The embedding layer of neural networks

Language model: predict the next word show embedding me embedding the embedding... embedding layer (one big shared matrix) hidden layer.1....8 money Try to predict this all possible words Probability that a word is the next word

Embedding layer is learned implicitly... me.5.3....1 show.7.2....2 litofar.5.6....0 embedding show

Can we make this simpler? Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arxiv preprint arxiv:1301.3781 (2013).

More on efficiency: negative sampling Don t predict which word among all words in the vocabulary; instead predict from a set of words that includes the true words and some number of noise words you draw from a distribution. word2vec draws from a distribution of unigrams to the 3/4rd power. A similar thing goes on on GloVe. Magic? Or smoothing? Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Wait, but why neural networks and not the count-based stuff? (n-grams, PMI, etc.)

The popularity of word2vec - Works better than deterministic methods Has a catchy name (and slogan) and feels kind of magical Pre-trained embeddings are made available Method is an exciting research area Improves the performance of neural network applications

Word embedding relations in 2D Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

Word embeddings as starbursts Allison Parrish, http://static.decontextualize.com/vecviz/

Off the shelf or train yourself Others have trained embeddings that you can download! e.g. Stanford has embeddings trained on 840 billion words from a web crawl (vocab size of 2.2 million) You can get a subset of the word2vec embeddings from Python s Natural Language Toolkit directly. (Or the whole thing from a web download.) Or, you can download the code to train them yourself. Data and context matter. - word2vec (Google) GloVe (Stanford) fasttext (Facebook) NumberBatch (ConceptNet)

The effect of the window The hairy little litofar hid behind a tree. There is no reason the context window has to be a certain size, symmetric-- or rectangular. Levy, Omer, and Yoav Goldberg. "Dependency-based word embeddings." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2014.

Adding specific information Add synonym/antonym information Depends on your usage if you want antonyms to be similar or dissimilar. N.Mrksic, et al., Counter-fitting word vectors to linguistic constraints, CoRR, vol. abs/1603.00892, 2016.

Adding general information Add conceptnet information Speer and Lowry-Duda. "Conceptnet at semeval-2017 task 2: Extending word embeddings with multilingual relational knowledge." arxiv preprint arxiv:1704.03560 (2017). (also: https://github.com/commonsense/conceptnet-numberbatch)

Evaluations are tricky - Similarity rankings Word analogies Task specific

Applications in the digital humanities

Diachronic Word Embeddings W. Hamilton, et al. 2016. Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. TACL. https://arxiv.org/abs/1703.08052.

M. Rudolph and D. Blei. 2017. Dynamic Bernoulli Embeddings for Language Evolution. ArXiv Preprint. https://arxiv.org/abs/1703.08052.

Multilingual Word Embeddings M. Rudolph and D. Blei. 2017. Dynamic Bernoulli Embeddings for Language Evolution. ArXiv Preprint. https://arxiv.org/abs/1703.08052.

Multilingual Word Embeddings M. Rudolph and D. Blei. 2017. Dynamic Bernoulli Embeddings for Language Evolution. ArXiv Preprint. https://arxiv.org/abs/1703.08052.

Translation Mover s Distance J. Jacobs. 2018. How to Do Things with Translations. (forthcoming!). Figure adapted from http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

Translation Mover s Distance J. Jacobs. 2018. How to Do Things with Translations. (forthcoming!). Figure adapted from http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/

Pitfalls and bias

Man is to Computer Programmer as Woman is to Homemaker? T. Bolukbasi, et al. 2016. Man is to Computer Programmer as Woman is to Homemaker?: Debiasing Word Embeddings. ArXiv Preprint. https://arxiv.org/abs/1607.06520

Doesn t Diminish Performance!

Why Does it Matter? (Downstream Tasks) A. Caliskan, et al. 2017. Semantics Derived Automatically from Language Corpora Contain Human-Like Biases. Science.10.1126/science.aal4230

But Training Data! (The Cop-Out) http://genderedinnovations.stanford.edu/case-studies/nlp.html

So Why Does it Matter Again? (Sans Computers) The Sapir-Whorf Hypothesis. The 'real world' is to a large extent unconsciously built upon the language habits of the group The worlds in which different societies live are distinct worlds, not merely the same world with different labels attached... We see and hear and otherwise experience very largely as we do because the language habits of our community predispose certain choices of interpretation." -Sapir, 1958 The world is presented in a kaleidoscopic flux of impressions which has to be organized by our minds and this means largely by the linguistic systems in our minds. We cut nature up, organize it into concepts, and ascribe significances as we do, largely because we are parties to an agreement to organize it in this way - an agreement that holds throughout our speech community and is codified in the patterns of our language. -Whorf, 1940

Linguistic Battle Royale

Man is to Computer Programmer Redux Debiased word embeddings can hopefully contribute to reducing gender bias in society. At the very least, machine learning should not be used to inadvertently amplify these biases. (15)