Grammatical and topical gender in crosslinguistic word embeddings. Kate McCurdy Berlin NLP June

Similar documents
arxiv: v1 [cs.cl] 20 Jul 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A deep architecture for non-projective dependency parsing

Georgetown University at TREC 2017 Dynamic Domain Track

An Empirical and Computational Test of Linguistic Relativity

Python Machine Learning

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

There are some definitions for what Word

Semantic and Context-aware Linguistic Model for Bias Detection

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A Vector Space Approach for Aspect-Based Sentiment Analysis

arxiv: v1 [cs.cl] 2 Apr 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Boosting Named Entity Recognition with Neural Character Embeddings

Text-mining the Estonian National Electronic Health Record

Unsupervised Cross-Lingual Scaling of Political Texts

arxiv: v2 [cs.ir] 22 Aug 2016

Memory-based grammatical error correction

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Topic Modelling with Word Embeddings

arxiv: v2 [cs.cl] 26 Mar 2015

CS 598 Natural Language Processing

Constructing Parallel Corpus from Movie Subtitles

Lecture 1: Machine Learning Basics

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

NOT SO FAIR AND BALANCED:

Dialog-based Language Learning

Word Embedding Based Correlation Model for Question/Answer Matching

Joint Learning of Character and Word Embeddings

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

arxiv: v1 [cs.cl] 22 Oct 2015

Probabilistic Latent Semantic Analysis

arxiv: v4 [cs.cl] 28 Mar 2016

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Evolution of Random Phenomena

Linking Task: Identifying authors and book titles in verbose queries

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Second Exam: Natural Language Parsing with Neural Networks

SOFTWARE EVALUATION TOOL

ScienceDirect. Malayalam question answering system

A study of speaker adaptation for DNN-based speech synthesis

Human-like Natural Language Generation Using Monte Carlo Tree Search

EVERY PICTURE TELLS A STORY

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

ON THE USE OF WORD EMBEDDINGS ALONE TO

Residual Stacking of RNNs for Neural Machine Translation

Speech Recognition at ICSI: Broadcast News and beyond

ROSETTA STONE PRODUCT OVERVIEW

The Effect of Discourse Markers on the Speaking Production of EFL Students. Iman Moradimanesh

Finding Translations in Scanned Book Collections

Spanish III Class Description

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

The taming of the data:

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Matching Similarity for Keyword-Based Clustering

Engineers and Engineering Brand Monitor 2015

Age Effects on Syntactic Control in. Second Language Learning

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

PIRLS. International Achievement in the Processes of Reading Comprehension Results from PIRLS 2001 in 35 Countries

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

arxiv: v5 [cs.ai] 18 Aug 2015

Cross Language Information Retrieval

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

MFL SPECIFICATION FOR JUNIOR CYCLE SHORT COURSE

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Deep Multilingual Correlation for Improved Word Embeddings

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Unraveling symbolic number processing and the implications for its association with mathematics. Delphine Sasanguie

SC 16 - Salt Lake City, Utah

Indian Institute of Technology, Kanpur

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Probing for semantic evidence of composition by means of simple classification tasks

Concepts and Properties in Word Spaces

Making Sales Calls. Watertown High School, Watertown, Massachusetts. 1 hour, 4 5 days per week

EDEXCEL FUNCTIONAL SKILLS PILOT. Maths Level 2. Chapter 7. Working with probability

Using Web Searches on Important Words to Create Background Sets for LSI Classification

The role of the first language in foreign language learning. Paul Nation. The role of the first language in foreign language learning

Introduction to Personality-Social Psychology Proposed Model of a Syllabus for Psychology 1

MODERNISATION OF HIGHER EDUCATION PROGRAMMES IN THE FRAMEWORK OF BOLOGNA: ECTS AND THE TUNING APPROACH

Learning Methods for Fuzzy Systems

Variations of the Similarity Function of TextRank for Automated Summarization

Outline. Web as Corpus. Using Web Data for Linguistic Purposes. Ines Rehbein. NCLT, Dublin City University. nclt

Corpus Linguistics (L615)

Speech Emotion Recognition Using Support Vector Machine

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Cross-Lingual Text Categorization

The Round Earth Project. Collaborative VR for Elementary School Kids

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Literal or idiomatic? Identifying the reading of single occurrences of German multiword expressions using word embeddings

Transcription:

Grammatical and topical gender in crosslinguistic word embeddings Kate McCurdy Berlin NLP June 14 2017

Word embeddings: From (almost) scratch to NLP Goal: word representations that... capture maximal semantic/syntactic information, yet require minimal task-specific feature engineering Neural embeddings to the rescue! Input: barely processed, massive corpora In general: tokenization + trimming the long tail in vocab Collobert et al.: capitalization as feature + a few extra tweaks Mikolov et al: n-gram phrase identification Output: dense, magically performant vectors

but there are pitfalls

You shall know a word by the company it keeps. Firth 1957

Pitfall #1 What if your words keep company with some unsavory stereotypes?

Analogous relations in the GloVe word embedding; from Caliskan-Islam et al 2016

Stereotypes in word embeddings: Bolukbasi et al. 2016 addiction : eating disorder accountant : paralegal pilot : flight attendant athlete : gymnast professor emeritus : associate professor

Bias in humans: the Implicit Association Test Standard psychological test to assess implicit bias Design: Greenwald et al. 1998 Two sets of attribute words Male, man, boy, Female, woman, Two sets of target words Children, wedding,... Office, salary, Task: left vs right fast categorization of both sets Measurement: differential association in average response time

WEAT: the Word Embedding Association Test Parallels the Implicit Association Test Measures the differential association between paired target and attribute word sets via cosine distance Core finding: nearly every single prejudice uncovered by the IAT is replicated by the WEAT on Google News + GloVe word embeddings

Pitfall #1 What if your words keep company with some unsavory stereotypes?

Pitfall #2 What if your content words hang out with your function words and make weird artefacts?

Crosslinguistic word embeddings Work with Oguz Serbetci (not pictured)

Data Corpus: OpenSubtitles ~5.5K movies with subtitles in 4 languages (2.6-2.9m ws): German - grammatical gender Spanish - grammatical gender Dutch - grammatical gender orthogonal to natural gender English - natural gender Lemmatized each corpus to remove gender Trained 10 word2vec CBOW embeddings per condition: Language (4) x Corpus version (2 - unprocessed vs lemmatized)

Method Measurement: differential association using the Word Embedding Association Test (WEAT - Caliskan et al.) {career} {family} {male} {female}

Method Measurement: differential association using the Word Embedding Association Test (WEAT - Caliskan et al.) Comparisons: Topical semantic gender bias replicate IAT findings of Caliskan et al. on dimension male:career::female:family

Method Measurement: differential association using the Word Embedding Association Test (WEAT - Caliskan et al.) Comparisons: Topical semantic gender bias replicate IAT findings of Caliskan et al. on dimension male:career::female:family Grammatical gender bias use stimuli from Phillips & Boroditsky on dimension male:masculine::female:feminine e.g. Spanish el sol (m), German die Sonne (f)

Topical gender bias average increase in cosine similarity per word

Topical gender bias Grammatical gender bias

Pitfall #2 What if your content words hang out with your function words and make weird artefacts?

Words can keep strange company! And arbitrary properties like grammatical gender can distort your embeddings.

Thank! Q?

References Bolukbasi, T., Chang, K.-W., Zou, J., Saligrama, V., & Kalai, A. (2016). Quantifying and reducing stereotypes in word embeddings. arxiv Preprint arxiv:1606.06121. Caliskan-Islam, A., Bryson, J. J., & Narayanan, A. (2016). Semantics derived automatically from language corpora necessarily contain human biases. arxiv Preprint arxiv:1608.07187. Caliskan, A., Bryson, J. J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183 186. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493 2537. Firth, John R. 1957. A synopsis of linguistic theory 1930 1955. In Studies in linguistic analysis, 1 32. Oxford: Blackwell. Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. (1998). Measuring individual differences in implicit cognition: the implicit association test. Journal of Personality and Social Psychology, 74(6), 1464. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111 3119).

Appendix

Interaction between topical and grammatical gender effects in DE + ES

Stereotypes in word embeddings: Bolukbasi et al. 2016 1. Define gender subspace

Stereotypes in word embeddings: Bolukbasi et al. 2016 1. Define gender subspace 2. Project profession names onto subspace

Stereotypes in word embeddings: Bolukbasi et al. 2016 addiction : eating disorder 1. 2. Define gender subspace Project profession names onto subspace 3. Generate analogies & get stereotype ratings from MTurk accountant : paralegal pilot : flight attendant athlete : gymnast professor emeritus : associate professor

Stereotypes in word embeddings: Bolukbasi et al. 2016 1. Define gender subspace 2. Project profession names onto subspace 3. Generate analogies & get stereotype ratings from MTurk 4. Compute transformation matrix to debias designated words