TTIC 31210: Advanced Natural Language Processing. Kevin Gimpel Spring Lecture 3: Word Embeddings

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep Neural Network Language Models

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A deep architecture for non-projective dependency parsing

Lecture 1: Machine Learning Basics

arxiv: v1 [cs.cl] 20 Jul 2015

Residual Stacking of RNNs for Neural Machine Translation

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Probabilistic Latent Semantic Analysis

Python Machine Learning

arxiv: v4 [cs.cl] 28 Mar 2016

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.cl] 27 Apr 2016

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Artificial Neural Networks written examination

A study of speaker adaptation for DNN-based speech synthesis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

ON THE USE OF WORD EMBEDDINGS ALONE TO

Assignment 1: Predicting Amazon Review Ratings

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v2 [cs.cl] 26 Mar 2015

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Switchboard Language Model Improvement with Conversational Data from Gigaword

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Learning Methods for Fuzzy Systems

Model Ensemble for Click Prediction in Bing Search Ads

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Boosting Named Entity Recognition with Neural Character Embeddings

Attributed Social Network Embedding

arxiv: v2 [cs.ir] 22 Aug 2016

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Generative models and adversarial training

Natural Language Processing. George Konidaris

Knowledge Transfer in Deep Convolutional Neural Nets

Evolutive Neural Net Fuzzy Filtering: Basic Description

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

THE world surrounding us involves multiple modalities

arxiv: v1 [cs.lg] 7 Apr 2015

(Sub)Gradient Descent

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Dialog-based Language Learning

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

STA 225: Introductory Statistics (CT)

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

arxiv: v1 [cs.cv] 10 May 2017

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.lg] 15 Jun 2015

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

A Vector Space Approach for Aspect-Based Sentiment Analysis

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Soft Computing based Learning for Cognitive Radio

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Calibration of Confidence Measures in Speech Recognition

Speech Recognition at ICSI: Broadcast News and beyond

Truth Inference in Crowdsourcing: Is the Problem Solved?

Artificial Neural Networks

On the Formation of Phoneme Categories in DNN Acoustic Models

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Axiom 2013 Team Description Paper

Joint Learning of Character and Word Embeddings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

arxiv: v3 [cs.cl] 7 Feb 2017

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Report on organizing the ROSE survey in France

Improvements to the Pruning Behavior of DNN Acoustic Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Speech Emotion Recognition Using Support Vector Machine

Softprop: Softmax Neural Network Backpropagation Learning

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

WHEN THERE IS A mismatch between the acoustic

CSL465/603 - Machine Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Time series prediction

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A Reinforcement Learning Variant for Control Scheduling

CS Machine Learning

arxiv: v1 [cs.lg] 20 Mar 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Speaker Identification by Comparison of Smart Methods. Abstract

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

INPE São José dos Campos

Online Updating of Word Representations for Part-of-Speech Tagging

On-the-Fly Customization of Automated Essay Scoring

Lecture 1: Basic Concepts of Machine Learning

Device Independence and Extensibility in Gesture Recognition

Learning to Schedule Straight-Line Code

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Understanding and Interpreting the NRC s Data-Based Assessment of Research-Doctorate Programs in the United States (2010)

Transcription:

TTIC 31210: Advanced Natural Language Processing Kevin Gimpel Spring 2017 Lecture 3: Word Embeddings 1

Assignment 1 Assignment 1 due tonight 2

Roadmap review of TTIC 31190 (week 1) deep learning for NLP (weeks 2-4) generamve models & Bayesian inference (week 5) Bayesian nonparametrics in NLP (week 6) EM for unsupervised NLP (week 7) syntax/semanmcs and structure predicmon (weeks 8-9) applicamons (week 10) 3

Neural Similarity Modeling Siamese networks (Bromley et al., 1993) two idenmcal networks with shared parameters at end, similarity computed between two representamons 4

Similarity FuncMons many choices for similarity funcmons we talked about some during Lecture 2 5

Learning for Similarity We want to learn input representamon funcmon as well as any parameters of similarity funcmon We ll just write all these parameters as How about this loss? (loss A on your handout) Any potenmal problems with this? 6

(Beber) Learning for Similarity ContrasMve hinge loss (loss B on handout): is a negamve example Any potenmal problems with this? 7

(Beber) Learning for Similarity Large- margin contrasmve hinge loss: is the margin 8

(Beber) Learning for Similarity Large- margin contrasmve hinge loss: How should we choose negamve examples? 9

(Beber) Learning for Similarity Large- margin contrasmve hinge loss: How should we choose negamve examples? random: just pick v randomly from the data max: many other ways depending on problem 10

Aside: 11

Recurrent Neural Networks hidden vector 12

Recurrent Neural Networks MulMplicaMve IntegraMon Recurrent Neural Networks 13

14

15

RNN MI- RNN 16

Word Embeddings Turian et al. (2010) 17

Journal of Machine Learning Research 3 (2003) 1137 1155 Submitted 4/02; Published 2/03 A Neural Probabilistic Language Model Yoshua Bengio Réjean Ducharme Pascal Vincent Christian Jauvin Département d Informatique et Recherche Opérationnelle Centre de Recherche Mathématiques Université de Montréal, Montréal, Québec, Canada BENGIOY@IRO.UMONTREAL.CA DUCHARME@IRO.UMONTREAL.CA VINCENTP@IRO.UMONTREAL.CA JAUVINC@IRO.UMONTREAL.CA idea: use a neural network for n- gram language modeling: 18

Journal of Machine Learning Research 3 (2003) 1137 1155 Submitted 4/02; Published 2/03 A Neural Probabilistic Language Model Yoshua Bengio Réjean Ducharme Pascal Vincent Christian Jauvin Département d Informatique et Recherche Opérationnelle Centre de Recherche Mathématiques Université de Montréal, Montréal, Québec, Canada BENGIOY@IRO.UMONTREAL.CA DUCHARME@IRO.UMONTREAL.CA VINCENTP@IRO.UMONTREAL.CA JAUVINC@IRO.UMONTREAL.CA this is not the earliest paper on using neural networks for n- gram language models, but it s the most well- known and first to scale up see paper for citamons of earlier work 19

Neural ProbabilisMc Language Models (Bengio et al., 2003) 1.1 Fighting the Curse of Dimensionality with Distributed Representations In a nutshell, the idea of the proposed approach can be summarized as follows: 1. associate with each word in the vocabulary a distributed word feature vector (a realvalued vector in R m ), 2. express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence, and 3. learn simultaneously the word feature vectors and the parameters of that probability function. 20

Model (Bengio et al., 2003) i-th output = P(w t = i context) softmax...... most computation here tanh...... C(w t n+1 )... Table look up in C... C(w t 2 ) C(w t 1 )...... Matrix C shared parameters across words w t n+1 w t 2 index for index for index for w t 1 21

Bengio et al. (2003) Experiments: they minimized log loss of next word condimoned on a fixed number of previous words no RNNs here. just a feed- forward network. ~800k training tokens, vocab size of 17k they trained for 5 epochs, which took 3 weeks on 40 CPUs! 22

Experiments (Bengio et al., 2003) n c h m direct mix train. valid. test. MLP1 5 50 60 yes no 182 284 268 MLP2 5 50 60 yes yes 275 257 MLP3 5 0 60 yes no 201 327 310 MLP4 5 0 60 yes yes 286 272 MLP5 5 50 30 yes no 209 296 279 MLP6 5 50 30 yes yes 273 259 MLP7 3 50 30 yes no 210 309 293 MLP8 3 50 30 yes yes 284 270 MLP9 5 100 30 no no 175 280 276 MLP10 5 100 30 no yes 265 252 classes). n :orderofthemodel. c :numberofwordclassesinclass-basedn-grams. h : number of hidden units. m :number of word features for MLPs,number of classes for class-based n-grams. direct: whether there are direct connections from word features to outputs. mix: whethertheoutputprobabilitiesoftheneuralnetworkaremixedwiththe output of the trigram (with a weight of 0.5 on each). The last three columns give perplexity on the training, validation and test sets. 23

Experiments (Bengio et al., 2003) n c h m direct mix train. valid. test. MLP1 5 50 60 yes no 182 284 268 MLP2 5 50 60 yes yes 275 257 MLP3 5 0 60 yes no 201 327 310 MLP4 5 0 60 yes yes 286 272 MLP5 5 50 30 yes no 209 296 279 MLP6 5 50 30 yes yes 273 259 MLP7 3 50 30 yes no 210 309 293 MLP8 3 50 30 yes yes 284 270 MLP9 5 100 30 no no 175 280 276 MLP10 5 100 30 no yes 265 252 ObservaMons: hidden layer (h > 0) helps interpolamng with n- gram model ( mix ) helps using higher word embedding dimensionality helps 5- gram model beber than trigram 24

Experiments n c h m direct mix train. valid. test. MLP1 5 50 60 yes no 182 284 268 MLP2 5 50 60 yes yes 275 257 MLP3 5 0 60 yes no 201 327 310 MLP4 5 0 60 yes yes 286 272 MLP5 5 50 30 yes no 209 296 279 MLP6 5 50 30 yes yes 273 259 MLP7 3 50 30 yes no 210 309 293 MLP8 3 50 30 yes yes 284 270 MLP9 5 100 30 no no 175 280 276 MLP10 5 100 30 no yes 265 252 Del. Int. 3 31 352 336 Kneser-Ney back-off 3 334 323 Kneser-Ney back-off 4 332 321 Kneser-Ney back-off 5 332 321 class-based back-off 3 150 348 334 class-based back-off 3 200 354 340 class-based back-off 3 500 326 312 class-based back-off 3 1000 335 319 25

Bengio et al. (2003) they discuss how the word embedding space might be interesmng to examine but they don t do this they suggest that a good way to visualize/ interpret word embeddings would be to use 2 dimensions J they discussed handling polysemous words, unknown words, inference speed- ups, etc. 26

Collobert et al. (2011) Journal of Machine Learning Research 12 (2011) 2493-2537 Submitted 1/10; Revised 11/10; Published 8/11 Natural Language Processing (Almost) from Scratch Ronan Collobert Jason Weston Léon Bottou Michael Karlen Koray Kavukcuoglu Pavel Kuksa NEC Laboratories America 4IndependenceWay Princeton, NJ 08540 RONAN@COLLOBERT.COM JWESTON@GOOGLE.COM LEON@BOTTOU.ORG MICHAEL.KARLEN@GMAIL.COM KORAY@CS.NYU.EDU PKUKSA@CS.RUTGERS.EDU 27

Input Window word of interest Text cat sat on the mat Feature 1 w1 1 w2 1.... wn 1 Feature K w1 K w2 K... wn K Lookup Table LT W 1. d LT W K Linear concat M 1 n 1 hu HardTanh Linear M 2 n 2 hu = #tags 28

Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary What is going on here? (loss C on handout) 29

Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary What is going on here? Make actual text window have higher score than all windows with center word replaced by w 30

Collobert et al. Pairwise Ranking Loss is training set of 11- word windows is vocabulary This smll sums over enmre vocabulary, so it should be as slow as log loss Why can it be faster? when using SGD, summamon à sample 31