Deep Learning for Natural Language Processing! (1/2)

Similar documents
Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Linking Task: Identifying authors and book titles in verbose queries

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Assignment 1: Predicting Amazon Review Ratings

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Second Exam: Natural Language Parsing with Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Indian Institute of Technology, Kanpur

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.cl] 20 Jul 2015

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Comparison of Two Text Representations for Sentiment Analysis

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

A Vector Space Approach for Aspect-Based Sentiment Analysis

Lecture 1: Machine Learning Basics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Online Updating of Word Representations for Part-of-Speech Tagging

Ensemble Technique Utilization for Indonesian Dependency Parser

Natural Language Processing. George Konidaris

Distant Supervised Relation Extraction with Wikipedia and Freebase

A Case Study: News Classification Based on Term Frequency

A deep architecture for non-projective dependency parsing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

arxiv: v2 [cs.cv] 30 Mar 2017

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

THE world surrounding us involves multiple modalities

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

Applications of memory-based natural language processing

Multi-Lingual Text Leveling

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

AQUA: An Ontology-Driven Question Answering System

Basic Parsing with Context-Free Grammars. Some slides adapted from Julia Hirschberg and Dan Jurafsky 1

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v4 [cs.cl] 28 Mar 2016

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Generative models and adversarial training

A Bayesian Learning Approach to Concept-Based Document Classification

CS Machine Learning

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Georgetown University at TREC 2017 Dynamic Domain Track

Prediction of Maximal Projection for Semantic Role Labeling

Improving Machine Learning Input for Automatic Document Classification with Natural Language Processing

Attributed Social Network Embedding

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

The Role of the Head in the Interpretation of English Deverbal Compounds

Dialog-based Language Learning

Cross Language Information Retrieval

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Using dialogue context to improve parsing performance in dialogue systems

arxiv: v1 [cs.lg] 15 Jun 2015

Boosting Named Entity Recognition with Neural Character Embeddings

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Discriminative Learning of Beam-Search Heuristics for Planning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Deep Neural Network Language Models

arxiv: v1 [cs.cl] 2 Apr 2017

Handling Sparsity for Verb Noun MWE Token Classification

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

The stages of event extraction

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Speech Recognition at ICSI: Broadcast News and beyond

arxiv: v2 [cs.ir] 22 Aug 2016

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Text-mining the Estonian National Electronic Health Record

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Probing for semantic evidence of composition by means of simple classification tasks

Speech Emotion Recognition Using Support Vector Machine

Axiom 2013 Team Description Paper

Laboratorio di Intelligenza Artificiale e Robotica

Language Independent Passage Retrieval for Question Answering

CSL465/603 - Machine Learning

Artificial Neural Networks written examination

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

UNIVERSITY OF OSLO Department of Informatics. Dialog Act Recognition using Dependency Features. Master s thesis. Sindre Wetjen

There are some definitions for what Word

Modeling function word errors in DNN-HMM based LVCSR systems

Unsupervised Cross-Lingual Scaling of Political Texts

Introduction, Organization Overview of NLP, Main Issues

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

Top US Tech Talent for the Top China Tech Company

Developing a TT-MCTAG for German with an RCG-based Parser

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Transcription:

Deep Learning for Natural Language Processing! (1/2) Alexis Conneau PhD student @ Facebook AI Research! Master MVA, 2018 1

Introduction Applications Sentence classification Sentiment analysis Answer selection 2

Introduction Applications Machine translation French English 3

Introduction Applications Image captioning Making Facebook visual content accessible to visually impaired 4

Introduction Motivations for this course Need for scientists who can deal with text data Deep Learning has changed Computer Vision but also NLP Deep Learning for NLP is a very active field of Research 5

Introduction Motivations for this course Text data at Facebook: some number Facebook: 1.2 billion daily active users 510,000 comments per second 283,000 updated status per second 6

Introduction Motivations for this course Text data at Facebook: some number Messenger and Whatsapp:! 60 billion messages a day! 3 times more than SMS More than 30,000 bots created! on Messenger bot platform 7

Introduction Motivations for this course Text data at Facebook: some number More than 17 billion photos sent! per month on Messenger Messages appear in contexts! (conversations, captions) 8

Introduction Motivations for this course Text data at Facebook: some challenges Informal language: handle spelling mistakes/sms language Text classification: provide relevant content to FB users Machine Translation: connect people all around the world Image captioning: give blind people access to FB content Chatbot: Messenger conversational agents for companies Messenger bot Wit.ai 9

Overview What you will learn in this class Class 1 Overview of some classical NLP tasks Word2vec: word embeddings Bag-of-words representations Class 2 Recurrent Neural Networks (RNNs, LSTMs) Language Modelling/Generation Encoders and decoders 10

Outline Outline 01 02 03 Overview of some classical NLP tasks Word2vec: word embeddings Bag of words representations 11

NLP tasks What is NLP? Natural Language Processing (NLP) can be defined as the automatic! processing of human language. Wikipedia s definition Natural language processing is a field of computer science, artificial intelligence,! and computation linguistic concerned with the interactions between computers! and human (natural) languages. 12

NLP tasks Overview of some classical NLP tasks Understanding a sentence Please, could you order a quarter pounder with cheese and send it to my place,! 6 rue Ménars in Paris. Tokenization: can t -> can t / place, -> place, / Paris. -> Paris. POS tagging: identify part-of-speech (noun, verb etc) to each word Parsing: generate the parse tree (grammar structure) of a sentence. NER: named entity ( person, location ) recognition SRE: semantic role labelling, who did what to whom? 13

NLP tasks Overview of some classical NLP tasks Tokenization Tokenization simply means that spaces have to be inserted! between (e.g.) words and punctuations. Stanford tokenizer : you don t -> you do n t MOSES tokenizer : you don t -> you don t MOSES 14

NLP tasks Overview of some classical NLP tasks Part-of-speech (POS) tagging POS are category of words that have simmilar grammatical properties List of POS tags 15

NLP tasks Overview of some classical NLP tasks Part-of-speech (POS) tagging Goal: assign the correct POS tag to each word Assigning most common tag to each word: ~90% accuracy HMM (2000): 96.5% accuracy (PTB) BiLSTM + CRF (2015): 97.6% accuracy (PTB) 16

NLP tasks Overview of some classical NLP tasks Parsing Berkeley parser * Stanford parser * 17

NLP tasks Overview of some classical NLP tasks Named Entity Recognition (NER) NER: classify named entities into pre-defined categories! (e.g. names of persons, organizations, locations etc) 18

NLP tasks Overview of some classical NLP tasks Semantic Role Labeling (SRL): Who did what to whom? SRL: Assign roles (agent, predicate, them) to the constituents in sentences List of SRL roles 19

NLP tasks Overview of some classical NLP tasks These tasks are important steps towards making sense of the meaning of a sentence Most of them are not useful themselves alone But help to solve higher tasks (simple chatbots) 20

Word2vec Deep Learning for NLP What is an embedding? Instead of assigning handcrafted roles to words can we learn (continuous) representations of words or sentences directly from data? Deep Learning is about learning representations as opposed to handcrafted features. 21

Outline Outline 01 02 03 Overview of some classical NLP tasks Word2vec: word embeddings Bag of words representations 22

Word2vec Word2vec: word embeddings What is an embedding? Embeddings are continuous vectors that represent objects Image embeddings.. word embeddings.. sentence embeddings In the embedding space, semantically similar objects are close (dot-product) 23

Word2vec Word2vec: word embeddings What is an embedding? Embeddings can be learned with neural networks They are the final (trained) parameters of a neural network This neural network has to be trained to solve a particular task (but which one?) 24

Word2vec What is an embedding? Example of image embeddings 1) Train your ConvNet on a large supervised image-classification task (ImageNet) 2) Encode your image with the ConvNet -> image embedding 25

Word2vec What is an embedding? Why is it useful? take your image embedding of a cat.. compute its nearest neighbors new classification task?.. image embeddings = image features.. 26

Word2vec Word2vec: word embeddings Word2vec: unsupervised word embeddings Now.. we can also obtain embeddings for words.. sentences.. documents Let s start with words! 27

Word2vec Word2vec: word embeddings Word2vec: unsupervised word embeddings Word2vec* is a fast C++ tool to obtain word embeddings from an unsupervised corpus of text * Mikolov et al. (NIPS 2014) Distributed representations of words and their compositionality 28

Word2vec Word2vec: word embeddings You shall know a word by the company it keeps (Firth, J. R. 1957) Meaning of love seen by a computer wife if the one I <love> will marry me. O graph to anybody. I <love> my husband and he creates this superb <love> story, bringing it g and responding in <love> at the heart of th o bombard Paul with <love> letters. She wrote ce, feeling all the <love> she feels, remembe rning for a foolish <love> she'd allowed to s ying to balance the <love> and the hate in th w why they say that <love> is blind I was a w, and knowledge of <love> which awakens joy.e 29

Word2vec Word2vec: word embeddings Word2vec: unsupervised word embeddings Word2vec consists of two models: CBOW: predict center words based on surrounding words SkipGram: predict surrounding words based on center words These tasks of predicting words are just means to an end The end goal is to learn embeddings of words 30

Word2vec Word2vec: word embeddings Word2vec: unsupervised word embeddings These tasks of predicting words are just means to an end The end goal is to learn embeddings of words word embedding space 31

Word2vec Word2vec: word embeddings Word2vec: SkipGram model The goal is to predict «feeling» (a surrounding word) from «love». 32

Word2vec Word2vec: word embeddings Word2vec: SkipGram model The «lookup table» transforms «love» into a word vector (=its embedding) 33

Word2vec Word2vec: word embeddings Word2vec: SkipGram model The embedding is sent to a classifier that outputs a vector of size V (=number of words) 34

Word2vec Word2vec: word embeddings Word2vec: SkipGram model softmax(u) i = e u i P V i=1 eu k The softmax function transforms the output of the classifier into a probability vector 35

Word2vec Word2vec: word embeddings Word2vec: SkipGram model softmax(u) i = e u i P V i=1 eu k The probability assigned to «feeling» is compared to (0,0,0,..,1,..,0,0,0) 36

Word2vec Word2vec: word embeddings Word2vec: SkipGram model softmax(u) i = e u i P V i=1 eu k The parameters are trained using SGD and backpropagation 37

Word2vec Word2vec: word embeddings Overview UNSUPERVISED Note: word2vec does not require human annotation 38

Word2vec Word2vec: word embeddings Overview Note: word2vec can encode unigrams and bigrams bigram 39

Can the computer know the meaning of love? Word2vec: word embeddings Word similarity v love v affection v football 40

Can the computer know the meaning of love? Word2vec: word embeddings Word similarity v love v affection β α v football 41

Can the computer know the meaning of love? Word2vec: word embeddings Word similarity v love v affection cos α = v love v football / v love v football β α v football 42

Word2vec Word2vec: word embeddings Word analogy vec(queen) ' vec(woman) + (vec(king) vec(man)) King Queen King - Man Man Woman 43

FastText FastText: word embeddings Adding character-level information https://github.com/facebookresearch/fasttext «FastText»: word embeddings are sums of char-n-gram embeddings v love = v lov + v ove v loving = v lov + v ovi + v vin + v ing v loviiing = v lov + v ovi + v vii + v iii + v iin + v ing * Bojanowski & Grave et al. (TACL 2017) Enriching word vectors with subword information 44

Word2vec Multilingual word embeddings Aligning monolingual word embedding spaces pretrained monolingual word embedding spaces 45

Word2vec Multilingual word embeddings Aligning monolingual word embedding spaces W LINEAR MAPPING pretrained monolingual word embedding spaces aligned word embedding spaces * Mikolov et al. (2013) Exploiting Similarities among Languages for Machine Translation 46

Word2vec Multilingual word embeddings Aligning monolingual word embedding spaces https://github.com/facebookresearch/muse W LINEAR MAPPING pretrained monolingual word embedding spaces aligned word embedding spaces W? = argmin kwx Y k F = UV T, with U V T = SVD(YX T ) W 2O d (R) 47

Outline Outline 01 02 03 Overview of some classical NLP tasks Word2vec: word embeddings Bag of words representations 48

BoW Bag of words representations bag-of-words Now.. all of this is very nice but.. how can it be useful? we can use word embeddings to embed larger chunks of text.. 49

BoW Bag of words representations Background: TF-IDF Set of documents: Set of labels: d 1,d 2,...,d n y 1,y 2,...,y n 8i, y i 2 [1,...,C] How do we get features for documents of text? 50

BoW Bag of words representations Document-term matrix ~ word embedding obama the cat Alabama New_York d1 0 4 2 0 0 0 0 ~ document embedding d2 2 6 0 1 0 0 0 d3 0 4 0 2 1 0 0 0 3 0 0 0 1 3 0 5 3 0 0 0 0 dn 0 3 0 2 1 0 1 Document-term (sparse) matrix (size: n x V) 51

BoW Bag of words representations Term Frequency Inverse Document Frequency (TF-IDF) Words that appear only in a few documents contain more discriminative information Example: if Obama appears in 10 documents out of 10000," these documents will likely be related to politics. total number of documents tf-idf i,j =tf i,j idf j idf j = log D {d i : t j 2 d i } new matrix #{term j appears in document i} number of documents where term j appears 52

BoW Bag of words representations TF-IDF matrix ~ word embedding obama the cat Alabama New_York d1 0 0.02 0.23 0 0 0 0 ~ document embedding d2 0.43 0.05 0 0.12 0 0 0 d3 0 0.03 0 0.14 0.73 0 0 0 0.025 0 0 0 0.8 0.5 0 0.04 0.31 0 0 0 0 dn 0 0.03 0 0.12 0.3 0 0.4 TF-IDF (sparse) matrix (size: n x V) 53

BoW Bag of words representations Latent Semantic Analysis (LSA) DOCUMENT CLASSIFICATION - Latent Semantic Analysis (LSA) 1. Create TF-IDF matrix (#documents, #words) 2. Perform PCA to reduce the dimension (#document, p) 3. Learn a classifier (Logistic Regression, SVM, Random Forest, MLP) 54

BoW (Continuous) Bag of words representations Transfer Learning pretrained word vectors LSA: requires many documents to get decent representations little modelling of interaction between words.. cat, dog, pet have separate columns 55

BoW Continuous Bag of words representations Transfer Learning pretrained word vectors DOCUMENT CLASSIFICATION - Continuous Bag-of-Words 1. Learn word embeddings on a huge unsupervised corpus (e.g. Wikipedia) 2. Embed documents using the (weighted) average of word embeddings 3. Learn a classifier (Logistic Regression, SVM, Random Forest, MLP) 56

BoW Continuous Bag of words representations Transfer Learning - pretrained word vectors In high dimension, the average of word vectors is a vector that is close to all its components! (preservation of the information of each word) (weighted) average of word embeddings 57

Word2vec Embeddings Nearest neighbors can also be useful for text Nearest neighbors Embed all your sentences From a query sentence, extract the most similar sentence 58

BoW (Continuous) Bag of words representations Transfer Learning pretrained word vectors Continuous bag-of-words representations: average of word vectors 59

BoW (Continuous) Bag of words representations Transfer Learning pretrained word vectors 1) Use pre-trained word embeddings 60

FastText FastText classification tool https://github.com/facebookresearch/fasttext FastText is an open-source tool that provides: a fast and easy-to-use text classification tool (based on bag-of-words) a fast algorithm to learn word embeddings (char-based word2vec) 61

Overview What you will learn in this class Class 1 Overview of some classical NLP tasks Word2vec: word embeddings Bag-of-words representations Class 2 Recurrent Neural Networks (RNNs, LSTMs) Language Modelling/Generation Encoders and decoders 62

BoW Beyond bag-of-words Bag-of-words are limited (word order, context, ) The cat is chasing the dog. versus The dog is chasing the cat. 63

BoW Beyond bag-of-words Bag-of-words are limited (word order, context, ) Goal: capture more structure of input sentence Approach: sentence as a sequence of words 64

Neural Networks Next class: RNNs Three main types of neural networks: Multi-layer perceptron (MLP) Convolutional neural networks (CNNs) Recurrent Neural Networks (RNNs) handle variable-length sequences 65

Tools DL4NLP Tools for Data Science 66

Wrapping up Important tools for NLP projects Python «NLTK» package Stanford parser/tokenizer MOSES tokenizer Pre-trained English word embeddings https://fasttext.cc/docs/en/english-vectors.html -> crawl-300d-2m.vec.zip 2 million word vectors Wikipedia corpora https://sites.google.com/site/rmyeid/projects/polyglot -> Wikipedia dumps in many languages Multiligual word embeddings https://github.com/facebookresearch/muse#download 67

Thank You! 68