Deep Learning. Mohammad Ebrahim Khademi Lecture 14: Natural Language Processing

Similar documents
Probabilistic Latent Semantic Analysis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.cl] 20 Jul 2015

Switchboard Language Model Improvement with Conversational Data from Gigaword

Python Machine Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Short Text Understanding Through Lexical-Semantic Analysis

A Case Study: News Classification Based on Term Frequency

arxiv: v1 [cs.cl] 2 Apr 2017

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A DISTRIBUTIONAL STRUCTURED SEMANTIC SPACE FOR QUERYING RDF GRAPH DATA

Attributed Social Network Embedding

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Lecture 1: Machine Learning Basics

A Comparison of Two Text Representations for Sentiment Analysis

Deep Neural Network Language Models

Linking Task: Identifying authors and book titles in verbose queries

Assignment 1: Predicting Amazon Review Ratings

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

AQUA: An Ontology-Driven Question Answering System

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

The Smart/Empire TIPSTER IR System

Extending Place Value with Whole Numbers to 1,000,000

The Strong Minimalist Thesis and Bounded Optimality

Experts Retrieval with Multiword-Enhanced Author Topic Model

Comment-based Multi-View Clustering of Web 2.0 Items

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Second Exam: Natural Language Parsing with Neural Networks

Using dialogue context to improve parsing performance in dialogue systems

arxiv: v2 [cs.ir] 22 Aug 2016

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Constructing Parallel Corpus from Movie Subtitles

Clickthrough-Based Translation Models for Web Search: from Word Models to Phrase Models

On document relevance and lexical cohesion between query terms

A Bayesian Learning Approach to Concept-Based Document Classification

Concepts and Properties in Word Spaces

Handling Sparsity for Verb Noun MWE Token Classification

Georgetown University at TREC 2017 Dynamic Domain Track

Combining a Chinese Thesaurus with a Chinese Dictionary

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Distant Supervised Relation Extraction with Wikipedia and Freebase

Beyond the Pipeline: Discrete Optimization in NLP

Model Ensemble for Click Prediction in Bing Search Ads

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Grade 6: Correlated to AGS Basic Math Skills

The Role of Semantic and Discourse Information in Learning the Structure of Surgical Procedures

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Term Weighting based on Document Revision History

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Cross Language Information Retrieval

Rule Learning With Negation: Issues Regarding Effectiveness

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

Unsupervised Cross-Lingual Scaling of Political Texts

A Domain Ontology Development Environment Using a MRD and Text Corpus

Universiteit Leiden ICT in Business

Latent Semantic Analysis

Multi-Lingual Text Leveling

A Vector Space Approach for Aspect-Based Sentiment Analysis

Online Updating of Word Representations for Part-of-Speech Tagging

Introduction to Causal Inference. Problem Set 1. Required Problems

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Ensemble Technique Utilization for Indonesian Dependency Parser

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Human-like Natural Language Generation Using Monte Carlo Tree Search

Controlled vocabulary

A deep architecture for non-projective dependency parsing

The Role of String Similarity Metrics in Ontology Alignment

arxiv: v2 [cs.cv] 30 Mar 2017

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

(Sub)Gradient Descent

Text-mining the Estonian National Electronic Health Record

BYLINE [Heng Ji, Computer Science Department, New York University,

ME 4495 Computational Heat Transfer and Fluid Flow M,W 4:00 5:15 (Eng 177)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Cal s Dinner Card Deals

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Missouri Mathematics Grade-Level Expectations

Visual CP Representation of Knowledge

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Differential Evolutionary Algorithm Based on Multiple Vector Metrics for Semantic Similarity Assessment in Continuous Vector Space

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Natural Language Processing. George Konidaris

Transcription:

Deep Learning Mohammad Ebrahim Khademi Lecture 14: Natural Language Processing

OUTLINE Introduction to Natural Language Processing Word Vectors SVD Based Methods Iteration Based Methods Word2vec Language Models (Unigrams, Bigrams, etc.) Continuous Bag of Words Model (CBOW) Skip-Gram Model Negative Sampling & Hierarchical Softmax 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 2

OUTLINE Introduction to Natural Language Processing Word Vectors SVD Based Methods Iteration Based Methods Word2vec Language Models (Unigrams, Bigrams, etc.) Continuous Bag of Words Model (CBOW) Skip-Gram Model Negative Sampling & Hierarchical Softmax 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 3

What is so special about NLP? Human language is a system specifically constructed to convey meaning, and is not produced by a physical manifestation of any kind. It is very different from vision or any other machine learning task. Most words are just symbols for an extra-linguistic entity : the word is a signifier that maps to a signified (idea or thing). Natural language is a discrete/symbolic/categorical system 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 4

Examples of tasks The goal of NLP is to be able to design algorithms to allow computers to "understand" natural language in order to perform some task. Easy Spell Checking Keyword Search Finding Synonyms 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 5

Examples of tasks Medium Hard Parsing information from websites, documents, etc. Machine Translation (e.g. Translate Chinese text to English) Semantic Analysis (What is the meaning of query statement?) Coreference (e.g. What does "he" or "it" refer to given a document?) Question Answering (e.g. Answering Jeopardy questions). 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 6

How to represent words? The first and arguably most important common denominator across all NLP tasks is how we represent words as input to any of our models. Much of the earlier NLP work treats words as atomic symbols. To perform well on most NLP tasks we first need to have some notion of similarity and difference between words. With word vectors, we can quite easily encode this ability in the vectors themselves (using distance measures such as Jaccard, Cosine, Euclidean, etc). 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 7

OUTLINE Introduction to Natural Language Processing Word Vectors SVD Based Methods Iteration Based Methods Word2vec Language Models (Unigrams, Bigrams, etc.) Continuous Bag of Words Model (CBOW) Skip-Gram Model Negative Sampling & Hierarchical Softmax 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 8

Word Vectors There are an estimated 13 million tokens for the English language but are they all completely unrelated? we want to encode word tokens each into some vector that represents a point in some sort of "word" space. perhaps there actually exists some N-dimensional space that is sufficient to encode all semantics of our language. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 9

Word Vectors Arguably the most simple word vector is the one-hot vector. In this notation, V is the size of our vocabulary. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 10

Word Vectors We represent each word as a completely independent entity. As we previously discussed, this word representation does not give us directly any notion of similarity. We can try to reduce the size of this space from R V to something smaller and thus find a subspace that encodes the relationships between words. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 11

SVD Based Methods For this class of methods to find word embeddings (otherwise known as word vectors): we first loop over a massive dataset and accumulate word co-occurrence counts in some form of a matrix X, and then perform Singular Value Decomposition on X to get a USV T decomposition. We then use the rows of U as the word embeddings for all words in our dictionary. Let us discuss a few choices of X. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 12

Word-Document Matrix As our first attempt, we make the bold conjecture that words that are related will often appear in the same documents. We build a word-document matrix, X in the following manner: Loop over billions of documents and for each time word i appears in document j, we add one to entry X ij. This is obviously a very large matrix (R V M ) and it scales with the number of documents (M). 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 13

Window based Co-occurrence Matrix The matrix X stores co-occurrences of words thereby becoming an affinity matrix. In this method we count the number of times each word appears inside a window of a particular size around the word of interest. Let our corpus contain just three sentences and the window size be 1: 1)I enjoy flying. 2)I like NLP. 3)I like deep learning. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 14

Window based Co-occurrence Matrix 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 15

Applying SVD to the cooccurrence matrix 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 16

Applying SVD to X 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 17

Reducing dimensionality 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 18

problems The dimensions of the matrix change very often (new words areadded very frequently and corpus changes in size). The matrix is extremely sparse since most words do not cooccur. The matrix is very high dimensional in general ( 10 6 10 6 ) Quadratic cost to train (i.e. to perform SVD) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 19

Some solutions Ignore function words such as "the", "he", "has", etc. Apply a ramp window i.e. weight the co-occurrence count based on distance between the words in the document. As we see in the next section, iteration based methods solve many of these issues in a far more elegant manner. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 20

OUTLINE Introduction to Natural Language Processing Word Vectors SVD Based Methods Iteration Based Methods Word2vec Language Models (Unigrams, Bigrams, etc.) Continuous Bag of Words Model (CBOW) Skip-Gram Model Negative Sampling & Hierarchical Softmax 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 21

Iteration Based Methods - Word2vec We can try to create a model that will be able to learn one iteration at a time and eventually be able to encode the probability of a word given its context. The idea is to design a model whose parameters are the word vectors. Then, train the model on a certain objective. At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 22

Iteration Based Methods - Word2vec In this class, we will present a simpler, more recent, probabilistic method by [Mikolov et al., 2013] : word2vec. Word2vec is a software package that actually includes : 2 algorithms: continuous bag-of-words (CBOW) and skipgram. 2 training methods: negative sampling and hierarchical softmax. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 23

Language Models (Unigrams, Bigrams, etc.) We need to create such a model that will assign a probability to a sequence of tokens. "The cat jumped over the puddle." A good language model will give this sentence a high probability because this is a completely valid sentence, syntactically and semantically. Mathematically, we can call this probability on any given sequence of n words: P ( w 1, w 2,, w n ) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 24

Unigram model We can take the unary language model approach and break apart this probability by assuming the word occurrences are completely independent: However, we know this is a bit ludicrous because we know the next word is highly contingent upon the previous sequence of words. And the silly sentence example might actually score highly. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 25

Bigram model So perhaps we let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it: Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 26

Continuous Bag of Words Model (CBOW) So perhaps we let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it: Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence. 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 27

Continuous Bag of Words Model (CBOW) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 28

Continuous Bag of Words Model (CBOW) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 29

Continuous Bag of Words Model (CBOW) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 30

Continuous Bag of Words Model (CBOW) Here, we use a popular choice of distance/loss measure, cross entropy H ( ŷ, y ). The intuition for the use of cross-entropy in the discrete case can be derived from the formulation of the loss function: Let us concern ourselves with the case at hand, which is that y is a one-hot vector. Thus we know that the above loss simplifies to simply: 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 31

Continuous Bag of Words Model (CBOW) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 32

References Stanford Natural Language Processing with Deep Learning course (Lecture Notes 1) Stanford Natural Language Processing with Deep Learning course (Lecture Notes 2) 12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 33

12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 34

12/24/2017 M. E. Khademi Deep Learning (Lecture14-NLP) 35