Natural Language Processing with h Deep Learning CS224N/Ling284

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Residual Stacking of RNNs for Neural Machine Translation

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Second Exam: Natural Language Parsing with Neural Networks

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

arxiv: v1 [cs.cv] 10 May 2017

Model Ensemble for Click Prediction in Bing Search Ads

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Assignment 1: Predicting Amazon Review Ratings

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Language Independent Passage Retrieval for Question Answering

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Switchboard Language Model Improvement with Conversational Data from Gigaword

(Sub)Gradient Descent

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

arxiv: v1 [cs.lg] 15 Jun 2015

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

A Neural Network GUI Tested on Text-To-Phoneme Mapping

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Name: Class: Date: ID: A

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Axiom 2013 Team Description Paper

Context Free Grammars. Many slides from Michael Collins

Learning Methods for Fuzzy Systems

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Calibration of Confidence Measures in Speech Recognition

Word Segmentation of Off-line Handwritten Documents

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

arxiv: v1 [cs.lg] 7 Apr 2015

Lecture 10: Reinforcement Learning

A study of speaker adaptation for DNN-based speech synthesis

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Linking Task: Identifying authors and book titles in verbose queries

Natural Language Processing. George Konidaris

A Case Study: News Classification Based on Term Frequency

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Probabilistic Latent Semantic Analysis

The stages of event extraction

Loughton School s curriculum evening. 28 th February 2017

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Introduction to Simulation

A deep architecture for non-projective dependency parsing

MYCIN. The MYCIN Task

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

THE world surrounding us involves multiple modalities

Attributed Social Network Embedding

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv: v1 [cs.cl] 2 Apr 2017

Multi-Lingual Text Leveling

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Modeling function word errors in DNN-HMM based LVCSR systems

CSL465/603 - Machine Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Modeling function word errors in DNN-HMM based LVCSR systems

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Disambiguation of Thai Personal Name from Online News Articles

Parsing of part-of-speech tagged Assamese Texts

Cross-Lingual Text Categorization

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Reducing Features to Improve Bug Prediction

Lecturing Module

Speaker Identification by Comparison of Smart Methods. Abstract

Using dialogue context to improve parsing performance in dialogue systems

A Reinforcement Learning Variant for Control Scheduling

WHEN THERE IS A mismatch between the acoustic

Classroom Assessment Techniques (CATs; Angelo & Cross, 1993)

INPE São José dos Campos

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

arxiv: v4 [cs.cl] 28 Mar 2016

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Getting Started with Deliberate Practice

Applications of memory-based natural language processing

Lecture 1: Basic Concepts of Machine Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

Android App Development for Beginners

Discriminative Learning of Beam-Search Heuristics for Planning

arxiv: v1 [cs.cl] 20 Jul 2015

*** * * * COUNCIL * * CONSEIL OFEUROPE * * * DE L'EUROPE. Proceedings of the 9th Symposium on Legal Data Processing in Europe

Truth Inference in Crowdsourcing: Is the Problem Solved?

Transcription:

Nat ural Language Pr ocessing Natural Language Processing with h Deep Learning CS224N/Ling284 CS224N/Ling284 Lecture 6: Christ opher Language Manning Models and and Richard Socher Recurrent Lecture Neural 2: Word Networks Vectors Abigail See

Overview Today we will: Introduce a new NLP task Language Modeling motivates Introduce a new family of neural networks Recurrent Neural Networks (RNNs) These are two of the most important ideas for the rest of the class! 2

Language Modeling Language Modeling is the task of predicting what word comes next. the students opened their books minds laptops exams More formally: given a sequence of words, compute the probability distribution of the next word : where can be any word in the vocabulary A system that does this is called a Language Model. 3

Language Modeling You can also think of a Language Model as a system that assigns probability to a piece of text. For example, if we have some text, then the probability of this text (according to the Language Model) is: This is what our LM provides 4

You use Language Models every day! 5

You use Language Models every day! 6

n-gram Language Models the students opened their Question: How to learn a Language Model? Answer (pre- Deep Learning): learn a n-gram Language Model! Definition: A n-gram is a chunk of n consecutive words. unigrams: the, students, opened, their bigrams: the students, students opened, opened their trigrams: the students opened, students opened their 4-grams: the students opened their Idea: Collect statistics about how frequent different n-grams are, and use these to predict next word. 7

n-gram Language Models First we make a simplifying assumption: depends only on the preceding n-1 words. n-1 words (assumption) prob of a n-gram prob of a (n-1)-gram (definition of conditional prob) Question: How do we get these n-gram and (n-1)-gram probabilities? Answer: By counting them in some large corpus of text! 8 (statistical approximation)

n-gram Language Models: Example Suppose we are learning a 4-gram Language Model. as the proctor started the clock, the students opened their discard condition on this For example, suppose that in the corpus: students opened their occurred 1000 times students opened their books occurred 400 times P(books students opened their) = 0.4 students opened their exams occurred 100 times P(exams students opened their) = 0.1 Should we have discarded the proctor context? 9

Sparsity Problems with n-gram Language Models Sparsity Problem 1 Problem: What if students opened their never occurred in data? Then has probability 0! (Partial) Solution: Add small δ to the count for every. This is called smoothing. Sparsity Problem 2 Problem: What if students opened their never occurred in data? Then we can t calculate probability for any! (Partial) Solution: Just condition on opened their instead. This is called backoff. Note: Increasing n makes sparsity problems worse. Typically we can t have n bigger than 5. 10

Storage Problems with n-gram Language Models Storage: Need to store count for all n-grams you saw in the corpus. Increasing n or increasing corpus increases model size! 11

n-gram Language Models in practice You can build a simple trigram Language Model over a 1.7 million word corpus (Reuters) in a few seconds on your laptop* today the Business and financial news get probability distribution 12 company 0.153 bank 0.153 price 0.077 italian 0.039 emirate 0.039 Otherwise, seems reasonable! Sparsity problem: not much granularity in the probability distribution * Try for yourself: https://nlpforhackers.io/language-models/

Generating text with a n-gram Language Model You can also use a Language Model to generate text. today the condition on this get probability distribution company 0.153 bank 0.153 price 0.077 italian 0.039 emirate 0.039 sample 13

Generating text with a n-gram Language Model You can also use a Language Model to generate text. today the price condition on this get probability distribution of 0.308 for 0.050 it 0.046 to 0.046 is 0.031 sample 14

Generating text with a n-gram Language Model You can also use a Language Model to generate text. today the price of condition on this get probability distribution the 0.072 18 0.043 oil 0.043 its 0.036 gold 0.018 sample 15

Generating text with a n-gram Language Model You can also use a Language Model to generate text. today the price of gold 16

Generating text with a n-gram Language Model You can also use a Language Model to generate text. today the price of gold per ton, while production of shoe lasts and shoe industry, the bank intervened just after it considered and rejected an imf demand to rebuild depleted european stocks, sept 30 end primary 76 cts a share. Surprisingly grammatical! but incoherent. We need to consider more than three words at a time if we want to model language well. 17 But increasing n worsens sparsity problem, and increases model size

How to build a neural Language Model? Recall the Language Modeling task: Input: sequence of words Output: prob dist of the next word How about a window-based neural model? We saw this applied to Named Entity Recognition in Lecture 3: LOCATION 18 museums in Paris are amazing

A fixed-window neural Language Model as the proctor started the clock the students opened their 19 discard fixed window

A fixed-window neural Language Model books laptops output distribution a zoo hidden layer concatenated word embeddings words / one-hot vectors the students opened their 20

A fixed-window neural Language Model Improvements over n-gram LM: No sparsity problem Don t need to store all observed n-grams books laptops Remaining problems: Fixed window is too small Enlarging window enlarges Window can never be large enough! and are multiplied by completely different weights in. No symmetry in how the inputs are processed. a zoo We need a neural architecture that can process any length input the students opened their 21

Recurrent Neural Networks (RNN) A family of neural architectures Core idea: Apply the same weights repeatedly outputs (optional) hidden states input sequence (any length) 22

A RNN Language Model output distribution books laptops a zoo hidden states is the initial hidden state word embeddings words / one-hot vectors the students opened their 23 Note: this input sequence could be much longer, but this slide doesn t have space!

A RNN Language Model books laptops RNN Advantages: Can process any length input Computation for step t can (in theory) use information from many steps back Model size doesn t increase for longer input Same weights applied on every timestep, so there is symmetry in how inputs are processed. a zoo RNN Disadvantages: Recurrent computation is slow In practice, difficult to access information from many steps back 24 More on these later in the course the students opened their

Training a RNN Language Model Get a big corpus of text which is a sequence of words Feed into RNN-LM; compute output distribution for every step t. i.e. predict probability dist of every word, given words so far Loss function on step t is cross-entropy between predicted probability distribution, and the true next word (one-hot for ): Average this to get overall loss for entire training set: 25

Training a RNN Language Model Loss = negative log prob of students Predicted prob dists 26 Corpus the students opened their exams

Training a RNN Language Model Loss = negative log prob of opened Predicted prob dists 27 Corpus the students opened their exams

Training a RNN Language Model Loss = negative log prob of their Predicted prob dists 28 Corpus the students opened their exams

Training a RNN Language Model Loss = negative log prob of exams Predicted prob dists 29 Corpus the students opened their exams

Training a RNN Language Model Loss + + + + = Predicted prob dists 30 Corpus the students opened their exams

Training a RNN Language Model However: Computing loss and gradients across entire corpus is too expensive! In practice, consider as a sentence (or a document) Recall: Stochastic Gradient Descent allows us to compute loss and gradients for small chunk of data, and update. Compute loss for a sentence (actually a batch of sentences), compute gradients and update weights. Repeat. 31

Backpropagation for RNNs Question: What s the derivative of w.r.t. the repeated weight matrix? Answer: The gradient w.r.t. a repeated weight is the sum of the gradient w.r.t. each time it appears Why? 32

Multivariable Chain Rule Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version 33

Backpropagation for RNNs: Proof sketch In our example: Apply the multivariable chain rule: = 1 Source: https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/differentiating-vector-valued-functions/a/multivariable-chain-rule-simple-version 34

Backpropagation for RNNs Answer: Backpropagate over timesteps i=t,,0, summing gradients as you go. This algorithm is called backpropagation through time 35 Question: How do we calculate this?

Generating text with a RNN Language Model Just like a n-gram Language Model, you can use a RNN Language Model to generate text by repeated sampling. Sampled output is next step s input. favorite season is spring sample sample sample sample 36 my favorite season is spring

Generating text with a RNN Language Model Let s have some fun! You can train a RNN-LM on any kind of text, then generate text in that style. RNN-LM trained on Obama speeches: 37 Source: https://medium.com/@samim/obama-rnn-machine-generated-political-speeches-c8abd18a2ea0

Generating text with a RNN Language Model Let s have some fun! You can train a RNN-LM on any kind of text, then generate text in that style. RNN-LM trained on Harry Potter: 38 Source: https://medium.com/deep-writing/harry-potter-written-by-artificial-intelligence-8a9431803da6

Generating text with a RNN Language Model Let s have some fun! You can train a RNN-LM on any kind of text, then generate text in that style. RNN-LM trained on recipes: 39 Source: https://gist.github.com/nylki/1efbaa36635956d35bcc

Generating text with a RNN Language Model Let s have some fun! You can train a RNN-LM on any kind of text, then generate text in that style. RNN-LM trained on paint color names: This is an example of a character-level RNN-LM (predicts what character comes next) 40 Source: http://aiweirdness.com/post/160776374467/new-paint-colors-invented-by-neural-network

Evaluating Language Models The standard evaluation metric for Language Models is perplexity. Inverse probability of corpus, according to Language Model Normalized by number of words This is equal to the exponential of the cross-entropy loss : 41 Lower perplexity is better!

RNNs have greatly improved perplexity n-gram model Increasingly complex RNNs Perplexity improves (lower is better) Source: https://research.fb.com/building-an-efficient-neural-language-model-over-a-billion-words/ 42

Why should we care about Language Modeling? Language Modeling is a benchmark task that helps us measure our progress on understanding language Language Modeling is a subcomponent of many NLP tasks, especially those involving generating text or estimating the probability of text: 43 Predictive typing Speech recognition Handwriting recognition Spelling/grammar correction Authorship identification Machine translation Summarization Dialogue etc.

Recap Language Model: A system that predicts the next word Recurrent Neural Network: A family of neural networks that: Take sequential input of any length Apply the same weights on each step Can optionally produce output on each step Recurrent Neural Network Language Model We ve shown that RNNs are a great way to build a LM. But RNNs are useful for much more! 44

RNNs can be used for tagging e.g. part-of-speech tagging, named entity recognition DT JJ NN VBN IN DT NN the startled cat knocked over the vase 45

RNNs can be used for sentence classification e.g. sentiment classification positive How to compute sentence encoding? Sentence encoding overall I enjoyed the movie a lot 46

RNNs can be used for sentence classification e.g. sentiment classification positive How to compute sentence encoding? Sentence encoding Basic way: Use final hidden state overall I enjoyed the movie a lot 47

RNNs can be used for sentence classification e.g. sentiment classification positive How to compute sentence encoding? Sentence encoding Usually better: Take element-wise max or mean of all hidden states overall I enjoyed the movie a lot 48

RNNs can be used as an encoder module e.g. question answering, machine translation, many other tasks! Answer: German Here the RNN acts as an encoder for the Question (the hidden states represent the Question). The encoder is part of a larger neural system. Context: Ludwig van Beethoven was a German composer and pianist. A crucial figure Question: 49 what nationality was Beethoven?

RNN-LMs can be used to generate text e.g. speech recognition, machine translation, summarization RNN-LM Input (audio) what s the weather conditioning <START> what s the This is an example of a conditional language model. We ll see Machine Translation in much more detail later. 50

A note on terminology RNN described in this lecture = vanilla RNN Next lecture: You will learn about other RNN flavors like GRU and LSTM and multi-layer RNNs By the end of the course: You will understand phrases like stacked bidirectional LSTM with residual connections and self-attention 51

Next time Problems with RNNs! Vanishing gradients motivates Fancy RNN variants! LSTM GRU multi-layer bidirectional 52