Neural models in NLP. Natural Language Processing: Lecture Kairit Sirts

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lecture 1: Machine Learning Basics

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cl] 20 Jul 2015

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep Neural Network Language Models

Attributed Social Network Embedding

arxiv: v1 [cs.lg] 15 Jun 2015

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Second Exam: Natural Language Parsing with Neural Networks

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A deep architecture for non-projective dependency parsing

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v4 [cs.cl] 28 Mar 2016

Georgetown University at TREC 2017 Dynamic Domain Track

(Sub)Gradient Descent

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

CS Machine Learning

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Model Ensemble for Click Prediction in Bing Search Ads

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Axiom 2013 Team Description Paper

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

arxiv: v1 [cs.lg] 7 Apr 2015

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Artificial Neural Networks written examination

Calibration of Confidence Measures in Speech Recognition

CSL465/603 - Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Neural Network GUI Tested on Text-To-Phoneme Mapping

WHEN THERE IS A mismatch between the acoustic

arxiv: v3 [cs.cl] 7 Feb 2017

Residual Stacking of RNNs for Neural Machine Translation

There are some definitions for what Word

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

THE world surrounding us involves multiple modalities

Unsupervised Cross-Lingual Scaling of Political Texts

arxiv: v2 [cs.ir] 22 Aug 2016

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.cl] 27 Apr 2016

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Generative models and adversarial training

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

arxiv: v1 [cs.cv] 10 May 2017

Improvements to the Pruning Behavior of DNN Acoustic Models

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

A Vector Space Approach for Aspect-Based Sentiment Analysis

INPE São José dos Campos

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Knowledge Transfer in Deep Convolutional Neural Nets

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Softprop: Softmax Neural Network Backpropagation Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semantic and Context-aware Linguistic Model for Bias Detection

A study of speaker adaptation for DNN-based speech synthesis

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

arxiv: v1 [cs.cl] 2 Apr 2017

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

FBK-HLT-NLP at SemEval-2016 Task 2: A Multitask, Deep Learning Approach for Interpretable Semantic Textual Similarity

School of Innovative Technologies and Engineering

Speaker Identification by Comparison of Smart Methods. Abstract

FF+FPG: Guiding a Policy-Gradient Planner

A Reinforcement Learning Variant for Control Scheduling

Evolutive Neural Net Fuzzy Filtering: Basic Description

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

An empirical study of learning speed in backpropagation

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Human Emotion Recognition From Speech

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Test Effort Estimation Using Neural Network

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

The Evolution of Random Phenomena

Summarizing Answers in Non-Factoid Community Question-Answering

Getting Started with Deliberate Practice

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Review: Speech Recognition with Deep Learning Methods

Introduction to Simulation

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Lecture 10: Reinforcement Learning

Lecture 1: Basic Concepts of Machine Learning

Indian Institute of Technology, Kanpur

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Learning to Schedule Straight-Line Code

Transcription:

Neural models in NLP Natural Language Processing: Lecture 4 28.09.2017 Kairit Sirts

The goal of today s lecture Explain word embeddings Explain the recurrent neural models used in NLP 2

Log-linear language model y the next word to predict x the context sequence: words, annotations etc v model parameters f(x, y) feature vector for the input-output pair (x, y) 3

The problem with log-linear models Feature engineering Developing feature templates Which features are relevant to which problems? Experiment with subsets of features Features can be very complex 4

What if we could let the model learn the relevant features automatically? Neural networks 5

1-hot representation the girl with flowers is cute are were flower The 1 0 0 0 0 0 0 0 0 0 0 girl 0 1 0 0 0 0 0 0 0 0 0 with 0 0 1 0 0 0 0 0 0 0 0 the 1 0 0 0 0 0 0 0 0 0 0 flowers 0 0 0 1 0 0 0 0 0 0 0 is 0 0 0 0 1 0 0 0 0 0 0 cute 0 0 0 0 0 1 0 0 0 0 0 flower 0 0 0 0 0 0 0 0 1 0 0 6

What is the similarity between vectors for flower and flowers? the girl with flowers is cute are were flower flowers 0 0 0 1 0 0 0 0 0 0 0 flower 0 0 0 0 0 0 0 0 1 0 0 7

Features as distributed representations Deep Learning: What is meant by a distributed representation? https://www.quora.com/deep-learning-what-is-meant-by-a-distributed-representation/answer/rangan-majumder 8

Distributed word representations f1 f2 f3 f4 flower 6 3 0 4 flowers 1 7 2 8 What is the cosine similarity between flower and flowers now? 9

Learning distributed word representations The girl with the flowers is cute. She has the flowers in her hand. I picked these flowers myself. The girl with a flower is cute. She has a flower in her hand. I picked this flower myself. with the has the flowers is cute in her with a has a flower is cute in her picked the myself picked a myself 10

http://metaoptimize.s3.amazonaws.com/cw-embeddings-acl2010/embeddings-mostcommon.embedding_size=50.png 11

12

Word2Vec Mikolov et al., 2013. Efficient Estimation of Word Representations in Vector Space 13

CBOW continuous bag of words w(t-2), w(t-1), w(t+1), w(t+2) one-hot vectors a row in the parameter matrix C the set of context vectors c the size of the context window - linear projection d embedding size 14

Skip-gram model Predict the context words w(t) one-hot vector Maximize: z

Training word embeddings General principle maximize the probability of the: Middle word, given the context words (CBOW) Context words, given the middle word (skip-gram) In case of skip-gram: Given T training words in context Maximize: Minimize: 16

Training word embeddings Typically trained with gradient descent You will learn more sophisticated methods in other courses Initialize the parameter vectors/matrices (somehow) Repeat until convergence: - the set of all trainable parameters - learning rate 17

Softmax vs log-linear model Softmax is a log-linear model Log-linear: Softmax: 18

The gradient of a log-linear model Empirical count Expected count 19

The gradients in skip-gram model c context word w middle word 20

The problem with softmax gradients Computing is computationally very expensive. Why? The gradients always include the sum over the whole vocabulary This makes computation very inefficient 21

Negative sampling The general idea: Maximize the probability of the (word, context) pairs that came from the training data (instead of the probability of the context given the word) Previously: maximize Now: maximize 22

Skip-gram objective with negative sampling Maximize: - the set of random negative samples In practice, the number of negative samples per each positive sample is between 2-20 23

Tools for training word embeddings Word2vec Gensim includes both CBOW and skip-gram implementations Glove optimizes the predictions of co-occurrence counts between words Polyglot Dependency-based word embeddings 24

Further reading on word embeddings Mikolov et al., 2013. Distributed representations of words and phrases and their compositionality Mikolov et al., 2013. Efficient estimation of word representations in vector space Goldberg and Levy, 2014. word2vec Explained: Deriving Mikolov et al. s Negative-Sampling Word-Embedding Method Pennington et al., 2014. GloVe: Global Vectors for Word Representation Al-Rfou et al., 2013. Polyglot: Distributed Word Representations for Multilingual NLP Levy and Goldberg, 2014. Dependency-based word embeddings 25

Regularities between word embeddings Vector Representations of Words: https://www.tensorflow.org/tutorials/word2vec 26

Word embedding models as neural networks One-hot vector of the input word Prediction of the context word (softmax) or whether the (context, word) pair belongs to the Data (negative sampling) Word embeddings The row corresponding To the input word in CS231n Convolutional Neural Networks for Visual Recognition: http://cs231n.github.io/assets/nn1/neural_net.jpeg 27

Recurrent Neural Networks http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ 28

RNN Language Model https://www.linkedin.com/pulse/what-i-learned-from-deep-learning-summer-school-2016-hamid-palangi 29

Machine Translation with RNN http://cs224d.stanford.edu/lectures/cs224d-lecture8.pdf 30

RNN music generation Music Language Modeling with Recurrent Neural Networks: http://yoavz.com/music_rnn/ 31

Sequence Models The Unreasonable Effectiveness of Recurrent Neural Models: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ 32

Recurrent Neural Networks - Initial state - a nonlinear function and so on <s> 33

Non-linear activation functions http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-python-caffe/ 34

Cross-entropy loss function https://theneuralperspective.com/2016/10/02/02-logistic-regression/ 35

Training neural networks Typically with stochastic or mini-batch gradient descent (full batch) GD gradients are computed based on all training items Mini-batch GD at each step compute the gradients based on a small number (a mini-batch) of training samples: for instance 20 or 32 or 128 etc Stochastic GD gradients are computed based on a single training item Gradients are computed using back-propagation BP is an algorithm for an efficient application of the chain rule There are several versions of gradient descent that set the learning rates in a clever way RMSProp, AdaGrad, AdaDelta, Momentum, Adam 36

Gated units RNN-s are supposed to remember long contexts but in practice they don t Gated units, such as LSTM or GRU include gates that control: How much from the next input is read in How much from the previous hidden state is remembered or forgotten How much from the cell state is used in the output Figure 12 from Herath et al., 2016. Going Deeper into Action Recognition: A Survey. 37

Tools for creating and training neural networks Python libraries that perform symbolic gradient computation Keras Tensorflow Theano PyTorch Dynet The field is developing rapidly 38

RNN LM and word embeddings Inputs x one-hot vectors Parameter matrix embeddings - word Training embeddings with word2vec or a similar model is faster than with RNNLM Pretrained word embeddings can be used to initialise the U matrix in RNNLM Transfer learning 39

Further reading Understanding LSTM networks Mikolov et al., 2013. Linguistic Regularities in Continuous Space Word Representations 40

Recap Word embeddings are dense distributed representations of words Word embeddings are trained from (word, context) pairs using neural models Word embeddings can be viewed as automatically learned feature vectors Recurrent neural networks are neural sequence models often used in NLP Pretrained word embeddings can be used to initialize the embedding layer of the recurrent neural models with textual input 41