Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture 9

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Assignment 1: Predicting Amazon Review Ratings

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Attributed Social Network Embedding

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v2 [cs.ir] 22 Aug 2016

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v1 [cs.cl] 20 Jul 2015

THE world surrounding us involves multiple modalities

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Generative models and adversarial training

Model Ensemble for Click Prediction in Bing Search Ads

Calibration of Confidence Measures in Speech Recognition

Georgetown University at TREC 2017 Dynamic Domain Track

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Probabilistic Latent Semantic Analysis

Lecture 1: Machine Learning Basics

A deep architecture for non-projective dependency parsing

Second Exam: Natural Language Parsing with Neural Networks

A study of speaker adaptation for DNN-based speech synthesis

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v4 [cs.cl] 28 Mar 2016

CS Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semantic and Context-aware Linguistic Model for Bias Detection

A Review: Speech Recognition with Deep Learning Methods

Modeling function word errors in DNN-HMM based LVCSR systems

A Vector Space Approach for Aspect-Based Sentiment Analysis

Using dialogue context to improve parsing performance in dialogue systems

(Sub)Gradient Descent

Dialog-based Language Learning

Knowledge Transfer in Deep Convolutional Neural Nets

Modeling function word errors in DNN-HMM based LVCSR systems

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

Residual Stacking of RNNs for Neural Machine Translation

Speech Emotion Recognition Using Support Vector Machine

arxiv: v1 [cs.cl] 2 Apr 2017

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

A Neural Network GUI Tested on Text-To-Phoneme Mapping

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

There are some definitions for what Word

Deep Facial Action Unit Recognition from Partially Labeled Data

Artificial Neural Networks written examination

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Lecture 1: Basic Concepts of Machine Learning

Axiom 2013 Team Description Paper

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Learning Methods for Fuzzy Systems

Semi-Supervised Face Detection

Topic Modelling with Word Embeddings

CSL465/603 - Machine Learning

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Softprop: Softmax Neural Network Backpropagation Learning

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

On-the-Fly Customization of Automated Essay Scoring

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Evolution of Symbolisation in Chimpanzees and Neural Nets

NEURAL DIALOG STATE TRACKER FOR LARGE ONTOLOGIES BY ATTENTION MECHANISM. Youngsoo Jang*, Jiyeon Ham*, Byung-Jun Lee, Youngjae Chang, Kee-Eung Kim

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ON THE USE OF WORD EMBEDDINGS ALONE TO

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

An Online Handwriting Recognition System For Turkish

Summarizing Answers in Non-Factoid Community Question-Answering

INPE São José dos Campos

Improvements to the Pruning Behavior of DNN Acoustic Models

Human Emotion Recognition From Speech

Word Embedding Based Correlation Model for Question/Answer Matching

A Reinforcement Learning Variant for Control Scheduling

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

CROSS-LANGUAGE INFORMATION RETRIEVAL USING PARAFAC2

Boosting Named Entity Recognition with Neural Character Embeddings

arxiv: v1 [cs.lg] 7 Apr 2015

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Test Effort Estimation Using Neural Network

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Device Independence and Extensibility in Gesture Recognition

LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

AQUA: An Ontology-Driven Question Answering System

Cultivating DNN Diversity for Large Scale Video Labelling

Transcription:

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture 9 Ann Copestake Computer Laboratory University of Cambridge October 2017

Distributional semantics and deep learning: outline Neural networks in pictures word2vec Visualization of NNs Some general comments on deep learning for NLP VQA has been moved to lecture 12 (insufficient time today) Some slides adapted from Aurelie Herbelot.

Neural networks in pictures Outline. Neural networks in pictures word2vec Visualization of NNs Some general comments on deep learning for NLP

Neural networks in pictures Perceptron Early model (1962): no hidden layers, just a linear classifier, summation output. x 1 w 1 x 2 w 2 > θ yes/no w 3 x 3 Dot product of an input vector x and a weight vector w, compared to a threshold θ

Neural networks in pictures Restricted Boltzmann Machines Boltzmann machine: hidden layer, arbitrary interconnections between units. Not effectively trainable. Restricted Boltzmann Machine (RBM): one input and one hidden layer, no intra-layer links. VISIBLE HIDDEN w 1,...w 6 b (bias)

Neural networks in pictures Restricted Boltzmann Machines Hidden layer (note one hidden layer can model arbitrary function, but not necessarily trainable). RBM layers allow for efficient implementation: weights can be described by a matrix, fast computation. One popular deep learning architecture is a combination of RBMs, so the output from one RBM is the input to the next. RBMs can be trained separately and then fine-tuned in combination. The layers allow for efficient implementations and successive approximations to concepts.

Neural networks in pictures Combining RBMs: deep learning https://deeplearning4j.org/restrictedboltzmannmachine Copyright 2016. Skymind. DL4J is distributed under an Apache 2.0 License.

Neural networks in pictures Sequences Combined RBMs etc, cannot handle sequence information well (can pass them sequences encoded as vectors, but input vectors are fixed length). So different architecture needed for sequences and most language and speech problems. RNN: Recurrent neural network. Long short term memory (LSTM): development of RNN, more effective for (some?) language applications.

Neural networks in pictures Recurrent Neural Networks http://colah.github.io/posts/ 2015-08-Understanding-LSTMs/

Neural networks in pictures RNN language model: Mikolov et al, 2010

Neural networks in pictures RNN as a language model Trained on a very large corpus to predict the next word. Input vector: vector for word at t concatenated to vector which is output from context layer at t 1. one-hot vector: one dimension per word (i.e., index) input embeddings: distributional model (with dimensionality reduction) embeddings: may be externally created (from another corpus) or learned for specific application.

Neural networks in pictures External embeddings for prediction Jurafsky and Martin, third edition web.stanford.edu/~jurafsky/slp3/

Neural networks in pictures Learned embeddings for prediction Jurafsky and Martin, third edition web.stanford.edu/~jurafsky/slp3/

Neural networks in pictures Other neural language models LSTMs etc capture long-term dependencies: She shook her head. She decided she did not want any more tea, so shook her head when the waiter reappeared. not the same as long distance dependency in linguistics LSTMs now standard for speech, but lots of experimentation for other language applications.

Neural networks in pictures Multimodal architectures Input to a NN is just a vector: we can combine vectors from different sources. e.g., features from a CNN for visual recognition concatenated with word embeddings. multimodal systems: captioning, visual question answering (VQA). Will be discussed further in lecture 12...

word2vec Outline. Neural networks in pictures word2vec Visualization of NNs Some general comments on deep learning for NLP

word2vec Embeddings embeddings: distributional models with dimensionality reduction, based on prediction word2vec: as originally described (Mikolov et al 2013), a NN model using a two-layer network (i.e., not deep!) to perform dimensionality reduction. two possible architectures: given some context words, predict the target (CBOW) given a target word, predict the contexts (Skip-gram) Very computationally efficient, good all-round model (good hyperparameters already selected).

word2vec The Skip-gram model

word2vec Features of word2vec representations A representation is learnt at the reduced dimensionality straightaway: we are outputting vectors of a chosen dimensionality (parameter of the system). Usually, a few hundred dimensions: dense vectors. The dimensions are not interpretable: it is impossible to look into characteristic contexts. For many tasks, word2vec (skip-gram) outperforms standard count-based vectors. But mainly due to the hyperparameters and these can be emulated in standard count models (see Levy et al).

word2vec What Word2Vec is famous for BUT... see Levy et al and Levy and Goldberg for discussion

word2vec The actual components of word2vec A vocabulary. (Which words do I have in my corpus?) A table of word probabilities. Negative sampling: tell the network what not to predict. Subsampling: don t look at all words and all contexts.

word2vec Negative sampling Instead of doing full softmax (final stage in a NN model to get probabilities, very expensive), word2vec is trained using logistic regression to discriminate between real and fake words: Whenever considering a word-context pair, also give the network some contexts which are not the actual observed word. Sample from the vocabulary. The probability to sample something more frequent in the corpus is higher. The number of negative samples will affect results.

word2vec Softmax (CBOW) https://www.tensorflow.org/versions/r0.11/ tutorials/word2vec/index.html

word2vec Negative sampling (CBOW) https://www.tensorflow.org/versions/r0.11/ tutorials/word2vec/index.html

word2vec Subsampling Instead of considering all words in the sentence, transform it by randomly removing words from it: considering all sentence transform randomly words The subsampling function makes it more likely to remove a frequent word. Note that word2vec does not use a stop list. Note that subsampling affects the window size around the target (i.e., means word2vec window size is not fixed). Also: weights of elements in context window vary.

word2vec Using word2vec predefined vectors or create your own can be used as input to NN model many researchers use the gensim Python library https://radimrehurek.com/gensim/ Emerson and Copestake (2016) find significantly better performance on some tests using parsed data Levy et al s papers are very helpful in clarifying word2vec behaviour Bayesian version: Barkan (2016) https://arxiv.org/ftp/arxiv/papers/1603/1603.06571.pdf

word2vec doc2vec: Le and Mikolov (2014) Learn a vector to represent a document : sentence, paragraph, short document. skip-gram trained by predicting context word vectors given an input word, distributed bag of words (dbow) trained by predicting context words given a document vector. order of document words ignored, but also dmpv, analogous to cbow: sensitive to document word order Options: 1. start with random word vector initialization 2. run skip-gram first 3. use pretrained embeddings (Lau and Baldwin, 2016)

word2vec doc2vec: Le and Mikolov (2014) Learned document vector effective for various tasks, including sentiment analysis. Lots and lots of possible parameters. Some initial difficulty in reproducing results, but Lau and Baldwin (2016) have a careful investigation of doc2vec, demonstraing its effectiveness.

Visualization of NNs Outline. Neural networks in pictures word2vec Visualization of NNs Some general comments on deep learning for NLP

Visualization of NNs Finding out what NNs are really doing Careful investigation of models (sometimes including going through code), describing as non-neural models (Omer Levy, word2vec). Building proper baselines (e.g., Zhou et al, 2015 for VQA). Selected and targeted experimentation (examples in lecture 12). Visualization.

Visualization of NNs t-sne example: Lau and Baldwin (2016 arxiv.org/abs/1607.05368

Visualization of NNs Heatmap example: Li et al (2015) arxiv.org/abs/1506.01066

Some general comments on deep learning for NLP Outline. Neural networks in pictures word2vec Visualization of NNs Some general comments on deep learning for NLP

Some general comments on deep learning for NLP Deep learning: positives Really important change in state-of-the-art for some applications: e.g., language models for speech. Multi-modal experiments are now much more feasible. Models are learning structure without hand-crafting of features. Structure learned for one task (e.g., prediction) applicable to others with limited training data. Lots of toolkits etc Huge space of new models, far more research going on in NLP, far more industrial research...

Some general comments on deep learning for NLP Deep Learning: negatives Models are made as powerful as possible to the point they are barely possible to train or use (http://www.deeplearningbook.org 16.7). Tuning hyperparameters is a matter of much experimentation. Statistical validity of results often questionable. Many myths, massive hype and almost no publication of negative results: but there are some NLP tasks where deep learning is not giving much improvement in results. Weird results: e.g., 33rpm normalized to thirty two revolutions per minute https://arxiv.org/ftp/arxiv/papers/1611/1611.00068.pdf Adversarial examples (lecture 12, maybe).

Some general comments on deep learning for NLP New methodology required for NLP? Perspective here is applied machine learning... Methodological issues are fundamental to deep learning: e.g., subtle biases in training data will be picked up. Old tasks and old data possibly no longer appropriate. The lack of predefined interpretation of the latent variables is what makes the models more flexible/powerful... but the models are usually not interpretable by humans after training: serious practical and ethical issues.