TOPICS IN NATURAL LANGUAGE PROCESSING

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Artificial Neural Networks written examination

(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Second Exam: Natural Language Parsing with Neural Networks

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Generative models and adversarial training

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Model Ensemble for Click Prediction in Bing Search Ads

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.lg] 7 Apr 2015

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v1 [cs.cv] 10 May 2017

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Axiom 2013 Team Description Paper

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Learning to Schedule Straight-Line Code

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.lg] 15 Jun 2015

arxiv: v4 [cs.cl] 28 Mar 2016

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Residual Stacking of RNNs for Neural Machine Translation

Knowledge Transfer in Deep Convolutional Neural Nets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

arxiv: v1 [cs.cl] 27 Apr 2016

Time series prediction

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Attributed Social Network Embedding

CSL465/603 - Machine Learning

Forget catastrophic forgetting: AI that learns after deployment

Learning Methods for Fuzzy Systems

A Reinforcement Learning Variant for Control Scheduling

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

arxiv: v2 [cs.ro] 3 Mar 2017

CS Machine Learning

Artificial Neural Networks

Human Emotion Recognition From Speech

INPE São José dos Campos

An empirical study of learning speed in backpropagation

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A study of speaker adaptation for DNN-based speech synthesis

Georgetown University at TREC 2017 Dynamic Domain Track

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Assignment 1: Predicting Amazon Review Ratings

arxiv: v5 [cs.ai] 18 Aug 2015

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Lip Reading in Profile

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Evolutive Neural Net Fuzzy Filtering: Basic Description

Adaptive learning based on cognitive load using artificial intelligence and electroencephalography

Speaker Identification by Comparison of Smart Methods. Abstract

Deep Neural Network Language Models

A Review: Speech Recognition with Deep Learning Methods

Top US Tech Talent for the Top China Tech Company

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Test Effort Estimation Using Neural Network

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v4 [cs.cv] 13 Aug 2017

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Discriminative Learning of Beam-Search Heuristics for Planning

Natural Language Processing. George Konidaris

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Calibration of Confidence Measures in Speech Recognition

THE world surrounding us involves multiple modalities

Modeling function word errors in DNN-HMM based LVCSR systems

SORT: Second-Order Response Transform for Visual Recognition

Cultivating DNN Diversity for Large Scale Video Labelling

arxiv: v2 [cs.ir] 22 Aug 2016

On the Formation of Phoneme Categories in DNN Acoustic Models

Word Segmentation of Off-line Handwritten Documents

Knowledge-Based - Systems

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Soft Computing based Learning for Cognitive Radio

WHEN THERE IS A mismatch between the acoustic

A Deep Bag-of-Features Model for Music Auto-Tagging

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

Lecture 1: Basic Concepts of Machine Learning

Online Updating of Word Representations for Part-of-Speech Tagging

Modeling function word errors in DNN-HMM based LVCSR systems

Transcription:

1 / 27 TOPICS IN NATURAL LANGUAGE PROCESSING DEEP LEARNING FOR NLP Shashi Narayan ILCC, School of Informatics University of Edinburgh

2 / 27 Overview What is Deep Learning? Why do we need to study deep learning? Deep Learning: Basics Deep Learning in Application

Neural Networks and Deep Learning 3 / 27

4 / 27 Neural Networks and Deep Learning Standard machine learning relies on human-designed representations and input features Then, machine learning algorithms aims at optimizing model weights to best make a final prediction

4 / 27 Neural Networks and Deep Learning Image: https://content-static.upwork.com/blog/uploads/sites/3/2017/06/27095812/image-16.png

4 / 27 Neural Networks and Deep Learning Representation learning automatically discovers good features or representations needed from the data Deep learning algorithms learn multiple levels of representation of increasing complexity or abstraction

Neural Networks and Deep Learning 4 / 27

Why do we need to study deep learning? 5 / 27

6 / 27 Representation Learning Human-designed representations and input features are: task dependent; time-consuming and expensive; and often under or over specified. Deep learning provides a way to do Representation Learning

7 / 27 Distributed and Continuous Representation Traditional NLP systems are incredibly fragile due to their symbolic representations

7 / 27 Distributed and Continuous Representation Document Classification Image: https://media.licdn.com/mpr/mpr/shrinknp 800 800/p/8/005/0a3/00e/1488735.png

7 / 27 Distributed and Continuous Representation Document Classification p(c i ) = f(bag of unigrams, bigrams,...)

7 / 27 Distributed and Continuous Representation Document Classification p(c i ) = f(bag of unigrams, bigrams,...) Curse of dimensionality No notion of semantic similarity US USA (Cricket -> Sports) (Football -> Sports)

7 / 27 Distributed and Continuous Representation Deep learning provides a way to use and learn continuous word representations word i = [0.11, 0.22, 0.21,..., 0.52, 0.19] 256 It solves the curse of dimensionality It also introduces a notion of semantic similarity It allows unsupervised feature and weight learning

Distributed and Continuous Representation Distributional Similarity Image: http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2016/05/word-embeddings.png 7 / 27

7 / 27 Distributed and Continuous Representation Distributional Similarity Image: https://www.tensorflow.org/images/linear-relationships.png

8 / 27 Hierarchical Representation Deep learning allows multiple levels of hierarchical representation of increasing complexity or abstraction

Hierarchical Representation Image: https://leonardoaraujosantos.gitbooks.io/artificial-inteligence/content/assets/deepconcept.png 8 / 27

8 / 27 Hierarchical Representation Deep learning allows multiple levels of hierarchical representation of increasing complexity or abstraction Compositionality in Natural Language: e.g., sentences are composed from words and phrases.

9 / 27 Deep Learning is establishing the state of the art! Computer Vision: e.g., Image recognition Natural Language Processing: e.g., Language Modelling, Neural Machine Translations, Dialogue Generation and Natural Language Understanding Speech Processing: e.g., Speech recognition Retail, Marketing, Healthcare, Finance,...

Deep Learning: But Why Now? Image: http://beamandrew.github.io//images/deep learning 101/nn timeline.jpg 10 / 27

Deep Learning: But Why Now? Availability of large-scale high-quality labeled datasets Availability of faster machines: Parallel computing with GPUs and multi-core CPUs Better understanding of regularization techniques - Dropout, batch normalization, and data-augmentation Availability of open-source machine learning frameworks: Tensorflow, Theano, Dynet, Torch and PyTorch Better activation functions (e.g., ReLU), optimizers (e.g., ADAM) and architectures (e.g., Highway networks) 10 / 27

Deep Learning: Basics 11 / 27

12 / 27 Basic Unit: Neuron y = f(w T X + b) Sigmoid Activation: f(z) = 1 1+e z

12 / 27 Basic Unit: Neuron Neuron acts as a logistic regression model

Neural Networks: Multiple logistic regressions 13 / 27

14 / 27 Training a Neural Network The Backprop Algorithm An application of the chain rule: the rate of change with respect to a variable x is the sum of rate of changes with respect to other variables z i multiplied by the rate of change of z i with respect to x f x = z i f z i z i x The extra variables we use are the activations in different parts of the network: the derivative of the output with respect to a parameter is the derivative of the output with respect to its activation times the derivative of the activation with respect to a parameter... and apply it recursively

15 / 27 What Happens When Deep is Really Deep? Vanishing Gradients Even large changes in the weights, especially in the early layers, make small changes in the final output Exploding Gradients Results in very large updates to neural network model weights during training.

15 / 27 What Happens When Deep is Really Deep? Slow convergence: The model is unable to get traction on the training data Unstable model: The model loss goes to 0 or NaN during training

15 / 27 What Happens When Deep is Really Deep? How to tackle Vanishing and Exploding Gradients? Rectified Linear Activation Gradient Clipping Long Short-Term Memory Networks (LSTMs)

16 / 27 Neural Networks: Questions Why do we need non-linear activations? Does the backprop algorithm guarantee to find the best solution? If not, why not? Why do neural networks still perform better than other models on various tasks?

Deep Learning in Application 17 / 27

18 / 27 Buckets of Deep Learning (Andrew Ng) 1. Traditional fully-connected feed-forward networks, multi-layer perceptron (Classification)

18 / 27 Buckets of Deep Learning (Andrew Ng) 2. Convolutional Neural Networks (Vision, Mainly Spatial data, e.g., images)

Buckets of Deep Learning (Andrew Ng) 3. Sequence Models: Recurrent Neural Networks (RNN), Long Short Term Memory Networks (LSTM), Gated Recurrent Units (Language) Image: https://cdn-images-1.medium.com/max/2000/1so-sp58t4bre9ehazhsega.png 18 / 27

18 / 27 Buckets of Deep Learning (Andrew Ng) 1. Traditional fully-connected feed-forward networks, multi-layer perceptron (Classification) 2. Convolutional Neural Networks (Vision, Mainly Spatial data, e.g., images) 3. Sequence Models: Recurrent Neural Networks (RNN), Long Short Term Memory Networks (LSTM), Gated Recurrent Units (Language) 4. Future of AI: Unsupervised Learning, Reinforcement Learning, etc.

Recurrent Neural Network h t = f(w 1 x t + W 2 h t 1 + b) Internal state h memorises context up to that point Applications: Language modelling, neural machine translation, natural language generation and many more Image: http://colah.github.io/posts/2015-08-understanding-lstms/img/rnn-unrolled.png 19 / 27

20 / 27 Training Recurrent Architectures h t = f(w 1 x t + W 2 h t 1 + b) Unroll the inputs and the outputs of the network into a long sequence (or larger structure) and use the back-propagation algorithm

Training Recurrent Architectures h t = f(w 1 x t + W 2 h t 1 + b) Unroll the inputs and the outputs of the network into a long sequence (or larger structure) and use the back-propagation algorithm Vanishing gradient problem?? Image: http://colah.github.io/posts/2015-08-understanding-lstms/img/rnn-unrolled.png 20 / 27

Long Short Term Memory (LSTM) Input gate, output gate and forget gate Image: taken from Chung et al. (2014) 21 / 27

22 / 27 Gated Recurrent Units (GRUs) Image: taken from Chung et al. (2014)

23 / 27 Sequence to Sequence Models Encoder encodes the input sentence into a vector and then decoder generates the output sentence, one word at a time Machine translation and dialogue generation Image: https://cdn-images-1.medium.com/max/2000/1so-sp58t4bre9ehazhsega.png

Sequence to Sequence Models with Attentions 24 / 27

25 / 27 Hierarchical Sequence to Sequence Models Document Modelling

26 / 27 Cautions Requires large amount of training data Hyper-parameter tuning and non-convex optimization Model interpretability is a growing issue Encoding structure of language: not everything is a sequence

27 / 27 Summary Deep learning is extremely powerful in learning feature representations and higher-level abstractions It is very simple to start with: Many off-the-shelf packages available implementing neural networks