DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING. Junyang LIN

Similar documents
System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Python Machine Learning

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Artificial Neural Networks written examination

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.cv] 10 May 2017

(Sub)Gradient Descent

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

SORT: Second-Order Response Transform for Visual Recognition

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Generative models and adversarial training

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v4 [cs.cl] 28 Mar 2016

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v1 [cs.lg] 15 Jun 2015

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Knowledge Transfer in Deep Convolutional Neural Nets

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.lg] 7 Apr 2015

CSL465/603 - Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A deep architecture for non-projective dependency parsing

arxiv: v1 [cs.cl] 20 Jul 2015

ON THE USE OF WORD EMBEDDINGS ALONE TO

Model Ensemble for Click Prediction in Bing Search Ads

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Machine Learning Basics

Learning Methods for Fuzzy Systems

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Axiom 2013 Team Description Paper

A study of speaker adaptation for DNN-based speech synthesis

Deep Neural Network Language Models

Assignment 1: Predicting Amazon Review Ratings

arxiv: v1 [cs.cl] 27 Apr 2016

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

THE world surrounding us involves multiple modalities

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

A Review: Speech Recognition with Deep Learning Methods

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

A Deep Bag-of-Features Model for Music Auto-Tagging

Lecture 1: Basic Concepts of Machine Learning

CS Machine Learning

THE enormous growth of unstructured data, including

Probabilistic Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

There are some definitions for what Word

Artificial Neural Networks

Residual Stacking of RNNs for Neural Machine Translation

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

A Vector Space Approach for Aspect-Based Sentiment Analysis

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology

Calibration of Confidence Measures in Speech Recognition

Human Emotion Recognition From Speech

arxiv: v2 [cs.ir] 22 Aug 2016

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Attributed Social Network Embedding

Cost-sensitive Deep Learning for Early Readmission Prediction at A Major Hospital

An OO Framework for building Intelligence and Learning properties in Software Agents

Deep Facial Action Unit Recognition from Partially Labeled Data

Issues in the Mining of Heart Failure Datasets

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

arxiv: v2 [cs.cl] 26 Mar 2015

arxiv: v1 [cs.lg] 20 Mar 2017

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Cultivating DNN Diversity for Large Scale Video Labelling

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A JOINT MANY-TASK MODEL: GROWING A NEURAL NETWORK FOR MULTIPLE NLP TASKS

arxiv: v2 [cs.cv] 30 Mar 2017

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Summarizing Answers in Non-Factoid Community Question-Answering

Evolution of Symbolisation in Chimpanzees and Neural Nets

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

On the Formation of Phoneme Categories in DNN Acoustic Models

An empirical study of learning speed in backpropagation

Test Effort Estimation Using Neural Network

arxiv: v1 [cs.cl] 2 Apr 2017

Softprop: Softmax Neural Network Backpropagation Learning

Forget catastrophic forgetting: AI that learns after deployment

Device Independence and Extensibility in Gesture Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Georgetown University at TREC 2017 Dynamic Domain Track

arxiv:submit/ [cs.cv] 2 Aug 2017

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Offline Writer Identification Using Convolutional Neural Network Activation Features

Dialog-based Language Learning

Transcription:

DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Junyang LIN linjunyang@pku.edu.cn https://justinlin60.github.io

Deep Learning: A Sub-field of Machine Learning Artificial Intelligence A pretty large field Machine Learning Perceptron, Logistic Regression, SVM, K- means Deep Learning MLP, CNN, RNN, GAN

Deep Learning is Becoming Popular

Deep Learning is Powerful

Deep Learning is Powerful (Machine Translation)

Deep Learning is Powerful (Summarization) https://arxiv.org/abs/706.02459

Deep Learning is Powerful (Object Recognition)

Deep Learning is Powerful (Face Generation)

Deep Learning is Powerful (Pokemon Generation)

Deep Learning is Powerful (Cat Generation)

Deep Learning is Powerful (Cat Generation)

History of Deep Learning

Ups and Downs 958: Perceptron Learning Algorithm (Rosenblatt, linear model, limited) 980s: Multi- layer Perceptron (MLP, non-linear, not fancy) 986: Backpropagation (G. Hinton et al., but not efficient when deep) 990s: SVM vs Neural Network (Yann LeCun, CNN) 2006: RBM Initialization (G. Hinton et al., Breakthrough) 2009: GPU 20: Started to be popular in Speech Recognition 202: AlexNet won ILSVRC (Deep Learning Era started) 204: Started to become very popular in NLP (Y. Bengio, RNN )

Great Figures

What is Deep Learning?

The Essence of Machine Learning Define a function set Evaluate performance of functions Pick the best one

Linear Regression (Housing Price) Price (y) z = wx + b f(z) = max(0, z) x y Size (x)

Perceptron Weight (x 2 ) z = w x + w 2 x 2 + b = w T x + b f(z) = sign(z) x +/- y x 2 Height (x )

Logistic Regression Weight (x 2 ) z = w T x + b g x = σ z = (g(z) (0, )) + e z x +/0 y x 2 Height (x )

Housing Price Prediction zip code x w walkability size #bedroom x 2 x 3 w 2 w 3 a a 2 family size w 5 w 6 price o y real price wealth x 4 w 4 a 3 w 7 school quality

Should you study linguistics? #Chomsky x w Syntax #Halliday #Hu Zhuanglin #Lakoff x 2 x 3 x 4 w 2 w 3 w 4 a a 2 a 3 SFL w 5 w 6 w 7 Cognitive Linguistics +, Yes o 0, No y +/0

Standard NN (MLP) x Activation Unit a x 2 x 3 a 2 o y x 4 a 3 Input Layer Hidden Layer Output Layer Now we have defined a Fullyconnected Feedforward Neural Network, in fact, a function set.

Deep means many hidden layers x a a 2 a n x 2 a 2 a n a 2 n o y x 3 a 3 a 3 2 a 3 n x 4 Input Layer Hidden Layer Hidden Layer 2 Hidden Layer n Output Layer

Activation Unit x Activation Unit a If there is no operation in the activation unit, the whole model will be a linear model. x 2 x 3 x 4 a 2 a 3 o Therefore, the effects of multi-layer NN will be equivalent to those of single-layer NN. This is why we need non-linear activation function in the activation unit.

Activation Function Preferable f(x) f(x) f(x) 0 x 0 0 x x - Sigmoid function f x = σ x = + e x Tanh function f x = tanh x = ex e x e x + e x ReLU function f x = ReLU x = max(0, x)

Deep means many hidden layers ResNet, 52 layers

How can you find the best function? Weight (x 2 ) Oh my god! No line can best separate the data! Height (x ) Don t worry! Neural Network can help you solve the problem!

XNOR -3 x -2 2 a 2 x x 2 a a 2 o 0 0 0 2 o 0 0 0 0 x 2-2 a 2 2-0 0 0 0 0

Gradient Descent Price (y) Loss Function: L = N σ N i 2 (y i y i ) 2 Objective: minimize the total loss w w α L w Size (x) (Here ɑ is learning rate, which controls the range of each step)

Gradient Descent

Backpropagation (Chain Rule) x a a 2 a n x 2 a 2 a 2 2 a 2 n a o ŷ x 3 a 3 a 3 2 a 3 n x 4

Deep Learning Frameworks

Word Embedding

Discrete Representation Commonest Linguistic Idea: signifier and signified (Saussure) One-Hot Encoding can represent word. It is a vector with only one and a lot of 0s. For example: Hotel http://web.stanford.edu/class/cs224n

Problems with Discrete Representation It has no relation to the meaning of word Similar word vectors should have large inner product. But We need a better solution to represent word meaning http://web.stanford.edu/class/cs224n

Distributed Representation You shall know a word by the company it keeps. (J. R. Firth, 957) Word Embedding can build distributed representations for words. Two of the most famous word embedding methods are: Word2Vec (Skip-Grams, CBOW) GloVe (Global Vector)

Skip-Grams http://web.stanford.edu/class/cs224n

http://web.stanford.edu/class/cs224n

Popular Networks

Convolutional Neural Network

Convolutional Neural Network (CNN) Fully-connected Feedforward Neural Network Convolutional Neural Network http://cs23n.github.io/convolutional-networks/#comp

Convolutional Layer http://cs23n.github.io/convolutional-networks/#comp

Max Pooling http://cs23n.github.io/convolutional-networks/#comp

Activations of CNN http://cs23n.github.io/convolutional-networks/#comp

Recurrent Neural Network

Any Problem in Fully-connected Network? x W a o N Destination Beijing W 2 x 2 a 2 o 2 N 2 Departure W 3 W 4 x 3 x 4 a 3 o 3 N 3 Other

Recurrent Neural Network (RNN) leave Beijing reach Beijing http://web.stanford.edu/class/cs224n

RNN vs Language Model http://web.stanford.edu/class/cs224n

Why Deep Learning?

Machine Learning vs Deep Learning Machine Learning Deep Learning Human-designed representations + Input features + Pick the best weights Representation Learning + Raw Inputs + Pick the best algorithm

Advantages of Deep Learning Feature engineering is hard and to some extent, ineffective, incomplete or over-specified and it is really a hard work! Deep learning can learn features, which are easy to adapt and fast to learn. Flexible, universal and learnable More data and more powerful machines

Advantages of Deep Learning From Andrew Ng s course Deep Learning

Future for Deep Learning? Unsupervised learning may be the most important research area in the future since it is pretty easy to achieve a large amount of unlabeled data while labelled data are far fewer and pretty expensive. Transfer Learning can help us transfer the task to pre-trained models Generative Model, such as GAN (Generative Adversarial Network) Abandon it?! (Well, Hinton said we should drop BP )

Personal Ideas About What We Can Do It seems that now linguists contribution to NLP becomes trivial and deep learning does not really need us, but things may not be that bad. Still, we find the effects of many NLP tasks, like machine translation, not satisfactory, and machines cannot really understand semantic meaning, let alone pragmatic. More significant problems for scientists to solve in today s world, instead of improving the performance of algorithms, which are though vital.

References Book: Goodfellow I, Bengio Y, Courville A. Deep learning[m]. MIT press, 206. http://www.deeplearningbook.org/ https://github.com/exacity/deeplearningbook-chinese Article: LeCun Y, Bengio Y, Hinton G. Deep learning[j]. Nature, 205, 52(7553): 436-444. Schmidhuber J. Deep learning in neural networks: An overview[j]. Neural networks, 205, 6: 85-7. Goldberg Y. A Primer on Neural Network Models for Natural Language Processing[J]. J. Artif. Intell. Res.(JAIR), 206, 57: 345-420.

Talk is Cheap, Show me the Code!