DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING. Junyang LIN

Size: px

Start display at page:

Download "DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING. Junyang LIN"

Victor Bruce
5 years ago
Views:

1 DEEP LEARNING FOR NATURAL LANGUAGE PROCESSING Junyang LIN

2 Deep Learning: A Sub-field of Machine Learning Artificial Intelligence A pretty large field Machine Learning Perceptron, Logistic Regression, SVM, K- means Deep Learning MLP, CNN, RNN, GAN

3 Deep Learning is Becoming Popular

4 Deep Learning is Powerful

5 Deep Learning is Powerful (Machine Translation)

6 Deep Learning is Powerful (Summarization)

7 Deep Learning is Powerful (Object Recognition)

8 Deep Learning is Powerful (Face Generation)

9 Deep Learning is Powerful (Pokemon Generation)

10 Deep Learning is Powerful (Cat Generation)

11 Deep Learning is Powerful (Cat Generation)

12 History of Deep Learning

13 Ups and Downs 958: Perceptron Learning Algorithm (Rosenblatt, linear model, limited) 980s: Multi- layer Perceptron (MLP, non-linear, not fancy) 986: Backpropagation (G. Hinton et al., but not efficient when deep) 990s: SVM vs Neural Network (Yann LeCun, CNN) 2006: RBM Initialization (G. Hinton et al., Breakthrough) 2009: GPU 20: Started to be popular in Speech Recognition 202: AlexNet won ILSVRC (Deep Learning Era started) 204: Started to become very popular in NLP (Y. Bengio, RNN )

14 Great Figures

15 What is Deep Learning?

16 The Essence of Machine Learning Define a function set Evaluate performance of functions Pick the best one

17 Linear Regression (Housing Price) Price (y) z = wx + b f(z) = max(0, z) x y Size (x)

18 Perceptron Weight (x 2 ) z = w x + w 2 x 2 + b = w T x + b f(z) = sign(z) x +/- y x 2 Height (x )

19 Logistic Regression Weight (x 2 ) z = w T x + b g x = σ z = (g(z) (0, )) + e z x +/0 y x 2 Height (x )

20 Housing Price Prediction zip code x w walkability size #bedroom x 2 x 3 w 2 w 3 a a 2 family size w 5 w 6 price o y real price wealth x 4 w 4 a 3 w 7 school quality

21 Should you study linguistics? #Chomsky x w Syntax #Halliday #Hu Zhuanglin #Lakoff x 2 x 3 x 4 w 2 w 3 w 4 a a 2 a 3 SFL w 5 w 6 w 7 Cognitive Linguistics +, Yes o 0, No y +/0

22 Standard NN (MLP) x Activation Unit a x 2 x 3 a 2 o y x 4 a 3 Input Layer Hidden Layer Output Layer Now we have defined a Fullyconnected Feedforward Neural Network, in fact, a function set.

23 Deep means many hidden layers x a a 2 a n x 2 a 2 a n a 2 n o y x 3 a 3 a 3 2 a 3 n x 4 Input Layer Hidden Layer Hidden Layer 2 Hidden Layer n Output Layer

24 Activation Unit x Activation Unit a If there is no operation in the activation unit, the whole model will be a linear model. x 2 x 3 x 4 a 2 a 3 o Therefore, the effects of multi-layer NN will be equivalent to those of single-layer NN. This is why we need non-linear activation function in the activation unit.

25 Activation Function Preferable f(x) f(x) f(x) 0 x 0 0 x x - Sigmoid function f x = σ x = + e x Tanh function f x = tanh x = ex e x e x + e x ReLU function f x = ReLU x = max(0, x)

26 Deep means many hidden layers ResNet, 52 layers

27 How can you find the best function? Weight (x 2 ) Oh my god! No line can best separate the data! Height (x ) Don t worry! Neural Network can help you solve the problem!

28 XNOR -3 x -2 2 a 2 x x 2 a a 2 o o x 2-2 a

29 Gradient Descent Price (y) Loss Function: L = N σ N i 2 (y i y i ) 2 Objective: minimize the total loss w w α L w Size (x) (Here ɑ is learning rate, which controls the range of each step)

30 Gradient Descent

31 Backpropagation (Chain Rule) x a a 2 a n x 2 a 2 a 2 2 a 2 n a o ŷ x 3 a 3 a 3 2 a 3 n x 4

32 Deep Learning Frameworks

33 Word Embedding

34 Discrete Representation Commonest Linguistic Idea: signifier and signified (Saussure) One-Hot Encoding can represent word. It is a vector with only one and a lot of 0s. For example: Hotel

35 Problems with Discrete Representation It has no relation to the meaning of word Similar word vectors should have large inner product. But We need a better solution to represent word meaning

36 Distributed Representation You shall know a word by the company it keeps. (J. R. Firth, 957) Word Embedding can build distributed representations for words. Two of the most famous word embedding methods are: Word2Vec (Skip-Grams, CBOW) GloVe (Global Vector)

37 Skip-Grams

39 Popular Networks

40 Convolutional Neural Network

41 Convolutional Neural Network (CNN) Fully-connected Feedforward Neural Network Convolutional Neural Network

42 Convolutional Layer

43 Max Pooling

44 Activations of CNN

45 Recurrent Neural Network

46 Any Problem in Fully-connected Network? x W a o N Destination Beijing W 2 x 2 a 2 o 2 N 2 Departure W 3 W 4 x 3 x 4 a 3 o 3 N 3 Other

47 Recurrent Neural Network (RNN) leave Beijing reach Beijing

48 RNN vs Language Model

49 Why Deep Learning?

50 Machine Learning vs Deep Learning Machine Learning Deep Learning Human-designed representations + Input features + Pick the best weights Representation Learning + Raw Inputs + Pick the best algorithm

51 Advantages of Deep Learning Feature engineering is hard and to some extent, ineffective, incomplete or over-specified and it is really a hard work! Deep learning can learn features, which are easy to adapt and fast to learn. Flexible, universal and learnable More data and more powerful machines

52 Advantages of Deep Learning From Andrew Ng s course Deep Learning

53 Future for Deep Learning? Unsupervised learning may be the most important research area in the future since it is pretty easy to achieve a large amount of unlabeled data while labelled data are far fewer and pretty expensive. Transfer Learning can help us transfer the task to pre-trained models Generative Model, such as GAN (Generative Adversarial Network) Abandon it?! (Well, Hinton said we should drop BP )

54 Personal Ideas About What We Can Do It seems that now linguists contribution to NLP becomes trivial and deep learning does not really need us, but things may not be that bad. Still, we find the effects of many NLP tasks, like machine translation, not satisfactory, and machines cannot really understand semantic meaning, let alone pragmatic. More significant problems for scientists to solve in today s world, instead of improving the performance of algorithms, which are though vital.

55 References Book: Goodfellow I, Bengio Y, Courville A. Deep learning[m]. MIT press, Article: LeCun Y, Bengio Y, Hinton G. Deep learning[j]. Nature, 205, 52(7553): Schmidhuber J. Deep learning in neural networks: An overview[j]. Neural networks, 205, 6: Goldberg Y. A Primer on Neural Network Models for Natural Language Processing[J]. J. Artif. Intell. Res.(JAIR), 206, 57:

56 Talk is Cheap, Show me the Code!

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering