Deep Learning Introduction and Natural Language Processing Applications

Size: px

Start display at page:

Download "Deep Learning Introduction and Natural Language Processing Applications"

Christine Harper
5 years ago
Views:

1 Deep Learning Introduction and Natural Language Processing Applications GMU CSI 899 Jim Simpson, PhD 9/18/2017

2 Agenda Fundamentals Linear and Logistic Regression Logistic Regression to Neural Networks Neural Networks to Deep Learning Representation Learning with Deep Neural Networks Natural Language Processing Applications Word Embeddings/Vectors Word2Vec Language Models Long-Short-Term-Memory Recurrent Neural Networks Additional Reading 2

3 Definitions Deep Learning Models Are Neural Networks with more then one hidden layer Neural Networks Are two dimensional array of Logistic Regressors loosely inspired by how neurons are connected in the mammalian brain Deep Learning vs Traditional Machine Learning Deep Learning can learn complex non-linear relationships in the data Can do this without explicit manual feature engineering Adapts to all types of data (even unstructured images and natural language) 3

function y = 1 1+e (β 0+β 1 x) Home prices from Square Footage Benign/ Malignant from Tumor Size Multiple Linear

4 Regression Analysis Overview Linear Regression Dependent Variable (Predictions): Continuous Simple Case: Equation of a line y = β 0 + β 1 x Logistic Regression Dependent Variable (Predictions): Categorical Simple Case: Sigmoid function y = 1 1+e (β 0+β 1 x) Home prices from Square Footage Benign/ Malignant from Tumor Size Multiple Linear Regression: Y = β 0 + β 1 X 1 + β 2 X 2 Multiple Logistic Regression: ( ) Y logit(y )=ln = β 0 +β 1 X 1 +β 2 X 2 1 Y 4

5 Visual Representation of the Linear Model Two input dimensions are combined linearly to form single dimension output Y = β 0 + β 1 X 1 + β 2 X 2 X 1 β 1 + Y X 2 β 2 Inputs Output 5

6 Logistic Regression to Neural Networks Add extra steps between input and output X 1 β 1 + H 1 W 1 Y X 2 β 2 Inputs Hidden Unit Output With multiple dimensions X 1 β 1,1 H 1 W 1 β 1,2 Y β 2,1 X 2 H 2 β 2,2 W 2 Inputs Hidden Unit Output 6

7 Neural Networks with Hidden Units Add non-linearity through layering activation functions X 1 β 1,1 f(x) H 1 W 1 f(x) = tanh(x) β 1,2 Y X 2 β 2,1 f(x) H 2 β 2,2 W 2 Inputs Hidden Unit Output Advantages Adding these Hidden Units allows us to capture complex interactions between the variables, whereas we previously treated them as linearly independent The non-linearity on the Hidden Units results in a warping of the feature space that is hard to visualize but really beneficial Being able to choose the number of Hidden Units allows us to change the dimensionality of the problem, potentially making classification far easier in a higher-dimensional space 7

8 Neural Networks to Deep Learning Deep Learning uses Neural Networks with multiple hidden layers X 1 Y X 2 Inputs First Layer Second Layer Output Number of neurons per layer and number of layers become hyper-parameters Input Dimension e.g. Number of pixels in image X 1 X 2 Y Output Classes e.g. Numbers 0-9 in Digits Recognition Inputs First Layer Second Layer Output 8

9 Learning Non-linear Decision Boundaries Logistic Regression without Feature Engineering Logistic Regression without manual feature engineering is NOT able to separate blue dots from orange dots se&ysquared=false&cosx=false&sinx=false&cosy=false&siny=false&collectstats=false&problem=classification&initzero=false&hidetext=false 9

Learning Non-linear Decision Boundaries Logistic Regression with Manual Feature Engineering Adding additional hand derived features allows logistic regression to separate blue dots from orange dots

10 Learning Non-linear Decision Boundaries Logistic Regression with Manual Feature Engineering Adding additional hand derived features allows logistic regression to separate blue dots from orange dots e&ysquared=true&cosx=false&sinx=false&cosy=false&siny=false&collectstats=false&problem=classification&initzero=false&hidetext=false 10

Learning Non-linear Decision Boundaries Neural Network without Manual Feature Engineering A very simple neural network can separate the two without any manual feature engineering http://playground.

11 Learning Non-linear Decision Boundaries Neural Network without Manual Feature Engineering A very simple neural network can separate the two without any manual feature engineering lse&ysquared=false&cosx=false&sinx=false&cosy=false&siny=false&collectstats=false&problem=classification&initzero=false&hidetext=false 11

Learning Non-linear Decision Boundaries Deep Neural Network http://playground.tensorflow.org/#activation=relu&regularization=l2&batchsize=20&dataset=spiral&regdataset=regplane&learningrate=0.

12 Learning Non-linear Decision Boundaries Deep Neural Network 12 ed=false&ysquared=false&cosx=false&sinx=false&cosy=false&siny=false&collectstats=false&problem=classification&initzero=false&hidetext=false

13 Deep Learning Frameworks 13

io/posts/2015-08-understanding-lstms/ Common Across all Applications Recurrent Neural Networks (RNNs) Word Embeddings/Vectors

14 Natural Language Processing (NLP) Tasks and Recurrent Neural Networks NLP Applications Sentiment Analysis Machine Translation Question Answering Dialogue Agents Language Generation Recurrent Neural Networks Common Across all Applications Recurrent Neural Networks (RNNs) Word Embeddings/Vectors Recommended Resource: Stanford CS224d/n: Natural Language Processing with Deep Learning: 14

15 Word Embeddings Problem: consider the sentence I made her duck Approach: Distributional Hypothesis You shall know a word by the company it keeps J. R. Firth Solution: Word Embeddings/Vectors 15

16 Given a corpus with these three sentences I like deep learning. I like NLP. I enjoy flying. Word Vectors from Singular Value Decomposition of Co-Occurence Matrix Co-Occurrence Matrix Singular Value Decomposition Problems: Computation scales quadratically for n x m matrix: O(mn 2 ) Hard to add new words or documents 16

17 Word Vectors: Main Idea of Word2Vec Instead of capturing co-occurrence counts directly Predict surrounding words of every word In a window of length c of ever word Objective function: Maximize the log probability of any context word given current center word: Simplest first formulation for conditional probability: 17

18 Word2Vec: Skip-Gram with Negative Sampling Word2Vec embeds each word into a low-dimensional vector space using: Skip-Gram: Train for center word w I at time t in a local context window of length c Negative Sampling: Clever way to frame the problem as a supervised classification problem Maximize probability of: a true pair (the center word and word in its context window) Minimize probability of: a couple of random pairs (the center word and a random word outside context window) This simple logistic regression problem moves the vectors for the true pair closer This simple logistic regression problem moves the vectors for the random pairs apart 18

19 Reduced Dimensional (300-dim to 2-d) Word Vectors Trained on English Wikipedia Relationships Superlatives Named Entities Images using GloVe from Richard Socher available 19

20 Language Model using Word Vectors A language model: Assigns probabilities to sentences (sequence of words) By predicting next word,, in a sentence given history of previous words coffee with cream and sugar It is a classification problem where the target class at each iteration is The model is trained to predict a probability distribution over the vocabulary The loss or error is the distance between the prediction and the target 20

Language Model using Neural Networks Trained using a Recurrent Neural Networks (RNNs): Neural networks with feedback loops, allowing information to persist Natural

21 Language Model using Neural Networks Trained using a Recurrent Neural Networks (RNNs): Neural networks with feedback loops, allowing information to persist Natural architecture for working with sequences With Long-Short-Term Memories (LSTMs): RNNs with more complex units To capture both long-term and short-term dependencies 21

22 Single Cell Visualization of Language Model trained on Linux Source Code 22

23 Single Cell Visualization of Language Model trained on Linux Source Code 23

24 Additional Reading Papers Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems Lin, Henry W., and Max Tegmark. "Critical Behavior from Deep Dynamics: A Hidden Dimension in Natural Language." arxiv preprint arxiv: (2016). Blog Posts Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks Chris Olah: Understanding LSTM Networks Chris Olah: Attention and Augmented Recurrent Neural Networks Code! Keras: TensorFlow: 24

25 Research Ideas Uncertainty of Predictions in Recurrent Neural Networks Gal, Yarin. Uncertainty in deep learning. Diss. PhD thesis, University of Cambridge, Tom Wiecki: Bayesian Deep Learning Uber Engineering: Application Motivation Distributed Deep Learning of Recurrent Neural Networks Scaling Out using Spark and Scaling Up using TensorFlow/Keras

26 BACKUP 26

Computer Vision Tasks and Convolutional Neural Networks Computer Vision Applications Image Classification Object Detection Semantic Segmentation

com/2015/03/26/convolutiondeep-learning/ Common Across All Applications Convolutional Neural Networks https://www.researchgate.

27 Computer Vision Tasks and Convolutional Neural Networks Computer Vision Applications Image Classification Object Detection Semantic Segmentation Image Captioning Style Transfer Image Generation Convolutional Neural Networks Common Across All Applications Convolutional Neural Networks _For_10K_Objects_Classification Recommended Resource: Stanford CS231n: Convolutional Neural Networks for Visual Recognition: 27

28 Convolutional Neural Networks 28

29 Convolutional Neural Networks 29

30 30

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks 1 Tzu-Hsuan Yang, 2 Tzu-Hsuan Tseng, and 3 Chia-Ping Chen Department of Computer Science and Engineering