CS 510: Intelligent and Learning Systems

Size: px

Start display at page:

Download "CS 510: Intelligent and Learning Systems"

Kimberly Mathews
5 years ago
Views:

1 CS 510: Intelligent and Learning Systems

2 Class Information Class web page: Class mailing list: Please write your on signup sheet My office hours: T, Th 2-3 or by appointment (FAB ) Machine Learning mailing list:

3 Class Description Course is 1-credit graduate seminar for students who have already taken a course in Artificial Intelligence or Machine Learning. Purpose of the course: Each term: Deep dive into one recent area of AI/ML. Not a general AI/ML course. Students will read and discuss recent papers in the Artificial Intelligence and Machine Learning literature. Each student will be responsible for presenting at least one paper during the term. Course will be offered each term, and students may take it multiple times. CS MS students who take this course for three terms may count it as one of the courses for the "Artificial Intelligence and Machine Learning" masters track requirement.

4 Course Topics Fall 2015: Neural Network Approaches to Natural Language Processing Winter 2015: TBD Spring 2016: TBD Previous topics: Spring 2015: Reinforcement Learning Winter 2015: Graphical Models

5 Course work One or more papers (from the recent literature) will be assigned per week for everyone in the class to read. Each week one or two students will be assigned as discussion leaders for the week's papers. I will hand out a question list with each paper assignment. You need to complete it and hand it in on the day we cover the paper. (Just to make sure you read the paper). Expected homework load: ~3-4 hours per week on average See class web page:

6 Grading Class attendance and participation You may miss up to one class if necessary with no penalty (or do make-up assignment) Discussion leader presentation Weekly questions - You can skip at most one week of this with no penalty

7 Class Introductions

8 How to present a paper/video Start early Use slides Explain any unfamiliar terminology in the paper Don t go through the paper exhaustively; just talk about most important parts Graphs / images are welcome Outside information is welcome (demos, short videos, explanations) Discussion questions for class are welcome! If you don t understand something in the paper, let the class help out

9 Natural Language Processing Examples of Tasks

10 Background on Neural Networks You need to thoroughly understand multilayer perceptrons and gradient descent/ backpropagation algorithm. Optional reading see web page

11 First reading assignment and question sheet

12 Discussion Leader Schedule

13 Dense vector representations for words and documents Bengio et al (2002): A neural probabilistic language model Le & Mikolov (2014): Distributed representations of sentences and documents

14 A Neural Probabilistic Language Model (Bengio et al., 2002) Statistical language modeling: One Task: Given a sequence of words in a language, predict probability of the next word. E.g., The cat is on the w P(w The cat is on the ): function over all words w in the vocabulary. Problem: Curse of dimensionality: too many sequences!

15 One popular solution: n-grams: P (w w i w j w k ) trigram probability If n is small, doesn t capture larger context. If n is larger, back to curse of dimensionality. Doesn t capture semantic similarity of words or similarity of grammatical roles. E.g., The cat was walking in the bedroom The dog was running in a room

16 Solution proposed in this paper: Learn distributed representation (dense, continuous-valued vectors) for words that captures larger context Each word is represented by a feature vector R m Simultaneously learn this distributed representation and probability function for word sequences Learned representation automatically places semantically similar words close together in vector space

17 Training set: Sequence: Objective: Learn model Details w 1,!, w T where w 1 V f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) Test set: Word sequences s = (w k-n...w k ) from documents not in the training set. Performance metric: Perplexity on the test set.

18 Decompose f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) into two parts (to be learned): 1. Feature vectors: Mapping C C :i C(i) where i V and C(i) R m C is a V x m matrix whose columns are the feature vectors associated with words in the vocabulary V. 2. Probability model: Mapping g g : (C(w t n ),!,C(w t 1 )) R V That is, the output of g is a vector whose ith element gives the probability ˆP(w t = i w 1 t 1 )

20 Training: Find parameters (weights) θ of network that maximize loglikelihood: where T is the number of training examples and R is a regularization term.

21 Neural network computes softmax at output layer: where and

22 Training Stochastic gradient ascent update:

23 Evaluating learned model (or comparing different learned models Obviously we want the model to assign higher probability to likely sentences than unlikely sentences. Need to see how this works on test set. Could evaluate accuracy of the language model in the context of a task (e.g., spam detection). Extrinsic evaluation. But this is time-consuming.

24 Evaluating learned model (or comparing different learned models Alternative: intrinsic evaluation. Most common intrinsic evaluation method: perplexity. Intuition: A better language model is one that assigns a higher probability to sentences that actually occur in the test data.

25 Perplexity measures the probability of the test set, normalized by the number of words.

26 Measuring Perplexity of Learned Model Let the test set be W= w 1,..., w N. Perplexity(W ) = P(w 1..., w N ) 1 N = N 1 P(w 1..., w N ) = N N i=1 1 P(w i w 1..., w i 1 ) This is the geometric mean of 1/ P(w t w 1 t 1 ). Minimizing perplexity is the same as maximizing probability Note that there is an alternative (equivalent) information-theoretic definition in terms of entropy.

27 Some experiments from Bengio et al., 2002

29 This general technique (with some added tricks) is now called word2vec (extended to doc2vec) and is used by Google, Facebook, etc. for their NLP applications. See

30 From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf

31 From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf

32 From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf

Python Machine Learning

Python Machine Learning Unlock deeper insights into machine learning with this vital guide to cuttingedge predictive analytics Sebastian Raschka [ PUBLISHING 1 open source I community experience distilled