CS 510: Intelligent and Learning Systems

Class Information Class web page: http://web.cecs.pdx.edu/~mm/ils/fall2015.htm Class mailing list: ils2015@cs.pdx.edu Please write your email on signup sheet My office hours: T, Th 2-3 or by appointment (FAB 120-24) Machine Learning mailing list: machinelearning@cs.pdx.edu

Class Description Course is 1-credit graduate seminar for students who have already taken a course in Artificial Intelligence or Machine Learning. Purpose of the course: Each term: Deep dive into one recent area of AI/ML. Not a general AI/ML course. Students will read and discuss recent papers in the Artificial Intelligence and Machine Learning literature. Each student will be responsible for presenting at least one paper during the term. Course will be offered each term, and students may take it multiple times. CS MS students who take this course for three terms may count it as one of the courses for the "Artificial Intelligence and Machine Learning" masters track requirement.

Course Topics Fall 2015: Neural Network Approaches to Natural Language Processing Winter 2015: TBD Spring 2016: TBD Previous topics: Spring 2015: Reinforcement Learning Winter 2015: Graphical Models

Course work One or more papers (from the recent literature) will be assigned per week for everyone in the class to read. Each week one or two students will be assigned as discussion leaders for the week's papers. I will hand out a question list with each paper assignment. You need to complete it and hand it in on the day we cover the paper. (Just to make sure you read the paper). Expected homework load: ~3-4 hours per week on average See class web page: http://web.cecs.pdx.edu/~mm/ils/fall2015.html

Grading Class attendance and participation You may miss up to one class if necessary with no penalty (or do make-up assignment) Discussion leader presentation Weekly questions - You can skip at most one week of this with no penalty

Class Introductions

How to present a paper/video Start early Use slides Explain any unfamiliar terminology in the paper Don t go through the paper exhaustively; just talk about most important parts Graphs / images are welcome Outside information is welcome (demos, short videos, explanations) Discussion questions for class are welcome! If you don t understand something in the paper, let the class help out

Natural Language Processing Examples of Tasks

Background on Neural Networks You need to thoroughly understand multilayer perceptrons and gradient descent/ backpropagation algorithm. Optional reading see web page

First reading assignment and question sheet

Discussion Leader Schedule

Dense vector representations for words and documents Bengio et al (2002): A neural probabilistic language model Le & Mikolov (2014): Distributed representations of sentences and documents

A Neural Probabilistic Language Model (Bengio et al., 2002) Statistical language modeling: One Task: Given a sequence of words in a language, predict probability of the next word. E.g., The cat is on the w P(w The cat is on the ): function over all words w in the vocabulary. Problem: Curse of dimensionality: too many sequences!

One popular solution: n-grams: P (w w i w j w k ) trigram probability If n is small, doesn t capture larger context. If n is larger, back to curse of dimensionality. Doesn t capture semantic similarity of words or similarity of grammatical roles. E.g., The cat was walking in the bedroom The dog was running in a room

Solution proposed in this paper: Learn distributed representation (dense, continuous-valued vectors) for words that captures larger context Each word is represented by a feature vector R m Simultaneously learn this distributed representation and probability function for word sequences Learned representation automatically places semantically similar words close together in vector space

Training set: Sequence: Objective: Learn model Details w 1,!, w T where w 1 V f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) Test set: Word sequences s = (w k-n...w k ) from documents not in the training set. Performance metric: Perplexity on the test set.

Decompose f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) into two parts (to be learned): 1. Feature vectors: Mapping C C :i C(i) where i V and C(i) R m C is a V x m matrix whose columns are the feature vectors associated with words in the vocabulary V. 2. Probability model: Mapping g g : (C(w t n ),!,C(w t 1 )) R V That is, the output of g is a vector whose ith element gives the probability ˆP(w t = i w 1 t 1 )

Training: Find parameters (weights) θ of network that maximize loglikelihood: where T is the number of training examples and R is a regularization term.

Neural network computes softmax at output layer: where and

Training Stochastic gradient ascent update:

Evaluating learned model (or comparing different learned models Obviously we want the model to assign higher probability to likely sentences than unlikely sentences. Need to see how this works on test set. Could evaluate accuracy of the language model in the context of a task (e.g., spam detection). Extrinsic evaluation. But this is time-consuming.

Evaluating learned model (or comparing different learned models Alternative: intrinsic evaluation. Most common intrinsic evaluation method: perplexity. Intuition: A better language model is one that assigns a higher probability to sentences that actually occur in the test data.

Perplexity measures the probability of the test set, normalized by the number of words.

Measuring Perplexity of Learned Model Let the test set be W= w 1,..., w N. Perplexity(W ) = P(w 1..., w N ) 1 N = N 1 P(w 1..., w N ) = N N i=1 1 P(w i w 1..., w i 1 ) This is the geometric mean of 1/ P(w t w 1 t 1 ). Minimizing perplexity is the same as maximizing probability Note that there is an alternative (equivalent) information-theoretic definition in terms of entropy.

Some experiments from Bengio et al., 2002

This general technique (with some added tricks) is now called word2vec (extended to doc2vec) and is used by Google, Facebook, etc. for their NLP applications. See https://code.google.com/p/word2vec/

From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf