CS 510: Intelligent and Learning Systems

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Probabilistic Latent Semantic Analysis

Second Exam: Natural Language Parsing with Neural Networks

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Deep Neural Network Language Models

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Axiom 2013 Team Description Paper

(Sub)Gradient Descent

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

arxiv: v1 [cs.cl] 20 Jul 2015

CSL465/603 - Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Switchboard Language Model Improvement with Conversational Data from Gigaword

Calibration of Confidence Measures in Speech Recognition

Attributed Social Network Embedding

Learning Methods for Fuzzy Systems

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

A Neural Network GUI Tested on Text-To-Phoneme Mapping

arxiv: v2 [cs.ir] 22 Aug 2016

arxiv: v1 [cs.cl] 2 Apr 2017

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Test Effort Estimation Using Neural Network

arxiv: v1 [cs.lg] 15 Jun 2015

Exploration. CS : Deep Reinforcement Learning Sergey Levine

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

A deep architecture for non-projective dependency parsing

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Dialog-based Language Learning

Knowledge Transfer in Deep Convolutional Neural Nets

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Semantic and Context-aware Linguistic Model for Bias Detection

Georgetown University at TREC 2017 Dynamic Domain Track

CS Machine Learning

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

A Reinforcement Learning Variant for Control Scheduling

Communities in Networks. Peter J. Mucha, UNC Chapel Hill

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Softprop: Softmax Neural Network Backpropagation Learning

If we want to measure the amount of cereal inside the box, what tool would we use: string, square tiles, or cubes?

CS 446: Machine Learning

Generative models and adversarial training

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

FF+FPG: Guiding a Policy-Gradient Planner

Speech Recognition at ICSI: Broadcast News and beyond

A Vector Space Approach for Aspect-Based Sentiment Analysis

Lecture 1: Basic Concepts of Machine Learning

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Learning to Schedule Straight-Line Code

Autoencoder and selectional preference Aki-Juhani Kyröläinen, Juhani Luotolahti, Filip Ginter

THE world surrounding us involves multiple modalities

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Latent Semantic Analysis

Device Independence and Extensibility in Gesture Recognition

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Spring 2015 IET4451 Systems Simulation Course Syllabus for Traditional, Hybrid, and Online Classes

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Semi-Supervised Face Detection

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Evolutive Neural Net Fuzzy Filtering: Basic Description

CS177 Python Programming

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Detecting English-French Cognates Using Orthographic Edit Distance

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

Human-like Natural Language Generation Using Monte Carlo Tree Search

Summarizing Answers in Non-Factoid Community Question-Answering

Indian Institute of Technology, Kanpur

Evolution of Symbolisation in Chimpanzees and Neural Nets

1.11 I Know What Do You Know?

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Text-mining the Estonian National Electronic Health Record

Comment-based Multi-View Clustering of Web 2.0 Items

arxiv: v4 [cs.cl] 28 Mar 2016

Truth Inference in Crowdsourcing: Is the Problem Solved?

CS Course Missive

A Case Study: News Classification Based on Term Frequency

WHEN THERE IS A mismatch between the acoustic

Missouri Mathematics Grade-Level Expectations

arxiv: v1 [cs.cv] 10 May 2017

MTH 215: Introduction to Linear Algebra

Improvements to the Pruning Behavior of DNN Acoustic Models

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Julia Smith. Effective Classroom Approaches to.

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

An OO Framework for building Intelligence and Learning properties in Software Agents

Transcription:

CS 510: Intelligent and Learning Systems

Class Information Class web page: http://web.cecs.pdx.edu/~mm/ils/fall2015.htm Class mailing list: ils2015@cs.pdx.edu Please write your email on signup sheet My office hours: T, Th 2-3 or by appointment (FAB 120-24) Machine Learning mailing list: machinelearning@cs.pdx.edu

Class Description Course is 1-credit graduate seminar for students who have already taken a course in Artificial Intelligence or Machine Learning. Purpose of the course: Each term: Deep dive into one recent area of AI/ML. Not a general AI/ML course. Students will read and discuss recent papers in the Artificial Intelligence and Machine Learning literature. Each student will be responsible for presenting at least one paper during the term. Course will be offered each term, and students may take it multiple times. CS MS students who take this course for three terms may count it as one of the courses for the "Artificial Intelligence and Machine Learning" masters track requirement.

Course Topics Fall 2015: Neural Network Approaches to Natural Language Processing Winter 2015: TBD Spring 2016: TBD Previous topics: Spring 2015: Reinforcement Learning Winter 2015: Graphical Models

Course work One or more papers (from the recent literature) will be assigned per week for everyone in the class to read. Each week one or two students will be assigned as discussion leaders for the week's papers. I will hand out a question list with each paper assignment. You need to complete it and hand it in on the day we cover the paper. (Just to make sure you read the paper). Expected homework load: ~3-4 hours per week on average See class web page: http://web.cecs.pdx.edu/~mm/ils/fall2015.html

Grading Class attendance and participation You may miss up to one class if necessary with no penalty (or do make-up assignment) Discussion leader presentation Weekly questions - You can skip at most one week of this with no penalty

Class Introductions

How to present a paper/video Start early Use slides Explain any unfamiliar terminology in the paper Don t go through the paper exhaustively; just talk about most important parts Graphs / images are welcome Outside information is welcome (demos, short videos, explanations) Discussion questions for class are welcome! If you don t understand something in the paper, let the class help out

Natural Language Processing Examples of Tasks

Background on Neural Networks You need to thoroughly understand multilayer perceptrons and gradient descent/ backpropagation algorithm. Optional reading see web page

First reading assignment and question sheet

Discussion Leader Schedule

Dense vector representations for words and documents Bengio et al (2002): A neural probabilistic language model Le & Mikolov (2014): Distributed representations of sentences and documents

A Neural Probabilistic Language Model (Bengio et al., 2002) Statistical language modeling: One Task: Given a sequence of words in a language, predict probability of the next word. E.g., The cat is on the w P(w The cat is on the ): function over all words w in the vocabulary. Problem: Curse of dimensionality: too many sequences!

One popular solution: n-grams: P (w w i w j w k ) trigram probability If n is small, doesn t capture larger context. If n is larger, back to curse of dimensionality. Doesn t capture semantic similarity of words or similarity of grammatical roles. E.g., The cat was walking in the bedroom The dog was running in a room

Solution proposed in this paper: Learn distributed representation (dense, continuous-valued vectors) for words that captures larger context Each word is represented by a feature vector R m Simultaneously learn this distributed representation and probability function for word sequences Learned representation automatically places semantically similar words close together in vector space

Training set: Sequence: Objective: Learn model Details w 1,!, w T where w 1 V f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) Test set: Word sequences s = (w k-n...w k ) from documents not in the training set. Performance metric: Perplexity on the test set.

Decompose f (w t n,!, w t 1, w t, ) = ˆP(w t w 1 t 1 ) into two parts (to be learned): 1. Feature vectors: Mapping C C :i C(i) where i V and C(i) R m C is a V x m matrix whose columns are the feature vectors associated with words in the vocabulary V. 2. Probability model: Mapping g g : (C(w t n ),!,C(w t 1 )) R V That is, the output of g is a vector whose ith element gives the probability ˆP(w t = i w 1 t 1 )

Training: Find parameters (weights) θ of network that maximize loglikelihood: where T is the number of training examples and R is a regularization term.

Neural network computes softmax at output layer: where and

Training Stochastic gradient ascent update:

Evaluating learned model (or comparing different learned models Obviously we want the model to assign higher probability to likely sentences than unlikely sentences. Need to see how this works on test set. Could evaluate accuracy of the language model in the context of a task (e.g., spam detection). Extrinsic evaluation. But this is time-consuming.

Evaluating learned model (or comparing different learned models Alternative: intrinsic evaluation. Most common intrinsic evaluation method: perplexity. Intuition: A better language model is one that assigns a higher probability to sentences that actually occur in the test data.

Perplexity measures the probability of the test set, normalized by the number of words.

Measuring Perplexity of Learned Model Let the test set be W= w 1,..., w N. Perplexity(W ) = P(w 1..., w N ) 1 N = N 1 P(w 1..., w N ) = N N i=1 1 P(w i w 1..., w i 1 ) This is the geometric mean of 1/ P(w t w 1 t 1 ). Minimizing perplexity is the same as maximizing probability Note that there is an alternative (equivalent) information-theoretic definition in terms of entropy.

Some experiments from Bengio et al., 2002

This general technique (with some added tricks) is now called word2vec (extended to doc2vec) and is used by Google, Facebook, etc. for their NLP applications. See https://code.google.com/p/word2vec/

From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf

From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf

From h#p://papers.nips.cc/paper/5021- distributed- representa8ons- of- words- and- phrases- and- their- composi8onality.pdf