TTIC 31190: Natural Language Processing

Similar documents
(Sub)Gradient Descent

Python Machine Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Lecture 1: Machine Learning Basics

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Second Exam: Natural Language Parsing with Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Attributed Social Network Embedding

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Assignment 1: Predicting Amazon Review Ratings

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Artificial Neural Networks written examination

Model Ensemble for Click Prediction in Bing Search Ads

arxiv: v1 [cs.cv] 10 May 2017

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

CSL465/603 - Machine Learning

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Test Effort Estimation Using Neural Network

Generative models and adversarial training

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

arxiv: v1 [cs.lg] 15 Jun 2015

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.cl] 20 Jul 2015

Discriminative Learning of Beam-Search Heuristics for Planning

Penn State University - University Park MATH 140 Instructor Syllabus, Calculus with Analytic Geometry I Fall 2010

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

A deep architecture for non-projective dependency parsing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

A Deep Bag-of-Features Model for Music Auto-Tagging

Human Emotion Recognition From Speech

INPE São José dos Campos

Calibration of Confidence Measures in Speech Recognition

Learning Methods for Fuzzy Systems

arxiv: v1 [cs.lg] 7 Apr 2015

WHEN THERE IS A mismatch between the acoustic

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Natural Language Processing. George Konidaris

CS 446: Machine Learning

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Deep Neural Network Language Models

On the Formation of Phoneme Categories in DNN Acoustic Models

STAT 220 Midterm Exam, Friday, Feb. 24

Learning to Schedule Straight-Line Code

AN EXAMPLE OF THE GOMORY CUTTING PLANE ALGORITHM. max z = 3x 1 + 4x 2. 3x 1 x x x x N 2

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

A study of speaker adaptation for DNN-based speech synthesis

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Probabilistic Latent Semantic Analysis

A Case Study: News Classification Based on Term Frequency

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Evolutive Neural Net Fuzzy Filtering: Basic Description

Speech Recognition at ICSI: Broadcast News and beyond

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Axiom 2013 Team Description Paper

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Beyond the Pipeline: Discrete Optimization in NLP

An empirical study of learning speed in backpropagation

Physics 270: Experimental Physics

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

Summarizing Answers in Non-Factoid Community Question-Answering


TextGraphs: Graph-based algorithms for Natural Language Processing

University of Victoria School of Exercise Science, Physical and Health Education EPHE 245 MOTOR LEARNING. Calendar Description Units: 1.

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

Truth Inference in Crowdsourcing: Is the Problem Solved?

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Introduction to Simulation

Lecture 1: Basic Concepts of Machine Learning

Semantic and Context-aware Linguistic Model for Bias Detection

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Linking Task: Identifying authors and book titles in verbose queries

Syntactic systematicity in sentence processing with a recurrent self-organizing network

arxiv: v4 [cs.cl] 28 Mar 2016

Indian Institute of Technology, Kanpur

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Probability and Statistics Curriculum Pacing Guide

TD(λ) and Q-Learning Based Ludo Players

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Modeling function word errors in DNN-HMM based LVCSR systems

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

The Strong Minimalist Thesis and Bounded Optimality

Transcription:

TTIC 31190: Natural Language Processing Kevin Gimpel Winter 2016 Lecture 10: Neural Networks for NLP 1

Announcements Assignment 2 due Friday project proposal due Tuesday, Feb. 16 midterm on Thursday, Feb. 18 2

classification words lexical semantics language modeling Roadmap sequence labeling neural network methods in NLP syntax and syntactic parsing semantic compositionality semantic parsing unsupervised learning machine translation and other applications 3

What is a neural network? just think of a neural network as a function it has inputs and outputs the term neural typically means a particular type of functional building block ( neural layers ), but the term has expanded to mean many things 4

Classifier Framework linear model score function: we can also use a neural network for the score function! 5

neural layer = affine transform + nonlinearity nonlinearity affine transform this is a single layer of a neural network input vector is vector of hidden units is 6

Nonlinearities most common: elementwise application of g function to each entry in vector examples 7

tanh: 8

(logistic) sigmoid: 9

rectified linear unit (ReLU): 10

2-layer network vector of label scores this is a 2-layer neural network input vector is output vector is 11

2-layer neural network for sentiment classification 12

Use softmax function to convert scores into probabilities 13

Why nonlinearities? 2-layer network: written in a single equation: if g is linear, then we can rewrite the above as a single affine transform can you prove this? (use distributivity of matrix multiplication) 14

15

Understanding the score function entry 2 of bias vector row vector corresponding to row 2 of 16

Parameter sharing parameters NOT shared between labels parameters shared between labels 17

with linear models: Observation when using linear models for, say, sentiment classification, every feature included a label no parameters were shared between labels with neural networks we now have parameters shared across labels! we still have some parameters that are devoted to particular labels to define x, we design features that only look at the input (not at the labels) 18

Defining input features say we re doing sentiment classification and we want to use a neural network what should x be? it has to be independent of the label it has to be a fixed-length vector 19

Empirical Risk Minimization with Surrogate Loss Functions given training data: where each is a label we want to solve the following: many possible loss functions to consider optimizing 20

Loss Functions name loss where used cost ( 0-1 ) perceptron hinge log intractable, but underlies direct error minimization perceptron algorithm (Rosenblatt, 1958) support vector machines, other largemargin algorithms logistic regression, conditional random fields, maximum entropy models 21

(Sub)gradients of Losses for Linear Models name cost ( 0-1 ) entry j of (sub)gradient of loss for linear model not subdifferentiable in general perceptron hinge log 22

Learning with Neural Networks we can use any of our loss functions from before, as long as we can compute (sub)gradients algorithm for doing this efficiently: backpropagation it s basically just the chain rule of derivatives 23

Computation Graphs a useful way to represent the computations performed by a neural model (or any model!) why useful? makes it easy to implement automatic differentiation (backpropagation) many neural net toolkits let you define your model in terms of computation graphs (Theano, Torch, TensorFlow, CNTK, CNN, PENNE, etc.) 24

Backpropagation backpropagation has become associated with neural networks, but it s much more general I also use backpropagation to compute gradients in linear models for structured prediction 25

A simple computation graph: represents expression a + 3 26

A slightly bigger computation graph: represents expression (a + 3) 2 + 4a 2 27

Operators can have more than 2 operands: still represents expression (a + 3) 2 + 4a 2 28

more concise: 29

Overfitting & Regularization when we can fit any function, overfitting becomes a big concern overfitting: learning a model that does well on the training set but doesn t generalize to new data there are many strategies to reduce overfitting (we ll use the general term regularization for any such strategy) you used early stopping in Assignment 1, which is one kind of regularization 30

Empirical Risk Minimization given training data: where each is a label we want to solve the following: 31

Regularized Empirical Risk Minimization given training data: where each is a label we want to solve the following: regularization strength regularization term 32

Regularization Terms most common: penalize large parameter values intuition: large parameters might be instances of overfitting examples: L 2 regularization: (also called Tikhonov regularization or ridge regression) L 1 regularization: (also called basis pursuit or LASSO) 33

Regularization Terms L 2 regularization: differentiable, widely-used L 1 regularization: not differentiable (but is subdifferentiable) leads to sparse solutions (many parameters become zero!) 34

Dropout popular regularization method for neural networks randomly drop out (set to zero) some of the vector entries in the layers 35

Optimization Algorithms you used stochastic gradient descent (SGD) in Assignment 1 but there are many other choices: AdaGrad AdaDelta Adam SGD with momentum we don t have time to go through these in class, but you should try using them! (most toolkits have implementations of these and others) 36