Linear Models Continued: Perceptron & Logistic Regression

Similar documents
(Sub)Gradient Descent

Lecture 1: Machine Learning Basics

Python Machine Learning

CS Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CSL465/603 - Machine Learning

Generative models and adversarial training

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Artificial Neural Networks written examination

Assignment 1: Predicting Amazon Review Ratings

Softprop: Softmax Neural Network Backpropagation Learning

WHEN THERE IS A mismatch between the acoustic

arxiv: v1 [cs.lg] 15 Jun 2015

Rule Learning With Negation: Issues Regarding Effectiveness

Learning From the Past with Experiment Databases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Semi-Supervised Face Detection

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS 446: Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Calibration of Confidence Measures in Speech Recognition

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Discriminative Learning of Beam-Search Heuristics for Planning

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Australian Journal of Basic and Applied Sciences

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Indian Institute of Technology, Kanpur

A Case Study: News Classification Based on Term Frequency

Lecture 1: Basic Concepts of Machine Learning

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Detailed course syllabus

Laboratorio di Intelligenza Artificiale e Robotica

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Evolutive Neural Net Fuzzy Filtering: Basic Description

FF+FPG: Guiding a Policy-Gradient Planner

Speech Recognition at ICSI: Broadcast News and beyond

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Axiom 2013 Team Description Paper

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Human Emotion Recognition From Speech

An empirical study of learning speed in backpropagation

Universidade do Minho Escola de Engenharia

Development of Multistage Tests based on Teacher Ratings

Using EEG to Improve Massive Open Online Courses Feedback Interaction

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Beyond the Pipeline: Discrete Optimization in NLP

Go fishing! Responsibility judgments when cooperation breaks down

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

The stages of event extraction

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

arxiv: v2 [cs.cv] 30 Mar 2017

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

The Boosting Approach to Machine Learning An Overview

College Pricing and Income Inequality

Laboratorio di Intelligenza Artificiale e Robotica

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Compositional Semantics

SARDNET: A Self-Organizing Feature Map for Sequences

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Learning Methods for Fuzzy Systems

Reducing Features to Improve Bug Prediction

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Model Ensemble for Click Prediction in Bing Search Ads

Data Fusion Through Statistical Matching

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

Multi-label classification via multi-target regression on data streams

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Chapter 2 Rule Learning in a Nutshell

Corrective Feedback and Persistent Learning for Information Extraction

Attributed Social Network Embedding

An Introduction to Simulation Optimization

Software Maintenance

Conference Presentation

Truth Inference in Crowdsourcing: Is the Problem Solved?

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

Test Effort Estimation Using Neural Network

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

STA 225: Introductory Statistics (CT)

How to Judge the Quality of an Objective Classroom Test

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

On-the-Fly Customization of Automated Essay Scoring

Transcription:

Linear Models Continued: Perceptron & Logistic Regression CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein

Linear Models for Classification Feature function representation Weights

Naïve Bayes recap

The Perceptron

The perceptron A linear model for classification An algorithm to learn feature weights given labeled data online algorithm error-driven

Multiclass perceptron

Understanding the perceptron What s the impact of the update rule on parameters? The perceptron algorithm will converge if the training data is linearly separable Proof: see A Course In Machine Learning Ch.4 Practical issues How to initalize? When to stop? How to order training examples?

When to stop? One technique When the accuracy on held out data starts to decrease Early stopping Requires splitting data into 3 sets: training/development/test

ML fundamentals aside: overfitting/underfitting/generalization

Training error is not sufficient We care about generalization to new examples A classifier can classify training data perfectly, yet classify new examples incorrectly Because training examples are only a sample of data distribution a feature might correlate with class by coincidence Because training examples could be noisy e.g., accident in labeling

Overfitting Consider a model θ and its: Error rate over training data error %&'() (θ) True error rate over all data error %&,- θ We say h overfits the training data if error %&'() θ < error %&,- θ

Evaluating on test data Problem: we don t know error %&,- θ! Solution: we set aside a test set some examples that will be used for evaluation we don t look at them during training! after learning a classifier θ, we calculate error %-0% θ

Overfitting Another way of putting it A classifier θ is said to overfit the training data, if there is another hypothesis θ, such that θ has a smaller error than θ on the training data but θ has larger error on the test data than θ.

Underfitting/Overfitting Underfitting Learning algorithm had the opportunity to learn more from training data, but didn t Overfitting Learning algorithm paid too much attention to idiosyncracies of the training data; the resulting classifier doesn t generalize

Back to the Perceptron

Averaged Perceptron improves generalization

What objective/loss does the perceptron optimize? Zero-one loss function What are the pros and cons compared to Naïve Bayes loss?

Logistic Regression

Perceptron & Probabilities What if we want a probability p(y x)? The perceptron gives us a prediction y Let s illustrate this with binary classification Illustrations: Graham Neubig

The logistic function Softer function than in perceptron Can account for uncertainty Differentiable

Logistic regression: how to train? Train based on conditional likelihood Find parameters w that maximize conditional likelihood of all answers y ( given examples x (

Stochastic gradient ascent (or descent) Online training algorithm for logistic regression and other probabilistic models Update weights for every training example Move in direction given by gradient Size of update step scaled by learning rate

What you should know Standard supervised learning set-up for text classification Difference between train vs. test data How to evaluate 3 examples of supervised linear classifiers Naïve Bayes, Perceptron, Logistic Regression Learning as optimization: what is the objective function optimized? Difference between generative vs. discriminative classifiers Smoothing, regularization Overfitting, underfitting

An online learning algorithm

Perceptron weight update If y = 1, increase the weights for features in If y = -1, decrease the weights for features in