(Sub)Gradient Descent

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS Machine Learning

CSL465/603 - Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Penn State University - University Park MATH 140 Instructor Syllabus, Calculus with Analytic Geometry I Fall 2010

Artificial Neural Networks written examination

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Lecture 1: Basic Concepts of Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Model Ensemble for Click Prediction in Bing Search Ads

Probabilistic Latent Semantic Analysis

Learning From the Past with Experiment Databases

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v2 [cs.cv] 30 Mar 2017

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Generative models and adversarial training

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Learning Methods for Fuzzy Systems

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Laboratorio di Intelligenza Artificiale e Robotica

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Rule Learning with Negation: Issues Regarding Effectiveness

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Axiom 2013 Team Description Paper

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Math 181, Calculus I

The stages of event extraction

WHEN THERE IS A mismatch between the acoustic

Many instructors use a weighted total to calculate their grades. This lesson explains how to set up a weighted total using categories.

Human Emotion Recognition From Speech

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Welcome to. ECML/PKDD 2004 Community meeting

CS Course Missive

Compositional Semantics

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Evolutive Neural Net Fuzzy Filtering: Basic Description

Probability and Game Theory Course Syllabus

CS 446: Machine Learning

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

The Boosting Approach to Machine Learning An Overview

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Discriminative Learning of Beam-Search Heuristics for Planning

AP Chemistry

Self Study Report Computer Science

Pretest Integers and Expressions

GACE Computer Science Assessment Test at a Glance

Natural Language Processing: Interpretation, Reasoning and Machine Learning

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Short vs. Extended Answer Questions in Computer Science Exams

Math 96: Intermediate Algebra in Context

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Syllabus for CHEM 4660 Introduction to Computational Chemistry Spring 2010

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Probability and Statistics Curriculum Pacing Guide

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

BUAD 425 Data Analysis for Decision Making Syllabus Fall 2015

BADM 641 (sec. 7D1) (on-line) Decision Analysis August 16 October 6, 2017 CRN: 83777

Data Structures and Algorithms

arxiv: v1 [cs.lg] 15 Jun 2015

A Neural Network GUI Tested on Text-To-Phoneme Mapping

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

BA 130 Introduction to International Business

EECS 700: Computer Modeling, Simulation, and Visualization Fall 2014

MTH 215: Introduction to Linear Algebra

arxiv: v1 [cs.cv] 10 May 2017

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

A Review: Speech Recognition with Deep Learning Methods

Calibration of Confidence Measures in Speech Recognition

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Knowledge Transfer in Deep Convolutional Neural Nets

Chapter 2 Rule Learning in a Nutshell

Article A Novel, Gradient Boosting Framework for Sentiment Analysis in Languages where NLP Resources Are Not Plentiful: A Case Study for Modern Greek

Ohio s Learning Standards-Clear Learning Targets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

MinE 382 Mine Power Systems Fall Semester, 2014

CS 100: Principles of Computing

Laboratorio di Intelligenza Artificiale e Robotica

Syllabus Foundations of Finance Summer 2014 FINC-UB

Second Exam: Natural Language Parsing with Neural Networks

Fall 2016 ARA 4400/ 7152

Learning to Schedule Straight-Line Code

CHAPTER 4: REIMBURSEMENT STRATEGIES 24

SYLLABUS. EC 322 Intermediate Macroeconomics Fall 2012

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Graphic Organizer For Movie Notes

Speech Recognition at ICSI: Broadcast News and beyond

Transcription:

(Sub)Gradient Descent CMSC 422 MARINE CARPUAT marine@cs.umd.edu Figures credit: Piyush Rai

Logistics Midterm is on Thursday 3/24 during class time closed book/internet/etc, one page of notes. will include short questions (similar to quizzes) and 2 problems that require applying what you've learned to new settings topics: everything up to this week, including linear models, gradient descent, homeworks and project 1 Next HW due on Tuesday 3/22 by 1:30pm Office hours Tuesday 3/22 after class Please take survey before end of break!

What you should know (1) Decision Trees What is a decision tree, and how to induce it from data Fundamental Machine Learning Concepts Difference between memorization and generalization What inductive bias is, and what is its role in learning What underfitting and overfitting means How to take a task and cast it as a learning problem Why you should never ever touch your test data!!

What you should know (2) New Algorithms K-NN classification K-means clustering Fundamental ML concepts How to draw decision boundaries What decision boundaries tells us about the underlying classifiers The difference between supervised and unsupervised learning

What you should know (3) The perceptron model/algorithm What is it? How is it trained? Pros and cons? What guarantees does it offer? Why we need to improve it using voting or averaging, and the pros and cons of each solution Fundamental Machine Learning Concepts Difference between online vs. batch learning What is error-driven learning

What you should know (4) Be aware of practical issues when applying ML techniques to new problems How to select an appropriate evaluation metric for imbalanced learning problems How to learn from imbalanced data using α- weighted binary classification, and what the error guarantees are

What you should know (5) What are reductions and why they are useful Implement, analyze and prove error bounds of algorithms for Weighted binary classification Multiclass classification (OVA, AVA, tree) Understand algorithms for Stacking for collective classification ω ranking

What you should know (6) Linear models: An optimization view of machine learning Pros and cons of various loss functions Pros and cons of various regularizers (Gradient Descent)

Today s topic How to optimize linear model objectives using gradient descent (and subgradient descent) [CIML Chapter 6]

Casting Linear Classification as an Optimization Problem Objective function Loss function measures how well classifier fits training data Regularizer prefers solutions that generalize well Indicator function: 1 if (.) is true, 0 otherwise The loss function above is called the 0-1 loss

Gradient descent A general solution for our optimization problem Idea: take iterative steps to update parameters in the direction of the gradient

Gradient descent algorithm Objective function to minimize Number of steps Step size

Illustrating gradient descent in 1-dimensional case

Gradient Descent 2 questions When to stop? How to choose the step size?

Gradient Descent 2 questions When to stop? When the gradient gets close to zero When the objective stops changing much When the parameters stop changing much Early When performance on held-out dev set plateaus How to choose the step size? Start with large steps, then take smaller steps

Now let s calculate gradients for multivariate objectives Consider the following learning objective What do we need to do to run gradient descent?

(1) Derivative with respect to b

(2) Gradient with respect to w

Subgradients Problem: some objective functions are not differentiable everywhere Hinge loss, l1 norm Solution: subgradient optimization Let s ignore the problem, and just try to apply gradient descent anyway!! we will just differentiate by parts

Example: subgradient of hinge loss

Subgradient Descent for Hinge Loss

Summary Gradient descent A generic algorithm to minimize objective functions Works well as long as functions are well behaved (ie convex) Subgradient descent can be used at points where derivative is not defined Choice of step size is important Optional: can we do better? For some objectives, we can find closed form solutions (see CIML 6.6)