Multilabel Classification and Deep Learning

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

(Sub)Gradient Descent

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Generative models and adversarial training

Assignment 1: Predicting Amazon Review Ratings

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Model Ensemble for Click Prediction in Bing Search Ads

Attributed Social Network Embedding

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Case Study: News Classification Based on Term Frequency

CS Machine Learning

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning From the Past with Experiment Databases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Artificial Neural Networks written examination

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Second Exam: Natural Language Parsing with Neural Networks

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.cl] 2 Apr 2017

Exploration. CS : Deep Reinforcement Learning Sergey Levine

10.2. Behavior models

A Reinforcement Learning Variant for Control Scheduling

CSL465/603 - Machine Learning

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Multi-label classification via multi-target regression on data streams

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Modeling function word errors in DNN-HMM based LVCSR systems

Australian Journal of Basic and Applied Sciences

arxiv: v1 [cs.cv] 10 May 2017

Radius STEM Readiness TM

Georgetown University at TREC 2017 Dynamic Domain Track

Comment-based Multi-View Clustering of Web 2.0 Items

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v1 [cs.lg] 15 Jun 2015

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Speech Recognition at ICSI: Broadcast News and beyond

Using EEG to Improve Massive Open Online Courses Feedback Interaction

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Forget catastrophic forgetting: AI that learns after deployment

Using dialogue context to improve parsing performance in dialogue systems

Reducing Features to Improve Bug Prediction

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Modeling function word errors in DNN-HMM based LVCSR systems

Calibration of Confidence Measures in Speech Recognition

Discriminative Learning of Beam-Search Heuristics for Planning

Human Emotion Recognition From Speech

Online Updating of Word Representations for Part-of-Speech Tagging

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Going to School: Measuring Schooling Behaviors in GloFish

Axiom 2013 Team Description Paper

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Learning Methods in Multilingual Speech Recognition

SARDNET: A Self-Organizing Feature Map for Sequences

Evolutive Neural Net Fuzzy Filtering: Basic Description

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

What is a Mental Model?

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

MYCIN. The MYCIN Task

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Residual Stacking of RNNs for Neural Machine Translation

Disambiguation of Thai Personal Name from Online News Articles

Evaluating and Comparing Classifiers: Review, Some Recommendations and Limitations

Knowledge Transfer in Deep Convolutional Neural Nets

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

An empirical study of learning speed in backpropagation

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

A Case-Based Approach To Imitation Learning in Robotic Agents

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

The Strong Minimalist Thesis and Bounded Optimality

Issues in the Mining of Heart Failure Datasets

Applications of data mining algorithms to analysis of medical data

Chapter 2 Rule Learning in a Nutshell

Software Maintenance

THE world surrounding us involves multiple modalities

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Encoding. Retrieval. Forgetting. Physiology of Memory. Systems and Types of Memory

Indian Institute of Technology, Kanpur

Linking Task: Identifying authors and book titles in verbose queries

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Detecting English-French Cognates Using Orthographic Edit Distance

Lecture 10: Reinforcement Learning

arxiv: v2 [cs.cv] 30 Mar 2017

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Transcription:

Multilabel Classification and Deep Learning Critical Review of RNNs: http://arxiv.org/abs/1506.00019 Learning to Diagnose: http://arxiv.org/abs/1511.03677 Conditional Generative RNNS: http://arxiv.org/abs/1511.03683 Zachary Chase Lipton

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Classifying Multilabel Time Series with RNNs

Supervised Learning General problem, desire a labeling function f : X! Y ERM principle - choose the model in hypothesis class H that minimizes loss on the training sample ˆf S 2 {X Y} n Most research assumes simplest case X = R d, Y = {0, 1} Real world much messier

Binary Classification y 2 {0, 1}

Multiclass Classification y 2 {c 1,c 2,...,c L }

Multilabel Classification y {c 1,c 2,...,c L }

Why Multilabel? Superset of both BC and MC: BC when = 1, MC when L y 2 L Natural for many real problems: Clinical diagnosis Predicting purchases Auto-tagging news articles Activity recognition Object detection Easy to formulate: Take L tasks and slap them together

Naive Baseline Binary relevance: Separately train L classifiers f l : X! {0, 1} Pros: Simple to execute, easy to understand strong baseline Cons: Computational cost: L Leaves some information on the table (correlation betw. labels)

Challenges Efficiency Develop classifiers that do not scale in time or space complexity with the number of labels Performance Make use of the extra labels to achieve better accuracy, generalization Evaluation How do we evaluate a multilabel classifier s performance across 10s, 100s, 1000s, or even 1M labels?

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Classifying Multilabel Time Series with RNNs

Why not accuracy? Often extreme class imbalance When blind classifier gets 99.99%, can be optimal to be uninformative Varying base rates across labels E.g.: MeSH dataset: Human applies to 71% of articles, platypus in <.0001%

F1 Score Easy to calculate from confusion matrix Harmonic mean of precision and recall F1 = 2 tp 2 tp+fp+fn tp tp + fp tp tp + fn

F1 given fixed base rate

Compared to Accuracy

Expected F1 for Uninformative Classifier

Multilabel Variations Micro F1 calculated over all entries Example 1 TP FP FN TN Example 2 FP FP FN TP Example 3 FN TP FN FP TN TP TP TN

Macro F1 Macro: F1 calculated separately for each label and averaged Label 1 Label 2 Label 3 Label 4 Example 1 TP FP FN TN Example 2 FP FP FN TP Example 3 FN TP FN FP TN TP TP TN

Characterizing the Optimal Threshold Threshold can be expressed in terms of the conditional probabilities of scores given labels When scores are calibrated probabilities, optimal threshold is precisely half the F1 it achieves.

Problems with F1 Sensitive to thresholding strategy Hard to tell who has the best algorithms and who is smart about thresholding Micro-F1 biased towards common labels Macro-F1 biased against them

Some alternatives Any threshold indicates a cost sensitivity: When you know the cost, specify it and use weighted accuracy AUC exhibits same dynamic range for every label (blind classifier gets 0, perfect is 1) Macro-averaged AUC scores may give a better sense of performance across all labels **high AUC for rare labels can be misleading. can achieve AUC of.99 produce useless results for IR

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Classifying Multilabel Time Series with RNNs

The problem With many labels, binary relevance models can be huge and slow 10k labels + 1M features = 80GB of parameters We want compact models Fast to train and evaluate, cheap to store

Linear Regression The bulk of computation is label agnostic (compute inverse (X T X) 1 =(X T X) 1 X T b =(X T X) 1 X T B Can do this especially fast when we reduce dimensionality of X via SVD. Problem: Unsupervised dim reduction -> lose signal of rare features -> mess up rare labels

Sparsity For auto-tagging tasks, features are often high-dimensional sparse bag-of-words or n-grams Datasets for web-scale information retrieval tasks are large in the number of examples, thus SGD is the default optimization procedure Absent regularization, the gradient is sparse and training is fast Regularization destroys the sparsity of the gradient Number of features and labels are large, dense stochastic updates are computationally infeasible

Regularization Goals: achieve model sparsity, prevent overfitting regularization is induces sparse models `1 regularization is thought to achieve more accurate `22 models in practice Elastic net, balances the two

Balancing Regularization with Efficiency To regularize while maintaining efficiency, can use a lazy updating scheme, first described by Carpenter (2008) For each feature, remember the last time it was nonzero When a feature is nonzero at some step t+k, perform a closed form update We derive lazy updates for elastic net regularization on both standard SGD and FoBoS (Duchi & Singer)

Lazy Updates for Elastic Net Theorem 1 To bring the weight w j current from time j to time k using SGD, the constant time update is apple w (k) j = sgn(w ( j) j ) w ( j) j P (k 1) P ( j 1) P (k 1) (B(k 1) B( j 1)) where P (t) =(1 (t) 2) P (t 1) with base case P ( 1) = 1 and B(t) = P t =0 ( ) /P ( 1) with base case B( 1) = 0. + (1 ) Theorem 2 A constant-time lazy update for FoBoS with elastic net regularization and decreasing learning rate to bring a weight current at time k from time j is where (t) = (t 1) 1 (t) (t 1) apple w (k) j = sgn(w ( j) j ) w ( j) j (k 1) ( j 1) (k 1) 1 ( (k 1) ( j 1)) 1+ t 2 with base case ( 1) = 0. + (2 ) with base case ( 1) = 1 and (t) = (t 1) +

Empirical Validation On two largest datasets in Mulan repository of multilabel datasets, we can train to convergence on a laptop in just minutes rcv1: 490x speedup, bookmarks: 20x speedup rcv1 bookmarks

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Classifying Multilabel Time Series with RNNs

Performance Efficiency is nice, but we d also like performance Neural networks can learn shared representations across labels. Both regularizes each label s model and exploits correlations between labels In extreme multilabel, may use significantly less parameters than logistic regression

Neural Network

Training w Backpropagation Goal: calculate the derivative of loss function with respect to each parameter (weight) in the model Update the weights by gradient following:

Forward Pass

Backward Pass

Multilabel MLP

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Classifying Multilabel Time Series with RNNs

To Model Sequential Data: Recurrent Neural Networks

Recurrent Net (Unfolded)

LSTM Memory Cell (Hochreiter & Schmidhuber, 1997)

LSTM Forward Pass

LSTM (full network)

Unstructured Input

Modeling Problems Examples: 10,401 episodes Features: 13 time series (sensor data, lab tests) Complications: Irregular sampling, missing values, varying-length sequences

How to models sequences? Markov models Conditional Random Fields Problem: Cannot model long range dependencies

Simple Formulation

Target Replication

Auxiliary Targets

Results

Outline Introduction to Multilabel Learning Evaluation Efficient Learning & Sparse Models Deep Learning for Multilabel Classification Jointly Learning to Generate and Classify Beer Reviews

RNN Language Model

Past Supervised Approaches relied upon Encoder-Decoder Model

Bridging Long Time Intervals with Concatenated Inputs

Example A.5 FRUIT/VEGETABLE BEER <STR>On tap at the brewpub. A nice dark red color with a nice head that left a lot of lace on the glass. Aroma is of raspberries and chocolate. Not much depth to speak of despite consisting of raspberries. The bourbon is pretty subtle as well. I really don t know that I find a flavor this beer tastes like. I would prefer a little more carbonization to come through. It s pretty drinkable, but I wouldn t mind if this beer was available. <EOS>

Character-based Classification

Love the Strong Hoppy Flavor

Thanks! Contact: zlipton@cs.ucsd.edu zacklipton.com Critical Review of RNNs: http://arxiv.org/abs/1506.00019 Learning to Diagnose: http://arxiv.org/abs/1511.03677 Conditional Generative RNNS: http://arxiv.org/abs/1511.03683