Deep multi-task learning with evolving weights

Similar documents
Python Machine Learning

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Attributed Social Network Embedding

(Sub)Gradient Descent

Artificial Neural Networks written examination

arxiv: v1 [cs.lg] 15 Jun 2015

Model Ensemble for Click Prediction in Bing Search Ads

A study of speaker adaptation for DNN-based speech synthesis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Deep Facial Action Unit Recognition from Partially Labeled Data

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Assignment 1: Predicting Amazon Review Ratings

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Lecture 1: Machine Learning Basics

An empirical study of learning speed in backpropagation

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v2 [cs.cv] 30 Mar 2017

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Softprop: Softmax Neural Network Backpropagation Learning

Evolutive Neural Net Fuzzy Filtering: Basic Description

On-the-Fly Customization of Automated Essay Scoring

Calibration of Confidence Measures in Speech Recognition

A Review: Speech Recognition with Deep Learning Methods

A Deep Bag-of-Features Model for Music Auto-Tagging

Deep Neural Network Language Models

INPE São José dos Campos

Knowledge Transfer in Deep Convolutional Neural Nets

Improving Fairness in Memory Scheduling

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A survey of multi-view machine learning

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval

Second Exam: Natural Language Parsing with Neural Networks

SARDNET: A Self-Organizing Feature Map for Sequences

CSL465/603 - Machine Learning

Evolution of Symbolisation in Chimpanzees and Neural Nets

Cultivating DNN Diversity for Large Scale Video Labelling

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

arxiv: v1 [cs.cv] 10 May 2017

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

arxiv:submit/ [cs.cv] 2 Aug 2017

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Learning to Schedule Straight-Line Code

On the Formation of Phoneme Categories in DNN Acoustic Models

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Corrective Feedback and Persistent Learning for Information Extraction

INVESTIGATION OF UNSUPERVISED ADAPTATION OF DNN ACOUSTIC MODELS WITH FILTER BANK INPUT

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Issues in the Mining of Heart Failure Datasets

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

UNIDIRECTIONAL LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK WITH RECURRENT OUTPUT LAYER FOR LOW-LATENCY SPEECH SYNTHESIS. Heiga Zen, Haşim Sak

Speaker Identification by Comparison of Smart Methods. Abstract

Comment-based Multi-View Clustering of Web 2.0 Items

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Residual Stacking of RNNs for Neural Machine Translation

Data Fusion Through Statistical Matching

Active Learning. Yingyu Liang Computer Sciences 760 Fall

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Learning Methods for Fuzzy Systems

Seminar - Organic Computing

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Georgetown University at TREC 2017 Dynamic Domain Track

A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation

Learning Methods in Multilingual Speech Recognition

Linking Task: Identifying authors and book titles in verbose queries

Multivariate k-nearest Neighbor Regression for Time Series data -

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

Offline Writer Identification Using Convolutional Neural Network Activation Features

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

BMBF Project ROBUKOM: Robust Communication Networks

Australian Journal of Basic and Applied Sciences

Truth Inference in Crowdsourcing: Is the Problem Solved?

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Laboratorio di Intelligenza Artificiale e Robotica

arxiv: v1 [cs.cl] 2 Apr 2017

Using Web Searches on Important Words to Create Background Sets for LSI Classification

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

HLTCOE at TREC 2013: Temporal Summarization

Improvements to the Pruning Behavior of DNN Acoustic Models

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Summarizing Answers in Non-Factoid Community Question-Answering

Semi-Supervised Face Detection

A deep architecture for non-projective dependency parsing

What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017

Transcription:

Deep multi-task learning with evolving weights ESANN 2016 Soufiane Belharbi Romain Hérault Clément Chatelain Sébastien Adam soufiane.belharbi@insa-rouen.fr LITIS lab., DocApp team - INSA de Rouen, France 27 April, 2016 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights

Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

Context Training deep neural networks Deep neural network are interesting models (Complex/hierarchical features, complex mapping) Improve performance Training deep neural networks is difficult Vanishing gradient More parameters Need more data Some solutions: Pre-training technique [Y.Bengio et al. 06, G.E.Hinton et al. 06] Use unlabeled data LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 1/20

Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

Context Semi-supervised learning General case: Data = { labeled }{{ data }, unlabeled }{{ data } } expensive (money, time), few cheap, abundant E.g: medical images semi-supervised learning: Exploit unlabeled data to improve the generalization LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 2/20

Context Pre-training and semi-supervised learning The pre-training technique can exploit the unlabeled data A sequential transfer learning performed in 2 steps: 1 Unsupervised task (x labeled and unlabeled data) 2 Supervised task ( (x, y) labeled data) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 3/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders x 1 x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 A DNN to train LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 4/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 ˆx 1 x 2 ˆx 2 x 3 ˆx 3 x 4 ˆx 4 x 5 ˆx 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 x 2 h 1,2 x 3 h 1,3 x 4 h 1,4 x 5 h 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 h 1,1 ĥ 1,1 x 2 h 1,2 ĥ 1,2 x 3 h 1,3 ĥ 1,3 x 4 h 1,4 ĥ 1,4 x 5 h 1,5 ĥ 1,5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 x 3 h 2,2 x 4 h 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 2,1 ĥ 2,1 x 3 h 2,2 ĥ 2,2 x 4 h 2,3 ĥ 2,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 h 3,1 x 3 h 3,2 x 4 h 3,3 x 5 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 1) Step 1: Unsupervised layer-wise training Train layer by layer sequentially using only x (labeled or unlabeled) x 1 x 2 x 3 x 4 x 5 At each layer: When to stop training? What hyper-parameters to use? How to make sure that the training improves the supervised task? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 5/20

Pre-training technique and semi-supervised learning Layer-wise pre-training: auto-encoders 2) Step 2: Supervised training x 1 Train the whole network using (x, y) x 2 x 3 x 4 ŷ 1 ŷ 2 x 5 Back-propagation LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 6/20

Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

Pre-training technique and semi-supervised learning Pre-training technique: Pros and cons Pros Improve generalization Can exploit unlabeled data Provide better initialization than random Train deep networks Circumvent the vanishing gradient problem Cons Add more hyper-parameters No good stopping criterion during pre-training phase Good criterion for the unsupervised task But May not be good for the supervised task LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 7/20

Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

Pre-training technique and semi-supervised learning Proposed solution Why is it difficult in practice? Sequential transfer learning Possible solution: Parallel transfer learning Why in parallel? Interaction between tasks Reduce the number of hyper-parameters to tune Provide one stopping criterion LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 8/20

Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

Proposed approach Parallel transfer learning: Tasks combination Train cost = supervised task + unsupervised task }{{} reconstruction l labeled samples, u unlabeled samples, w sh : shared parameters. Reconstruction (auto-encoder) task: Supervised task: l+u J r (D; w = {w sh, w r }) = C r (R(x i ; w ), x i ). J s (D; w = {w sh, w s }) = i=1 l C s (M(x i ; w), y i ). i=1 Weighted tasks combination J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 9/20

Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

Proposed approach Tasks combination with evolving weights Weighted tasks combination: J (D; {w sh, w s, w r }) = λ s J s (D; {w sh, w s }) + λ r J r (D; {w sh, w r }). λs, λr [0, 1]: importance weight, λs + λr = 1. Problems How to fix λ s, λ r? At the end of the training, only J s should matters Tasks combination with evolving weights (our contribution) J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s }) + λ r (t) J r (D; {w sh, w r }). t: learning epochs, λs(t), λr (t) [0, 1]: importance weight, λs(t) + λr (t) = 1. LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 10/20

Proposed approach Tasks combination with evolving weights J (D; {w sh, w s, w r }) = λ s (t) J s (D; {w sh, w s })+λ r (t) J r (D; {w sh, w r }). 1 Exponential schedule Importance weights 0.8 0.6 0.4 0.2 { λ r (t) = exp( t σ, σ : slope λ s(t) = 1 λ r (t) λ r (t) λ s (t) 0 start t: Train epochs LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 11/20

Proposed approach Tasks combination with evolving weights: Optimization Algorithm 1 Training our model for one epoch 1: D is the shuffled training set. B a mini-batch. 2: for B in D do 3: Make a gradient step toward J r using B (update w ) 4: B s labeled examples of B, 5: Make a gradient step toward J s using B s (update w) 6: end for [R.Caruana 97, J.Weston 08, R.Collobert 08, Z.Zhang 15] LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 12/20

Results Experimental protocol Objective: Compare Training DNN using different approaches: No pre-training (base-line) With pre-training (Stairs schedule) Parallel transfer learning (proposed approach) Studied evolving weights schedules: Importance weights 1 0 1 Stairs (Pre-training) 0 start t 1 Linear until t 1 start t: Train epochs Linear Exponential λ r λ s LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 13/20

Results Experimental protocol Task: Classification (MNIST) Number of hidden layers K : 1, 2, 3, 4. Optimization: Epochs: 5000 Batch size: 600 Options: No regularization, No adaptive learning rate Hyper-parameters of the evolving schedules: t 1 : 100 σ: 40 LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 14/20

Results Shallow networks: (K = 1, l = 1E2) 32.5 Evaluation of the eloving weight schedules (size of labeled data l = 100), K = 1 32.0 Calssification error MNIST test (%) 31.5 31.0 30.5 30.0 29.5 29.0 baseline stairs 100 lin 100 lin exp 40 28.5 28.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 15/20

Results Shallow networks: (K = 1, l = 1E3) 14.5 Evaluation of the eloving weight schedules (size of labeled data l = 1000), K = 1 14.0 Calssification error MNIST test (%) 13.5 13.0 12.5 12.0 baseline stairs 100 lin 100 lin exp 40 11.5 11.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 16/20

Results Deep networks: exponential schedule (l = 1E3) 13.0 Evaluation of the exp 40 eloving weight schedule (size of labeled data l = 1000) 12.5 Calssification error MNIST test (%) 12.0 11.5 11.0 10.5 K = 2 K = 3 K = 4 10.0 0 1E+03 2E+03 5E+03 1E+04 2E+04 4E+04 49900 Size of unlabeled data (u) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 17/20

Conclusion and perspectives Conclusion An alternative method to the pre-training. Parallel transfer learning with evolving weights Improve generalization easily. Reduce the number of hyper-parameters (t 1, σ) LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 18/20

Conclusion and perspectives Perspectives Evolve the importance weight according to the train/validation error. Explore other evolving schedules (toward automatic schedule) Adjust the learning rate: Adadelta, Adagrad, RMSProp LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 19/20

Questions Conclusion and perspectives Thank you for your attention, Questions? LITIS lab., DocApp team - INSA de Rouen, France Deep multi-task learning with evolving weights 20/20