Transfer Learning. Pei-Hao (Eddy) Su 1 and Yingzhen Li 2. January 29, Outline Motivation Historical points Definition Case studies

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Python Machine Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Active Learning. Yingyu Liang Computer Sciences 760 Fall

TRANSFER LEARNING IN MIR: SHARING LEARNED LATENT REPRESENTATIONS FOR MUSIC AUDIO CLASSIFICATION AND SIMILARITY

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

arxiv: v2 [cs.cv] 30 Mar 2017

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

CSL465/603 - Machine Learning

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

CS Machine Learning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

A study of speaker adaptation for DNN-based speech synthesis

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Transfer Learning with Applications

A survey of multi-view machine learning

Semi-Supervised Face Detection

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Word Segmentation of Off-line Handwritten Documents

Rule Learning With Negation: Issues Regarding Effectiveness

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A Case Study: News Classification Based on Term Frequency

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Linking Task: Identifying authors and book titles in verbose queries

A Deep Bag-of-Features Model for Music Auto-Tagging

Top US Tech Talent for the Top China Tech Company

Attributed Social Network Embedding

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

Rule Learning with Negation: Issues Regarding Effectiveness

Welcome to. ECML/PKDD 2004 Community meeting

Lecture 1: Basic Concepts of Machine Learning

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Online Updating of Word Representations for Part-of-Speech Tagging

Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models

Calibration of Confidence Measures in Speech Recognition

Diverse Concept-Level Features for Multi-Object Classification

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Knowledge Transfer in Deep Convolutional Neural Nets

Discriminative Learning of Beam-Search Heuristics for Planning

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Second Exam: Natural Language Parsing with Neural Networks

arxiv: v1 [cs.cv] 10 May 2017

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Deep Facial Action Unit Recognition from Partially Labeled Data

Exposé for a Master s Thesis

arxiv: v1 [cs.cl] 20 Jul 2015

Model Ensemble for Click Prediction in Bing Search Ads

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Speech Emotion Recognition Using Support Vector Machine

EXPLOITING DOMAIN AND TASK REGULARITIES FOR ROBUST NAMED ENTITY RECOGNITION

THE world surrounding us involves multiple modalities

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

A Case-Based Approach To Imitation Learning in Robotic Agents

Speech Recognition at ICSI: Broadcast News and beyond

arxiv:submit/ [cs.cv] 2 Aug 2017

Switchboard Language Model Improvement with Conversational Data from Gigaword

Indian Institute of Technology, Kanpur

Learning Methods for Fuzzy Systems

Transfer Learning Action Models by Measuring the Similarity of Different Domains

CS 446: Machine Learning

FSL-BM: Fuzzy Supervised Learning with Binary Meta-Feature for Classification

arxiv: v1 [cs.lg] 3 May 2013

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Learning From the Past with Experiment Databases

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Summarizing Answers in Non-Factoid Community Question-Answering

Human Emotion Recognition From Speech

Word learning as Bayesian inference

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

Cultivating DNN Diversity for Large Scale Video Labelling

arxiv: v1 [cs.cl] 2 Apr 2017

Taxonomy-Regularized Semantic Deep Convolutional Neural Networks

Mining Student Evolution Using Associative Classification and Clustering

Distant Supervised Relation Extraction with Wikipedia and Freebase

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Detecting English-French Cognates Using Orthographic Edit Distance

Issues in the Mining of Heart Failure Datasets

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Team Formation for Generalized Tasks in Expertise Social Networks

A deep architecture for non-projective dependency parsing

Copyright by Sung Ju Hwang 2013

Ensemble Technique Utilization for Indonesian Dependency Parser

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Comparison of network inference packages and methods for multiple networks inference

Artificial Neural Networks written examination

Transcription:

Transfer Learning Pei-Hao (Eddy) Su 1 and Yingzhen Li 2 1 Dialogue Systems Group and 2 Machine Learning Group January 29, 2015 Transfer Learning 1 / 41

Outline 1 Motivation 2 Historical points 3 Definition 4 Case studies Transfer Learning 2 / 41

Outline 1 Motivation 2 Historical points 3 Definition 4 Case studies Transfer Learning 3 / 41

Standard Supervised Learning Task Transfer Learning 4 / 41

Standard Supervised Learning Task Most ML tasks assume the training/test data are drawn from the same data space and the same distribution Transfer Learning 4 / 41

NLP tasks: POS, NER, Category labelling Modified from Gao et al. s presentation in KDD 08 Transfer Learning 5 / 41

Combine and get better result Modified from Gao et al. s presentation in KDD 08 Transfer Learning 6 / 41

Motivation Traditional ML tasks assume the training/test data are drawn from the same data space and the same distribution Insufficient labelled data result in poor prediction performance Lots of (un-)related existing data from various sources Start from scratch is always time-consuming Transfer knowledge from other sources may help! Transfer Learning 7 / 41

Motivation (Taylor et.al JMLR 09) Transfer Learning 8 / 41

Outline 1 Motivation 2 Historical points 3 Definition 4 Case studies Transfer Learning 9 / 41

Psychology and Education In 1901, Thorndike and Woodworth explored how individuals transfer similar characteristics shared by different contexts Transfer Learning 10 / 41

Psychology and Education In 1901, Thorndike and Woodworth explored how individuals transfer similar characteristics shared by different contexts In 1992, Perkins and Salomon published Transfer of Learning which defined different types of transfer Transfer Learning 10 / 41

Psychology and Education In 1901, Thorndike and Woodworth explored how individuals transfer similar characteristics shared by different contexts In 1992, Perkins and Salomon published Transfer of Learning which defined different types of transfer examples: Skill learning: C/C + + Python Language acquisition: German English Transfer Learning 10 / 41

Machine Learning Transfer Learning 11 / 41

Machine Learning Explanation-Based Neural Network Learning: A Lifelong Learning Approach [Thrun PhD 95, NIPS 96] Transfer Learning 12 / 41

Machine Learning Explanation-Based Neural Network Learning: A Lifelong Learning Approach [Thrun PhD 95, NIPS 96] Multitask Learning [Caruana ICML 93 & 96, PhD 97] Transfer Learning 12 / 41

Machine Learning Explanation-Based Neural Network Learning: A Lifelong Learning Approach [Thrun PhD 95, NIPS 96] Multitask Learning [Caruana ICML 93 & 96, PhD 97] Workshops Learning to Learn: Knowledge Consolidation and Transfer in Inductive Systems [NIPS 95] Inductive Transfer: 10 Years Later [NIPS 05] Structural Knowledge Transfer for Machine Learning [ICML 06] Transfer Learning for Complex Tasks [AAAI 08] Lifelong Learning [AAAI 11] Theoretically Grounded Transfer Learning [ICML 13] Workshop: Second Workshop on Transfer and Multi-Task Learning: Theory meets Practice [NIPS 14]... Transfer Learning 12 / 41

Outline 1 Motivation 2 Historical points 3 Definition 4 Case studies Transfer Learning 13 / 41

Definition Notations Domain D 1 Data space X 2 Marginal distribution P(X ), where X X Task T (Given D = {X, P(X )}) 1 Label space Y 2 Learn a f : X Y to approach the underlying P(Y X ), where X X and Y Y Transfer Learning 14 / 41

Definition Assume we have only one source S and one target T : Definition Transfer Learning (TL): Given a source domain D S and learning task T S, a target domain D T and learning task T T, transfer learning aims to help improve the learning of the target predictive function f T ( ) in D T using the knowledge in D S and T S, where D S D T (either X S X T or P S (X ) P T (X )) or T S T T (either Y S Y T or P(Y S X S ) P(Y T X T )) Transfer Learning 15 / 41

Example: Category labelling Transfer Learning 16 / 41

Example: Category labelling Transfer Learning 17 / 41

Example: Category labelling Transfer Learning 18 / 41

ML v.s. TL (Langley 06, Yang et al. 13) Transfer Learning 19 / 41

Outline 1 Motivation 2 Historical points 3 Definition 4 Case studies Transfer Learning 20 / 41

Transfer in practice The rest of the talk will give you an intuition, with examples, on: when to transfer what to transfer and how to transfer Transfer Learning 21 / 41

When to transfer: Domain relatedness Transfer learning is applicable when there exists relatedness Standard machine learning assume source = target Transferring knowledge from unrelated domain can be harmful - Negative transfer [Rosenstein et al NIPS-05 Workshop] (Ben-David et al.) proposed a bound of target domain error Reference Ben-David et al. Analysis of Representation for Domain Adaptation. NIPS 06 Transfer Learning 22 / 41

When to transfer (Ben-David et al.) In standard binary classification supervised learning task: Given X, Y = {0, 1} and samples from P(x, y), we aim to learn f : X [0, 1] which captures P(y x) Often we decompose the problem into: 1 determine a feature mapping Φ : X Z 2 learn a hypothesis h : Z {0, 1} on dataset {Φ(x), y} In transfer learning scenario: Theorem (Simplified version of Thm. 1&2) Given X = X S = X T and P S (x), P T (x) the distributions of the source and target domain. Let Φ : X Z be a fixed mapping function and H be a hypothesis space. For any hypothesis h H trained on source domain: ɛ T (h) ɛ S (h) + d H( P S, P T ) + ɛ S (h ) + ɛ T (h ) where P S, P T are induced distributions on Z wrt. P S and P T, h = arg min h H (ɛ S (h) + ɛ T (h)) is the best hypothesis by joint training. Transfer Learning 23 / 41

Domain adaptation Approach 1: mixture of general & specific component Can we learn hypotheses for both the general and specific components? Reference: Daume III. Frustratingly easy domain adaptation. ACL 07 Daume III et al. Co-regularization Based Semi-supervised Domain Adaptation. NIPS 10 Transfer Learning 24 / 41

EasyAdapt (Daume III) Binary classification problem: X S = X T R d, Y S = Y T = { 1, +1} Goal: obtain classifier f T : X T Y T in SVM context: learn a hypothesis h T R d However: too little training data available on (X T, Y T ) for robust training also P(x S ) P(x T ) and P(x S, y T ) P(x S, y T )...so directly apply a trained hypothesis h s returns bad results How to use x S, y S P(x S, y S ) to improve learning of h T? Transfer Learning 25 / 41

EasyAdapt (Daume III) EasyAdapt algorithm define two mappings Φ S, Φ T : R d R 3d : Φ S (x S ) = (x S, x S, 0), Φ t(x T ) = (x T, 0, x T ) training: learn a hypothesis h = (w g, w s, w t ) R 3d on transformed dataset {(Φ S (x S ), y S )} {(Φ T (x T ), y T )} test: apply h T = w g + w t on x T (also h S = w g + w s ) Transfer Learning 26 / 41

EA++ (Daume III et al.) Use unlabelled data to improve training: want h S and h T to agree on unlabelled data x U : h S x U = h T x U w s x U = w t x U h (0, x U, x U ) = 0 so we define mapping Φ U : R d R 3d for unlabelled data Φ U (x U ) = (0, x U, x U ) (1) and train the hypothesis h on augmented and transformed dataset {(Φ S (x S ), y S )} {(Φ T (x T ), y T )} {(Φ U (x U ), 0)} Transfer Learning 27 / 41

EA++ (Daume III et al.) (a) DVD BOOKS (proxy A-distance=0.7616), (b) KITCHEN APPAREL (proxy A-distance=0.0459). SOURCE/TARGETONLY(-FULL): trained on source/target (full) labelled samples ALL: trained on combined labelled samples EA/EA++: trained in augmented feature space (and unlabelled target data) Transfer Learning 28 / 41

Feature transfer Approach 2: shared lower-level features DNN first layer learns Gabor filters or color blobs when trained on images instances in source/target domain share the same lower-level features? Reference: Yosinski et al. How transferable are features in deep neural networks? NIPS 14. Transfer Learning 29 / 41

Feature transfer 1 Lee et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML 09 1 adapt from Ruslan Salakhutdinov s tutorial in MLSS 14 Beijing Transfer Learning 30 / 41

Feature transfer (Yosinski et al.) Transfer Learning 31 / 41

Feature transfer (Yosinski et al.) Test 1 (similar datasets): random A/B splits of the ImageNet dataset (similar source and target domain training/testing instances) Transfer Learning 32 / 41

Feature transfer (Yosinski et al.) Test 1 (similar datasets): random A/B splits of the ImageNet dataset (similar source and target domain training/testing instances) Transfer Learning 32 / 41

Feature transfer (Yosinski et al.) Test 1 (similar datasets): random A/B splits of the ImageNet dataset (similar source and target domain training/testing instances) Transfer Learning 32 / 41

Feature transfer (Yosinski et al.) Test 2 (very different datasets): man-made/natural object split (dissimilar source and target domain training/testing instances) Transfer Learning 33 / 41

Feature transfer (Yosinski et al.) Test 2 (very different datasets): man-made/natural object split (dissimilar source and target domain training/testing instances) Transfer Learning 33 / 41

Joint representation Approach 3: joint feature representation data has many domain specific characteristics however might be related in high level? our brain might work like this as well Reference: Srivastava and Salakhutdinov. Multimodal Learning with Deep Boltzmann Machines. NIPS 12, JMLR 15 (2014). Transfer Learning 34 / 41

Joint representation (Srivastava et al.) MIR Flickr Dataset http://press.liacs.nl/mirflickr/ For images 1M datapoints, 25K labelled instances in 38 classes, 10K for training, 5K for validation and 10K for testing inputs are the concatenation of PHOW and MPEG-7 features For texts use word count vectors on 2K frequently used tags (very sparse) 18% training images have missing texts Transfer Learning 35 / 41

Joint representation (Srivastava et al.) for images: 2-layer deep Boltzmann machine (DBM) with Gaussian input units (v mi R, abbrev. W m (k) (i, j) as W (k) ij ) P(v m, h m (1), h m (2) ) exp (v mi b i ) 2 + 2σ 2 i i i,j v mi σ i W (1) ij h (1) mj + j,l h (1) mj W (2) jl h (2) ml Transfer Learning 36 / 41

Joint representation (Srivastava et al.) for texts: 2-layer DBM with replicated softmax model (v ti counts the occurrence of word i, abbrev. W (k) t (i, j) as W (k) P(v t, h (1) t, h (2) t ) exp v ti b i + i=1 i,j ij ) v ti W (1) ij h (1) mj + j,l h (1) tj W (2) jl h (2) tl Transfer Learning 36 / 41

Joint representation (Srivastava et al.) combining domain specific models to a multimodal DBM: P(v m, v t, h; θ) ( ) exp E(h m (2), h (2) t, h (3) ) E(v m, h m (1), h m (2) ) E(v t, h (1) t, h (2) t ) Transfer Learning 36 / 41

Joint representation (Srivastava et al.) first pre-train domain specific DBMs with CD, then co-train the joint model with PCD use mean-field variational approximation when computing hidden unit moments driven by data Transfer Learning 36 / 41

Joint representation (Srivastava et al.) Results: Figure: Classification with data from both image and text domain Figure: Classification with data from image domain only Transfer Learning 37 / 41

Joint representation (Srivastava et al.) Results: Figure: Retrieval results for multi/image domain queries Transfer Learning 37 / 41

Conclusions In this talk, we showed that transfer learning adapts knowledge from other sources to improve target task performance domains related to each other in different ways In the future: manage large scale data that do not lack in size but may lack in quality manage data which may continuously change over time Transfer Learning 38 / 41

Open Questions 2 what are the limits of existing multi-task learning methods when the number of tasks grows while each task is described by only a small bunch of samples ( big T, small n )? what is the right way to leverage over noisy data gathered from the Internet as reference for a new task? how can an automatic system process a continuous stream of information in time and progressively adapt for life-long learning? can deep learning help to learn the right representation (e.g., task similarity matrix) in kernel-based transfer and multi-task learning? How can similarities across languages help us adapt to different domains in natural language processing tasks?... 2 nips.cc/conferences/2014/program/event.php?id=4282 Transfer Learning 39 / 41

Thank you Transfer Learning 40 / 41

Reference 1 Pan and Yang. A Survey on Transfer Learning. IEEE TKDE 2010 2 Pan and Yang. Transfer Learning. MLSS 2011 3 Taylor et al. Transfer Learning for Reinforcement Learning Domains: A Survey. JMLR 2010 4 Langley. Transfer of Learning in Cognitive System. ICML 2006 5 Perkins et al. Transfer of Learning. IEE 1992 6 Thrun. Explanation-Based Neural Network Learning: A Lifelong Learning Approach. PhD thesis 1995 7 Caruana. Multitask Learning. PhD thesis 1993 8 Ben-David et al. Analysis of Representation for Domain Adaptation. NIPS 2006 9 Daume III. Frustratingly easy domain adaptation. ACL 2007 10 Daume III et al. Co-regularization Based Semi-supervised Domain Adaptation. NIPS 2010 11 Yosinski et al. How transferable are features in deep neural networks? NIPS 2014 12 Lee et al. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. ICML 2009 Pei-Hao (Eddy) 13 Srivastava Su and Yingzhen and Li Salakhutdinov. Multimodal Learning with Deep Transfer Learning 41 / 41