Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA Ronan Collobert, Jason Weston, NEC. ICML, June 16th, 2009, Montreal. Acknowledgment: Myriam Côté

Similar documents
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.lg] 15 Jun 2015

Artificial Neural Networks written examination

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

CSL465/603 - Machine Learning

arxiv: v1 [cs.cl] 20 Jul 2015

(Sub)Gradient Descent

An empirical study of learning speed in backpropagation

A deep architecture for non-projective dependency parsing

Exploration. CS : Deep Reinforcement Learning Sergey Levine

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Knowledge Transfer in Deep Convolutional Neural Nets

TD(λ) and Q-Learning Based Ludo Players

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

SARDNET: A Self-Organizing Feature Map for Sequences

Dialog-based Language Learning

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Calibration of Confidence Measures in Speech Recognition

Degeneracy results in canalisation of language structure: A computational model of word learning

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Python Machine Learning

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Word Segmentation of Off-line Handwritten Documents

Human Emotion Recognition From Speech

Merry-Go-Round. Science and Technology Grade 4: Understanding Structures and Mechanisms Pulleys and Gears. Language Grades 4-5: Oral Communication

Residual Stacking of RNNs for Neural Machine Translation

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Proposal of Pattern Recognition as a necessary and sufficient principle to Cognitive Science

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Generative models and adversarial training

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Review: Speech Recognition with Deep Learning Methods

Speeding Up Reinforcement Learning with Behavior Transfer

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

City University of Hong Kong Course Syllabus. offered by Department of Architecture and Civil Engineering with effect from Semester A 2017/18

Evolutive Neural Net Fuzzy Filtering: Basic Description

Reinforcement Learning by Comparing Immediate Reward

INPE São José dos Campos

LEARNING TO PLAY IN A DAY: FASTER DEEP REIN-

Learning Methods for Fuzzy Systems

Dropout improves Recurrent Neural Networks for Handwriting Recognition

BMBF Project ROBUKOM: Robust Communication Networks

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Learning to Schedule Straight-Line Code

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Acquiring Competence from Performance Data

A study of speaker adaptation for DNN-based speech synthesis

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Mike Cohn - background

Using Web Searches on Important Words to Create Background Sets for LSI Classification

arxiv: v1 [cs.lg] 7 Apr 2015

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Cultivating DNN Diversity for Large Scale Video Labelling

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Rule Learning With Negation: Issues Regarding Effectiveness

Evolution of Symbolisation in Chimpanzees and Neural Nets

Discriminative Learning of Beam-Search Heuristics for Planning

A Comparison of Annealing Techniques for Academic Course Scheduling

Computer Organization I (Tietokoneen toiminta)

Syntactic systematicity in sentence processing with a recurrent self-organizing network

Comment-based Multi-View Clustering of Web 2.0 Items

Elite schools or Normal schools: Secondary Schools and Student Achievement: Regression Discontinuity Evidence from Kenya

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Laboratorio di Intelligenza Artificiale e Robotica

Speech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers

Probabilistic principles in unsupervised learning of visual structure: human data and a model

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

CS Machine Learning

Executive Guide to Simulation for Health

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Artificial Neural Networks

Soft Computing based Learning for Cognitive Radio

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

arxiv: v4 [cs.cl] 28 Mar 2016

Chapter 4 - Fractions

A Deep Bag-of-Features Model for Music Auto-Tagging

Lecture 1.1: What is a group?

Massachusetts Institute of Technology Tel: Massachusetts Avenue Room 32-D558 MA 02139

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Lecture 2: Quantifiers and Approximation

A Vector Space Approach for Aspect-Based Sentiment Analysis

Corrective Feedback and Persistent Learning for Information Extraction

MADERA SCIENCE FAIR 2013 Grades 4 th 6 th Project due date: Tuesday, April 9, 8:15 am Parent Night: Tuesday, April 16, 6:00 8:00 pm

Boosting Named Entity Recognition with Neural Character Embeddings

NEURAL PROCESSING INFORMATION SYSTEMS 2 DAVID S. TOURETZKY ADVANCES IN EDITED BY CARNEGI-E MELLON UNIVERSITY

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Classify: by elimination Road signs

Lecture 10: Reinforcement Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

WHY GRADUATE SCHOOL? Turning Today s Technical Talent Into Tomorrow s Technology Leaders

Transcription:

Curriculum Learning Yoshua Bengio, U. Montreal Jérôme Louradour, A2iA Ronan Collobert, Jason Weston, NEC ICML, June 16th, 2009, Montreal Acknowledgment: Myriam Côté

Curriculum Learning Guided learning helps training humans and animals Shaping Education Start from simpler examples / easier tasks (Piaget 1952, Skinner 1958)

The Dogma in question It is best to learn from a training set of examples sampled from the same distribution as the test set. Really?

Question Can machine learning algorithms benefit from a curriculum strategy? Cognition journal: (Elman 1993) vs (Rohde & Plaut 1999), (Krueger & Dayan 2009)

Convex vs Non-Convex Criteria Convex criteria: the order of presentation of examples should not matter to the convergence point, but could influence convergence speed Non-convex criteria: the order and selection of examples could yield to a better local minimum

Deep Architectures Theoretical arguments: deep architectures can be exponentially more compact than shallow ones representing the same function Cognitive and neuroscience arguments Many local minima Guiding the optimization by unsupervised pre-training yields much better local minima o/w not reachable Good candidate for testing curriculum ideas

Deep Training Trajectories (Erhan et al. AISTATS 09) Random initialization Unsupervised guidance

Starting from Easy Examples 2 1 Easiest Lower level abstractions 3 Most difficult examples Higher level abstractions

Continuation Methods Final solution Track local minima Easy to find minimum

Curriculum Learning as Continuation Sequence of training distributions 2 1 Easiest Lower level abstractions 3 Most difficult examples Higher level abstractions Initially peaking on easier / simpler ones Gradually give more weight to more difficult ones until reach target distribution

How to order examples? The right order is not known 3 series of experiments: 1. Toy experiments with simple order Larger margin first Less noisy inputs first 2. Simpler shapes first, more varied ones later 3. Smaller vocabulary first

Larger Margin First: Faster Convergence

Cleaner First: Faster Convergence

Shape Recognition First: easier, basic shapes Second = target: more varied geometric shapes

Shape Recognition Experiment 3-hidden layers deep net known to involve local minima (unsupervised pre-training finds much better solutions) 10 000 training / 5 000 validation / 5 000 test examples Procedure: 1. Train for k epochs on the easier shapes 2. Switch to target training set (more variations)

Shape Recognition Results k

Language Modeling Experiment Objective: compute the score of the next word given the previous ones (ranking criterion) Architecture of the deep neural network (Bengio et al. 2001, Collobert & Weston 2008)

Language Modeling Results Gradually increase the vocabulary size (dips) Train on Wikipedia with sentences containing only words in vocabulary

Conclusion Yes, machine learning algorithms can benefit from a curriculum strategy.

Why? Faster convergence to a minimum Wasting less time with noisy or harder to predict examples Convergence to better local minima Curriculum = particular continuation method Finds better local minima of a non-convex training criterion Like a regularizer, with main effect on test set

Perspectives How could we define better curriculum strategies? We should try to understand general principles that make some curricula work better than others Emphasizing harder examples and riding on the frontier

THANK YOU! Questions? Comments?

Training Criterion: Ranking Words = 1 C C s D s,w w D ( ( ) C = 1 max 0, 1 f ( s)+ ) w f s D w D with S a word sequence C s w D score of the next word given the previous one a word of the vocabulary the considered word vocabulary

Curriculum = Continuation Method? z ( ) P z Examples from are weighted by 0 W λ( z) 1 Sequence of distributions Q ( z) called a λ W λ( z )P ( z ) curriculum if: the entropy of these distributions increases (larger domain) W ( z) H( Q λ )< H Q λ+ε ( ) ε > 0 λ monotonically increasing in λ: W λ+ε ( z) W λ z ( ) z, ε > 0