Supervised Learning with Neural Networks and Machine Translation with LSTMs

Similar documents
Residual Stacking of RNNs for Neural Machine Translation

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Python Machine Learning

arxiv: v1 [cs.lg] 7 Apr 2015

arxiv: v1 [cs.lg] 15 Jun 2015

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Second Exam: Natural Language Parsing with Neural Networks

Lecture 1: Machine Learning Basics

Deep Neural Network Language Models

arxiv: v4 [cs.cl] 28 Mar 2016

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Artificial Neural Networks written examination

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Exploration. CS : Deep Reinforcement Learning Sergey Levine

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Lip Reading in Profile

arxiv: v1 [cs.cl] 27 Apr 2016

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures

Calibration of Confidence Measures in Speech Recognition

Assignment 1: Predicting Amazon Review Ratings

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Modeling function word errors in DNN-HMM based LVCSR systems

Speech Recognition at ICSI: Broadcast News and beyond

Knowledge Transfer in Deep Convolutional Neural Nets

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v5 [cs.ai] 18 Aug 2015

INPE São José dos Campos

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

arxiv: v1 [cs.cl] 2 Apr 2017

Improvements to the Pruning Behavior of DNN Acoustic Models

arxiv: v1 [cs.cv] 10 May 2017

Forget catastrophic forgetting: AI that learns after deployment

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

TRANSFER LEARNING OF WEAKLY LABELLED AUDIO. Aleksandr Diment, Tuomas Virtanen

Modeling function word errors in DNN-HMM based LVCSR systems

Getting Started with Deliberate Practice

Learning Methods for Fuzzy Systems

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Learning to Schedule Straight-Line Code

A study of speaker adaptation for DNN-based speech synthesis

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

SEMI-SUPERVISED ENSEMBLE DNN ACOUSTIC MODEL TRAINING

Active Learning. Yingyu Liang Computer Sciences 760 Fall

A Case Study: News Classification Based on Term Frequency

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

12- A whirlwind tour of statistics

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

Model Ensemble for Click Prediction in Bing Search Ads

Noisy SMS Machine Translation in Low-Density Languages

LEGO MINDSTORMS Education EV3 Coding Activities

Evolution of Symbolisation in Chimpanzees and Neural Nets

Softprop: Softmax Neural Network Backpropagation Learning

Generative models and adversarial training

The A2iA Multi-lingual Text Recognition System at the second Maurdor Evaluation

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Mathematics process categories

Language Model and Grammar Extraction Variation in Machine Translation

An empirical study of learning speed in backpropagation

Discriminative Learning of Beam-Search Heuristics for Planning

Short vs. Extended Answer Questions in Computer Science Exams

Georgetown University at TREC 2017 Dynamic Domain Track

Disciplinary Literacy in Science

WORK OF LEADERS GROUP REPORT

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Cross Language Information Retrieval

Ask Me Anything: Dynamic Memory Networks for Natural Language Processing

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Human Emotion Recognition From Speech

Rule Learning With Negation: Issues Regarding Effectiveness

THE world surrounding us involves multiple modalities

arxiv: v2 [cs.cl] 18 Nov 2015

I-COMPETERE: Using Applied Intelligence in search of competency gaps in software project managers.

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Lecture 1: Basic Concepts of Machine Learning

Cultivating DNN Diversity for Large Scale Video Labelling

(Sub)Gradient Descent

Using focal point learning to improve human machine tacit coordination

Segmental Conditional Random Fields with Deep Neural Networks as Acoustic Models for First-Pass Word Recognition

CS Machine Learning

Chapter 4 - Fractions

Software Maintenance

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

arxiv: v3 [cs.cl] 24 Apr 2017

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Full text of O L O W Science As Inquiry conference. Science as Inquiry

CEFR Overall Illustrative English Proficiency Scales

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

On the Formation of Phoneme Categories in DNN Acoustic Models

Transcription:

Supervised Learning with Neural Networks and Machine Translation with LSTMs Ilya Sutskever in collaboration with: Minh-Thang Luong Quoc Le Oriol Vinyals Wojciech Zaremba Google Brain

Deep Neural Networks 1. Can perform an astonishingly wide range of computations 2. Can be learned automatically powerful models learnable models deep neural networks

Powerful models are necessary A weak model will never get good performance Examples of weak models: Single layer logistic regression Linear SVM Small neural nets Small conv nets A neural network needs to be large and deep to be powerful powerful models learnable models deep neural networks

Why are deep nets powerful? A single neuron can implement boolean logic, and thus arbitrary computation OR AND -0.5-1.5 +1 +1 +1 NOT +0.5-1 +1

Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit

Why are deep nets powerful? A single neuron can implement boolean logic, and general thus computation and computers Mid-sized 2-hidden layer neural network can sort N N-bit numbers Intuitively, sorting requires log N parallel steps Backpropagation can find this circuit Neurons are more economical than boolean logic

The Deep Learning Hypothesis Human perception is fast Neurons fire at most 100 times a second Humans solve perception in 0.1 seconds our neurons fire 10 times, at most Anything a human can do in 0.1 seconds, a big 10layer neural network can do, too! 20-layer neural networks can be trained well in practice Two years ago we could only train 10-layer networks

Implication DNNs, once trained, should do well on all perception problem Vision, speech, emotion, face recognition, instinct If there exists a human expert that can solve hard problems in a fraction of a second, then large deep neural networks could do so, too Instantaneous translation Speed reading Identifying the obvious thing to do in a complicated situation or a game

Learning Powerful models are useless unless we can train them Supervised backpropagation works! Not clear why 20-layer neural nets easily trainable with backprop powerful models learnable models deep neural networks

Learning Algorithm While not done Pick an example (x, y) Run the network of x to get a prediction p Use the gradient to bring p slightly closer to y Theory must make nontrivial assumptions about dataset powerful models learnable models deep neural networks

How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network powerful models learnable models deep neural networks

How to solve hard problems? Use a lot of good AND labelled training data Use a big deep neural network Success is the only possible outcome powerful models learnable models deep neural networks

The deep learning hypothesis is true! Not a controversial statement Big deep nets get the best results ever on: Speech recognition Object recognition Language modelling

Summary Big deep nets with 10-20 layers can do great things Supervised backpropagation can train 10-20-layer nets Ergo: we can do a whole lot if we have a large, good supervised training set

Deep nets can t solve all problems???

Deep nets can t solve all problems DNNs couldn t solve problems where the input and the output are very structured Sequence to sequence Graph to graph So far, we ve addressed the most basic form of supervised learning

Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category

Key limitation Inputs and outputs must be fixed-sized vectors Great for images: input is a big image of a fixed size output is a 1-of-N encoding of category output Bad news for machine translation, question answering, squiggle recognition The enemy: Unit-specific connections Input

Our contribution: solving the sequence to sequence problem It s a fundamental capability Applications: MT, Q&A, ASR, squiggle recognition Nice feature: Our approach has minimal innovation We demonstrate that the approach is viable by matching state-of-the-art results on machine translation State-of-the-art in MT is strong So approach should do well on many other tasks too

Recurrent Neural Networks (RNNs) RNNs can work with sequences t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp Key idea: each timestep has a layer with the same weights Time Problem is solved

Recurrent Neural Networks (RNNs) Are neural networks that can process sequences well Very expressive models Backpropagation is applicable Fun fact: recurrent neural networks were trained in the original backpropagation paper in 1986 Has trouble learning long-term dependencies Vanishing gradient problem (Hochreiter 1991; Bengio et al., 1994) There are ways to learn RNNs but they are complicated

Long Short-Term Memory (LSTM) Modify / hack the RNN architecture so that the vanishing gradient problem goes away Do so without sacrificing expressive power A model that achieves this purpose will be useful

Long Short-Term Memory (LSTM) RNNs overwrite the hidden state LSTMs add to the hidden state Addition has nice gradients All terms in a sum get a nice gradient LSTM is good at noticing long-range correlations It uses sums instead of overwriting Main advantage: requires little tuning Hugely important

RNN t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid hid hid hid hid hid inp inp inp inp inp inp

LSTM t=1 t=2 t=3 t=4 t=5 t=6 out out out out out out hid inp + hid inp + hid inp + hid inp + hid inp + hid inp

The heart of the LSTM The memory cell output X M M + X H I1 I2 F O X H X

Sequence to sequence Length of input sequence = length of output sequence = bad Not good for either ASR and MT Existing strategies for mapping sequences to sequences has an HMM-like component Normal ASR approaches have a big complicated transducer The Connectionist Sequence Classification (CTC) assumes monotonic alignments, uses an HMM But we want something simpler and more generic Should be applicable to any sequence-to-sequence problem Including MT, where words can be reordered in many ways

Main idea Neural nets are excellent at learning very complicated functions Coerce a neural network / LSTM to read one sequence and produce another Learning will take care of the rest

Main idea Target sequence A B C Input sequence D X Y Z Q X Y Z

That s it! The LSTM needs to read the entire input sequence, and then produce the target sequence from memory The input sequence is stored by a single LSTM hidden state

Relevant Related Work Independently and simultaneously, Kyunghyun Cho et al. (Bengio lab) invented basically the same approach, with a model that s related to the LSTM More interestingly and impressively, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio developed a model that could learn to attend to different parts of the input sentence No need to remember the entire input sequence Maximal benefit with smaller hidden states

Step 1: can the LSTM reconstruct the input sentence? Can this scheme learn the identity function? Target sequence A B C D A B C D A B C Answer: it can, and it can do it very easily. It just does it effortlessly and perfectly.

Step 2: small dataset experiments: EuroParl French to English Low-entropy parliament language 20M words in total Small vocabulary Sentence length no longer than 25 The net was doing something non-trivial We were inspired

Digression: decoding Formally, given an input sentence, the LSTM defines a distribution over output sentences Therefore, we should produce the sentence with the highest probability But there are exponentially many sentences, how to find it? We use a simple greedy strategy

Decoding in a nutshell Proceed left to right Maintain N partial translations Expand each translation with possible next words Discard all but the top N new partial translations 2 partial hypothesis I My expand and sort expand hypotheses I decided My decision I thought I tried My thinking My direction 2 new partial hypotheses prune I decided My decision

Why does simple beam-search work? The LSTM is trained to predict the next word given previous words Maintain a list of partial translations Extend each partial translation, evaluate each extensions, and discard all but the top-k Most search improvement is obtained with a beam of size 2 A full 1 BLEU point

Model for big experiments 160K input words 80k output words 4 layers of 1000D LSTM different LSTMs for input and output language 384M parameters

The model A B C D 80k softmax by 1000 dims This is very big! 1000 LSTM cells 2000 dims per timestep 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Parallelization Parallelization is important More parallelization is better -- ongoing work 8 GPUs Speed: 6,700 words per second Idea: layer per GPU

Learning parameters Learning parameters are very simple and straightforward: Learning rate = 0.7 / batch_size init scale = -0.08 0.08 norm of gradient is clipped to 5 learning rate is halved every 0.5 epochs after 5 epochs No momentum (may be a mistake )

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

GPU6 GPU5 A B C D A B C D 80k softmax by 1000 dims This is very big! Split softmax into 4 GPUs GPU4 GPU3 GPU2 1000 LSTM cells 2000 dims per timestep GPU1 2000 x 4 = 8k dims per sentence A B C D A B C 160k vocab in input language

Representations X Y Z. X Y Z This thing A B C D

Representations

Representations

Reversing source sentences It is natural to train the LSTM on sentences in the source language followed by the target language A, B, C, A, B, C, It turns out that reversing the words in the source sentence substantially improves performance. Why? Because we introduce many short-term dependencies to the dataset that make the learning problem easier

Reversing source sentences, C, B, A A, B, C, A is very close to A B is fairly close to B Close = close in time Backpropagation can notice that A is connected to A and establish this connection It makes it easier to learn that B is connected to B Natural bootstrapping Backprop can notice the short-term dependencies first, and slowly extend them to long range dependencies

Results on a big dataset Corpus: WMT 14 English French 384M French words, 303M English words 70K test words BLEU score: large is good Bahdanau et al. One LSTM Phrase-based SMT baseline Ensemble of 5 LSTM State of the art 28.45 30.6 33.3 34.8 37.0 We are doing OK, but we re still far from state of the art

Performance vs sentence length

Performance on rare words

Big Problem Performance deteriorates on sentences that has many rare words Vocabulary is limited, many words cannot be translated for that reason Most obvious flaw, worth fixing

Solving the rare word problem Example: input: I trained a veryrareword output: I trained a <unk> veryrareword is a very rare word. But model could know where it came from

Our solution Use a traditional word alignment algorithm Replace each <unk> in the target translation with an <unk-d> where d is an integer d indicates the position of the aligned word in the source sentence, if it exists

Our solution The model no longer needs to have every word in its vocabulary It only needs to know where the word originated from!

Procedure Train time: Annotate the unknown tokens with an alignment algorithm Train the LSTM to emit the annotations Test time: Use the LSTM to produce the annotated translations Translate each word with the word dictionary

Example

Example

Example

It works really well! We match state-of-the-art

Using only 1/3rd of the training data! Model takes many days to train, so we trained on ⅓rd of the data Additional models are being trained on the entire dataset, results are expected to get better Much larger dataset exist, and much larger neural nets should get much better

Depth helps

Depth helps Most gains are due to larger hidden state Better models benefit more from the postprocessing. Why? Because better models compute the annotations more accurately

Conclusions We match state-of-the-art in MT and will likely exceed it Near future: train a huge, giant model on much more translation data Apply these models to other sequence to sequence problems This work brings us closer to the complete solution to the problem of supervised learning

Conclusions If you have a large big dataset

Conclusions If you have a large big dataset And you train a very big neural network

Conclusions If you have a large big dataset And you train a very big neural network Then success is guaranteed!

Questions?

Thank you!