Bayesian Reasoning and Deep Learning Shakir Mohamed

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

Python Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

(Sub)Gradient Descent

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

Model Ensemble for Click Prediction in Bing Search Ads

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

CS Machine Learning

Probabilistic Latent Semantic Analysis

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

arxiv: v1 [cs.lg] 15 Jun 2015

Introduction to Simulation

Deep Neural Network Language Models

Semi-Supervised Face Detection

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Mathematics subject curriculum

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

arxiv: v2 [cs.cv] 30 Mar 2017

Probability and Statistics Curriculum Pacing Guide

Using Deep Convolutional Neural Networks in Monte Carlo Tree Search

Attributed Social Network Embedding

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

STA 225: Introductory Statistics (CT)

Georgetown University at TREC 2017 Dynamic Domain Track

Word Segmentation of Off-line Handwritten Documents

Dropout improves Recurrent Neural Networks for Handwriting Recognition

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Second Exam: Natural Language Parsing with Neural Networks

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Assignment 1: Predicting Amazon Review Ratings

WHEN THERE IS A mismatch between the acoustic

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Lecture 10: Reinforcement Learning

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

arxiv: v1 [cs.cl] 27 Apr 2016

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Data Fusion Through Statistical Matching

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

arxiv: v1 [cs.lg] 7 Apr 2015

Softprop: Softmax Neural Network Backpropagation Learning

arxiv: v1 [cs.cv] 10 May 2017

A Model of Knower-Level Behavior in Number Concept Development

A Case Study: News Classification Based on Term Frequency

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

AP Calculus AB. Nevada Academic Standards that are assessable at the local level only.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A study of speaker adaptation for DNN-based speech synthesis

MASTER OF PHILOSOPHY IN STATISTICS

Residual Stacking of RNNs for Neural Machine Translation

Comparison of network inference packages and methods for multiple networks inference

Learning Methods for Fuzzy Systems

School Size and the Quality of Teaching and Learning

INPE São José dos Campos

Redirected Inbound Call Sampling An Example of Fit for Purpose Non-probability Sample Design

Physics 270: Experimental Physics

Introduction to Causal Inference. Problem Set 1. Required Problems

SARDNET: A Self-Organizing Feature Map for Sequences

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Calibration of Confidence Measures in Speech Recognition

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING, VOL XXX, NO. XXX,

Extending Place Value with Whole Numbers to 1,000,000

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Глубокие рекуррентные нейронные сети для аспектно-ориентированного анализа тональности отзывов пользователей на различных языках

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Analysis of Enzyme Kinetic Data

Semantic Segmentation with Histological Image Data: Cancer Cell vs. Stroma

arxiv: v2 [cs.ro] 3 Mar 2017

The Strong Minimalist Thesis and Bounded Optimality

Universityy. The content of

An empirical study of learning speed in backpropagation

arxiv:submit/ [cs.cv] 2 Aug 2017

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

A Deep Bag-of-Features Model for Music Auto-Tagging

Evolutive Neural Net Fuzzy Filtering: Basic Description

College Pricing and Income Inequality

AI Agent for Ice Hockey Atari 2600

Cultivating DNN Diversity for Large Scale Video Labelling

CS/SE 3341 Spring 2012

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Transcription:

Bayesian Reasoning and Deep Learning Shakir Mohamed DeepMind shakirm.com @shakir_za 9 October 2015

Abstract Deep learning and Bayesian machine learning are currently two of the most active areas of machine learning research. Deep learning provides a powerful class of models and an easy framework for learning that now provides state-ofthe-art methods for applications ranging from image classification to speech recognition. Bayesian reasoning provides a powerful approach for information integration, inference and decision making that has established it as the key tool for data-efficient learning, uncertainty quantification and robust model composition that is widely used in applications ranging from information retrieval to large-scale ranking. Each of these research areas has shortcomings that can be effectively addressed by the other, pointing towards a needed convergence of these two areas of machine learning; the complementary aspects of these two research areas is the focus of this talk. Using the tools of auto-encoders and latent variable models, we shall discuss some of the ways in which our machine learning practice is enhanced by combining deep learning with Bayesian reasoning. This is an essential, and ongoing, convergence that will only continue to accelerate and provides some of the most exciting prospects, some of which we shall discuss, for contemporary machine learning research. Bayesian Reasoning and Deep Learning 2

Deep Learning Bayesian Reasoning Better ML Bayesian Reasoning and Deep Learning 3

Deep Learning A framework for constructing flexible models + Rich non-linear models for classification and sequence prediction. + Scalable learning using stochastic approximations and conceptually simple. + Easily composable with other gradientbased methods - Only point estimates - Hard to score models, do model selection and complexity penalisation. Bayesian Reasoning and Deep Learning 4

Bayesian Reasoning A framework for inference and decision making + Unified framework for model building, inference, prediction and decision making + Explicit accounting for uncertainty and variability of outcomes + Robust to overfitting; tools for model selection and composition. - Mainly conjugate and linear models - Potentially intractable inference leading to expensive computation or long simulation times. Bayesian Reasoning and Deep Learning 5

Two Streams of Machine Learning + Rich non-linear models for classification and sequence prediction. + Scalable learning using stochastic approximation and conceptually simple. + Easily composable with other gradient-based methods - Only point estimates Deep Learning - Hard to score models, do selection and complexity penalisation. Bayesian Reasoning - Mainly conjugate and linear models - Potentially intractable inference, computationally expensive or long simulation time. + Unified framework for model building, inference, prediction and decision making + Explicit accounting for uncertainty and variability of outcomes + Robust to overfitting; tools for model selection and composition. Bayesian Reasoning and Deep Learning 6

Outline Bayesian Reasoning + Deep Learning Complementary strengths that we should expect to be successfully combined. 1 2 Why is this a good idea? Review of deep learning Limitations of maximum likelihood and MAP estimation How can we achieve this convergence? Case study using auto-encoders and latent variable models Approximate Bayesian inference 3 What else can we do? Semi-supervised learning, classification, better inference and more. Bayesian Reasoning and Deep Learning 7

A (Statistical) Review of Deep Learning Generalised Linear Regression = w > x + b p(y x) =p(y g( ); ) The basic function can be any linear function, e.g., affine, convolution. g(.) is an inverse link function that we ll refer to as an activation function. generalised regression. Target Regression Link Inv link Activation Real Linear Identity Identity Binary Logistic Logit log µ 1-µ Sigmoid 1 Sigmoid 1+exp(- ) Binary Probit Inv Gauss Gauss CDF Probit CDF -1 (µ) ( ) Binary Gumbel Compl. Gumbel CDF log-log log(-log(µ)) e -e-x Binary Logistic Hyperbolic Tangent tanh( ) Tanh Categorical Multinomial Multin. Logit Softmax P i j j Counts Poisson log(µ) p exp( ) Counts Poisson (µ) 2 Non-neg. Gamma Reciprocal 1 1 µ Sparse Tobit max max(0; ) ReLU Ordered Ordinal Cum. Logit ( k - ) Maximum likelihood estimation Optimise the negative log-likelihood L = log p(y g( ); ) Bayesian Reasoning and Deep Learning 8

A (Statistical) Review of Deep Learning Recursive Generalised Linear Regression Recursively compose the basic linear functions. Gives a deep neural network. E[y] =h L... h l h 0 (x) A general framework for building non-linear, parametric models Problem: Overfitting of MLE leading to limited generalisation. Bayesian Reasoning and Deep Learning 9

A (Statistical) Review of Deep Learning Regularisation Strategies for Deep Networks Regularisation is essential to overcome the limitations of maximum likelihood estimation. Regularisation, penalised regression, shrinkage. A wide range of available regularisation techniques: Large data sets Input noise/jittering and data augmentation/expansion. L2 /L1 regularisation (Weight decay, Gaussian prior) Binary or Gaussian Dropout Batch normalisation More robust loss function using MAP estimation instead. Bayesian Reasoning and Deep Learning 10

More Robust Learning MAP estimators and limitations Power of MAP estimators is that they provide some robustness to overfitting. Creates sensitivities to parameterisation. 1. Sensitivities affect gradients and can make learning hard Invariant MAP estimators and exploiting natural gradients, trust region methods and other improved optimisation. 2. Still no way to measure confidence of our model. Can generate frequentist confidence intervals and bootstrap estimates. Bayesian Reasoning and Deep Learning 11

Towards Bayesian Reasoning Proposed solutions have not fully dealt with the underlying issues. Issues arise as a consequence of: Reasoning only about the most likely solution and Not maintaining knowledge of the underlying variability (and averaging over this). Given this powerful model class and invaluable tools for regularisation and optimisation, let us develop a Pragmatic Bayesian Approach for Probabilistic Reasoning in Deep Networks. Bayesian reasoning over some, but not all parts of our models (yet). Bayesian Reasoning and Deep Learning 12

Outline Bayesian Reasoning + Deep Learning Complementary strengths that we should expect to be successfully combined. 1 Why is this a good idea? Review of deep learning Limitations of maximum likelihood and MAP estimation 2 How can we achieve this convergence? Case study using auto-encoders and latent variable models Approximate Bayesian inference 3 What else can we do? Semi-supervised learning, classification, better inference and more. Bayesian Reasoning and Deep Learning 13

Dimensionality Reduction and Auto-encoders Unsupervised learning and auto-encoders A generic tool for dimensionality reduction and feature extraction. Minimise reconstruction error using an encoder and a decoder. + Non-linear dimensionality reduction using deep networks for encoder and decoder. + Easy to implement as a single computational graph and train using SGD z Decoder g(.) y* = g(z) z = f(y) Encoder f(.) Data y - No natural handling of missing data L = log p(y g(z)) - No representation of variability of the representation space. L = ky g(f(y))k 2 2 Bayesian Reasoning and Deep Learning 14

Dimensionality Reduction and Auto-encoders z z = f(y) Some questions about auto-encoders: What is the model we are interested in? Why use an encoder? How do we regularise? Decoder g(.) Encoder f(.) y* = g(z) Data y Best to be explicit about the: Probabilistic model of interest and Mechanism we use for inference. Bayesian Reasoning and Deep Learning 15

Density Estimation and Latent Variable Models Latent variable models: Generic and flexible model class for density estimation. Specifies a generative process that gives rise to the data. Latent Gaussian Models: Probabilistic PCA, Factor analysis (FA), Bayesian Exponential Family PCA (BXPCA). BXPCA Latent Variable z N (z µ, ) Observation Model μ z Σ = Wz + b W y Expon(y ) Exponential fam natural parameters η. y n = 1,, N Use our knowledge of deep learning to design even richer models. Bayesian Reasoning and Deep Learning 16

Deep Generative Models Rich extension of previous model using deep neural networks. E.g., non-linear factor analysis, non-linear Gaussian belief networks, deep latent Gaussian models (DLGM). μ z 2 Σ DLGM Latent Variables (Stochastic layers) z l N (z l f l (z l+1 ), l ) f l (z) = (Wh(z)+b) h 4 h 3 W 1 Deterministic layers z 1 h i (x) = (Ax + c) h 2 Observation Model = Wh 1 + b h 1 W y Expon(y ) y Can also use non-exponential family. n = 1,, N Bayesian Reasoning and Deep Learning 17

Deep Latent Gaussian Models 1. Explain this data Our inferential tasks are: p(z y, W) / p(y z, W)p(z) μ Σ z 1 p(y y) = 2. Make predictions: Z p(y z, W)p(z y, W)dz h 2 h 1 W 3. Choose the best model Z p(y W) = p(y z, W)p(z)dz y n = 1,, N Bayesian Reasoning and Deep Learning 18

Variational Inference Use tools from approximate inference to handle intractable integrals. True posterior KL[q(z y)kp(z y)] Approximation class Reconstruction cost: Expected log-likelihood measures how well samples from q(z) are able to explain the data y. q (z) Penalty: Explanation of the data q(z) doesn t deviate too far from your beliefs p(z) - Okham s razor. Reconstruction F(y, q) =E q(z) [log p(y z)] Penalty KL[q(z)kp(z)] Penalty is derived from your model and does not need to be designed. Bayesian Reasoning and Deep Learning 19

Amortised Variational Inference z ~ q(z y) F(y, q) =E q(z) [log p(y z)] KL[q(z)kp(z)] Approx. Posterior Reconstruction Penalty Approximate posterior distribution q(z): Best match to true posterior p(z y), one of the unknown inferential quantities of interest to us. Inference/ Encoder q(z y) Inference network: q is an encoder or inverse model. Parameters of q are now a set of global parameters used for inference of all data points - test and train. Amortise (spread) the cost of inference over all data. Data y Encoders provide an efficient mechanism for amortised posterior inference Bayesian Reasoning and Deep Learning 20

Auto-encoders and Inference in DGMs F(y, q) =E q(z) [log p(y z)] KL[q(z)kp(z)] z z ~ q(z y) Approx. Posterior Reconstruction Penalty Model (Decoder): likelihood p(y z). Inference (Encoder): variational distribution q(z y) Model p(y z) Inference Network q(z y) Stochastic encoder-decoder systems implement variational inference. y ~ p(y z) Data y Specific combination of variational inference in latent variable models using inference networks Variational Auto-encoder But don t forget what your model is, and what inference you use. Bayesian Reasoning and Deep Learning 21

What Have we Gained + Transformed an auto-encoders into more interesting deep generative models. + Rich new class of density estimators built with non-linear models. + Used a principled approach for deriving loss functions that automatically include appropriate penalty functions. + Explained how an encoder enters into our models and why this is a good idea. + Able to answer all our desired inferential questions. + Knowledge of the uncertainty associated with our latent variables. F(y, q) =E q(z) [log p(y z)] z Model p(y z) y ~ p(y z) KL[q(z)kp(z)] z ~ q(z y) Inference Network q(z y) Data y Bayesian Reasoning and Deep Learning 22

What Have we Gained F(y, q) =E q(z) [log p(y z)] KL[q(z)kp(z)] + Able to score our models and do model selection using the free energy. z Model p(y z) y ~ p(y z) z ~ q(z y) Inference Network q(z y) Data y + Can impute missing data under any missingness assumption + Can still combine with natural gradient and improved optimisation tools. + Easy implementation - have a single computational graph and simple Monte Carlo gradient estimators. + Computational complexity the same as any large-scale deep learning system. A true marriage of Bayesian Reasoning and Deep Learning Bayesian Reasoning and Deep Learning 23

Data Visualisation MNIST Handwritten digits...... 500 28x28 DLGM Samples from 2D latent model Labels in 2D latent space Bayesian Reasoning and Deep Learning 24

Visualising MNIST in 3D DLGM Bayesian Reasoning and Deep Learning 25

Data Simulation DLGM Data Samples Bayesian Reasoning and Deep Learning 26

Missing Data Imputation Original Data unobserved pixels Inferred Image 10% observed DLGM 50% observed Bayesian Reasoning and Deep Learning 27

Outline Bayesian Reasoning + Deep Learning Complementary strengths that we should expect to be successfully combined. 1 Why is this a good idea? Review of deep learning Limitations of maximum likelihood and MAP estimation 2 How can we achieve this convergence? Auto-encoders and latent variable models Approximate and variational inference 3 What else can we do? Semi-supervised learning, recurrent networks, classification, better inference and more. Bayesian Reasoning and Deep Learning 28

Semi-supervised Learning Can extend the marriage of Bayesian reasoning and deep learning to the problem of semi-supervised classification. Semi-supervised DLGM π y μ x z Σ W n = 1,, N Bayesian Reasoning and Deep Learning 29

Analogical Reasoning Semi-supervised DLGM Bayesian Reasoning and Deep Learning 30

Generative Models with Attention Figure 7. MNIST generation sequences for DRAW without atwe can also combine other tools from deep learning to design tention. Notice how the network first generates a very blurry imthat is subsequently refined.generative models: recurrent networks even age more powerful and attention. Figure 8. Generated MNIST images with two digits. attention it constructs the digit by tracing the lines nt Neural Network For with Image Generation much like a person with a pen. DRAW ts scenes d by the. y step is ne while ew years by a seby a sin& Hinton, to, 2014; 014; Serequential h can be s such as model in possible nse it re- P (x z) decoder FNN ct 1 write ct write... ct 4.3. MNIST Generation with Two Digits dec decoder P (x z1:t ) decoder ht motivation The main for using 1 RNN RNNan attention-based generative model is that large images can be built up iteratively, z zt+1 zt decoding by adding to a small part of the image at a time. To test (generative model) sample this capability sample sample in a controlled fashion, we trained DRAW encoding two 28 x, 28 zmnist images choq(z x) to generate Q(ztimages x, z1:t with Q(z 1) t+1 1:t ) (inference) sen at random and placed at random locations in a 60 60 encoderin casesencoder black background. where the two digits overlap, henc t 1 RNN RNNtogether at each point and encoder the pixel intensities were added FNN clipped to be noread greater thanread one. Examples of generated data are shown in Fig. 8. The network typically generates x the other, suggesting x x one digit and then an ability to recreate composite scenes from simple pieces. Figure 2. Left: Conventional Auto-Encoder. Dur4.4. Street View House Variational Number Generation ing generation, a sample z is drawn from a prior P (z) and passedfigure 9. Generated SVHN images. The rightmost column MNIST digits are very simplistic in terms of visual strucshows the training images closest (in L2 distance) to the generthrough the feedforward decoder network to compute the probature, and we were keen to see how well DRAW performed ated images beside them. Note that the two columns are visually bility of the input P (x z) given the sample. During inference the on natural images. Our first natural image generation exsimilar, but the numbers are generally different. input x is periment passed to thetheencoder network, producing an approxused multi-digit Street View House Numbers datasetq(z x) (Netzer etover al., 2011). used the same preprocessimate posterior latentwe variables. During training, z ing as (Goodfellow et al., 2013), a 64 64 31 Bayesian Reasoning and Deep Learning is sampled from Q(z x) and then usedyielding to compute thehouse total de-highly realistic, as shown in Figs. 9 and 10. Fig. 11 reveals

Uncertainty on Model Parameters We can also combine other tools from deep learning to design even more powerful generative models: recurrent networks and attention. Bayesian Neural Networks x h 1 y n = 1,, N W 1 h 2 W 2 W 3 Y H 1 H 2 H 3 1 X 1 Bayesian Reasoning and Deep Learning 32

In Review Deep learning as a framework for building highly flexible non-linear parametric models, but regularisation and accounting for uncertainty and lack of knowledge is still needed. Bayesian reasoning as a general framework for inference that allows us to account for uncertainty and a principled approach for regularisation and model scoring. z z ~ q(z y) Combined Bayesian reasoning with auto-encoders and showed just how much can be gained by a marriage of these two streams of machine learning research. Model p(y z) y ~ p(y z) Inference Network q(z y) Data y Bayesian Reasoning and Deep Learning 33

Thanks to many people: Danilo Rezende, Ivo Danihelka, Karol Gregor, Charles Blundell, Theophane Weber, Andriy Mnih, Daan Wierstra (Google DeepMind), Durk Kingma, Max Welling (U. Amsterdam) Thank You. Bayesian Reasoning and Deep Learning 34

Some References Probabilistic Deep Learning Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models." ICML (2014). Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." ICLR 2014. Mnih, Andriy, and Karol Gregor. "Neural variational inference and learning in belief networks." ICML (2014). Gregor, Karol, et al. "Deep autoregressive networks." ICML (2014). Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. NIPS (pp. 3581-3589). Gregor, K., Danihelka, I., Graves, A., & Wierstra, D. (2015). DRAW: A recurrent neural network for image generation. arxiv preprint arxiv: 1502.04623. Rezende, D. J., & Mohamed, S. (2015). Variational Inference with Normalizing Flows. arxiv preprint arxiv:1505.05770. Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015). Weight Uncertainty in Neural Networks. arxiv preprint arxiv:1505.05424. Hernández-Lobato, J. M., & Adams, R. P. (2015). Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. arxiv preprint arxiv:1502.05336. Gal, Y., & Ghahramani, Z. (2015). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. arxiv preprint arxiv:1506.02142. Bayesian Reasoning and Deep Learning 35

What is a Variational Method? Variational Principle General family of methods for approximating complicated densities by a simpler class of densities. KL[q(z y)kp(z y)] Approximation class True posterior q (z) Deterministic approximation procedures with bounds on probabilities of interest. Fit the variational parameters. Bayesian Reasoning and Deep Learning 36

From IS to Variational Inference Integral problem Proposal Importance Weight Jensen s inequality Z log Z p(x)g(x)dx p(x) log g(x)dx log p(y) = log log p(y) = log Z log p(y) = log p(y z) p(z) q(z) q(z)dz Z log p(y) q(z) log p(y z) p(z) dz q(z) = Z Z Z q(z) log p(y z) p(y z)p(z)dz p(y z)p(z) q(z) q(z) dz Z q(z) log q(z) p(z) Variational lower bound = E q(z) [log p(y z)] KL[q(z)kp(z)] Bayesian Reasoning and Deep Learning 37

Minimum Description Length (MDL) F(y, q) =E q(z) [log p(y z)] KL[q(z)kp(z)] Stochastic encoder Data code-length Hypothesis code Stochastic encoder-decoder systems implement variational inference. Regularity in our data that can be explained with latent variables, implies that the data is compressible. MDL: inference seen as a problem of compression we must find the ideal shortest message of our data y: marginal likelihood. Must introduce an approximation to the ideal message. Encoder: variational distribution q(z y), Decoder: likelihood p(y z). z Decoder p(y z) y ~ p(y z) z ~ q(z y) Encoder q(z y) Data y Bayesian Reasoning and Deep Learning 38

Denoising Auto-encoders (DAE) F(y, q) =E q(z) [log p(y z)] (z,y) Stochastic encoder Reconstruction Penalty Stochastic encoder-decoder systems implement variational inference. DAE: A mechanism for finding representations or features of data (i.e. latent variable explanations). Encoder: variational distribution q(z y), Decoder: likelihood p(y z). z Decoder p(y z) z ~ q(z y) Encoder q(z y) The variational approach requires you to be explicit about your assumptions. Penalty is derived from your model and does not need to be designed. y ~ p(y z) Data y Bayesian Reasoning and Deep Learning 39

Amortising the Cost of Inference Repeat: E-step For i = 1, N n /r E q (z) [log p (y n z n )] r KL[q(z n )kp(z n )] Instead of solving this optimisation for every data point n, we can instead use a model. M-step / 1 N X r log p (y n z n ) n z Model p(y z) y ~ p(y z) z ~ q(z y) Inference Network q(z y) Data y Inference network: q is an encoder or inverse model. Parameters of q are now a set of global parameters used for inference of all data points - test and train. Share the cost of inference (amortise) over all data. Combines easily with mini-batches and Monte Carlo expectations. Can jointly optimise variational and model parameters: no need for alternating optimisation. Bayesian Reasoning and Deep Learning 40

Implementing your Variational Algorithm Avoid deriving pages of gradient updates for variational inference. Variational inference turns integration into optimisation: Automated Tools: Differentiation: Theano, Torch7, Stan Message passing: infer.net Stochastic gradient descent and other preconditioned optimisation. Same code can run on both GPUs or on distributed clusters. Probabilistic models are modular, can easily be combined. E q [( log p(y z) + log q(z) log p(z)] Prior p(z) log p(z) Model p(x z) log p(x z) Forward pass z H[q(z)] Inference q(z x) Data x Backward pass Prior p(z) r Model p(x z) Inference q(z x) Ideally want probabilistic programming using variational inference. r r Bayesian Reasoning and Deep Learning 41

Stochastic Backpropagation A Monte Carlo method that works with continuous latent variables. Original problem r E q(z) [f(z)] Reparameterisation z N (µ, 2 ) z = µ + N (0, 1) Backpropagation with Monte Carlo r E N (0,1) [f(µ + )] E N (0,1) [r ={µ, } f(µ + )] Can use any likelihood function, avoids the need for additional lower bounds. Low-variance, unbiased estimator of the gradient. Can use just one sample from the base distribution. Possible for many distributions with location-scale or other known transformations, such as the CDF. Bayesian Reasoning and Deep Learning 42

Monte Carlo Control Variate Estimators More general Monte Carlo approach that can be used with both discrete or continuous latent variables. Property of the score function: r log q (z x) = r q (z x) q (z x) Original problem Score ratio r E q (z) [log p (y z)] E q (z) [log p (y z)r log q(z y)] MCCV Estimate E q (z) [(log p (y z) c)r log q(z y)] c is known as a control variate and is used to control the variance of the estimator. Bayesian Reasoning and Deep Learning 43