Variational inference / Variational Bayesian methods

Size: px

Start display at page:

Download "Variational inference / Variational Bayesian methods"

Loreen Barber
5 years ago
Views:

1 Variational inference / Variational Bayesian methods

2 Outline What is it How does it work Closeness Variational family Optimization Example Drawbacks

3 Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) Input Hidden Output S 1 S 2 S 3 S 4 S 5 O 1 O 2 O 3 O 4 O 5

4 Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) Input Hidden Output S 1 S 2 S 3 S 4 S 5 O 1 O 2 O 3 O 4 O 5

5 Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) First: Peterson & Anderson (1987) Neural network Neal & Hinton (1993) made connections to EM algorithm Also in Bayesian inference: Complex models (intractable integrals) Large data Model selection

6 How does it work p(z X) posterior probability p(x Z) likelihood p(z) prior probability Z X Hidden / latent variable Observed variable What if we don t know how to sample from P(Z X) or computation of p(z X) is very complicated?

7 How does it work what is close? Kullback-Leibler (KL) divergence: KL(qq(ZZ) pp ZZ XX qq zz = arg min KL(qq(ZZ) pp ZZ XX ) qq(zz) ℶ = EE[log qq(zz)] EE[log pp(zz XX)] ℶ = family of densities over the latent variables, where each q(z) ℶ is a candidate approximation KL(qq(ZZ) pp ZZ XX = EE[log qq(zz)] EE[log pp(zz, XX)] + EE[log pp(xx)] Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] negative KL divergence, plus a constant: log p(x)

8 How does it work what is close? Optimal variational density Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] ELBO(qq) = EE[log pp(zz)] + EE[log pp(xx ZZ)] EE[log qq(zz)] ELBO(qq) = EE[log pp(xx ZZ)] KL(qq(ZZ) pp(zz)) expected likelihood: encourages densities that place their mass on configurations of latent variables that explain observed data difference between variational density and the prior: encourages densities close to the prior

9 How does it work Variational family ℶ The mean-field variational family: mm qq ZZ = qq jj (zz jj ) jj=1 latent variables are mutually independent, and each latent variable is governed by da distinct factor in the variational density

10 How does it work Optimization CAVI: coordinate ascent variational inference for latent variable Zj fix all other variational factors optimize Zj given the fixed values of the other Z and the data X continue to the next latent variable Iterate through all of them Continue until converged We now reached a local optimum GIBBS sampler of variational inference.

11 Example Gaussian mixture Bayesian mixture of unit variance univariate Gaussians K mixture components = K Gaussian distributions with means μμ = μμ 1,, μμ KK Common prior p(μμ kk ) Full hierarchical model: Latent variables are Z = (μμ, c): the K class means and n class assignments Blei et al, 2017

12 Example Gaussian mixture mean-field variational family Blei et al, 2017

13 Example Gaussian mixture CAVI algorithm Blei et al, 2017

14 Example Gaussian mixture Simulate 2 dimensional Gaussion mixture with 5 components Blei et al, 2017

15 Implementation R packages: VBmix: Variational algorithms and methods for fitting mixture models (Windows binary unaviable on CRAN) VBLPCM: Variational Bayes Latent Position Cluster Model for Networks locus (unofficial): Large-scale variational inference for combined covariate and response selection in sparse regression models. STAN 2.7 and up (2015) can do various methods of variational inference

16 Why (not) Fast, easier to scale to large data Statistical properties are less well understood Less accurate compared to MCMC generally underestimates the variance of posterior density Deriving the equations used to iteratively update the parameters often requires a large amount of work (compared to e.g., GIBBS sampling) Active field of research Blei et al, 2017

17 References Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2), Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2), (Book) Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, (justaccepted). Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. London: University of London. (Thesis)

18 Questions?

Lecture 1: Machine Learning Basics

1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3