Variational inference / Variational Bayesian methods

Outline What is it How does it work Closeness Variational family Optimization Example Drawbacks

Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) Input Hidden Output S 1 S 2 S 3 S 4 S 5 O 1 O 2 O 3 O 4 O 5

Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) First: Peterson & Anderson (1987) Neural network Neal & Hinton (1993) made connections to EM algorithm Also in Bayesian inference: Complex models (intractable integrals) Large data Model selection

How does it work p(z X) posterior probability p(x Z) likelihood p(z) prior probability Z X Hidden / latent variable Observed variable What if we don t know how to sample from P(Z X) or computation of p(z X) is very complicated?

How does it work what is close? Kullback-Leibler (KL) divergence: KL(qq(ZZ) pp ZZ XX qq zz = arg min KL(qq(ZZ) pp ZZ XX ) qq(zz) ℶ = EE[log qq(zz)] EE[log pp(zz XX)] ℶ = family of densities over the latent variables, where each q(z) ℶ is a candidate approximation KL(qq(ZZ) pp ZZ XX = EE[log qq(zz)] EE[log pp(zz, XX)] + EE[log pp(xx)] Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] negative KL divergence, plus a constant: log p(x)

How does it work what is close? Optimal variational density Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] ELBO(qq) = EE[log pp(zz)] + EE[log pp(xx ZZ)] EE[log qq(zz)] ELBO(qq) = EE[log pp(xx ZZ)] KL(qq(ZZ) pp(zz)) expected likelihood: encourages densities that place their mass on configurations of latent variables that explain observed data difference between variational density and the prior: encourages densities close to the prior

How does it work Variational family ℶ The mean-field variational family: mm qq ZZ = qq jj (zz jj ) jj=1 latent variables are mutually independent, and each latent variable is governed by da distinct factor in the variational density

How does it work Optimization CAVI: coordinate ascent variational inference for latent variable Zj fix all other variational factors optimize Zj given the fixed values of the other Z and the data X continue to the next latent variable Iterate through all of them Continue until converged We now reached a local optimum GIBBS sampler of variational inference.

Example Gaussian mixture Bayesian mixture of unit variance univariate Gaussians K mixture components = K Gaussian distributions with means μμ = μμ 1,, μμ KK Common prior p(μμ kk ) Full hierarchical model: Latent variables are Z = (μμ, c): the K class means and n class assignments Blei et al, 2017

Example Gaussian mixture mean-field variational family Blei et al, 2017

Example Gaussian mixture CAVI algorithm Blei et al, 2017

Example Gaussian mixture Simulate 2 dimensional Gaussion mixture with 5 components Blei et al, 2017

Implementation R packages: VBmix: Variational algorithms and methods for fitting mixture models (Windows binary unaviable on CRAN) VBLPCM: Variational Bayes Latent Position Cluster Model for Networks locus (unofficial): Large-scale variational inference for combined covariate and response selection in sparse regression models. STAN 2.7 and up (2015) can do various methods of variational inference

Why (not) Fast, easier to scale to large data Statistical properties are less well understood Less accurate compared to MCMC generally underestimates the variance of posterior density Deriving the equations used to iteratively update the parameters often requires a large amount of work (compared to e.g., GIBBS sampling) Active field of research Blei et al, 2017

References Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2), 183-233. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2), 1-305. (Book) Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, (justaccepted). Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. London: University of London. (Thesis) http://blog.evjang.com/2016/08/variational-bayes.html

Questions?