Variational inference / Variational Bayesian methods

Similar documents
Lecture 1: Machine Learning Basics

Generative models and adversarial training

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

Probabilistic Latent Semantic Analysis

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Learning Methods in Multilingual Speech Recognition

Acquiring Competence from Performance Data

Truth Inference in Crowdsourcing: Is the Problem Solved?

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Comparison of network inference packages and methods for multiple networks inference

Python Machine Learning

arxiv:cmp-lg/ v1 22 Aug 1994

CSL465/603 - Machine Learning

arxiv: v2 [cs.cv] 30 Mar 2017

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Learning Methods for Fuzzy Systems

Speech Recognition at ICSI: Broadcast News and beyond

Semi-Supervised Face Detection

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Axiom 2013 Team Description Paper

Calibration of Confidence Measures in Speech Recognition

CS Machine Learning

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

Speaker recognition using universal background model on YOHO database

Lecture 10: Reinforcement Learning

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Evolutive Neural Net Fuzzy Filtering: Basic Description

Introduction to Simulation

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Modeling function word errors in DNN-HMM based LVCSR systems

WHEN THERE IS A mismatch between the acoustic

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

INPE São José dos Campos

Model Ensemble for Click Prediction in Bing Search Ads

Modeling function word errors in DNN-HMM based LVCSR systems

Software Maintenance

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Detailed course syllabus

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Experts Retrieval with Multiword-Enhanced Author Topic Model

On the Combined Behavior of Autonomous Resource Management Agents

Why Did My Detector Do That?!

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A study of speaker adaptation for DNN-based speech synthesis

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Deep Neural Network Language Models

Analysis of Enzyme Kinetic Data

Switchboard Language Model Improvement with Conversational Data from Gigaword

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Probability and Statistics Curriculum Pacing Guide

Word learning as Bayesian inference

Speech Emotion Recognition Using Support Vector Machine

A Reinforcement Learning Variant for Control Scheduling

Reducing Features to Improve Bug Prediction

The Method of Immersion the Problem of Comparing Technical Objects in an Expert Shell in the Class of Artificial Intelligence Algorithms

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Comparison of Annealing Techniques for Academic Course Scheduling

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Abnormal Activity Recognition Based on HDP-HMM Models

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

The Strong Minimalist Thesis and Bounded Optimality

Purdue Data Summit Communication of Big Data Analytics. New SAT Predictive Validity Case Study

Human Emotion Recognition From Speech

Hierarchical Linear Models I: Introduction ICPSR 2015

DIRECT ADAPTATION OF HYBRID DNN/HMM MODEL FOR FAST SPEAKER ADAPTATION IN LVCSR BASED ON SPEAKER CODE

Robot Learning Simultaneously a Task and How to Interpret Human Instructions

Word Segmentation of Off-line Handwritten Documents

Assignment 1: Predicting Amazon Review Ratings

Reinforcement Learning by Comparing Immediate Reward

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

An empirical study of learning speed in backpropagation

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

An Introduction to Simio for Beginners

MASTER OF PHILOSOPHY IN STATISTICS

Automatic Speaker Recognition: Modelling, Feature Extraction and Effects of Clinical Environment

ENME 605 Advanced Control Systems, Fall 2015 Department of Mechanical Engineering

Corrective Feedback and Persistent Learning for Information Extraction

Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Data Fusion Through Statistical Matching

arxiv: v2 [cs.ir] 22 Aug 2016

Latent Semantic Analysis

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Transcription:

Variational inference / Variational Bayesian methods

Outline What is it How does it work Closeness Variational family Optimization Example Drawbacks

Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) Input Hidden Output S 1 S 2 S 3 S 4 S 5 O 1 O 2 O 3 O 4 O 5

Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) Input Hidden Output S 1 S 2 S 3 S 4 S 5 O 1 O 2 O 3 O 4 O 5

Variational inference Allows us to re-write statistical problems as optimization problems Very popular in statistical machine learning (deep learning) First: Peterson & Anderson (1987) Neural network Neal & Hinton (1993) made connections to EM algorithm Also in Bayesian inference: Complex models (intractable integrals) Large data Model selection

How does it work p(z X) posterior probability p(x Z) likelihood p(z) prior probability Z X Hidden / latent variable Observed variable What if we don t know how to sample from P(Z X) or computation of p(z X) is very complicated?

How does it work what is close? Kullback-Leibler (KL) divergence: KL(qq(ZZ) pp ZZ XX qq zz = arg min KL(qq(ZZ) pp ZZ XX ) qq(zz) ℶ = EE[log qq(zz)] EE[log pp(zz XX)] ℶ = family of densities over the latent variables, where each q(z) ℶ is a candidate approximation KL(qq(ZZ) pp ZZ XX = EE[log qq(zz)] EE[log pp(zz, XX)] + EE[log pp(xx)] Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] negative KL divergence, plus a constant: log p(x)

How does it work what is close? Optimal variational density Evidence lower bound (ELBO): ELBO(qq) = EE[log pp(zz, XX)] EE[log qq(zz)] ELBO(qq) = EE[log pp(zz)] + EE[log pp(xx ZZ)] EE[log qq(zz)] ELBO(qq) = EE[log pp(xx ZZ)] KL(qq(ZZ) pp(zz)) expected likelihood: encourages densities that place their mass on configurations of latent variables that explain observed data difference between variational density and the prior: encourages densities close to the prior

How does it work Variational family ℶ The mean-field variational family: mm qq ZZ = qq jj (zz jj ) jj=1 latent variables are mutually independent, and each latent variable is governed by da distinct factor in the variational density

How does it work Optimization CAVI: coordinate ascent variational inference for latent variable Zj fix all other variational factors optimize Zj given the fixed values of the other Z and the data X continue to the next latent variable Iterate through all of them Continue until converged We now reached a local optimum GIBBS sampler of variational inference.

Example Gaussian mixture Bayesian mixture of unit variance univariate Gaussians K mixture components = K Gaussian distributions with means μμ = μμ 1,, μμ KK Common prior p(μμ kk ) Full hierarchical model: Latent variables are Z = (μμ, c): the K class means and n class assignments Blei et al, 2017

Example Gaussian mixture mean-field variational family Blei et al, 2017

Example Gaussian mixture CAVI algorithm Blei et al, 2017

Example Gaussian mixture Simulate 2 dimensional Gaussion mixture with 5 components Blei et al, 2017

Implementation R packages: VBmix: Variational algorithms and methods for fitting mixture models (Windows binary unaviable on CRAN) VBLPCM: Variational Bayes Latent Position Cluster Model for Networks locus (unofficial): Large-scale variational inference for combined covariate and response selection in sparse regression models. STAN 2.7 and up (2015) can do various methods of variational inference

Why (not) Fast, easier to scale to large data Statistical properties are less well understood Less accurate compared to MCMC generally underestimates the variance of posterior density Deriving the equations used to iteratively update the parameters often requires a large amount of work (compared to e.g., GIBBS sampling) Active field of research Blei et al, 2017

References Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine learning, 37(2), 183-233. Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2), 1-305. (Book) Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, (justaccepted). Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. London: University of London. (Thesis) http://blog.evjang.com/2016/08/variational-bayes.html

Questions?