Foundations of Graphical Models

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Probabilistic Latent Semantic Analysis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

STA 225: Introductory Statistics (CT)

Henry Tirri* Petri Myllymgki

CSL465/603 - Machine Learning

Introduction to Simulation

GRADUATE STUDENT HANDBOOK Master of Science Programs in Biostatistics

Knowledge-Based - Systems

Mining Association Rules in Student s Assessment Data

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

A Model of Knower-Level Behavior in Number Concept Development

Learning Methods for Fuzzy Systems

Experts Retrieval with Multiword-Enhanced Author Topic Model

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Generative models and adversarial training

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Time series prediction

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Evolutive Neural Net Fuzzy Filtering: Basic Description

Computerized Adaptive Psychological Testing A Personalisation Perspective

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Software Maintenance

Abstractions and the Brain

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Speech Emotion Recognition Using Support Vector Machine

Lecture 10: Reinforcement Learning

Statewide Framework Document for:

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Physics 270: Experimental Physics

Evidence for Reliability, Validity and Learning Effectiveness

Human Emotion Recognition From Speech

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Evolution of Symbolisation in Chimpanzees and Neural Nets

Document number: 2013/ Programs Committee 6/2014 (July) Agenda Item 42.0 Bachelor of Engineering with Honours in Software Engineering

Rule Learning With Negation: Issues Regarding Effectiveness

An Introduction to Simio for Beginners

Speech Recognition at ICSI: Broadcast News and beyond

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

WHEN THERE IS A mismatch between the acoustic

Word Segmentation of Off-line Handwritten Documents

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Reinforcement Learning Variant for Control Scheduling

Artificial Neural Networks written examination

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Probability and Statistics Curriculum Pacing Guide

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

DEVELOPMENT OF AN INTELLIGENT MAINTENANCE SYSTEM FOR ELECTRONIC VALVES

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

The Use of Statistical, Computational and Modelling Tools in Higher Learning Institutions: A Case Study of the University of Dodoma

Master s Programme in Computer, Communication and Information Sciences, Study guide , ELEC Majors

Truth Inference in Crowdsourcing: Is the Problem Solved?

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Disciplinary Literacy in Science

MASTER OF SCIENCE (M.S.) MAJOR IN COMPUTER SCIENCE

A student diagnosing and evaluation system for laboratory-based academic exercises

This Performance Standards include four major components. They are

Level 6. Higher Education Funding Council for England (HEFCE) Fee for 2017/18 is 9,250*

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Word learning as Bayesian inference

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Georgetown University at TREC 2017 Dynamic Domain Track

Laboratorio di Intelligenza Artificiale e Robotica

Modeling function word errors in DNN-HMM based LVCSR systems

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Laboratorio di Intelligenza Artificiale e Robotica

Wenguang Sun CAREER Award. National Science Foundation

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Learning Methods in Multilingual Speech Recognition

Modeling function word errors in DNN-HMM based LVCSR systems

Introduction to Causal Inference. Problem Set 1. Required Problems

BAYESIAN ANALYSIS OF INTERLEAVED LEARNING AND RESPONSE BIAS IN BEHAVIORAL EXPERIMENTS

Rule Learning with Negation: Issues Regarding Effectiveness

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Applications of data mining algorithms to analysis of medical data

Data Integration through Clustering and Finding Statistical Relations - Validation of Approach

Discriminative Learning of Beam-Search Heuristics for Planning

A Semantic Imitation Model of Social Tag Choices

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Honors Mathematics. Introduction and Definition of Honors Mathematics

Learning and Transferring Relational Instance-Based Policies

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

SURVIVING ON MARS WITH GEOGEBRA

Kelli Allen. Vicki Nieter. Jeanna Scheve. Foreword by Gregory J. Kaiser

Citrine Informatics. The Latest from Citrine. Citrine Informatics. The data analytics platform for the physical world

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Transcription:

Foundations of Graphical Models David M. Blei Columbia University Probabilistic modeling is a mainstay of modern machine learning and statistics research, providing essential tools for analyzing the vast amount of data that have become available in science, government, industry, and everyday life. This course will cover the mathematical and algorithmic foundations of this field, as well as methods underlying the current state of the art. [ What kinds of problems with data do you care about? ] Over the last century, many problems that have been solved (at least partially) with probabilistic models. These include: Group genes into clusters Filter email that is likely to be spam Transcribe speech from a recorded signal Identify recurring patterns in gene sequences Uncover hidden topics in collections of texts Predict what someone will purchase based on his or her purchase history Track an object s position via radar measurements Determine the structure of the evolutionary tree of a set of species Identify the ancestral populations embedded in the human population. Diagnose a disease from its symptoms Decode an original message from a noisy transmission Understand the phase transitions in a physical system of electrons Find the communities embedded in a massive social network Locate politicians on the political spectrum based on their voting records For each of these applications of probabilistic modeling, someone determined a statistical model, fit that model to observed data, and used the fitted model to solve the task at hand. As one might expect from the diversity of applications 1

listed above, each model was developed and studied within a different intellectual community. Over the past two decades, scholars working in the field of machine learning have sought to unify such data analysis activities. Their focus has been on developing tools for devising, analyzing, and implementing probabilistic models in generality. These efforts have lead to the body of work on probabilistic graphical models, a marriage of graph theory and probability theory. Graphical models provide a language for expressing assumptions about data, and a suite of efficient algorithms for reasoning and computing with those assumptions. As a consequence, graphical models research has forged connections between signal processing, coding theory, computational biology, natural language processing, computer vision, and many other fields. Knowledge of graphical models is essential to academics working in machine learning and statistics, and is of increased importance to those in the other scientific and engineering fields to which these methods have been applied. Example: Latent Dirichlet allocation To give you an idea of what applied probabilistic modeling is, I will quickly descrbe latent Dirichlet allocation (LDA) (Blei et al., 2003), which is a kind of probabilistic topic model. (If you have seen me speak, you have probably heard about LDA.) Basically, LDA is a model of large document collections that can be used to automatically extract the hidden topics that pervade them and how each document expresses those topics. It has become a widely-used method for modeling digital content, and is an example of a successfully deployed probabilistic model. (I developed LDA with Andrew Ng and Michel Jordan in the late nineties. Note it was my final project in a class like this. Andrew Ng was the TA; Michael Jordan was the professor.) [ Show slides about LDA and probabilistic modeling in my research group. ] Box s Loop [ This text was taken, largely unchanged, from Blei (2014). ] 2

DATA Build model Infer hidden quantities Criticize model Mixtures and mixed-membership; Time series; Generalized linear models; Factor models; Bayesian nonparametrics Markov chain Monte Carlo; Variational inference; Laplace approximation Performance on a task; Prediction on unseen data; Posterior predictive checks Apply model Predictive systems; Data exploration; Data summarization Revise Model Figure 1: Building and computing with models is part of an iterative process for solving data analysis problems. This is Box s loop, an adaptation of the perspective of Box (1976). Our perspective is that building and using probabilistic models is part of an iterative process for solving data analysis problems. First, formulate a simple model based on the kinds of hidden structure that you believe exists in the data. Then, given a data set, use an inference algorithm to approximate the posterior the conditional distribution of the hidden variables given the data which points to the particular hidden pattens that your data exhibits. Finally, use the posterior to test the model against the data, identifying the important ways that it succeeds and fails. If satisfied, use the model to solve the problem; if not satisfied, revise the model according to the results of the criticism and repeat the cycle. Figure 1 illustrates this process. We call this process Box s loop. It is an adaptation an attempt at revival, really of the ideas of George Box and collaborators in their papers from the 1960s and 1970s (Box and Hunter, 1962, 1965; Box and Hill, 1967; Box, 1976, 1980). Box focused on the scientific method, understanding nature by iterative experimental design, data collection, model formulation, and model criticism. But his general approach just as easily applies to other applications of probabilistic modeling. It applies to engineering, where the goal is to use a model to build a system that performs a task, such as information retrieval or item recommendation. And it applies to exploratory data analysis, where the goal is to summarize, visualize, and hypothesize about observational data, i.e., data that we observe but that are not part of a designed experiment. Why revive this perspective now? The future of data analysis lies in close col- 3

laborations between domain experts and modelers. Box s loop cleanly separates the tasks of articulating domain assumptions into a probability model, conditioning on data and computing with that model, evaluating it in realistic settings, and using the evaluation to revise it. It is a powerful methodology for guiding collaborative efforts in solving data analysis problems. As machine learning researchers and statisticians, our research goal is to make Box s loop easy to implement, and modern research has radically changed each component in the half-century since Box s inception. We have developed intuitive grammars for building models, scalable algorithms for computing with a wide variety of models, and general methods for understanding the performance of a model to guide its revision. This course gives a curated view of the state-of-the-art research for implementing Box s loop. In the first step of the loop, we build (or revise) a probability model. Probabilistic graphical models (Pearl, 1988; Dawid and Lauritzen, 1993; Jordan, 2004) is a field of research that connects graph theory to probability theory, and provides an elegant language for building models. With graphical models, we can clearly articulate what kinds of hidden structures are governing the data and construct complex models from simpler components like clusters, sequences, hierarchies, and others to tailor our models to the data at hand. This language gives us a palette with which to posit and revise our models. The observed data enters the picture in the second step of Box s loop. Here we compute the posterior distribution, the conditional distribution of the hidden patterns given the obervations, to understand how the hidden structures we assumed are manifested in the data. 1 Most useful models are difficult to compute with, however, and researchers have developed powerful approximate posterior inference algorithms for approximating these conditionals. Techniques like Markov chain Monte Carlo (MCMC) (Metropolis et al., 1953; Hastings, 1970; Geman and Geman, 1984) and variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008) make it possible for us to examine large data sets with sophisticated statistical models. Moreover, these algorithms are modular recurring components in a graphical model lead to recurring subroutines in their corresponding inference algorithms. This has led to more recent work in effi- 1 In a way, we take a Bayesian perspective because we treat all hidden quantities as random variables and investigate them through their conditional distribution given observations. However, we prefer the more general langauge of latent variables, which can be either parameters to the whole data set or local hidden structure to individual data points (or something in between). Further, in performing model criticism we will step out of the Bayesian framework to ask whether the model we assumed has good properties in the sampling sense. 4

cient generic algorithms, which can be easily applied to a wide class of models (Gelfand and Smith, 1990; Bishop et al., 2003). Finally we close the loop, studying how our models succeed and fail to guide the process of revision. Here again is an opportunity for a revival. With new methods for quickly building and computing with sophisticated models, we can make better use of techniques like predictive sample reuse (Geisser, 1975) and posterior predictive checks (Box, 1980; Rubin, 1984; Gelman et al., 1996). These are general techniques for assessing model fitness, contrasting the predictions that a model makes against the observed data. Understanding a model s performance in the ways the matter to the task at hand an activity called model criticism is essential to solving modern data analysis problems. Course Topics Here is an organized list of what we will cover in this course. Note that the readings will often go beyond what we can cover in lecture. The basics of graphical models Basic concepts in probability; the semantics of graphical models D-separation and conditional independence in graphical models Message passing, tree propagation, and a word about the junction tree Learning about data with probability models Probability models, data, and statistical concepts Bayesian mixtures of Gaussians and why we need approximate inference Markov chain Monte Carlo sampling and the Gibbs sampler The exponential family, conjugacy, and mixtures of exponential families Variational inference (and a word about expectation maximization) The building blocks of complex models Mixtures and mixed membership models (including topic models) Matrix factorization: Gaussian, Poisson, exponential family Time series models: Hidden Markov models and state-space models 5

Spatial models Regression: Linear and logistic, generalized linear models, regularization Bayesian nonparametrics I: Clustering models Bayesian nonparametrics II: Latent feature models Bayesian nonparametrics III: Gaussian processes Advanced topics Scalable inference with stochastic variational inference Model checking and revision with posterior predictive checks Hierarchical models, multi-level models, empirical Bayes Causal inference and probabilistic modeling (a rabbit hole) Additional Discussion Programming languages. I expect you to be implementing and experimenting with the methods that we study. There are no programming assignments, so it is up to you to do this in advance of your final project. For prototyping and developing algorithms, I like the programming language R and embellishments like RStudio. This is not the only choice I know that many like to use Python, Julia, and probably others I do not know about. (Matlab seems to have fallen out of favor.) On the backend, to make things fast, I use C. But it seems that my collaborators mostly use C++. Stan is a probabilistic programming language that is actively developed here at Columbia by Andrew Gelman, Bob Carpenter, and colleagues. It lets you specify a probabilistic model programmatically and then compile it down to an inference algorithm, an executable that takes data as input and returns estimates of the posterior distribution. I encourage you to try it out at some point during the semester. Solving real problems involves many hours of data wrangling, working with online APIs and otherwise cleaning and manipulating data so that it is easy to analyze. For this important activity, you will need to be fluent in a scripting language. I recommend Python. 6

Applications. We will focus on methods. We will mention applications, especially as motivating concrete examples, but there will not be reading about applications. Each of you will be doing a project connected to an application and so you are expected to read and absorb that material on your own. A student doing a project about recommendation systems should read about the state of the art in probabilistic recommendation systems; a student doing a project about population genetics should read about their field. Building and using models. This course is about how to build and compute with probabilistic models that are tailored to the problem at hand. (Note that it is not a course that gives a cookbook of methods and when to use them.) Returning to the figure about Box s loop, we are going to focus on the model building piece and the inference piece. What components are in my toolbox with which to build models? How do I compose them together? What algorithms are available to compute with the resulting model and what are their properties? How do I derive an algorithm for the model I want to work with? Two of the other pieces of the picture getting the right data and using the results of inference are equally important, but are specific to the problems that you will be individually working on. The final piece revising models (and building them in the first place) is a fuzzy and difficult problem. We will discuss it toward the end of the semester, but building and diagnosing models is more of a craft at this point, one learned through experience. References Bishop, C., Spiegelhalter, D., and Winn, J. (2003). VIBES: A variational inference engine for Bayesian networks. In Neural Information Processing Systems. Cambridge, MA. Blei, D. (2014). Build, compute, critique, repeat: Data analysis with latent variable models. Annual Review of Statistics and Its Application, 1:203 232. Blei, D., Ng, A., and Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993 1022. Box, G. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791 799. 7

Box, G. (1980). Sampling and Bayes inference in scientific modeling and robustness. Journal of the Royal Statistical Society, Series A, 143(4):383 430. Box, G. and Hill, W. (1967). Discrimination among mechanistic models. Technometrics, 9(1):pp. 57 71. Box, G. and Hunter, W. (1962). A useful method for model-building. Technometrics, 4(3):pp. 301 318. Box, G. and Hunter, W. (1965). The experimental study of physical mechanisms. Technometrics, 7(1):pp. 23 42. Dawid, A. and Lauritzen, S. (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics, 21(3):1272 1317. Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70:320 328. Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85:398 409. Gelman, A., Meng, X., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6:733 807. Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721 741. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97 109. Jordan, M. (2004). Graphical models. Statistical Science, 19(1):140 155. Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. (1999). Introduction to variational methods for graphical models. Machine Learning, 37:183 233. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, M., and Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087 1092. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann. 8

Rubin, D. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151 1172. Wainwright, M. and Jordan, M. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305. 9