Parameter and Structure Learning in Graphical Models

Similar documents
Lecture 1: Machine Learning Basics

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Semi-Supervised Face Detection

Truth Inference in Crowdsourcing: Is the Problem Solved?

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Artificial Neural Networks written examination

Introduction to Causal Inference. Problem Set 1. Required Problems

Active Learning. Yingyu Liang Computer Sciences 760 Fall

CS Machine Learning

Comparison of network inference packages and methods for multiple networks inference

Calibration of Confidence Measures in Speech Recognition

CSL465/603 - Machine Learning

Python Machine Learning

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Introduction to Simulation

BAUM-WELCH TRAINING FOR SEGMENT-BASED SPEECH RECOGNITION. Han Shu, I. Lee Hetherington, and James Glass

Probabilistic Latent Semantic Analysis

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A study of speaker adaptation for DNN-based speech synthesis

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Learning From the Past with Experiment Databases

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Mathematics subject curriculum

Challenges in Deep Reinforcement Learning. Sergey Levine UC Berkeley

***** Article in press in Neural Networks ***** BOTTOM-UP LEARNING OF EXPLICIT KNOWLEDGE USING A BAYESIAN ALGORITHM AND A NEW HEBBIAN LEARNING RULE

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Learning Methods in Multilingual Speech Recognition

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Probability and Statistics Curriculum Pacing Guide

Why Did My Detector Do That?!

A Model of Knower-Level Behavior in Number Concept Development

Software Maintenance

(Sub)Gradient Descent

WHEN THERE IS A mismatch between the acoustic

Model Ensemble for Click Prediction in Bing Search Ads

A Review: Speech Recognition with Deep Learning Methods

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Assignment 1: Predicting Amazon Review Ratings

Visual CP Representation of Knowledge

Evolutive Neural Net Fuzzy Filtering: Basic Description

STA 225: Introductory Statistics (CT)

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

A Model to Detect Problems on Scrum-based Software Development Projects

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

An Online Handwriting Recognition System For Turkish

Grade 6: Correlated to AGS Basic Math Skills

Axiom 2013 Team Description Paper

Uncertainty concepts, types, sources

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Probabilistic Mission Defense and Assurance

Data Fusion Through Statistical Matching

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Lecture 10: Reinforcement Learning

Softprop: Softmax Neural Network Backpropagation Learning

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

JONATHAN H. WRIGHT Department of Economics, Johns Hopkins University, 3400 N. Charles St., Baltimore MD (410)

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

Dublin City Schools Mathematics Graded Course of Study GRADE 4

Human Emotion Recognition From Speech

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Firms and Markets Saturdays Summer I 2014

CONSTRUCTION OF AN ACHIEVEMENT TEST Introduction One of the important duties of a teacher is to observe the student in the classroom, laboratory and

Go fishing! Responsibility judgments when cooperation breaks down

Laboratorio di Intelligenza Artificiale e Robotica

An Introduction to Simulation Optimization

Let s think about how to multiply and divide fractions by fractions!

Integrating E-learning Environments with Computational Intelligence Assessment Agents

Numeracy Medium term plan: Summer Term Level 2C/2B Year 2 Level 2A/3C

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Finding Your Friends and Following Them to Where You Are

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Using Proportions to Solve Percentage Problems I

BMBF Project ROBUKOM: Robust Communication Networks

Reducing Features to Improve Bug Prediction

Speech Emotion Recognition Using Support Vector Machine

Word learning as Bayesian inference

CS/SE 3341 Spring 2012

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Radius STEM Readiness TM

INPE São José dos Campos

Analysis of Enzyme Kinetic Data

Machine Learning and Development Policy

Learning Methods for Fuzzy Systems

Julia Smith. Effective Classroom Approaches to.

A Case Study: News Classification Based on Term Frequency

ISFA2008U_120 A SCHEDULING REINFORCEMENT LEARNING ALGORITHM

Speaker recognition using universal background model on YOHO database

Transcription:

Advanced Signal Processing 2 SE Parameter and Structure Learning in Graphical Models 02.05.2005 Stefan Tertinek turtle@sbox.tugraz.at

Outline Review: Graphical models (DGM, UGM) Learning issues (approaches, observations etc.) Parameter learning: Frequentist approach (Likelihood function, MLE) Bayesian approach (Bayes rule, MAP) Detailed example: Gaussian density estimation Structure learning: Search-and-score approach Conclusion 2

Review: Graphical Models (GM) GM = Probability theory + Graph theory Tool for dealing with uncertainty and complexity Notion of modularity Representation of a GM: A graph is a pair Set of nodes Set of edges Lack of edges: Conditional independence! Factorisation of the joint probability distribution Fewer parameters -> learning easier 3

Review: Directed Graphical Model = Bayesian network, belief network uses Bayes rule for inference DAG: Directed acyclic graph (causal dependencies) Parent-child relationsship: Directed local Markov property Joint probability distribution: Factored representation 4

Review: Undirected Graphical Model = Markov random field, Markov networks Global and local Markov property Joint probability distribution: 5

Parameter Vs. Structure Learning Parameter Learning: = parameter estimation Discrete: CPD = table For a binary variable Continuous: CPD = variable For a Gaussian Structure Learning: = model selection Inferring graph G 6

Full Vs. Partial Observations Fully observed variables (=complete data): Data is obtainable on all variables in the network Partially observed variables (=incomplete data): Missing data Hidden variables General assumption: Missing at random Learning is harder (no close form solution for the likelihood) 7

Frequentists Vs. Bayesians 1/2 The Frequentists: Probability is an objective quantity A parameter is an unknown but fixed quantity ( is a family of distributions indexed by ) Consider various estimators for and choose the best one (low bias, low variance) Likelihood: Consider as a function of for fixed (inverts relationship between them) Advantage: Mathematically / computationally simple 8

Frequentists Vs. Bayesians 2/2 The Bayesians: Probability is a Person s degree of belief and therefore subjective A parameter is a random variable with a prior distribution (treat model as CPD) Update the degree of belief for using Bayes rule (inverts relationship between data and parameter) Data is a quantity to be conditioned on Advantage: Works well when amount of data less than number of parameters Can be used for model selection 9

What will we focus on? Learning Issues Frequentist Bayesian Approach Model DGM UGM Fully Observed Partially Observed Variables LEARNING Task Parameter Structure 10

Overview: Learning Approaches Known structure Unknown structure Complete Data Parameter estimation: ML, MAP Optimization over structures Incomplete data Parametric optimization: EM, gradient descent, stochastic sampling methods Optimization over structures and parameters: Structural EM 11

Where are we? Review: Graphical models (DGM, UGM) Learning issues (approaches, observations etc.) Parameter learning: Frequentist approach (Likelihood function, MLE) Bayesian approach (Bayes rule, MAP) Detailed example: Gaussian density estimation Structure learning: Search-and-score approach Conclusion 12

Learning Parameters From Data 1/2 Given: - Structure G known and fixed (DAG) Goal: -Data set - Learn the conditional probability distribution of each node Structure Dataset Parameters A B C D E 1 2 2 0 1 1 1 0 2 1 0 0 1 1 1 1 1 1 1 2 13

Learning Parameters From Data 2/2 Maximum likelihood estimation: Parameter values are fixed but unknown Estimate these values by maximizing the probability of obtaining the samples observed Bayesian estimation: Parameters are random variables having some known prior distribution Observing new samples converts the prior to a posterior density 14

Frequentist Approach 1/5 Given: Data set of M observations Assumptions: Observations are independently and identically distributed according to the JPD (i.i.d. samples) Aim: Use the data set to estimate the unknown parameter vector 15

Frequentist Approach 2/5 Define the likelihood function: Due to i.i.d. assumption Maximum likelihood estimation: Choose the parameter vector that maximizes the likelihood function most likely to have generated the data Trick: Maximize the log-likelihood instead 16

Frequentist Approach 3/5 Detailed example: Given: - Network structure - Choice of representation for the parameters -Data set The log-likelihood function Factorization due to graph structure 17

Frequentist Approach 4/5 Assume: Parameter independence are the parameters associated with node Reduced to learning three sparate small DAGs 18

Frequentist Approach 5/5 Generalizing for any Bayes net The likelihood decomposes according to the structure of the graph Independent estimation problems: Maximize each likelihood function separately 19

Assumptions: Bayesian Approach 1/2 1) is a quantity whose variation can be described by a prior probability distribution 2) Samples in the data set are drawn independently from the density whose form is assumed to be known but is not know exactly 20

Bayesian Approach 2/2 Given, the prior distribution can be updated to form the posterior distribution using Bayes rule Link between Frequentist and Bayesian view Posterior Likelihood x prior Maximum a-posterior (MAP) estimate: MAP = MLE if the prior is uniform 21

Gaussian Density Estimation 1/7 Univariate Gaussian distribution Parameter vector: Given: Multiple observations which are IID (assumption no necessary) Aim: Estimate based on the observations of using a Frequentist and Bayesian approach 22

Gaussian Density Estimation 2/7 FREQUENTIST APPROACH: Graphical model: The Frequentists : No conditioning on the data Use maximum likelihood estimation JP written as the product of local probabilites 23

Gaussian Density Estimation 3/7 The log-likelihood function Maximization with respect to the parameters and and For a Gaussian distribution: The MLE of the mean = sample mean The MLE of the variance = sample variance 24

Gaussian Density Estimation 4/7 BAYESIAN APPROACH: The Bayesians : Data is conditionally independent given the parameters Choose a prior distribution Assume: Variance is a known constant Goal: Find the mean to form the posterior Modeling decision: What prior should we take for? 25

Gaussian Density Estimation 5/7 Take the prior distribution to be Gaussian Hierarchical Bayesian Modeling Hyperparameter: Fixed mean and variance for Graphical model: Data is assumed to be conditionally independent given the parameters 26

Gaussian Density Estimation 6/7 Multiply the prior with the likelihood to obtain the posterior where and The posterior PD is Gaussian with Linear combination of sample mean and prior mean Inverse of data variance and prior variance add 27

Gaussian Density Estimation 7/7 Interpretation of the result: is our best guess after observing is the uncertainty about this guess always lies between and If, then and If (no prior knowledge can change our opinion), then (we are very uncertain about our prior guess) With we get (For set large data the two approaches provide the same result) 28

Where are we? Review: Graphical models (DGM, UGM) Learning issues (approaches, observations etc.) Parameter learning: Frequentist approach (Likelihood function, MLE) Bayesian approach (Bayes rule, MAP) Detailed example: Gaussian density estimation Structure learning: Search-and-score approach Conclusion 29

Learning Structure From Data Given: - Possible prior knowledge about the network structure G Goal: -Data set D - Learn the full network structure G (parameter learning often as sub-problem) ABCDE 1 2 2 0 1 1 1 0 2 1 0 0 1 1 1 1 1 1 1 2 30

First Approach How could we learn a structure? Naive approach: Enumterate all possible network structures Choose the one which maximizes some criteria Problem: Enumeration becomes feasible for an increasing number of nodes E.g. 10 nodes leads to structures Unless we have prior (expert) knowledge to eliminiate some possible structures, use statistically efficient search strageties 31

Equivalent Probability Models Given: GM with 3 nodes (binary random variables) Number of possible structures: 25 Structure : Structure : Using Bayes rule: Equivalent probability models 32

Idea: Search-And-Score Approach 1/2 Define a score function for measuring model quality (e.g. penalized likelihood) Use search algorithm to find a (local) maximum of the score Scoring function: Statistically motivated Assigns a score to the graph Goal: Find the structure with the best score, given the data set 33

Search-And-Score Approach 2/2 Frequentist way: Maximize the likelihood of the data Bayesian score: is proportional to the posterior probability of a network structure given the data where Use search methods to find the optimal structure 34

Where are we? Review: Graphical models (DGM, UGM) Learning issues (approaches, observations etc.) Parameter learning: Frequentist approach (Likelihood function, MLE) Bayesian approach (Bayes rule, MAP) Detailed example: Gaussian density estimation Structure learning: Search-and-score approach Conclusion 35

Conclusion Parameter learning: Frequentist approach: Use Maximum likelihood estimate Bayesian approach: Use Maximum a-posteriori estimate Approaches are equivalent for large data sizes Structure learning: Search-and-score approach: Optimize according to some scoring function Use search methods to find the optimal structure 36

References Heckerman, D. (1995). A tutorial on learning with Bayesian networks. Technical Report MSR-TR-95-06, Microsoft Research. Buntine, W. (1996) A Guide to the Literature on Learning Probabilistic Networks From Data. IEEE transactions On Knowledge and Data Engineering P.J. Krause (1998), Learning Probabilistic Networks, Knowledge Engineering Review 13, 321-351. Selim Aksoy, Lecture slides, CS 551Pattern Recognition http://www.cs.bilkent.edu.tr/~saksoy/courses/cs551/index.html 37