Learning Bayes Networks

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

Semi-Supervised Face Detection

Introduction to Causal Inference. Problem Set 1. Required Problems

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probabilistic Latent Semantic Analysis

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

Universidade do Minho Escola de Engenharia

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Basic Concepts of Machine Learning

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

CSL465/603 - Machine Learning

Artificial Neural Networks written examination

Learning From the Past with Experiment Databases

Switchboard Language Model Improvement with Conversational Data from Gigaword

Introduction to Simulation

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

A Case Study: News Classification Based on Term Frequency

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Assignment 1: Predicting Amazon Review Ratings

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Laboratorio di Intelligenza Artificiale e Robotica

Comparison of network inference packages and methods for multiple networks inference

MYCIN. The MYCIN Task

(Sub)Gradient Descent

The Evolution of Random Phenomena

Truth Inference in Crowdsourcing: Is the Problem Solved?

Reducing Features to Improve Bug Prediction

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Mathematics Success Grade 7

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Algebra 2- Semester 2 Review

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Issues in the Mining of Heart Failure Datasets

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

CS 446: Machine Learning

Unit: Human Impact Differentiated (Tiered) Task How Does Human Activity Impact Soil Erosion?

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

STA 225: Introductory Statistics (CT)

stateorvalue to each variable in a given set. We use p(x = xjy = y) (or p(xjy) as a shorthand) to denote the probability that X = x given Y = y. We al

LEGO MINDSTORMS Education EV3 Coding Activities

Probability and Statistics Curriculum Pacing Guide

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Curriculum Design Project with Virtual Manipulatives. Gwenanne Salkind. George Mason University EDCI 856. Dr. Patricia Moyer-Packenham

STAT 220 Midterm Exam, Friday, Feb. 24

An OO Framework for building Intelligence and Learning properties in Software Agents

Rule-based Expert Systems

1. Answer the questions below on the Lesson Planning Response Document.

CSC200: Lecture 4. Allan Borodin

Softprop: Softmax Neural Network Backpropagation Learning

A Model of Knower-Level Behavior in Number Concept Development

CS Machine Learning

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Using computational modeling in language acquisition research

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Word learning as Bayesian inference

How do adults reason about their opponent? Typologies of players in a turn-taking game

Let's Learn English Lesson Plan

Applications of data mining algorithms to analysis of medical data

The Strong Minimalist Thesis and Bounded Optimality

Science Fair Project Handbook

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

A Study of the Effectiveness of Using PER-Based Reforms in a Summer Setting

Chapter 2 Rule Learning in a Nutshell

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Using Proportions to Solve Percentage Problems I

An Empirical and Computational Test of Linguistic Relativity

MODELING ITEM RESPONSE DATA FOR COGNITIVE DIAGNOSIS

A Version Space Approach to Learning Context-free Grammars

Predicting Students Performance with SimStudent: Learning Cognitive Skills from Observation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Development of Multistage Tests based on Teacher Ratings

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

What is Thinking (Cognition)?

Going to School: Measuring Schooling Behaviors in GloFish

A Neural Network GUI Tested on Text-To-Phoneme Mapping

An extended dual search space model of scientific discovery learning

How to analyze visual narratives: A tutorial in Visual Narrative Grammar

Name: Class: Date: ID: A

Senior Project Information

Rule Learning With Negation: Issues Regarding Effectiveness

Corrective Feedback and Persistent Learning for Information Extraction

A Bayesian Learning Approach to Concept-Based Document Classification

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Predicting Future User Actions by Observing Unmodified Applications

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Latent Semantic Analysis

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

Transcription:

Learning Bayes Networks 6.034 Based on Russell & Norvig, Artificial Intelligence:A Modern Approach, 2nd ed., 2003 and D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed.. MIT Press, Cambridge, MA, 1999.

Statistical Learning Task Given a set of observations (evidence), find {any/good/best} hypothesis that describes the domain and can predict the data and, we hope, data not yet seen ML section of course introduced various learning methods nearest neighbors, decision (classification) trees, naive Bayes classifiers, perceptrons,... Here we introduce methods that learn (non-naive) Bayes networks, which can exhibit more systematic structure

Characteristics of Learning BN Models Benefits Handle incomplete data Can model causal chains of relationships Combine domain knowledge and data Can avoid overfitting Two main uses: Find (best) hypothesis that accounts for a body of data Find a probability distribution over hypotheses that permits us to predict/interpret future data

An Example Surprise Candy Corp. makes two flavors of candy: cherry and lime Both flavors come in the same opaque wrapper Candy is sold in large bags, which have one of the following distributions of flavors, but are visually indistinguishable: h1: 100% cherry h2: 75% cherry, 25% lime h3: 50% cherry, 50% lime h4: 25% cherry, 75% lime h5: 100% lime Relative prevalence of these types of bags is (.1,.2,.4,.2,.1) As we eat our way through a bag of candy, predict the flavor of the next piece; actually a probability distribution.

Bayesian Learning Calculate the probability of each hypothesis given the data To predict the probability distribution over an unknown quantity, X, If the observations d are independent, then E.g., suppose the first 10 candies we taste are all lime

Learning Hypotheses and Predicting from Them (a) probabilities of hi after k lime candies; (b) prob. of next lime Posterior probability of hypothesis 1 0.8 0.6 0.4 0.2 0 a Probability that next candy is lime b 1 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 8 10 0 2 4 6 8 10 Number of samples in d Number of samples in d P(h 1 d) P(h 2 d) P(h 3 d) P(h 4 d) P(h 5 d) Images by MIT OpenCourseWare. MAP prediction: predict just from most probable hypothesis After 3 limes, h5 is most probable, hence we predict lime Even though, by (b), it s only 80% probable

Observations Bayesian approach asks for prior probabilities on hypotheses! Natural way to encode bias against complex hypotheses: make their prior probability very low Choosing hmap to maximize is equivalent to minimizing but from our earlier discussion of entropy as a measure of information, these two terms are # of bits needed to describe the data given hypothesis # bits needed to specify the hypothesis Thus, MAP learning chooses the hypothesis that maximizes compression of the data; Minimum Description Length principle Assuming uniform priors on hypotheses makes MAP yield hml, the maximum likelihood hypothesis, which maximizes

ML Learning (Simplest) Surprise Candy Corp. is taken over by new management, who abandon their former bagging policies, but do continue to mix together θ cherry and (1-θ) lime candies in large bags Their policy is now represented by a parameter θ [0,1], and we have a continuous set of hypotheses, hθ Assuming we taste N candies, of which c are cherry and l=n c lime For convenience, we maximize the log likelihood Setting the derivative = 0, Surprise! But need Laplace correction for small data sets Flavor P(F=cherry) θ

ML Parameter Learning Suppose the new SCC management decides to give a hint of the candy flavor by (probabilistically) choosing wrapper colors Now we unwrap N candies of which c are cherries, with rc in red wrappers and gc in green, and l are limes, with rl in red wrappers and gl in green P(F=cherry) θ Flavor F P(W=red F) cherry θ1 With complete data, ML learning decomposes into n learning problems, one for each parameter lime Wrapper θ2

Use BN to learn Parameters If we extend BN to continuous variables (essentially, replace by ) Then a BN showing the dependence of the observations on the parameters lets us compute (the distributions over) the parameters using just the normal rules of Bayesian inference. This is efficient if all observations are known Need sampling methods if not θ θ1 θ2 Parameter Independence Sample 1 F W P(F=cherry) θ Sample 2 F W Sample 3 F W... Sample N F W Flavor F cherry lime Wrapper P(W=red F) θ1 θ2

Learning Structure In general, we are trying to determine not only parameters for a known structure but in fact which structure is best (or the probability of each structure, so we can average over them to make a prediction)

Structure Learning Recall that a Bayes Network is fully specified by a DAG G that gives the (in)dependencies among variables the collection of parameters θ that define the conditional probability tables for each of the Then We define the Bayesian score as But First term: usual marginal likelihood calculation Second term: parameter priors Third term: penalty for complexity of graph Define a search problem over all possible graphs & parameters

Searching for Models X Y X Y How many possible DAGs are there for n variables? = all possible directed graphs on n vars X Y Not all are DAGs To get a closer estimate, imagine that we order the variables so that the parents of each var come before it in the ordering.then there are n! possible ordering, and the j-th var can have any of the previous vars as a parent If we can choose a particular ordering, say based on prior models knowledge, then we need consider merely If we restrict Par(X) to no more than k, consider models; this is actually practical Search actions: add, delete, reverse an arc Hill-climb on P(D G) or on P(G D) All usual tricks in search: simulated annealing, random restart,...

Caution about Hidden Variables Suppose you are given a dataset containing data on patients smoking, diet, exercise, chest pain, fatigue, and shortness of breath You would probably learn a model like the one below left If you can hypothesize a hidden variable (not in the data set), e.g., heart disease, the learned network might be much simpler, such as the one below right But, there are potentially infinitely many such variables S D E S D E C F B H C F B

19 10 21 22 20 31 15 23 13 16 Re-Learning the ALARM Network from 10,000 Samples 25 17 18 6 26 5 4 27 11 32 28 29 12 7 8 9 34 35 33 14 36 24 37 1 2 3 30 a) Original Network 25 17 18 6 26 5 10 21 22 19 20 31 15 4 27 11 32 34 35 28 29 12 7 8 9 33 14 23 13 36 24 16 37 case # 1 2 3 4 x 1 x 2 x 3... x 37 3 3 2 2 2 1 3... 2 3 3 2 3...... 4 3 3 1 1 2 3 30 10,000 2 2 2 3 b) Starting Network Complete independence c) Sampled Data 19 10 20 21 22 31 15 23 13 16 17 6 5 4 27 11 32 deleted 28 29 12 34 35 36 24 37 25 18 26 7 8 9 33 14 1 2 3 30 d) Learned Network Images by MIT OpenCourseWare.

MIT OpenCourseWare http://ocw.mit.edu HST.950J / 6.872 Biomedical Computing Fall 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.