Statistical Learning. CS 486/686 Introduction to AI University of Waterloo

Similar documents
Lecture 1: Machine Learning Basics

Semi-Supervised Face Detection

A Case Study: News Classification Based on Term Frequency

AUTOMATED TROUBLESHOOTING OF MOBILE NETWORKS USING BAYESIAN NETWORKS

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Switchboard Language Model Improvement with Conversational Data from Gigaword

Probabilistic Latent Semantic Analysis

(Sub)Gradient Descent

Laboratorio di Intelligenza Artificiale e Robotica

CS Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Lecture 10: Reinforcement Learning

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Truth Inference in Crowdsourcing: Is the Problem Solved?

Decision Analysis. Decision-Making Problem. Decision Analysis. Part 1 Decision Analysis and Decision Tables. Decision Analysis, Part 1

CS 446: Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CSL465/603 - Machine Learning

Generative models and adversarial training

Python Machine Learning

Artificial Neural Networks written examination

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Hierarchical Linear Modeling with Maximum Likelihood, Restricted Maximum Likelihood, and Fully Bayesian Estimation

Word learning as Bayesian inference

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

STT 231 Test 1. Fill in the Letter of Your Choice to Each Question in the Scantron. Each question is worth 2 point.

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Laboratorio di Intelligenza Artificiale e Robotica

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

INPE São José dos Campos

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Introduction to Causal Inference. Problem Set 1. Required Problems

The Good Judgment Project: A large scale test of different methods of combining expert predictions

A Model of Knower-Level Behavior in Number Concept Development

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Softprop: Softmax Neural Network Backpropagation Learning

Rule-based Expert Systems

Software Maintenance

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Lecture 1: Basic Concepts of Machine Learning

Radius STEM Readiness TM

Universidade do Minho Escola de Engenharia

Grade 6: Correlated to AGS Basic Math Skills

Corrective Feedback and Persistent Learning for Information Extraction

Speech Recognition at ICSI: Broadcast News and beyond

Reducing Features to Improve Bug Prediction

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Medical Complexity: A Pragmatic Theory

A Bayesian Learning Approach to Concept-Based Document Classification

Attributed Social Network Embedding

Go fishing! Responsibility judgments when cooperation breaks down

Calibration of Confidence Measures in Speech Recognition

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Toward Probabilistic Natural Logic for Syllogistic Reasoning

College Pricing and Income Inequality

Mathematics Success Grade 7

Introduction to Simulation

College Pricing and Income Inequality

Level 6. Higher Education Funding Council for England (HEFCE) Fee for 2017/18 is 9,250*

Mathematics subject curriculum

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Rule Learning With Negation: Issues Regarding Effectiveness

College Pricing. Ben Johnson. April 30, Abstract. Colleges in the United States price discriminate based on student characteristics

Visual CP Representation of Knowledge

Segregation of Unvoiced Speech from Nonspeech Interference

Cal s Dinner Card Deals

Conceptual and Procedural Knowledge of a Mathematics Problem: Their Measurement and Their Causal Interrelations

A Bootstrapping Model of Frequency and Context Effects in Word Learning

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Mathematics process categories

Hierarchical Linear Models I: Introduction ICPSR 2015

arxiv:cmp-lg/ v1 22 Aug 1994

Mathematics Scoring Guide for Sample Test 2005

Evolutive Neural Net Fuzzy Filtering: Basic Description

Planning with External Events

FF+FPG: Guiding a Policy-Gradient Planner

Latent Semantic Analysis

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

Name: Class: Date: ID: A

Alpha provides an overall measure of the internal reliability of the test. The Coefficient Alphas for the STEP are:

Firms and Markets Saturdays Summer I 2014

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

TUESDAYS/THURSDAYS, NOV. 11, 2014-FEB. 12, 2015 x COURSE NUMBER 6520 (1)

Multi-Dimensional, Multi-Level, and Multi-Timepoint Item Response Modeling.

Axiom 2013 Team Description Paper

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Learning From the Past with Experiment Databases

Finding Your Friends and Following Them to Where You Are

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Algebra 2- Semester 2 Review

Transcription:

Statistical Learning CS 486/686 Introduction to AI University of Waterloo

Motivation: Things you know Agents model uncertainty in the world and utility of different courses of actions - Bayes nets are models of probability distributions which involve a graph structure annotated with probabilities - Bayes nets for realistic applications have hundreds of nodes Where do these numbers come from? 2

Pathfinder (Heckerman, 1991) Medical diagnosis for lymph node disease Large net - 60 diseases, 100 symptoms and test results, 14000 probabilities Built by medical experts - 8 hours to determine the variables - 35 hours for network topology - 40 hours for probability table values 3

Knowledge acquisition bottleneck In many applications, Bayes net structure and parameters are set by experts in the field - Experts are scarce and expensive, can be inconsistent or non-existent But data is cheap and plentiful (usually) Goal of learning: - Build models of the world directly from data - We will focus on learning models for probabilistic models 4

Candy Example (from R&N) Favourite candy sold in two flavours - Lime and Cherry Same wrapper for both flavours Sold in bags with different ratios - 100% cherry - 75% cherry, 25% lime - 50% cherry, 50% lime - 25% cherry, 75% lime - 100% lime 5

Candy Example You bought a bag of candy but do not know its flavour ratio After eating k candies - What is the flavour ratio of the bag? - What will be the flavour of the next candy? 6

Statistical Learning Hypothesis H: probabilistic theory about the world - h 1 : 100% cherry - h 2 : 75% cherry, 25% lime - h 3 : 50% cherry, 50% lime - h 4 : 25% cherry, 75% lime - h 5 : 100% lime Data D: evidence about the world - d 1 : 1 st candy is cherry - d 2 : 2 nd candy is lime - d 3 : 3 rd candy is lime - 7

Bayesian learning Prior: P(H) Likelihood: P(d H) Evidence: d=<d1,d2,,dn> Bayesian learning - Compute the probability of each hypothesis given the data - P(H d)=α P(d H)P(H) 8

Bayesian learning Suppose we want to make a prediction about some unknown quantity x (i.e. flavour of the next candy) Predictions are weighted averages of the predictions of the individual hypothesis 9

Candy Example Assume prior P(H)=<0.1,0.2,0.4,0.2,0.1> Assume candies are i.i.d: P(d hi)=πj P(dj hi) Suppose first 10 candies are all lime - P(d h1)=0 10 =0 - P(d h2)=0.25 10 =0.00000095 - P(d h3)=0.5 10 =0.00097 - P(d h4)=0.75 10 =0.056 - P(d h5)=1 10 =1 10

Candy Example: Posterior Posteriors given that data is really generated from h 5 11

Candy Example: Prediction Prediction next candy is lime given that data is really generated from h 5 12

Bayesian learning Good News Optimal: Given prior, no other prediction is correct more often than the Bayesian one No Overfitting: Use the prior to penalize complex hypothesis (complex hypothesis are unlikely) Bad News Intractable: If hypothesis space is large Solution Approximations: Maximum a posteriori (MAP) 13

Maximum a posteriori (MAP) Idea: Make prediction on the most probable hypothesis hmap Compare to Bayesian Learning which makes predictions on all hypothesis weighted by their probability 14

MAP Candy Example 15

MAP Properties MAP prediction is less accurate than Bayesian prediction - MAP relies on only one hypothesis MAP and Bayesian predictions converge as data increases No overfitting - Use prior to penalize complex hypothesis Finding h MAP may be intractable - h MAP =argmax P(h d) - Optimization may be hard! 16

MAP computation Optimization Product introduces nonlinear optimization Take log to linearize 17

Maximum Likelihood (ML) Idea: Simplify MAP by assuming uniform prior (i.e. P(hi)=P(hj) for all i,j) Make prediction on hml only - P(x d)=p(x hml)

ML Properties ML prediction is less accurate than Bayesian and MAP ML, MAP and Bayesian predictions converge as data increases Subject to overfitting - Does not penalize complex hypothesis Finding hml is often easier than hmap - hml=argmaxj i log P(di hj) 19

Learning with complete data Parameter learning with complete data - Parameter learning task involves finding numerical parameters for a probability model whose structure is fixed Example: Learning CPT for a Bayes net with a given structure 20

Simple ML Example Hypothesis hθ - P(cherry)=θ and P(lime)=1-θ - θ is our parameter Data d: - N candies (c cherry and l=n-c lime) What should θ be? 21

Simple ML example Likelihood of this particular data set Log Likelihood 22

Simple ML example Find θ that maximizes log likelihood ML hypothesis asserts that actual proportion of cherries is equal to observed proportion 23

More complex ML example Hypothesis: h Ɵ, Ɵ1,Ɵ 2 Data: c Cherries: Gc green wrappers Rc red wrappers l Limes: Gl green wrappers Rl red wrappers 24

More complex ML example 25

More Complex ML Optimize by taking partial derivatives and setting to zero = 1 = 2 = c c + l R c R c + G c R l R l + G l

ML Comments This approach can be extended to any Bayes net With complete data - ML parameter learning problem decomposes into separate learning problems, one for each parameter! - Parameter values for a variable, given its parents are just observed frequencies of variable values for each setting of parent values! 27

A problem: Zero probabilities What happens if we observed zero cherry candies? - θ would be set to 0 - Is this a good prediction? Instead of use 28

Laplace Smoothing Given observations x from N trials Estimate parameters θ

Naïve Bayes model Want to predict a class C based on attributes A i Parameters: - Ɵ =P(C=true) - Ɵ j,1 =P(A j =true C=true) - Ɵ j,2 =P(A j =true C=false) Assumption: A i s are independent given C C A 1 A 2 A n 30

Naïve Bayes Model With observed attribute values x1,x2,,xn - P(C x1,x2,,xn)=α P(C)Πi P(xi C) From ML we know what the parameters should be - Observed frequencies (with possible Laplace smoothing) Just need to choose the most likely class C 31

Naïve Bayes comments Naïve Bayes scales well Naïve Bayes tends to perform well - Even though the assumption that attributes are independent given class often does not hold Application - Text classification 32

Text classification Important practical problem, occurring in many applications - Information retrieval, spam filtering, news filtering, building web directories Simplified problem description - Given: collection of documents, classified as interesting or not interesting by people - Goal: learn a classifier that can look at text of new documents and provide a label, without human intervention 33

Data representation Consider all possible significant words that can occur in documents Do not include stopwords Stem words: map words to their root For each root, introduce common binary feature - Specifying whether the word is present or not in the document 34

Example Machine learning is fun 35

Use Naïve Bayes Assumption Words are independent of each other, given the class, y, of document How do we get the probabilities? 36

Use Naïve Bayes Assumption Use ML parameter estimation! Count words over collections of documents Use Bayes rule to compute probabilities for unseen documents Laplace smoothing is very useful here 37

Observations We may not be able to find θ analytically Gradient search to find good value of θ 38

Conclusions What you should know - Bayesian learning, MAP, ML - How to learn parameters in Bayes Nets - Naïve Bayes assumption - Laplace smoothing 39