CS Lecture 11. Basics of Machine Learning

Similar documents
Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Python Machine Learning

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

(Sub)Gradient Descent

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Corrective Feedback and Persistent Learning for Information Extraction

Assignment 1: Predicting Amazon Review Ratings

SETTING STANDARDS FOR CRITERION- REFERENCED MEASUREMENT

Managerial Decision Making

Softprop: Softmax Neural Network Backpropagation Learning

Truth Inference in Crowdsourcing: Is the Problem Solved?

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

An Introduction to Simio for Beginners

University of Waterloo School of Accountancy. AFM 102: Introductory Management Accounting. Fall Term 2004: Section 4

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

The Evolution of Random Phenomena

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Using focal point learning to improve human machine tacit coordination

CS Machine Learning

CS 446: Machine Learning

Visual CP Representation of Knowledge

Software Maintenance

CSL465/603 - Machine Learning

Mathematics Success Grade 7

P-4: Differentiate your plans to fit your students

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

The Strong Minimalist Thesis and Bounded Optimality

COMMUNICATION & NETWORKING. How can I use the phone and to communicate effectively with adults?

A Model of Knower-Level Behavior in Number Concept Development

Knowledge Transfer in Deep Convolutional Neural Nets

Learning From the Past with Experiment Databases

A Case-Based Approach To Imitation Learning in Robotic Agents

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

How to make successful presentations in English Part 2

Rule Learning With Negation: Issues Regarding Effectiveness

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Evolutive Neural Net Fuzzy Filtering: Basic Description

POLA: a student modeling framework for Probabilistic On-Line Assessment of problem solving performance

Why Did My Detector Do That?!

Semi-Supervised Face Detection

Learning Methods for Fuzzy Systems

Calibration of Confidence Measures in Speech Recognition

NBER WORKING PAPER SERIES INVESTING IN SCHOOLS: CAPITAL SPENDING, FACILITY CONDITIONS, AND STUDENT ACHIEVEMENT

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Speech Emotion Recognition Using Support Vector Machine

Physics 270: Experimental Physics

Attributed Social Network Embedding

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Spring 2015 IET4451 Systems Simulation Course Syllabus for Traditional, Hybrid, and Online Classes

UDL AND LANGUAGE ARTS LESSON OVERVIEW

Dublin City Schools Mathematics Graded Course of Study GRADE 4

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Probabilistic Latent Semantic Analysis

Entrepreneurial Discovery and the Demmert/Klein Experiment: Additional Evidence from Germany

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Generative models and adversarial training

FUZZY EXPERT. Dr. Kasim M. Al-Aubidy. Philadelphia University. Computer Eng. Dept February 2002 University of Damascus-Syria

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Toward Probabilistic Natural Logic for Syllogistic Reasoning

Getting Started Guide

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

Rule Learning with Negation: Issues Regarding Effectiveness

Learning to Rank with Selection Bias in Personal Search

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Statewide Framework Document for:

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

Word Segmentation of Off-line Handwritten Documents

Appendix L: Online Testing Highlights and Script

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

On-the-Fly Customization of Automated Essay Scoring

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Lecture 1: Basic Concepts of Machine Learning

Houghton Mifflin Online Assessment System Walkthrough Guide

Artificial Neural Networks written examination

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Grade 4. Common Core Adoption Process. (Unpacked Standards)

The MEANING Multilingual Central Repository

Uncertainty concepts, types, sources

NCEO Technical Report 27

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Table of Contents. Introduction Choral Reading How to Use This Book...5. Cloze Activities Correlation to TESOL Standards...

A NEW ALGORITHM FOR GENERATION OF DECISION TREES

Cal s Dinner Card Deals

Australian Journal of Basic and Applied Sciences

Corpus Linguistics (L615)

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

ME 443/643 Design Techniques in Mechanical Engineering. Lecture 1: Introduction

M55205-Mastering Microsoft Project 2016

Model Ensemble for Click Prediction in Bing Search Ads

FIGURE IT OUT! MIDDLE SCHOOL TASKS. Texas Performance Standards Project

Transcription:

CS 6347 Lecture 11 Basics of Machine Learning

The Course So Far What we ve seen: How to compactly model/represent joint distributions using graphical models How to solve basic inference problems Exactly: variable elimination & belief propagation Approximately: LP relaxations, duality, loopy belief propagation, mean field, sampling 2

Next Goal Where we are going: Given independent samples from a joint distribution, we want to estimate the graphical model that produced them In practice, we typically have no idea what joint distribution describes the data There might be lots of hidden variables (i.e., data that we can t or didn t observe) We want the best model for some notion of best 3

Machine Learning Need a principled approach to solving these types of problems How do we determine which model is better than another? How do we measure the performance of our model on tasks that we care about? Many approaches to machine learning rephrase a learning problem as that of optimizing some objective that captures the quantities of interest 4

Spam Filtering Given a collection of emails EE 1,, EE nn and labels LL 1,, LL nn {ssssssss, nnnnnn ssssssss} want to learn a model that detects whether or not an email is spam How might we evaluate the model that we learn? 5

Spam Filtering Given a collection of emails EE 1,, EE nn and labels LL 1,, LL nn {ssssssss, nnnnnn ssssssss} want to learn a model that detects whether or not an email is spam How might we evaluate the model that we learn? This is an example of what is called a supervised learning problem We are presented with labeled data, and our goal is to correctly predict the labels of unseen data 6

Performance Measures Classification: given a set of unseen emails, correctly label them as spam/not spam Classification error defined to be the number of misclassified emails (under the model) Two types of error: training and test Training error: the number of misclassified emails in the labelled training set Test error: the number of misclassified emails in the unseen set 7

Performance Measures Other prediction/inference tasks: choose a loss function that reflects the task you want to solve Density estimation: estimate the full joint distribution Error could be defined using the KL divergence between the learned model and the true model Structure estimation: estimate the structure of the joint distribution (i.e., what independence properties does it assert) 8

Machine Learning Terminology Overfitting: the learned model caters too much to the data on which it was trained. In the worst case, the learned model corresponds exactly to the training set and assigns probability zero to all unobserved samples Generalization: the model should apply beyond the training set to unseen samples (independent of the true distribution) Cross-validation: a method of holding out some of the training data in order to limit overfitting and improve generalization Regularization: encode a soft constraint that prefers simpler models 9

Bias Variance Tradeoff The true model may not be a member of the family of models that we learn Even with unlimited data, we will not recover the true solution This limitation is known as bias We can always choose more complicated models at the expense of computation time With only a few samples, many models might be a good fit Small changes in the samples may result in significantly different models This type of limitation is referred to as variance 10

The Learning Problem Given iid samples xx 1,, xx MM from some probability distribution find the graphical model that best represents the samples from some family of graphical models This could entail Structure learning: if the graph structure is unknown, we would need to learn it Parameter learning: learn the parameters of the model (the parameters usually control the allowable potential functions) 11

Maximum Likelihood Estimation Fix a family of parameterized distributions Each choice of the parameters produces a different distribution Example: for the coloring problem on a graph GG, we could treat the weights as parameters Given samples xx (1),, xx (MM) from some unknown distribution and parameters θθ The likelihood of the data is defined to be ll θθ = mm pp(xx (mm) θθ) Goal: find the θθ that maximizes the log-likelihood Example: given samples of colorings of a graph GG, find the weights that maximize the likelihood of observing these colorings 12

Simple MLE A biased coin is described by a single parameter bb which corresponds to the probability of seeing heads Given the set of samples HH, HH, HH, HH, TT use MLE to estimate bb (worked out on the board) 13

Bayesian Inference MLE assumes that there exists some joint distribution pp(xx, θθ) over possible observations and choices of the parameters, but only works with the conditional distribution pp(xx θθ) In practice, this is much easier than dealing with the whole joint distribution In the coin flipping example If we are told the bias, we can compute the probability that a coin comes up heads To compute the joint probability, pp xx θθ pp(θθ) we would need to choose a probability distribution over the biases 14

Bayesian Inference We could also consider the posterior probability distribution of the parameters given the evidence pp θθ xx = pp xx θθ pp θθ pp xx 15

Bayesian Inference We could also consider the posterior probability distribution of the parameters given the evidence likelihood prior pp θθ xx = pp xx θθ pp θθ pp xx evidence Prior captures our previous knowledge about the parameters 16

Bayesian Inference We could also consider the posterior probability distribution of the parameters given the evidence likelihood prior pp θθ xx = pp xx θθ pp θθ pp xx evidence Prior captures our previous knowledge about the parameters Bayesian inference computes the posterior probability distribution over θθ given the observed samples MAP inference maximizes the posterior probability over θθ 17

Simple MAP Inference A biased coin is described by a single parameter bb which corresponds to the probability of seeing heads Given the set of samples HH, HH, HH, HH, TT use MAP inference to estimate bb What prior distribution should we pick for pp bb? 18

Simple MAP Inference A biased coin is described by a single parameter bb which corresponds to the probability of seeing heads Given the set of samples HH, HH, HH, HH, TT use MAP inference to estimate bb What prior distribution should we pick for pp bb? Uniform on [0,1] Beta distribution: pp bb bb αα 1 1 bb ββ 1 (worked out on the board) 19

Beta Distribution source: Wikipedia 20

Simple MAP Inference A biased coin is described by a single parameter bb which corresponds to the probability of seeing heads Given the set of samples HH, HH, HH, HH, TT use MAP inference to estimate bb What prior distribution should we pick for pp bb? MAP inference with a uniform prior is equivalent to maximum likelihood estimation Prior can be viewed as a certain kind of regularization: it preferences parameters that occur with high probability under the prior 21