Lecture 1. Introduction. Probability Theory

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

S T A T 251 C o u r s e S y l l a b u s I n t r o d u c t i o n t o p r o b a b i l i t y

Lecture 1: Basic Concepts of Machine Learning

Probabilistic Latent Semantic Analysis

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

CSL465/603 - Machine Learning

CS Machine Learning

Statistics and Data Analytics Minor

Rule Learning With Negation: Issues Regarding Effectiveness

A Case Study: News Classification Based on Term Frequency

Welcome to. ECML/PKDD 2004 Community meeting

STA 225: Introductory Statistics (CT)

Probability and Statistics Curriculum Pacing Guide

(Sub)Gradient Descent

CS/SE 3341 Spring 2012

Mathematics subject curriculum

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Generative models and adversarial training

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

The Evolution of Random Phenomena

Learning From the Past with Experiment Databases

Rule Learning with Negation: Issues Regarding Effectiveness

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Word learning as Bayesian inference

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Quantitative analysis with statistics (and ponies) (Some slides, pony-based examples from Blase Ur)

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

The Strong Minimalist Thesis and Bounded Optimality

Modeling function word errors in DNN-HMM based LVCSR systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

arxiv: v1 [cs.lg] 3 May 2013

4-3 Basic Skills and Concepts

UK Institutional Research Brief: Results of the 2012 National Survey of Student Engagement: A Comparison with Carnegie Peer Institutions

Theory of Probability

Assignment 1: Predicting Amazon Review Ratings

THE PENNSYLVANIA STATE UNIVERSITY SCHREYER HONORS COLLEGE DEPARTMENT OF MATHEMATICS ASSESSING THE EFFECTIVENESS OF MULTIPLE CHOICE MATH TESTS

Mining Student Evolution Using Associative Classification and Clustering

Reducing Features to Improve Bug Prediction

Applications of data mining algorithms to analysis of medical data

Julia Smith. Effective Classroom Approaches to.

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A study of speaker adaptation for DNN-based speech synthesis

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Truth Inference in Crowdsourcing: Is the Problem Solved?

Semi-Supervised Face Detection

12- A whirlwind tour of statistics

INTRODUCTION TO DECISION ANALYSIS (Economics ) Prof. Klaus Nehring Spring Syllabus

Introduction to the Practice of Statistics

Speech Recognition at ICSI: Broadcast News and beyond

CS 446: Machine Learning

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Modeling function word errors in DNN-HMM based LVCSR systems

Proof Theory for Syntacticians

Comparison of network inference packages and methods for multiple networks inference

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Mathematics process categories

AQUA: An Ontology-Driven Question Answering System

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Comment-based Multi-View Clustering of Web 2.0 Items

Indian Institute of Technology, Kanpur

Using dialogue context to improve parsing performance in dialogue systems

Switchboard Language Model Improvement with Conversational Data from Gigaword

A Comparison of Standard and Interval Association Rules

Exposé for a Master s Thesis

Australian Journal of Basic and Applied Sciences

MGT/MGP/MGB 261: Investment Analysis

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Probability and Game Theory Course Syllabus

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Statewide Framework Document for:

Linking Task: Identifying authors and book titles in verbose queries

University of Cincinnati College of Medicine. DECISION ANALYSIS AND COST-EFFECTIVENESS BE-7068C: Spring 2016

Firms and Markets Saturdays Summer I 2014

Computerized Adaptive Psychological Testing A Personalisation Perspective

Course Content Concepts

Disambiguation of Thai Personal Name from Online News Articles

Improving Simple Bayes. Abstract. The simple Bayesian classier (SBC), sometimes called

Speech Emotion Recognition Using Support Vector Machine

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Word Segmentation of Off-line Handwritten Documents

Math 96: Intermediate Algebra in Context

Introduction and Motivation

Cal s Dinner Card Deals

Calibration of Confidence Measures in Speech Recognition

Managerial Decision Making

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Methods for Fuzzy Systems

Transcription:

Lecture 1. Introduction. Probability Theory COMP90051 Machine Learning Sem2 2017 Lecturer: Trevor Cohn Adapted from slides provided by Ben Rubinstein

Why Learn Learning? 2

Motivation We are drowning in information, but we are starved for knowledge - John Naisbitt, Megatrends Data = raw information Knowledge = patterns or models behind the data 3

Solution: Machine Learning Hypothesis: pre-existing data repositories contain a lot of potentially valuable knowledge Mission of learning: find it Definition of learning: (semi-)automatic extraction of valid, novel, useful and comprehensible knowledge in the form of rules, regularities, patterns, constraints or models from arbitrary sets of data 4

Applications of ML are Deep and Prevalent Online ad selection and placement Risk management in finance, insurance, security High-frequency trading Medical diagnosis Mining and natural resources Malware analysis Drug discovery Search engines 5

Draws on Many Disciplines Artificial Intelligence Statistics Continuous optimisation Databases Information Retrieval Communications/information theory Signal Processing Computer Science Theory Philosophy Psychology and neurobiology 6

Many companies across all industries hire ML experts: Job$ Data Scientist Analytics Expert Business Analyst Statistician Software Engineer Researcher 7

About this Subject (refer to subject outline on github for more information linked from LMS) 8

Vital Statistics Lecturers: Weeks 1; 9-12 Weeks 2-8 Tutors: Contact: Office Hours Website: Trevor Cohn (DMD8., tcohn@unimelb.edu.au) A/Prof & Future Fellow, Computing & Information Systems Statistical Machine Learning, Natural Language Processing Andrey Kan (andrey.kan@unimelb.edu.au) Research Fellow, Walter and Eliza Hall Institute ML, Computational immunology, Medical image analysis Yasmeen George (ygeorge@student.unimelb.edu.au) Nitika Mathur (nmathur@student.unimelb.edu.au) Yuan Li (yuanl4@student.unimelb.edu.au) Weekly you should attend 2x Lectures, 1x Workshop Thursdays 1-2pm, 7.03 DMD Building https://trevorcohn.github.io/comp90051-2017/ 9

About Me (Trevor) PhD 2007 UMelbourne 10 years abroad UK * Edinburgh University, in Language group * Sheffield University, in Language & Machine learning groups Expertise: Basic research in machine learning; Bayesian inference; graphical models; deep learning; applications to structured problems in text (translation, sequence tagging, structured parsing, modelling time series) 10

Subject Content The subject will cover topics from Foundations of statistical learning, linear models, non-linear bases, kernel approaches, neural networks, Bayesian learning, probabilistic graphical models (Bayes Nets, Markov Random Fields), cluster analysis, dimensionality reduction, regularisation and model selection We will gain hands-on experience with all of this via a range of toolkits, workshop pracs, and projects 11

Subject Objectives Develop an appreciation for the role of statistical machine learning, both in terms of foundations and applications Gain an understanding of a representative selection of ML techniques Be able to design, implement and evaluate ML systems Become a discerning ML consumer 12

Textbooks Primarily references to * Bishop (2007) Pattern Recognition and Machine Learning Other good general references: * Murphy (2012) Machine Learning: A Probabilistic Perspective [read free ebook using ebrary at http://bit.ly/29shaqs] * Hastie, Tibshirani, Friedman (2001) The Elements of Statistical Learning: Data Mining, Inference and Prediction [free at http://www-stat.stanford.edu/~tibs/elemstatlearn] 13

Textbooks References for PGM component * Koller, Friedman (2009) Probabilistic Graphical Models: Principles and Techniques 14

Assumed Knowledge (Week 2 Workshop revises COMP90049) Programming * Required: proficiency at programming, ideally in python * Ideal: exposure to scientific libraries numpy, scipy, matplotlib etc. (similar in functionality to matlab & aspects of R.) Maths * Familiarity with formal notation Pr x = % Pr (x, y) y * Familiarity with probability (Bayes rule, marginalisation) * Exposure to optimisation (gradient descent) ML: decision trees, naïve Bayes, knn, kmeans 15

Assessment Assessment components * Two projects one released early (w3-4), one late (w7-8); will have ~3 weeks to complete First project fairly structured (20%) Second project includes competition component (30%) * Final Exam Breakdown * 50% Exam * 50% Project work 50% Hurdle applies to both exam and ongoing assessment 16

Machine Learning Basics 17

Terminology Input to a machine learning system can consist of * Instance: measurements about individual entities/objects a loan application * Attribute (aka Feature, explanatory var.): component of the instances the applicant s salary, number of dependents, etc. * Label (aka Response, dependent var.): an outcome that is categorical, numeric, etc. forfeit vs. paid off * Examples: instance coupled with label <(100k, 3), forfeit > * Models: discovered relationship between attributes and/or label 18

Supervised vs Unsupervised Learning Data Model used for Supervised learning Unsupervised learning Labelled Unlabelled Predict labels on new instances Cluster related instances; Project to fewer dimensions; Understand attribute relationships 19

Architecture of a Supervised Learner Train data Examples Learner Test data Instances Labels Model Labels Evaluation 20

Evaluation (Supervised Learners) How you measure quality depends on your problem! Typical process * Pick an evaluation metric comparing label vs prediction * Procure an independent, labelled test set * Average the evaluation metric over the test set Example evaluation metrics * Accuracy, Contingency table, Precision-Recall, ROC curves When data poor, cross-validate 21

Data is noisy (almost always) ML mark Training data* Example: * given mark for Knowledge Technologies (KT) * predict mark for Machine Learning (ML) KT mark * synthetic data :) 22

Types of models y- = f x P y x x P(x, y) KT mark was 95, ML mark is predicted to be 95 KT mark was 95, ML mark is likely to be in (92, 97) probability of having (KT = x, ML = y) 23

Probability Theory Brief refresher 24

Basics of Probability Theory A probability space: * Set W of possible outcomes * Set F of events (subsets of outcomes) * Probability measure P: F à R Example: a die roll * {1, 2, 3, 4, 5, 6} * { j, {1},, {6}, {1,2},, {5,6},, {1,2,3,4,5,6} } * P(j)=0, P({1})=1/6, P({1,2})=1/3, 25

Axioms of Probability 1. P(f) 0 for every event f in F 2. P 8 f = 8 P(f) for all collections* of pairwise disjoint events 3. P Ω = 1 * We won t delve further into advanced probability theory, which starts with measure theory. But to be precise, additivity is over collections of countably-many events. 26

Random Variables (r.v. s) A random variable X is a numeric function of outcome X(ω) R P X A denotes the probability of the outcome being such that X falls in the range A Example: X winnings on $5 bet on even die roll * X maps 1,3,5 to -5 X maps 2,4,6 to 5 * P(X=5) = P(X=-5) = ½ 27

Discrete vs. Continuous Distributions Discrete distributions * Govern r.v. taking discrete values * Described by probability mass function p(x) which is P(X=x) * P X x = EFGH p(a) * Examples: Bernoulli, Binomial, Multinomial, Poisson D Continuous distributions * Govern real-valued r.v. * Cannot talk about PMF but rather probability density function p(x) D * P X x = p a da GH * Examples: Uniform, Normal, Laplace, Gamma, Beta, Dirichlet 28

Expectation Expectation E[X] is the r.v. X s average value * Discrete: E X = x P(X = x) D * Continuous: E X = x p x dx D Properties * Linear: E ax + b = ae X + b E X + Y = E X + E Y * Monotone: X Y E X E Y Variance: Var X = E[ X E X T ] p(x) 0.0 0.1 0.2 0.3 0.4-4 -2 0 2 4 x 29

Independence and Conditioning X, Y are independent if * P X A, Y B = P X A P(Y B) * Similarly for densities: p W,X x, y = p W (x)p X (y) * Intuitively: knowing value of Y reveals nothing about X * Algebraically: the joint on X,Y factorises! Conditional probability * P A B = Y(Z \) Y(\) * Similarly for densities p y x = ](D,^) ](D) * Intuitively: probability event A will occur given we know event B has occurred * X,Y independent equiv to P Y = y X = x = P(Y = y) 30

Inverting Conditioning: Bayes Theorem In terms of events A, B * P A B = P A B P B = P B A P A * P A B = Y B A Y(Z) Y(\) Simple rule that lets us swap conditioning order Bayes Bayesian statistical inference makes heavy use * Marginals: probabilities of individual variables * Marginalisation: summing away all but r.v. s of interest 31

Summary Why study machine learning? Machine learning basics Review of probability theory 32