CSC412/2506 Probabilistic Learning and Reasoning. Introduction

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Generative models and adversarial training

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

CSL465/603 - Machine Learning

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Probabilistic Latent Semantic Analysis

arxiv: v1 [cs.lg] 15 Jun 2015

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Exploration. CS : Deep Reinforcement Learning Sergey Levine

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

arxiv: v2 [cs.cv] 30 Mar 2017

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Lecture 1: Basic Concepts of Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

arxiv: v2 [stat.ml] 30 Apr 2016 ABSTRACT

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

Learning From the Past with Experiment Databases

Artificial Neural Networks written examination

Speech Emotion Recognition Using Support Vector Machine

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Switchboard Language Model Improvement with Conversational Data from Gigaword

Human Emotion Recognition From Speech

Knowledge Transfer in Deep Convolutional Neural Nets

Assignment 1: Predicting Amazon Review Ratings

Active Learning. Yingyu Liang Computer Sciences 760 Fall

WHEN THERE IS A mismatch between the acoustic

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Model Ensemble for Click Prediction in Bing Search Ads

Axiom 2013 Team Description Paper

STA 225: Introductory Statistics (CT)

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Chapter 10 APPLYING TOPIC MODELING TO FORENSIC DATA. 1. Introduction. Alta de Waal, Jacobus Venter and Etienne Barnard

Semi-Supervised Face Detection

A survey of multi-view machine learning

Deep Neural Network Language Models

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Ryerson University Sociology SOC 483: Advanced Research and Statistics

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Dual-Memory Deep Learning Architectures for Lifelong Learning of Everyday Human Behaviors

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

THE world surrounding us involves multiple modalities

Introduction to Simulation

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

A Review: Speech Recognition with Deep Learning Methods

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

BUILDING CONTEXT-DEPENDENT DNN ACOUSTIC MODELS USING KULLBACK-LEIBLER DIVERGENCE-BASED STATE TYING

HIERARCHICAL DEEP LEARNING ARCHITECTURE FOR 10K OBJECTS CLASSIFICATION

Speech Recognition at ICSI: Broadcast News and beyond

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE DEEP NEURAL NETWORK. Yun Lei Nicolas Scheffer Luciana Ferrer Mitchell McLaren

Rule Learning With Negation: Issues Regarding Effectiveness

Word Segmentation of Off-line Handwritten Documents

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Forget catastrophic forgetting: AI that learns after deployment

Modeling function word errors in DNN-HMM based LVCSR systems

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Comment-based Multi-View Clustering of Web 2.0 Items

Universidade do Minho Escola de Engenharia

A study of speaker adaptation for DNN-based speech synthesis

A Survey on Unsupervised Machine Learning Algorithms for Automation, Classification and Maintenance

Comparison of network inference packages and methods for multiple networks inference

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Evolutive Neural Net Fuzzy Filtering: Basic Description

Laboratorio di Intelligenza Artificiale e Robotica

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

arxiv:submit/ [cs.cv] 2 Aug 2017

CS Machine Learning

Indian Institute of Technology, Kanpur

Multi-tasks Deep Learning Model for classifying MRI images of AD/MCI Patients

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Issues in the Mining of Heart Failure Datasets

Residual Stacking of RNNs for Neural Machine Translation

Laboratorio di Intelligenza Artificiale e Robotica

Welcome to. ECML/PKDD 2004 Community meeting

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Dropout improves Recurrent Neural Networks for Handwriting Recognition

Calibration of Confidence Measures in Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

COMPUTER SCIENCE GRADUATE STUDIES Course Descriptions by Methodology

Why Did My Detector Do That?!

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

arxiv: v2 [cs.ir] 22 Aug 2016

COMPUTER SCIENCE GRADUATE STUDIES Course Descriptions by Research Area

A Deep Bag-of-Features Model for Music Auto-Tagging

Modeling function word errors in DNN-HMM based LVCSR systems

arxiv: v1 [cs.cl] 27 Apr 2016

TD(λ) and Q-Learning Based Ludo Players

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Sociology 521: Social Statistics and Quantitative Methods I Spring Wed. 2 5, Kap 305 Computer Lab. Course Website

Georgetown University at TREC 2017 Dynamic Domain Track

Attributed Social Network Embedding

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

Abnormal Activity Recognition Based on HDP-HMM Models

Transcription:

CSC412/2506 Probabilistic Learning and Reasoning Introduction

Today Course information Overview of ML with examples Ungraded, anonymous background quiz Thursday: Basics of ML vocabulary (crossvalidation, objective functions, overfitting, regularization) and basics of probability manipulation

Course Website www.cs.toronto.edu/~duvenaud/courses/csc412 Contains all course information, slides, etc.

Evaluation Assignment 1: due Feb 10th worth 15% Assignment 2: due March 3rd worth 15% Assignment 3: due March 24th worth 20% 1-hour Midterm: Feb 23rd worth 20% Project: due April 10th worth 30% 15% per day of lateness, up to 4 days

Related Courses CSC411: List of methods, (K-NN, Decision trees), more focus on computation STA302: Linear regression and classical stats ECE521: Similar material, more focus on computation STA414: Mostly same material, slightly more introductory, more emphasis on theory than coding, exam instead of project CSC321: Neural networks - about 30% overlap

Textbooks + Resources No required textbook Christopher M. Bishop (2006) Pattern Recognition and Machine Learning. Kevin Murphy (2012), Machine Learning: A Probabilistic Perspective. Trevor Hastie, Robert Tibshirani, Jerome Friedman (2009) The Elements of Statistical Learning David MacKay (2003) Information Theory, Inference, and Learning Algorithms Deep Learning (2016) Goodfellow, Bengio, Courville

Stats vs Machine Learning Statistician: Look at the data, consider the problem, and design a model we can understand Analyze methods to give guarantees Want to make few assumptions ML: We only care about making good predictions! Let s make a general procedure that works for lots of datasets No way around making assumptions, let s just make the model large enough to hopefully include something close to the truth Can t use bounds in practice, so evaluate empirically to choose model details Sometimes end up with interpretable models anyways

Types of Learning Supervised Learning: Given input-output pairs (x,y) the goal is to predict correct output given a new input. Unsupervised Learning: Given unlabeled data instances x1, x2, x3 build a statistical model of x, which can be used for making predictions, decisions. Semi-supervised Learning: We are given only a limited amount of (x,y) pairs, but lots of unlabeled x s. All just special cases of estimating distributions from data: p(y x), p(x), p(x, y). Active learning and RL: Also get to choose actions that influence future information + reward. Can just use basic decision theory.

Finding Structure in Data Vector of word counts on a webpage Latent variables: hidden topics 804,414 newswire stories

Matrix Factorization Collaborative Filtering/ Matrix Factorization/ Rating value of user i for item j Hierarchical Bayesian Model Latent user feature (preference) vector Latent item feature vector Latent variables that we infer from observed ratings. Prediction: predict a rating r * ij for user i and query movie j. Posterior over Latent Variables Infer latent variables and make predictions using Bayesian inference (MCMC or SVI).

Finding Structure in Data Collaborative Filtering/ Matrix Factorization/ Product Recommendation Learned ``genre Netflix dataset: 480,189 users 17,770 movies Over 100 million ratings. Fahrenheit 9/11 Bowling for Columbine The People vs. Larry Flynt Canadian Bacon La Dolce Vita Friday the 13th The Texas Chainsaw Massacre Children of the Corn Child's Play The Return of Michael Myers Independence Day The Day After Tomorrow Con Air Men in Black II Men in Black Part of the wining solution in the Netflix contest (1 million dollar prize).

Impact of Deep Learning Speech Recognition Computer Vision Recommender Systems Language Understanding Drug Discovery and Medical Image Analysis

Multimodal Data mosque, tower, building, cathedral, dome, castle ski, skiing, skiers, skiiers, snowmobile kitchen, stove, oven, refrigerator, microwave bowl, cup, soup, cups, coffee beach snow

Caption Generation

Density estimation using Real NVP. Ding et al, 2016

Nguyen A, Dosovitskiy A, Yosinski J, Brox T, Clune J (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in Neural Information Processing Systems 29

Density estimation using Real NVP. Ding et al, 2016

Pixel Recurrent Neural Networks Aaron van den Oord, Nal Kalchbrenner, Koray Kavukcuoglu

Density estimation using Real NVP. Ding et al, 2016

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks Alec Radford, Luke Metz, Soumith Chintala

Course Themes Start with a simple model and add to it Linear regression or PCA is a special case of almost everything A few lego bricks are enough to build most models Gaussians, Categorical variables, Linear transforms, Neural networks The exact form of each distribution/function shouldn t matter much Your model should have a million parameters in it somewhere (the real world is messy!) Model checking is hard and important Learning algorithms are especially hard to debug

Computation Later assignments will involve a bit of programming. Can use whatever language you want, but Python + Numpy is recommended. For fitting and inference in high-dimensional models, gradient-based methods are basically the only game in town Lots of methods conflate model and fitting algorithm, we will try to separate these

ML as a bag of tricks Fast special cases: K-means Kernel Density Estimation SVMs Boosting Random Forests K-Nearest Neighbors Extensible family: Mixture of Gaussians Latent variable models Gaussian processes Deep neural nets Bayesian neural nets??

Regularization as a bag of tricks Fast special cases: Extensible family: Early stopping Ensembling L2 Regularization Stochastic variational inference Gradient noise Dropout Expectation-Maximization

A language of models Hidden Markov Models, Mixture of Gaussians, Logistic Regression These are simply sentences - examples from a language of models. We will try to show larger family, and point out common special cases.

[1] [2] [3] [4] Gaussian mixture model Linear dynamical system Hidden Markov model Switching LDS [5] [2] [6] [7] Mixture of Experts Driven LDS IO-HMM Factorial HMM [8,9] [10] Canonical correlations analysis admixture / LDA / NMF [1] Palmer, Wipf, Kreutz-Delgado, and Rao. Variational EM algorithms for non-gaussian latent variable models. NIPS 2005. [2] Ghahramani and Beal. Propagation algorithms for variational Bayesian learning. NIPS 2001. [3] Beal. Variational algorithms for approximate Bayesian inference, Ch. 3. U of London Ph.D. Thesis 2003. [4] Ghahramani and Hinton. Variational learning for switching state-space models. Neural Computation 2000. [5] Jordan and Jacobs. Hierarchical Mixtures of Experts and the EM algorithm. Neural Computation 1994. [6] Bengio and Frasconi. An Input Output HMM Architecture. NIPS 1995. [7] Ghahramani and Jordan. Factorial Hidden Markov Models. Machine Learning 1997. [8] Bach and Jordan. A probabilistic interpretation of Canonical Correlation Analysis. Tech. Report 2005. [9] Archambeau and Bach. Sparse probabilistic projections. NIPS 2008. [10] Hoffman, Bach, Blei. Online learning for Latent Dirichlet Allocation. NIPS 2010. Courtesy of Matthew Johnson

AI as a bag of tricks Russel and Norvig s parts of AI: Extensible family: Machine learning Natural language processing Knowledge representation Automated reasoning Deep probabilistic latent-variable models + decision theory Computer vision Robotics

Advantages of probabilistic latent-variable models Data-efficient learning - automatic regularization, can take advantage of more information Compose-able models - e.g. incorporate data corruption model. Different from composing feedforward computations Handle missing + corrupted data (without the standard hack of just guessing the missing values using averages). Predictive uncertainty - necessary for decision-making conditional predictions (e.g. if brexit happens, the value of the pound will fall) Active learning - what data would be expected to increase our confidence about a prediction Cons: intractable integral over latent variables Examples: medical diagnosis, image modeling

Probabilistic graphical models + structured representations + priors and uncertainty + data and computational efficiency rigid assumptions may not fit feature engineering top-down inference Deep learning neural net goo difficult parameterization can require lots of data + flexible + feature learning + recognition networks

The unreasonable easiness of deep learning Recipe: define an objective function (i.e. probability of data given params) Optimize params to maximize objective Gradients are computed automatically, you just define model by some computation Show demo here

Differentiable models Model distributions implicitly by a variable pushed through a deep net: y = f (x) Approximate intractable distribution by a tractable distribution parameterized by a deep net: p(y x) =N (y µ = f (x), = g (x)) Optimize all parameters using stochastic gradient descent

Modeling idea: graphical models on latent variables, neural network models for observations Composing graphical models with neural networks for structured representations and fast inference. Johnson, Duvenaud, Wiltschko, Datta, Adams, NIPS 2016

data space latent space

unsupervised learning supervised learning Courtesy of Matthew Johnson

Learning outcomes Know standard algorithms (bag of tricks), when to use them, and their limitations. For basic applications and baselines. Know main elements of language of deep probabilistic models (distributions, expectations, latent variables, neural networks) and how to combine them. For custom applications + research Know standard computational tools (Monte Carlo, Stochastic optimization, regularization, automatic differentiation). For fitting models

Tentative list of topics Linear methods for regression + classification, Bayesian linear regression Probabilistic Generative and Discriminative models, Regularization methods Stochastic Optimization (practically important) Neural Networks Model Comparison and marginal likelihood (conceptually important) Stochastic Variational Inference Time series and recurrent models Mixture Models, Graphical Models and Bayesian Networks Kernel Methods, Gaussian processes, Support Vector Machines

Quiz

Machine-learning-centric History of Probabilistic Models 1940s - 1960s Motivating probability and Bayesian inference 1980s - 2000s Bayesian machine learning with MCMC 1990s - 2000s Graphical models with exact inference 1990s - present Bayesian Nonparametrics with MCMC (Indian Buffet process, Chinese restaurant process) 1990s - 2000s Bayesian ML with mean-field variational inference 2000s - present Probabilistic Programming 2000s - 2013 Deep undirected graphical models (RBMs, pretraining) 2010s - present Stan - Bayesian Data Analysis with HMC 2000s - 2013 Autoencoders, denoising autoencoders 2000s - present Invertible density estimation 2013 - present Stochastic variational inference, variational autoencoders 2014 - present Generative adversarial nets, Real NVP, Pixelnet 2016 - present Lego-style deep generative models (attend, infer, repeat)