Lecture 1: Introduc4on

Similar documents
Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Probabilistic Latent Semantic Analysis

Generative models and adversarial training

Python Machine Learning

(Sub)Gradient Descent

CS224d Deep Learning for Natural Language Processing. Richard Socher, PhD

Laboratorio di Intelligenza Artificiale e Robotica

CS Machine Learning

CSL465/603 - Machine Learning

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Laboratorio di Intelligenza Artificiale e Robotica

Lecture 1: Basic Concepts of Machine Learning

Speech Recognition at ICSI: Broadcast News and beyond

Extending Place Value with Whole Numbers to 1,000,000

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Lecture 10: Reinforcement Learning

Assignment 1: Predicting Amazon Review Ratings

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Probability and Statistics Curriculum Pacing Guide

Exploration. CS : Deep Reinforcement Learning Sergey Levine

A Case Study: News Classification Based on Term Frequency

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning Methods for Fuzzy Systems

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Comparison of EM and Two-Step Cluster Method for Mixed Data: An Application

Rule Learning With Negation: Issues Regarding Effectiveness

Calibration of Confidence Measures in Speech Recognition

Rover Races Grades: 3-5 Prep Time: ~45 Minutes Lesson Time: ~105 minutes

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

STA 225: Introductory Statistics (CT)

WHEN THERE IS A mismatch between the acoustic

Spring 2014 SYLLABUS Michigan State University STT 430: Probability and Statistics for Engineering

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

A Neural Network GUI Tested on Text-To-Phoneme Mapping

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Artificial Neural Networks written examination

OFFICE SUPPORT SPECIALIST Technical Diploma

A study of speaker adaptation for DNN-based speech synthesis

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

ACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Truth Inference in Crowdsourcing: Is the Problem Solved?

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

Pp. 176{182 in Proceedings of The Second International Conference on Knowledge Discovery and Data Mining. Predictive Data Mining with Finite Mixtures

On-Line Data Analytics

Human Emotion Recognition From Speech

Learning Methods in Multilingual Speech Recognition

Firms and Markets Saturdays Summer I 2014

Lecture 2: Quantifiers and Approximation

Switchboard Language Model Improvement with Conversational Data from Gigaword

Australian Journal of Basic and Applied Sciences

Machine Learning and Development Policy

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Likelihood-Maximizing Beamforming for Robust Hands-Free Speech Recognition

arxiv: v2 [cs.cv] 30 Mar 2017

Reducing Features to Improve Bug Prediction

AGENDA LEARNING THEORIES LEARNING THEORIES. Advanced Learning Theories 2/22/2016

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

SOFTWARE EVALUATION TOOL

Grade 6: Correlated to AGS Basic Math Skills

Welcome to. ECML/PKDD 2004 Community meeting

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

K 1 2 K 1 2. Iron Mountain Public Schools Standards (modified METS) Checklist by Grade Level Page 1 of 11

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

SARDNET: A Self-Organizing Feature Map for Sequences

San José State University Department of Psychology PSYC , Human Learning, Spring 2017

Mining Association Rules in Student s Assessment Data

Word Segmentation of Off-line Handwritten Documents

Rule Learning with Negation: Issues Regarding Effectiveness

arxiv: v1 [cs.cl] 2 Apr 2017

TOPICS LEARNING OUTCOMES ACTIVITES ASSESSMENT Numbers and the number system

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Using Web Searches on Important Words to Create Background Sets for LSI Classification

We are strong in research and particularly noted in software engineering, information security and privacy, and humane gaming.

Introduction to Causal Inference. Problem Set 1. Required Problems

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

A Context-Driven Use Case Creation Process for Specifying Automotive Driver Assistance Systems

Robust Speech Recognition using DNN-HMM Acoustic Model Combining Noise-aware training with Spectral Subtraction

Statewide Framework Document for:

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Reinforcement Learning by Comparing Immediate Reward

Course Law Enforcement II. Unit I Careers in Law Enforcement

AP Statistics Summer Assignment 17-18

A Comparison of Two Text Representations for Sentiment Analysis

Applications of data mining algorithms to analysis of medical data

Mathematics subject curriculum

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Navigating the PhD Options in CMS

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Introduction to the Practice of Statistics

Go fishing! Responsibility judgments when cooperation breaks down

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Transcription:

CSC2515 Spring 2014 Introduc4on to Machine Learning Lecture 1: Introduc4on All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html Many of the Cigures are provided by Chris Bishop from his textbook: Pattern Recognition and Machine Learning

Admin Details Permanent tutorial time/place: Thursdays 2-3, Haultain 401 Do I have the appropriate background? Linear algebra: vector/matrix manipulations, properties Calculus: partial derivatives Probability: common distributions; Bayes Rule Statistics: mean/median/mode; maximum likelihood Sheldon Ross: A First Course in Probability Related Courses

Textbooks Christopher Bishop: Pattern Recognition and Machine Learning, 2006. Other recommended texts Kevin Murphy: Machine Learning: a Probabilistic Perspective David Mackay: Information Theory, Inference, and Learning Algorithms

Do the readings! Requirements Assignments Two assignments, worth 10% each Programming: take Matlab/Python code and extend it Derivations: pen(cil)- and- paper Test Two hour exam on last day of class, check that understand main concepts in course Worth 35% of course mark Project Proposal due Jan 26 Presentations: Week March 23 (date might change) Write- up due April 3 rd (date might change) Worth 45% of course mark

What is Machine Learning? Learning systems are not directly programmed to solve a problem, instead develop own program based on: Examples of how they should behave From trial- and- error experience trying to solve the problem Different than standard CS: want to implement unknown function, only have access to sample input- output pairs (training examples) Learning simply means incorporating information from the training examples into the system

Why Study Learning? Develop enhanced computer systems Automatically adapt to user, customize Often difcicult to acquire necessary knowledge Improve understanding of human, biological learning Computational analysis provides concrete theory, predictions Explosion of methods to analyze brain activity during learning Timing is good Ever growing amounts of data available Cheap and powerful computers Suite of algorithms, theory already developed

A classic example of a task that requires machine learning: What makes a 2?

Why use learning? It is very hard to write programs that solve problems like recognizing a handwrinen digit What dis4nguishes a 2 from a 7? How does our brain do it? Instead of wri4ng a program by hand, we collect examples that specify the correct output for a given input A machine learning algorithm then takes these examples and produces a program that does the job The program produced by the learning algorithm may look very different from a typical hand- wrinen program. It may contain millions of numbers. If we do it right, the program works for new cases as well as the ones we trained it on.

Two classic examples of tasks that are best solved by using a learning algorithm

Learning algorithms are useful in other tasks Recognizing panerns: Facial iden44es, expressions HandwriNen or spoken words Digital images and videos: Loca4ng, tracking, and iden4fying objects Driving a car Recognizing anomalies: Unusual sequences of credit card transac4ons Spam filtering, fraud detec4on: The enemy adapts so we must adapt too Recommenda4on systems: Noisy data, commercial pay- off (Amazon, Ne\lix). Informa4on retrieval: Find documents or images with similar content

Data Explosion: Text Large text dataset 1,000,000 words in 1967 1,000,000,000,000 words in 2006 Successful Applications Speech recognition Machine translation Lots of labeled data Memorization is useful

Really Big Data

Human learning Josh Tenenbaum

Types of learning task Supervised: correct output known for each training example Learn to predict output when given an input vector Classifica4on: 1- of- N output (speech recogni4on, object recogni4on, medical diagnosis) Regression: real- valued output (predic4ng market prices, customer ra4ng) Unsupervised learning Create an internal representa4on of the input, capturing regulari4es/structure in data Examples: form clusters; extract features How do we know if a representa4on is good? Reinforcement learning Learn ac4on to maximize payoff Not much informa4on in a payoff signal Payoff is ocen delayed Important area not covered here, many applica4ons: games, SmartHouse

Supervised Learning ClassiCication Outputs are categorical (1- of- N) Inputs are anything Goal: select correct class for new inputs Ex: speech, object recognition, medical diagnosis Regression Outputs are continuous Inputs are anything (typically continuous) Goal: predict outputs accurately for new inputs Ex: predicting market prices, customer rating of movie Temporal Prediction Goal: perform classicication/regression on new input sequences values at future time points Given input values and corresponding class labels/outputs at some previous time points

Unsupervised Learning Clustering: Inputs are vector or categorical Goal: group data cases into a Cinite number of clusters so that within each cluster all cases have very similar inputs Compression Inputs are typically vector Goal: deliver an encoder and decoder such that size of encoder output is much smaller than original input, but composition of encoder followed by decode very similar to original input Outlier detection Inputs are anything Goal: select highly unusual cases from new and given data

Machine Learning & Data Mining Data- mining: Typically using very simple machine learning techniques on very large databases because computers are too slow to do anything more interesting with ten billion examples Previously used in a negative sense misguided statistical procedure of looking for all kinds of relationships in the data until Cinally Cind one Now lines are blurred: many ML problems involve tons of data But problems with AI Clavor (e.g., recognition, robot navigation) still domain of ML

Machine Learning & Sta4s4cs ML uses statistical theory to build models core task is inference from a sample A lot of ML is rediscovery of things statisticians already knew; often disguised by differences in terminology: But the emphasis is very different: Good piece of statistics: Clever proof that relatively simple estimation procedure is asymptotically unbiased. Good piece of ML: Demo that a complicated algorithm produces impressive results on a specicic task. Can view ML as applying computational techniques to statistical problems. But go beyond typical statistics problems, with different aims (speed vs. accuracy).

Cultural gap (Tibshirani) Machine Learning---------------------Statistics network, graphs weights learning generalization supervised learning model parameters Citting test set performance regression/classicication unsupervised learning. large grant: $1,000,000 conference location: Snowbird, French Alps density estimation, clustering large grant: $50,000 conference location: Las Vegas in August

Represen4ng the structure of a set of documents using Latent Seman4c Analysis (a form of PCA) Each document is converted to a vector of word counts. This vector is then mapped to two coordinates and displayed as a colored dot. The colors represent the hand- labeled classes. When the documents are laid out in 2- D, the classes are not used. So we can judge how good the algorithm is by seeing if the classes are separated.

Represen4ng the structure of a set of documents using a neural network

Using Variables to Represent the World We use mathematical variables to encode everything we know about the task: inputs, outputs and internal states. Variables may be discrete/categorical; continuous/vector Discrete quantities take on one of a Cixed set of values e.g., {0,1}, {email,spam}, {sunny,overcast,raining} Continuous quantities take on real values e.g.. 1.6632, [3.3,- 1.8,120.4] Generally have repeated measurements of same quantities Conventions i,j, indexes components/variables/dimensions n,m, indexes cases/records x i (n) : value of the i th input variable on the n th case y j (m) : value of the j th output variable on the m th case x (n) : vector of inputs for the n th case X = {x (1) x (2),, x (N) } is all the inputs

Ini4al Case Study What grade will I get in this course? Data: entry survey and marks from previous years Process the data Split into training set; test set Determine representation of input features; output Choose form of model: linear regression Decide how to evaluate the system s performance: objective function Set model parameters to optimize performance Evaluate on test set: generalization

Hypothesis Space Now have a representation for inputs and outputs How to represent a supervised learning machine? One way to think about a supervised learning machine is as a device that explores a hypothesis space. Each setting of the parameters in the machine is a different hypothesis about the function that maps input vectors to output vectors. If the data is noise- free, each training example rules out a region of hypothesis space. If the data is noisy, each training example scales the posterior probability of each point in the hypothesis space in proportion to how likely the training example is given that hypothesis. The art of supervised machine learning is in: Deciding how to represent the inputs and outputs Selecting a hypothesis space that is powerful enough to represent the relationship between inputs and outputs but simple enough to be searched.

Searching a hypothesis space The obvious method is to Cirst formulate a loss function and then adjust the parameters to minimize the loss function. This allows the optimization to be separated from the objective function that is being optimized. Bayesians do not search for a single set of parameter values that do well on the loss function. They start with a prior distribution over parameter values and use the training data to compute a posterior distribution over the whole hypothesis space.

Some Loss Functions Squared difference between actual and target real- valued outputs Number of classicication errors Problematic for optimization because the derivative is not smooth Negative log probability assigned to the correct answer. This is usually the right function to use. In some cases it is the same as squared error (regression with Gaussian output noise) In other cases it is very different (classicication with discrete classes needs cross- entropy error)