Machine Learning ICS 273A. Instructor: Max Welling

Similar documents
Python Machine Learning

Lecture 1: Machine Learning Basics

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

(Sub)Gradient Descent

CSL465/603 - Machine Learning

Learning From the Past with Experiment Databases

Artificial Neural Networks written examination

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

CS Machine Learning

Lecture 1: Basic Concepts of Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

Switchboard Language Model Improvement with Conversational Data from Gigaword

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

A Case Study: News Classification Based on Term Frequency

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Axiom 2013 Team Description Paper

Probabilistic Latent Semantic Analysis

Speech Emotion Recognition Using Support Vector Machine

Rule Learning with Negation: Issues Regarding Effectiveness

Human Emotion Recognition From Speech

Knowledge Transfer in Deep Convolutional Neural Nets

Evolutive Neural Net Fuzzy Filtering: Basic Description

CS 446: Machine Learning

INPE São José dos Campos

Probability and Statistics Curriculum Pacing Guide

Model Ensemble for Click Prediction in Bing Search Ads

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

A Neural Network GUI Tested on Text-To-Phoneme Mapping

WHEN THERE IS A mismatch between the acoustic

Universidade do Minho Escola de Engenharia

Semi-Supervised Face Detection

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Data Fusion Through Statistical Matching

arxiv: v2 [cs.ro] 3 Mar 2017

Word Segmentation of Off-line Handwritten Documents

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Australian Journal of Basic and Applied Sciences

arxiv: v1 [cs.lg] 15 Jun 2015

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Calibration of Confidence Measures in Speech Recognition

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

SARDNET: A Self-Organizing Feature Map for Sequences

Laboratorio di Intelligenza Artificiale e Robotica

Speech Recognition at ICSI: Broadcast News and beyond

Issues in the Mining of Heart Failure Datasets

Time series prediction

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Active Learning. Yingyu Liang Computer Sciences 760 Fall

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Welcome to. ECML/PKDD 2004 Community meeting

STA 225: Introductory Statistics (CT)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Reducing Features to Improve Bug Prediction

Detailed course syllabus

Lecture 10: Reinforcement Learning

Learning Methods for Fuzzy Systems

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

A survey of multi-view machine learning

MGT/MGP/MGB 261: Investment Analysis

A Reinforcement Learning Variant for Control Scheduling

Learning Optimal Dialogue Strategies: A Case Study of a Spoken Dialogue Agent for

TD(λ) and Q-Learning Based Ludo Players

UMass at TDT Similarity functions 1. BASIC SYSTEM Detection algorithms. set globally and apply to all clusters.

Modeling function word errors in DNN-HMM based LVCSR systems

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Reinforcement Learning by Comparing Immediate Reward

Word learning as Bayesian inference

Math 96: Intermediate Algebra in Context

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Further, Robert W. Lissitz, University of Maryland Huynh Huynh, University of South Carolina ADEQUATE YEARLY PROGRESS

Speaker Identification by Comparison of Smart Methods. Abstract

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

UNIVERSITY OF CALIFORNIA SANTA CRUZ TOWARDS A UNIVERSAL PARAMETRIC PLAYER MODEL

arxiv: v2 [cs.cv] 30 Mar 2017

Laboratorio di Intelligenza Artificiale e Robotica

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

Applications of data mining algorithms to analysis of medical data

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Multivariate k-nearest Neighbor Regression for Time Series data -

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

The stages of event extraction

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1

BUAD 425 Data Analysis for Decision Making Syllabus Fall 2015

arxiv: v1 [cs.cl] 2 Apr 2017

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Transcription:

Machine Learning ICS 273A Instructor: Max Welling

Class Homework What is Expected? Required, (answers will be provided) A Project See webpage Quizzes A quiz every Friday Bring scantron form (buy in UCI bookstore) Final Programming in MATLAB or R

introduction: overview, examples, goals. Syllabus Classification I: decision trees, random forests, boosting, k-nearest neighbors, Naïve Bayes, over-fitting, bias variance trade-off, cross-validation. Classification 2: neural networks: perceptron, logistic regression, multi-layer networks, back- propagation. Classification 3: kernel methods & support vector machines. Clustering & dimensionality reduction: (kernel) k-means, (kernel) PCA. Kernel design Nonlinear dimension reduction. (Kernel) Fisher linear discriminant analysis (Kernel) canonical correlation analysis Algorithm evaluation, hypothesis testing. week 9/10: project presentations.

Machine Learning according to The ability of a machine to improve its performance based on previous results. The process by which computer systems can be directed to improve their performance over time. Subspecialty of artificial intelligence concerned with developing methods for software to learn from experience or extract knowledge from examples in a database. The ability of a program to learn from experience that is, to modify its execution on the basis of newly acquired information. Machine learning is an area of artificial intelligence concerned with the development of techniques which allow computers to "learn". More specifically, machine learning is a method for creating computer programs by the analysis of data sets. Machine learning overlaps heavily with statistics, since both fields study the analysis of data, but unlike statistics, machine learning is concerned with the algorithmic complexity of computational implementations....

Some Examples ZIP code recognition Loan application classification Signature recognition Voice recognition over phone Credit card fraud detection Spam filter Collaborative Filtering: suggesting other products at Amazone.com Marketing Stock market prediction Expert level chess and checkers systems biometric identification (fingerprints, DNA, iris scan, face) machine translation web-search document & information retrieval camera surveillance robosoccer and so on and so on...

Why is this cool/important? Modern technologies generate data at an unprecedented scale. The amount of data doubles every year. One petabyte is equivalent to the text in one billion books, yet many scientific instruments, including the Large Synoptic Survey Telescope, will soon be generating several petabytes annually. (2020 Computing: Science in an exponential world: Nature Published online: 22 March 2006) Computers dominate our daily lives Science, industry, army, our social interactions etc. We can no longer eyeball the images captured by some satellite for interesting events, or check every webpage for some topic. We need to trust computers to do the work for us.

We will be concerned with these topics in this class Types of Learning Supervised Learning Labels are provided, there is a strong learning signal. e.g. classification, regression. Semi-supervised Learning. Only part of the data have labels. e.g. a child growing up. Reinforcement learning. The learning signal is a (scalar) reward and may come with a delay. e.g. trying to learn to play chess, a mouse in a maze. Unsupervised learning There is no direct learning signal. We are simply trying to find structure in data. e.g. clustering, dimensionality reduction.

Ingredients Data: what kind of data do we have? Prior assumptions: what do we know a priori about the problem? Representation: How do we represent the data? Model / Hypothesis space: What hypotheses are we willing to entertain to explain the data? Feedback / learning signal: what kind of learning signal do we have (delayed, labels)? Learning algorithm: How do we update the model (or set of hypothesis) from feedback? Evaluation: How well did we do, should we change the model?

Histograms and Scatter Plots Visualize your data before you start modeling it!

Supervised Learning I Example: Imagine you want to classify versus Data: 100 monkey images and 200 human images with labels what is what. where x represents the greyscale of the image pixels and y=0 means monkey while y=1 means human. Task: Here is a new image: monkey or human?

1 nearest neighbors (your first ML algorithm!) Idea: 1. Find the picture in the database which is closest your query image. 2. Check its label. 3. Declare the class of your query image to be the same as that of the closest picture. query closest image

knn Decision Surface decision curve

Distance Metric How do we measure what it means to be close? Depending on the problem we should choose an appropriate distance metric.

Remarks on NN methods We only need to construct a classifier that works locally for each query. Hence: We don t need to construct a classifier everywhere in space. Classifying is done at query time. This can be computationally taxing at a time where you might want to be fast. Memory inefficient (you have to keep all data around). Curse of dimensionality: imagine many features are irrelevant / noisy distances are always large. Very flexible, not many prior assumptions. k-nn variants robust against bad examples.

Non-parametric Methods Non-parametric methods keep all the data cases/examples in memory. A better name is: instance-based learning As the data-set grows, the complexity of the decision surface grows. Sometimes, non-parametric methods have some parameters to tune... Very few assumptions (we let the data speak).

Logistic Regression / Perceptron (your second ML algorithm!) Fits a soft decision boundary between the classes. 1 dimension 2 dimensions

The logit / sigmoid Determines the offset Determines the angle and the steepness.

Objective We interpret h(x) as the probability of classifying a data case as positive. We want to maximize the total probability of the datavectors:

Algorithm in detail Repeat until convergence (gradient descend): O W = ( 1 h(x n )) x n h(x n ) positive examples (y n =1) negative examples (y n =0) O b = ( 1 h(x n )) h(x n ) positive examples (y n =1) negative examples (y n =0) x n

A Note on Stochastic GD For very large problems it is more efficient to compute the gradient using a small (random) subset of the data. For every new update you pick a new random subset. Towards convergence, you decrease the stepsize. Why is this more efficient? The gradient is an average over many data-points. If your parameters are very bad, every data-point will tell you to move in the same direction, so you need only a few data-points to find that direction. Towards convergence you need all the data-points. A small step-size effectively averages over many data-points.

Parametric Methods Parametric methods fit a finite set of parameters to the data. Unlike NP methods, this implies a maximum complexity for the model. Assumption heavy : by choosing the parameterized model you impose your prior assumptions (this can be an advantage when you have sound assumptions!) Classifier is build off-line. Classification is fast at query time. Easy on memory: samples are summarized through model parameters.

Hypothesis Space An hypothesis h: X [0,1] for a binary classifier is a function that maps all possible input values to either class 0 or class 1. E.g. for 1-NN the hypothesis h(x) is given by: The hypothesis space H, is the space of all hypotheses that you are willing to consider/search over. For instance, for logistic regression, H is given by all classifiers of the form (parameterized by W,b):

Inductive Bias The assumption one makes to generalize beyond the training data. Examples: 1-NN: the label is the same as that of the closest training example. LL: the classification function is a smooth function of the form: Without inductive bias (i.e. without assumptions) there is no generalization possible! (you have not expressed preference for unseen data in any way). Learning is hence converting your prior assumptions + the data into a classifier for new data.

Generalization Consider the following regression problem: Predict the real value on the y-axis from the real value on the x-axis. You are given 6 examples: {Xi,Yi}. What is the y-value for a new query point X*? X*

Generalization

Generalization

Generalization which curve is best?

Generalization Ockham s razor: prefer the simplest hypothesis consistent with data.

Generalization Learning is concerned with accurate prediction of future data, not accurate prediction of training data. (The single most important sentence you will see in the course)

Cross-validation How do we ensure good generalization, i.e. avoid over-fitting on our particular data sample. You are ultimately interested in good performance on new (unseen) test data. To estimate that, split off a (smallish) subset of the training data (called validation set). Train without validation data and test on validation data. Repeat this over multiple splits of the data and average results. Reasonable split: 90% train, 10% test, average over the 10 splits.