Registration Hw1 is due tomorrow night Hw2 will be out tomorrow night. Please start working on it as soon as possible Come to sections with questions

Similar documents
(Sub)Gradient Descent

Python Machine Learning

Lecture 1: Machine Learning Basics

Lecture 1: Basic Concepts of Machine Learning

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Artificial Neural Networks written examination

CSL465/603 - Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS 446: Machine Learning

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Rule Learning With Negation: Issues Regarding Effectiveness

CS Machine Learning

Knowledge Transfer in Deep Convolutional Neural Nets

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

arxiv: v1 [cs.lg] 15 Jun 2015

Discriminative Learning of Beam-Search Heuristics for Planning

Rule Learning with Negation: Issues Regarding Effectiveness

Softprop: Softmax Neural Network Backpropagation Learning

Speech Recognition at ICSI: Broadcast News and beyond

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Learning From the Past with Experiment Databases

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Learning to Schedule Straight-Line Code

Proof Theory for Syntacticians

Version Space. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Version Space Term 2012/ / 18

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

Axiom 2013 Team Description Paper

Deep search. Enhancing a search bar using machine learning. Ilgün Ilgün & Cedric Reichenbach

Using focal point learning to improve human machine tacit coordination

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Evolutive Neural Net Fuzzy Filtering: Basic Description

Assignment 1: Predicting Amazon Review Ratings

Model Ensemble for Click Prediction in Bing Search Ads

The Good Judgment Project: A large scale test of different methods of combining expert predictions

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

INPE São José dos Campos

Active Learning. Yingyu Liang Computer Sciences 760 Fall

WHEN THERE IS A mismatch between the acoustic

Generative models and adversarial training

Learning Methods for Fuzzy Systems

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Human Emotion Recognition From Speech

Calibration of Confidence Measures in Speech Recognition

On the Polynomial Degree of Minterm-Cyclic Functions

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Learning Probabilistic Behavior Models in Real-Time Strategy Games

Attributed Social Network Embedding

Diagnostic Test. Middle School Mathematics

An empirical study of learning speed in backpropagation

Lahore University of Management Sciences. FINN 321 Econometrics Fall Semester 2017

Learning to Rank with Selection Bias in Personal Search

Switchboard Language Model Improvement with Conversational Data from Gigaword

FF+FPG: Guiding a Policy-Gradient Planner

arxiv: v1 [math.at] 10 Jan 2016

Natural Language Processing: Interpretation, Reasoning and Machine Learning

Chapter 2 Rule Learning in a Nutshell

A Reinforcement Learning Variant for Control Scheduling

Classify: by elimination Road signs

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Version Space Approach to Learning Context-free Grammars

Shockwheat. Statistics 1, Activity 1

Knowledge-Based - Systems

Probabilistic Latent Semantic Analysis

A survey of multi-view machine learning

Lecture 10: Reinforcement Learning

NEURAL PROCESSING INFORMATION SYSTEMS 2 DAVID S. TOURETZKY ADVANCES IN EDITED BY CARNEGI-E MELLON UNIVERSITY

STUDENTS' RATINGS ON TEACHER

Learning Methods in Multilingual Speech Recognition

On the Combined Behavior of Autonomous Resource Management Agents

Writing Research Articles

GACE Computer Science Assessment Test at a Glance

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Constructive Induction-based Learning Agents: An Architecture and Preliminary Experiments

Universidade do Minho Escola de Engenharia

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

Software Maintenance

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

arxiv: v1 [cs.lg] 3 May 2013

A study of speaker adaptation for DNN-based speech synthesis

Mathematics process categories

Helping Your Children Learn in the Middle School Years MATH

Word Segmentation of Off-line Handwritten Documents

arxiv: v1 [cs.cv] 10 May 2017

Linking Task: Identifying authors and book titles in verbose queries

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

On Human Computer Interaction, HCI. Dr. Saif al Zahir Electrical and Computer Engineering Department UBC

Objectives. Chapter 2: The Representation of Knowledge. Expert Systems: Principles and Programming, Fourth Edition

Using dialogue context to improve parsing performance in dialogue systems

A Genetic Irrational Belief System

Welcome to. ECML/PKDD 2004 Community meeting

Online Updating of Word Representations for Part-of-Speech Tagging

Speech Emotion Recognition Using Support Vector Machine

Analysis of Enzyme Kinetic Data

arxiv: v2 [cs.cv] 30 Mar 2017

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Transcription:

Administration Registration Hw1 is due tomorrow night Hw2 will be out tomorrow night. Please start working on it as soon as possible Come to sections with questions No lectures net Week!! Please watch the corresponding videos: check the schedule page across from the corresponding dates. I will not have office hours this week. Questions Please go to the TAs office hours and discussion session. Etensions: you don t need to email me about etensions to the Hw. You have it 96 hours of it. 1

Projects Projects proposals are due on Friday 3/10/17 We will give you an approval to continue with your project, possibly, along with comments and/or a request to modify/augment/do a different project. There may also be a mechanism for peer comments. We encourage team projects a team can be up to 3 people. Please start thinking and working on the project now. Your proposal is limited to 1-2 pages, but needs to include references and, ideally, some of the ideas you have developed in the direction of the project (maybe even some preliminary results). Any project that has a significant Machine Learning component is good. You can do eperimental work, theoretical work, a combination of both or a critical survey of results in some specialized topic. The work has to include some reading. Even if you do not do a survey, you must read (at least) two related papers or book chapters and relate your work to it. Originality is not mandatory but is encouraged. Try to make it interesting! 2

Eamples KDD Cup 2013: "Author-Paper Identification": given an author and a small set of papers, we are asked to identify which papers are really written by the author. https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge Author Profiling : given a set of document, profile the author: identification, gender, native language,. Caption Control: Is it gibberish? Spam? High quality tet? Adapt an NLP program to a new domain Work on making learned hypothesis (e.g., linear threshold functions, NN) more comprehensible Eplain the prediction Develop a (multi-modal) People Identifier Compare Regularization methods: e.g., Winnow vs. L1 Regularization Large scale clustering of documents + name the cluster Deep Networks: convert a state of the art NLP program to a deep network, efficient, architecture. Try to prove something 3

Today: A Guide Take a more general perspective and think more about learning, learning protocols, Learning Algorithms quantifying performance, Search: (Stochastic) Gradient Descent with LMS etc. Decision Trees & Rules This will motivate some of Importance of hypothesis space (representation) the ideas we will see net. How are we doing? Simplest: Quantification in terms of cumulative # of mistakes More later Perceptron How to deal better with large features spaces & sparsity? Winnow Variations of Perceptron Dealing with overfitting Closing the loop: Back to Gradient Descent Dual Representations & Kernels Multilayer Perceptron Beyond Binary Classification? Multi-class classification and Structured Prediction More general way to quantify learning performance (PAC) New Algorithms (SVM, Boosting) 4

Quantifying Performance We want to be able to say something rigorous about the performance of our learning algorithm. We will concentrate on discussing the number of eamples one needs to see before we can say that our learned hypothesis is good. 5

Learning Conjunctions There is a hidden (monotone) conjunction the learner (you) is to learn f 2 3 4 5 100 How many eamples are needed to learn it? How? Protocol I: The learner proposes instances as queries to the teacher Protocol II: The teacher (who knows f) provides training eamples Protocol III: Some random source (e.g., Nature) provides training eamples; the Teacher (Nature) provides the labels (f()) 6

Learning Conjunctions Protocol I: The learner proposes instances as queries to the teacher Since we know we are after a monotone conjunction: Is 100 in? <(1,1,1,1,0),?> f()=0 (conclusion: Yes) Is 99 Is 1 in? <(1,1, 1,0,1),?> f()=1 (conclusion: No) in? <(0,1, 1,1,1),?> f()=1 (conclusion: No) A straight forward algorithm requires n=100 queries, and will produce as a result the hidden conjunction (eactly). h 2 3 4 5 100 What happens here if the conjunction is not known to be monotone? If we know of a positive eample, the same algorithm works. 7

Learning Conjunctions Protocol II: The teacher (who knows f) provides training eamples 8

Learning Conjunctions Protocol II: The teacher (who knows f) provides training eamples <(0,1,1,1,1,0,,0,1), 1> 9

Learning Conjunctions Protocol II: The teacher (who knows f) provides training eamples <(0,1,1,1,1,0,,0,1), 1> (We learned a superset of the good variables) 10

Learning Conjunctions Protocol II: The teacher (who knows f) provides training eamples <(0,1,1,1,1,0,,0,1), 1> (We learned a superset of the good variables) To show you that all these variables are required 11

Learning Conjunctions Protocol II: The teacher (who knows f) provides training eamples <(0,1,1,1,1,0,,0,1), 1> (We learned a superset of the good variables) To show you that all these variables are required <(0,0,1,1,1,0,,0,1), 0> need 2 <(0,1,0,1,1,0,,0,1), 0> need 3.. <(0,1,1,1,1,0,,0,0), 0> need 100 Modeling Teaching Is tricky A straight forward algorithm requires k = 6 eamples to produce the hidden conjunction (eactly). f 2 3 4 5 100 12

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> <(1,0,1,1,1,0,...0,1,1), 0> <(1,1,1,1,1,0,...0,0,1), 1> <(1,0,1,0,0,0,...0,1,1), 0> <(1,1,1,1,1,1,,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0> Skip 13

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample 14

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample f 1... 2 3 4 5 100 15

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> f 1... 2 3 4 5 100 16

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> learned nothing f 1 2 3 4 5... 100 17

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> f <(1,0,1,1,0,0,...0,0,1), 0> learned nothing <(1,1,1,1,1,0,...0,0,1), 1> f 1... learned nothing 2 3 4 5 100 1 2 3 4 5 99 100 18

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> <(1,0,1,1,0,0,...0,0,1), 0> <(1,1,1,1,1,0,...0,0,1), 1> <(1,0,1,0,0,0,...0,1,1), 0> <(1,1,1,1,1,1,,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0> f f 1... 2 learned nothing f 3 4 5 100 1 2 3 4 5 99 100 learned nothing 1 2 3 4 5 100 19

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm: Elimination Start with the set of all literals as candidates Eliminate a literal that is not active (0) in a positive eample <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> <(1,0,1,1,0,0,...0,0,1), 0> <(1,1,1,1,1,0,...0,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0> f f f 1... 2 learned nothing 3 4 5 100 1 2 3 4 5 99 100 learned nothing <(1,0,1,0,0,0,...0,1,1), 0> Final hypothesis: <(1,1,1,1,1,1,,0,1), 1> 1 2 3 4 5 100 h 1 2 3 4 5 100 Is that good? Performance? # of eamples? 20

Learning Conjunctions Protocol III: Some random source (e.g., Nature) provides training eamples Teacher (Nature) provides the labels (f()) Algorithm:. <(1,1,1,1,1,1,,1,1), 1> <(1,1,1,0,0,0,,0,0), 0> <(1,1,1,1,1,0,...0,1,1), 1> <(1,0,1,1,0,0,...0,0,1), 0> <(1,1,1,1,1,0,...0,0,1), 1> <(1,0,1,0,0,0,...0,1,1), 0> <(1,1,1,1,1,1,,0,1), 1> <(0,1,0,1,0,0,...0,1,1), 0> Final hypothesis: With the given data, we only learned an approimation to the true concept h Is it good Performance? # of eamples? 1 2 3 4 5 100 21

Two Directions Can continue to analyze the probabilistic intuition: Never saw 1 =0 in positive eamples, maybe we ll never see it? And if we will, it will be with small probability, so the concepts we learn may be pretty good Good: in terms of performance on future data PAC framework Mistake Driven Learning algorithms (Now, we can only reason about #(mistakes), not #(eamples)) Update your hypothesis only when you make mistakes Good: in terms of how many mistakes you make before you stop, happy with your hypothesis. Note: not all on-line algorithms are mistake driven, so performance measure could be different. 22

On-Line Learning Two new learning algorithms (learn a linear function over the feature space) Perceptron (+ many variations) Winnow General Gradient Descent view Issues: Importance of Representation Compleity of Learning Idea of Kernel Based Methods More about features 23

Motivation Consider a learning problem in a very high dimensional space { 1, 2, 3,..., 1000000} And assume that the function space is very sparse (every function of interest depends on a small number of attributes.) f 2 3 4 5.100 Middle Eastern deserts are known for their sweetness Can we develop an algorithm that depends only weakly on the space dimensionality and mostly on the number of relevant attributes? How should we represent the hypothesis? 24

On-Line Learning Of general interest; simple and intuitive model; Robot in an assembly line, language learning, Important in the case of very large data sets, when the data cannot fit memory Streaming data Evaluation: We will try to make the smallest number of mistakes in the long run. What is the relation to the real goal? Generate a hypothesis that does well on previously unseen data 25

Model: On-Line Learning Not the most general setting for on-line learning. Not the most general metric (Regret: cumulative loss; Competitive analysis) Instance space: X (dimensionality n) Target: f: X {0,1}, f C, concept class (parameterized by n) Protocol: learner is given X learner predicts h(), and is then given f() (feedback) Performance: learner makes a mistake when h() f() number of mistakes algorithm A makes on sequence S of eamples, for the target function f. M A ( C) ma, M ( f, S) f C A is a mistake bound algorithm for the concept class C, if MA(c) is a polynomial in n, the compleity parameter of the target concept. S A 26

On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ²-± (PAC) behavior? Instead, looking for eact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get eample, update hypothesis, get rid of it (??) 27

On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ²-± (PAC) behavior Instead, looking for eact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get eample, update hypothesis, get rid of it (??) Drawbacks: Too simple Global behavior: not clear when will the mistakes be made 28

On-Line/Mistake Bound Learning We could ask: how many mistakes to get to ²-± (PAC) behavior Instead, looking for eact learning. (easier to analyze) No notion of distribution; a worst case model Memory: get eample, update hypothesis, get rid of it (??) Drawbacks: Too simple Global behavior: not clear when will the mistakes be made Advantages: Simple Many issues arise already in this setting Generic conversion to other learning models Equivalent to PAC for natural problems (?) 29

Generic Mistake Bound Is it clear that we can bound the number of mistakes? Let C be a finite concept class. Learn f ² C CON: In the ith stage of the algorithm: C i all concepts in C consistent with all i-1 previously seen eamples Choose randomly f 2 C i and use to predict the net eample Clearly, C i+1 µ C i and, if a mistake is made on the ith eample, then C i+1 < C i so progress is made. The CON algorithm makes at most C -1 mistakes Can we do better? Algorithms 30

The Halving Algorithm Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm: C i all concepts in C consistent with all i-1 previously seen eamples Given an eample e i consider the value f j ( e i ) for all and predict by majority. f C j i 31

The Halving Algorithm Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm: C i all concepts in C consistent with all i-1 previously seen eamples Given an eample e i consider the value f j ( e i ) for all f j C and predict by majority. Predict 1 if { f C ; f ( e ) 0} { f C ; f ( e ) 1} j i j i j i j i i 32

The Halving Algorithm Let C be a concept class. Learn f ² C Halving: In the ith stage of the algorithm: C i all concepts in C consistent with all i-1 previously seen eamples Given an eample e i consider the value f j ( e i ) for all and predict by majority. Predict 1 if Clearly C 1 eample, then and if a mistake is made in the ith The Halving algorithm makes at most log( C ) mistakes f C { f C ; f ( e ) 0} { f C ; f ( e ) 1} j i C i 1 Ci 1 Ci i 2 j i j i j i j i 33

The Halving Algorithm Hard to compute In some cases Halving is optimal (C - class of all Boolean functions) In general, to be optimal, instead of guessing in accordance with the majority of the valid concepts, we should guess according to the concept group that gives the least number of epected mistakes (even harder to compute) 34

Learning Conjunctions Can mistakes be bounded in the nonfinite case? Can this bound be achieved? There is a hidden conjunctions the learner is to learn f The number of conjunctions: log( C ) = n The algorithm makes n mistakes Learn.. k-conjunctions: 2 3 4 5 100 Assume that only k<<n attributes occur in the disjunction The number of k-conjunctions: log( C ) = k log n Can we learn efficiently with this number of mistakes? n 3 k 2 ( n, k) C 2 k n k 35

Representation Assume that you want to learn conjunctions. Should your hypothesis space be the class of conjunctions? Theorem: Given a sample on n attributes that is consistent with a conjunctive concept, it is NP-hard to find a pure conjunctive hypothesis that is both consistent with the sample and has the minimum number of attributes. [David Haussler, AIJ 88: Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework ] Same holds for Disjunctions. Intuition: Reduction to minimum set cover problem. Given a collection of sets that cover X, define a set of eamples so that learning the best (dis/conj)junction implies a minimal cover. Consequently, we cannot learn the concept efficiently as a (dis/con)junction. But, we will see that we can do that, if we are willing to learn the concept as a Linear Threshold function. In a more epressive class, the search for a good hypothesis sometimes becomes combinatorially easier. 37

Linear Functions f () = { 1 if w1 1 + w2 2 +... wn n >= 0 Otherwise Disjunctions At least m of n: y = 1 3 5 y = ( 1 1 + 1 3 + 1 5 >= 1) y = at least 2 of {1, 3, 5} y = ( 1 1 + 1 3 + 1 5 >=2) Eclusive-OR: Non-trivial DNF y = (1 2 v ) (1 2) y = (1 2) v (3 4) 38

w = w = 0 -- - - - - - - - - - - - - - 39

Footnote About the Threshold On previous slide, Perceptron has no threshold But we don t lose generality:,1 w,,1 0 w w w, 0 0 1 1 40

Perceptron learning rule On-line, mistake driven algorithm. Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fied input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule (Perceptron == Linear Threshold Unit) 1 6 1 2 3 4 5 6 w 1 w 6 7 T y 41

Perceptron learning rule We learn f:x{-1,+1} represented as f =sgn{w) Where X= {0,1} n or X= R n and w R n Given Labeled eamples: {( 1, y 1 ), ( 2, y 2 ), ( m, y m )} 1. Initialize w=0 R n 2. Cycle through all eamples a. Predict the label of instance to be y = sgn{w) b. If y y, update the weight vector: w = w + r y (r - a constant, learning rate) Otherwise, if y =y, leave weights unchanged. 42

Perceptron in action 1 0.5 w = 0 Current 0 decision boundary 0.5 (with y = +1) net item to be classified w Current weight vector 1 1 0.5 0 0.5 1 1 0.5 0 0.5 as a vector as a vector added to w 1 1 0.5 0 0.5 1 1 0.5 0 0.5 w New weight vector w = 0 New decision boundary 1 1 0.5 0 0.5 1 (Figures from Bishop 2006) Positive Negative 44

1 Perceptron in action (with y = +1) net item to be classified 1 as a vector 1 w = 0 New decision boundary w 0.5= 0 Current decision boundary 0 0.5 0 0.5 0 0.5 w Current weight vector 0.5 as a vector added to w 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0 0.5 1 w New weight vector (Figures from Bishop 2006) Positive Negative 45

Perceptron learning rule If is Boolean, only weights of active features are updated Why is this important? 46 1. Initialize w=0 2. Cycle through all eamples a. Predict the label of instance to be y = sgn{w) b. If y y, update the weight vector to w = w + r y (r - a constant, learning rate) Otherwise, if y =y, leave weights unchanged. n R 1/2 )} ep{-(w 1 1 to 0 is equivalent w 1 0 1 1 1 3 2 1 3 2 1 1 w w w w w w i i w w

Perceptron Learnability Obviously can t learn what it can t represent (???) Only linearly separable functions Minsky and Papert (1969) wrote an influential book demonstrating Perceptron s representational limitations Parity functions can t be learned (XOR) In vision, if patterns are represented with local features, can t represent symmetry, connectivity Research on Neural Networks stopped for years Rosenblatt himself (1959) asked, What pattern recognition problems can be transformed so as to become linearly separable? 47

(1 2) v (3 4) y1 y2 48

Perceptron Convergence Perceptron Convergence Theorem: If there eist a set of weights that are consistent with the data (i.e., the data is linearly separable), the perceptron learning algorithm will converge How long would it take to converge? Perceptron Cycling Theorem: If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop. How to provide robustness, more epressivity? 49