When Dictionary Learning Meets Classification

Similar documents
OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Lecture 1: Machine Learning Basics

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Python Machine Learning

Assignment 1: Predicting Amazon Review Ratings

Rule Learning With Negation: Issues Regarding Effectiveness

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

arxiv: v2 [cs.cv] 30 Mar 2017

CS Machine Learning

(Sub)Gradient Descent

Rule Learning with Negation: Issues Regarding Effectiveness

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

WHEN THERE IS A mismatch between the acoustic

AGS THE GREAT REVIEW GAME FOR PRE-ALGEBRA (CD) CORRELATED TO CALIFORNIA CONTENT STANDARDS

Comment-based Multi-View Clustering of Web 2.0 Items

Speech Emotion Recognition Using Support Vector Machine

Extending Place Value with Whole Numbers to 1,000,000

Word Segmentation of Off-line Handwritten Documents

Generative models and adversarial training

Montana Content Standards for Mathematics Grade 3. Montana Content Standards for Mathematical Practices and Mathematics Content Adopted November 2011

Human Emotion Recognition From Speech

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Design Of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm

Probabilistic Latent Semantic Analysis

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Australian Journal of Basic and Applied Sciences

Switchboard Language Model Improvement with Conversational Data from Gigaword

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Why Did My Detector Do That?!

INPE São José dos Campos

Statewide Framework Document for:

Learning Methods in Multilingual Speech Recognition

South Carolina College- and Career-Ready Standards for Mathematics. Standards Unpacking Documents Grade 5

Lecture 1: Basic Concepts of Machine Learning

Semi-Supervised Face Detection

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Artificial Neural Networks written examination

A survey of multi-view machine learning

Grade 6: Correlated to AGS Basic Math Skills

Disambiguation of Thai Personal Name from Online News Articles

Active Learning. Yingyu Liang Computer Sciences 760 Fall

AUTOMATIC DETECTION OF PROLONGED FRICATIVE PHONEMES WITH THE HIDDEN MARKOV MODELS APPROACH 1. INTRODUCTION

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Math-U-See Correlation with the Common Core State Standards for Mathematical Content for Third Grade

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Reducing Features to Improve Bug Prediction

Chapter 2 Rule Learning in a Nutshell

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Algebra 1, Quarter 3, Unit 3.1. Line of Best Fit. Overview

Exemplar 6 th Grade Math Unit: Prime Factorization, Greatest Common Factor, and Least Common Multiple

Truth Inference in Crowdsourcing: Is the Problem Solved?

Speech Segmentation Using Probabilistic Phonetic Feature Hierarchy and Support Vector Machines

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

arxiv: v1 [cs.lg] 15 Jun 2015

CS 446: Machine Learning

Calibration of Confidence Measures in Speech Recognition

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Evidence for Reliability, Validity and Learning Effectiveness

Math 96: Intermediate Algebra in Context

CSL465/603 - Machine Learning

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Multi-Lingual Text Leveling

Modeling function word errors in DNN-HMM based LVCSR systems

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Discriminative Learning of Beam-Search Heuristics for Planning

University of Groningen. Systemen, planning, netwerken Bosman, Aart

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

Modeling function word errors in DNN-HMM based LVCSR systems

Software Maintenance

Diverse Concept-Level Features for Multi-Object Classification

SARDNET: A Self-Organizing Feature Map for Sequences

Unsupervised Acoustic Model Training for Simultaneous Lecture Translation in Incremental and Batch Mode

A Case Study: News Classification Based on Term Frequency

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Corrective Feedback and Persistent Learning for Information Extraction

Beyond the Pipeline: Discrete Optimization in NLP

Ohio s Learning Standards-Clear Learning Targets

Probability and Statistics Curriculum Pacing Guide

The lab is designed to remind you how to work with scientific data (including dealing with uncertainty) and to review experimental design.

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Speech Recognition at ICSI: Broadcast News and beyond

Linking Task: Identifying authors and book titles in verbose queries

Laboratorio di Intelligenza Artificiale e Robotica

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Evolutive Neural Net Fuzzy Filtering: Basic Description

MINUTE TO WIN IT: NAMING THE PRESIDENTS OF THE UNITED STATES

Course Outline. Course Grading. Where to go for help. Academic Integrity. EE-589 Introduction to Neural Networks NN 1 EE

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Chapter 4 - Fractions

The Good Judgment Project: A large scale test of different methods of combining expert predictions

An Online Handwriting Recognition System For Turkish

BMBF Project ROBUKOM: Robust Communication Networks

The University of Amsterdam s Concept Detection System at ImageCLEF 2011

Transcription:

When Dictionary Learning Meets Classification Bufford, Teresa Chen, Yuxin Horning, Mitchell Shee, Liberty Supervised by: Prof. Yohann Tero August 9, 213 Abstract This report details and exts the implementation of the methods proposed by Sprechmann et al. [16], which aims at clustering signals. Two set ups are considered: supervised and unsupervised clustering. For unsupervised clustering, the algorithms proposed in [16] combine spectral clustering and dictionary learning. Two unsupervised algorithms are proposed, for each a variant is proposed. Thus, we have implemented the five algorithms of [16], as well as a k-means variant. Five of them are described thoroughly in this report, and a complete Matlab code is available at http://dev.ipol.im/ ~tero/code_archive_[8.9.13].zip, which should allow one to easily reproduce all our results. Our experiments agree with [16] for the supervised clustering case. Despite our efforts, we were unable to reproduce the unsupervised clustering results on MNIST, using the similarity measure S 1. This fact is discussed in section 7. Overall, our experiments agree with [16]. The unsupervised dictionary learning with kmeans algorithm gave slightly better results than [16] for the unsupervised MNIST experiment ({,..., 4} digits). We have also shown, through experimentation, that the supervised clustering is robust with respect to Gaussian noise. UCLA. Email: tdbufford@ucla.edu Dalhousie University. Email: yuxinchen612@gmail.com Harvey Mudd College. Email: mhorning@hmc.edu UCLA. Email: lshee@g.ucla.edu UCLA. Email: tero@math.ucla.edu 1

CONTENTS CONTENTS Contents 1 Introduction 3 2 Dictionary Learning 3 3 Related Work 4 4 Algorithms 5 4.1 Supervised Clustering............................................. 5 4.2 Semisupervised Clustering........................................... 6 4.3 Unsupervised Clustering (by Signals)..................................... 7 4.4 Unsupervised Clustering (by Atoms)..................................... 8 4.5 Unsupervised Clustering (by Atoms/Split Initialization).......................... 9 4.6 Unsupervised Clustering (by Signals), kmeans................................ 1 5 Experiments and Results 11 5.1 Sprechmann Results.............................................. 11 5.2 Supervised.................................................... 12 5.2.1 Non-centered and Non-normalized.................................. 12 5.2.2 Centered and....................................... 14 5.3 Semisupervised................................................. 16 5.4 Unsupervised - Signals............................................. 18 5.4.1 Non-centered and Non-normalized.................................. 18 5.4.2 Centered and....................................... 2 5.4.3 Changes over Refinements....................................... 22 5.5 Unsupervised - Atoms............................................. 23 5.5.1 Non-centered and Non-normalized.................................. 23 5.5.2 Centered and....................................... 25 5.5.3 Changes over Refinements....................................... 27 5.6 Unsupervised - Atoms, split in 2 each iteration............................... 28 5.6.1 Non-centered and Non-normalized.................................. 28 5.6.2 Centered and....................................... 3 5.7 Unsupervised (kmeans) - Signals....................................... 32 5.7.1 Non-centered and Non-normalized.................................. 32 5.7.2 Centered and....................................... 34 5.8 Gaussian Noise Experiments.......................................... 36 5.8.1 Classifying Pure Gaussian Noise................................... 36 5.8.2 Adding Noise to Test Images..................................... 36 5.8.3 Adding Noise to Training and Test Images.............................. 36 6 Code and Toolboxes 39 7 Discussion 39 8 Conclusion 42 2

2 DICTIONARY LEARNING 1 Introduction Image recognition and classification is a common problem studied in computer vision research. The disparity between a human and a computer s ability to recognize and classify images is the basis behind CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart), which determine whether or not a user is human based on the user s ability to identify objects in an image. The Asirra (Animal Species Image Recognition for Restricting Access) challenge, which was proposed at ACM CCS 27, is a CAPTCHA that specifically uses cat and dog image to determine whether the user is human, by testing their ability to distinguish between the two classes of animals. According to [4], Asirra can be solved by humans 99.6% of the time in under 3 seconds. The challenge of teaching a computer to classify cats and dogs is a difficult one, due to both large innerclass variation (e.g. the physical differences between breeds of cats and dogs; the flexibility of the animals, which allows even one animal to appear in all types of shapes and sizes) and small inter-class variations (e.g. the similarity of cats and dogs in general shape and size). In [6], the authors describe a classifier that combines support-vector machines (SVMs) to obtain an 82.7% rate of accuracy in distinguishing and classifying the images of cats and dogs used in Asirra. Our research has been centered around using dictionary learning methods to classify cat and dog image signals. In dictionary learning methods, a dictionary is constructed with training signals and used to classify test signals. To measure the viability of various methods, we first concentrated on classifying a simpler data set: the MNIST dataset of handwritten digits. This report details the experiments that have been reimplemented so far, following the algorithms published in [16] for supervised dictionary learning (training signals labeled) and unsupervised dictionary learning (training signals unlabeled). 2 Dictionary Learning The cornerstone in dictionary learning is to find a sparse signal representation α R K of a signal x R n for a dictionary D R n K such that the reconstructed image Dα is as close as possible to x. In the following, this is done by solving the problem arg min α R K x Dα 2 2 + λ α 1 (1) where x R n is the signal being classified, D R n K is the dictionary whose columns are called atoms (or features) R n, and α R K is the coefficient vector of the signal. The existence of this min will be shortly justified. The parameter λ balances the trade-off between the reconstructed error x Dα 2 2 and the sparsity of the decomposition measured by α 1. Given a collection of dictionaries D 1,..., D N R n K, classifying a signal x R n consists of: 1. Using sparse coding to compute α 1,..., α N RK, the representation of x in each dictionary D i ; 2. Comparing the cost of the representations α i in each dictionary and assigning x to the least costly dictionary D i in the sense of (1). In other words, in order to classify a signal x as one of N possible classes using a trained collection of dictionaries {D 1,..., D N }, we calculate for i = 1,..., N. Assuming that class i. E i (x) = min α R K x D i α 2 2 + λ α 1. (2) arg min E i (x) is unique, we define i = arg min E i (x) and assign x to i {1,...,N} i {1,...,N} Now to show that the energy defined in (2) E(x) = min α R k ( x Dα 2 2 + λ α 1 ) : R n [, ), where λ, is a well-defined function. 3

3 RELATED WORK Proof. Since x Dα 2 and α 1, α R k and x R n, then E(R n ) [, ). To show that G(α) = x Dα 2 2 + λ α 1 : R k [, ), where λ, has a global minimum. Because any locally optimal point of a convex optimizing problme is globally optimal, we want to show that G(α) is convex. First we note that the domain of G(α) is R k which is closed and convex. Let x R n, α 1, α 2 R k, t [, 1], by definition, we need to show that G(tα 1 + (1 t)α 2 ) tg(α 1 ) + (1 t)g(α 2 ). G(tα 1 + (1 t)α 2 ) = x tdα 1 (1 t)dα 2 2 2 + λ tα 1 + (1 t)α 2 1 = tx tdα 1 + x(1 t) (1 t)dα 2 2 2 + λ tα 1 + (1 t)α 2 1 t x Dα 1 2 2 + (1 t) x Dα 2 2 2 + λt α 1 1 + (1 t) α 2 1 (By positive scalability and triangular inequality of norms, and t [, 1], λ ) =t ( x Dα 1 2 2 + λ α 1 1 ) + (1 t) ( x Dα1 2 2 + λ α 2 1 ) =tg(α 1 ) + (1 t)g(α 2 ) Because we are minimizing a convex function over a convex and closed set R k, so E(x) = min α R k ( x Dα 2 2 + λ α 1 ) exists and is a well-defined function. 3 Related Work Several works towards the accurate classification of cats and dogs have been published in the last half of a decade. Here we detail some of the more interesting approaches that we have encountered. [9] introduces a new approach by localizing the features used for classification at object parts and applying appearancebased sliding window detectors and probabilistic consensus of geometric models to detect fine-grained images where instances from different classes share common parts but have various shape and appearance. [14] is responsible for the creation of the Oxford-IIIT-Pet dataset that we have been using, and concerns itself with the problem of fine-grained object categorization. They create a model that combines shape and appearance/texture features for the discrimination. [17] explains the process of object detection with simple rectangular Haar-like features, rather than with pixels, to provide rapid and high detection. The reason that they use rectangular features over pixels are that information is lost when only using pixels and the system operates faster using features rather than pixels. In [5], a discriminative approach to object detection is introduced. This approach induces classifiers directly from training data without a data model. In general, the system learns a pose-specific binary classifier and applies it many times to different sub-windows in image, checking if the target is in the sub-window. [1] introduces Nearest-Neighbor image classification which is faster than other image classification methods. The authors propose a Nearest-Neighbor-based classifier classifier called Naive-Bayes Nearest-Neighbor, which uses image-to-class distances without descriptor quantization. We are also interested in previous research in dictionary learning. [13] exts the generalization of discriminative image understanding tasks, such as texture segmentation and feature selection, by proposing a multi-scale method to minimize least-squares reconstruction errors and discriminative cost functions under l or l 1 regularization constraints. [13] learns multiple dictionaries which are simultaneously re-constructive and discriminative and uses the reconstruction errors of these dictionaries on image patches to deprive a pixelwise classification. The method proposes the following novelty: First, redundant non-parametric dictionaries are learned. Second, the sparse local representations are learned with an explicit discriminative goal. [8] presents a label consistent K-SVD algorithm to learn a discriminative dictionary for sparse coding. K-SVD is a supervised algorithm to incorporate a discriminative sparse coding error criterion and an optimal classification performance criterion into the objective function and optimize it. Because the learned dictionary provides discriminative sparse representations of signals, so good accuracy on object classification even with a simple multi-class linear classifier is achieved. [19] exts the K-SVD algorithm to learn an overcomplete dictionary from a set of labeled training face images. The authors also propose a corresponding classification algorithm based on the learned dictionary by incorporating the classification stage directly into the dictionary-learning procedure. [1] provides formulations for dictionary learning algorithms that are adapted to performing tasks other than data reconstruction. Specifically, they provide a dictionary 4

4 ALGORITHMS learning algorithm for multi-class classification which provides state of the art results. Along with this, they investigate a theory for dictionary learning algorithms and show that the problem is smooth under a set of three assumptions. Thirdly, they support the use of semi-supervised dictionary learning algorithms which can make use of unlabeled data along with data that does have labels when learning the dictionaries. [7] applies modularity to classifying MNIST to 9 digits and achieve 3.6% misclassification rate. 4 Algorithms 4.1 Supervised Clustering For supervised dictionary learning, the MNIST training images were clustered according to their digit label {,..., 9}. Each cluster was used to construct its respective dictionary {D,..., D 9 }, all of which were then used to classify the test images. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1 }; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see SPAMS documentation): param.k (number of atoms in the dictionaries) param.lambda = λ 2 param.iter = 1 param.mode = 2 param.lambda2 = Set digitvector (digits to be classified); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; Step 2 Train dictionaries {D,..., D N 1 } R n K indepently; Step 3 Classify each test image y h : for h = 1 to l do Compute E i (y h ) = min y h D i α 2 α R K 2 + λ α 1 for i =,..., N 1; Take i = arg min i {,...,N 1} Step 4 Get results: Compute misclassification rate; Show error histograms; E i (y h ), y h is classified as class i ; Algorithm 1: Supervised Dictionary Learning on MNIST images 5

4.2 Semisupervised Clustering 4 ALGORITHMS 4.2 Semisupervised Clustering For semisupervised dictionary learning, a percentage of the MNIST training images had labels which were known to be correct, and the remaining percentage was assigned random labels. The training images were then clustered according to their digit label {,..., 9}. Each cluster was used to construct its respective dictionary {D,..., D 9 }, all of which were then used to classify the test images. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1 }; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see SPAMS documentation): param.k (number of atoms in the dictionaries) param.lambda = λ 2 param.iter = 1 param.mode = 2 param.lambda2 = Set digitvector (digits to be classified); Set perturbation percent (percentage of training images to be assigned random labels); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; Assign random labels from digitvector to perturbation percent of the MNIST training images; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; Step 2 Train dictionaries {D,..., D N 1 } R n K indepently; Step 3 Refine the initial set of dictionaries: for iteration = 1 to max iter refinement do Classify the m training images using current dictionaries by minimizing Equation (1); for j = 1 to m do Compute E i (x j ) = min x j D i α 2 α R K 2 + λ α 1 for i =,..., N 1; Take i = arg min i {,...,N 1} E i (x j ), x j is classified as class i ; For each i =,..., N 1, train D i R n k2 using training images classified to class i ; Step 4 Get results: Classify test images. Compute misclassification rate; Show error histograms; Algorithm 2: Semisupervised Dictionary Learning on MNIST images 6

4.3 Unsupervised Clustering (by Signals) 4 ALGORITHMS 4.3 Unsupervised Clustering (by Signals) For unsupervised dictionary learning (by image signals), spectral clustering was used to cluster the MNIST training images. In the initialization step, A = [α 1,..., α m ] R K m, where K denotes the number of atoms in D and m denotes the number of training images. If two signals belong to the same cluster, they are expected to have decompositions that use similar atoms. For unsupervised clustering by signals, the similarity matrix is defined S 1 := A T A R m m, where A denotes the element-wise absolute value of A, and A T its transpose. Each cluster was used to train its respective dictionary. These dictionaries were then refined and used to classify the test images. Since the unsupervised algorithm does not guarantee that the dictionaries are correctly ordered upon creation, we associate each dictionary to the class (label) that produces the highest number of correct labeled training image occurrences. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1}; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see spams documentation): param.k (number of atoms in initial dictionary), param.lambda = λ, param.iter = 1, param.mode, param.lambda2 = ; 2 Set max iter refinement (number of refinements to initial set of dictionaries), and dictionary sizes refinement (number of atoms in refined dictionaries, k 2); Set digitvector (digits to be classified); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; Step 2 Create an initial set of dictionaries: Train dictionary D R n K from all training images; Construct A = [α 1,..., α m], where α j is the minimum-energy sparse representation of x j; Construct similarity matrix to be used in spectral clustering S 1 := A T A ; Perform spectral clustering on G 1 := {X, S 1} to assign each signal to one of the N classes; For each i =,..., N 1, train D i R n k 2 using training images classified to class i for our initial set of dictionaries; Step 3 Refine the initial set of dictionaries: for iteration = 1 to max iter refinement do Classify the m training images using current dictionaries by minimizing Equation (1); for j = 1 to m do Compute E i(x j) = min α R K xj Diα 2 2 + λ α 1 for i =,..., N 1; Take i = arg min E i(x j), x j is classified as class i ; i {,...,N 1} For each i =,..., N 1, train D i R n k 2 using training images classified to class i ; Step 4 Reorder dictionaries. Classify test images. Compute misclassification rate. Produce error histograms; Algorithm 3: Unsupervised Dictionary Learning by Signal Clusterization on MNIST images 7

4.4 Unsupervised Clustering (by Atoms) 4 ALGORITHMS 4.4 Unsupervised Clustering (by Atoms) For unsupervised dictionary learning (by atoms, i.e., dictionary columns), spectral clustering was used to cluster the dictionary atoms. In the initialization step, A = [α 1,..., α m ] R K m, where K denotes the number of atoms in D and m denotes the number of training images. If two signals belong to the same cluster, they are expected to have decompositions that use similar atoms. For unsupervised clustering by atoms, the similarity matrix is defined S 2 := A A T R K K, where A denotes the element-wise absolute value of A, and A T its transpose. Each cluster was used to train its respective dictionary. These dictionaries were then refined and used to classify the test images. Since the unsupervised algorithm does not guarantee that the dictionaries are correctly ordered upon creation, we associate each dictionary to the class (label) that produces the highest number of correct labeled training image occurrences. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1}; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see spams documentation): param.k (number of atoms in initial dictionary), param.lambda = λ, param.iter = 1, param.mode, param.lambda2 = ; 2 Set max iter refinement (number of refinements to initial set of dictionaries); Set digitvector (digits to be classified); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; Step 2 Create an initial set of dictionaries: Train dictionary D R n K from all training images; Construct A = [α 1,..., α m], where α j is the minimum-energy sparse representation of x j; Construct similarity matrix to be used in spectral clustering S 2 := A A T R K K ; Perform spectral clustering on G 2 := {D, S 2} to extract N classes of atoms; Collect the atoms of classes i =,..., N 1 into D i R n k i to form initial set of dictionaries; Step 3 Refine the initial set of dictionaries: for iteration = 1 to max iter refinement do Classify the m training images using current dictionaries by minimizing Equation (1); for j = 1 to m do Compute E i(x j) = min α R K xj Diα 2 2 + λ α 1 for i =,..., N 1; Take i = arg min E i(x j), x j is classified as class i ; i {,...,N 1} For each i =,..., N 1, train D i R n k 2 using training images classified to class i ; Step 4 Reorder dictionaries. Classify test images. Compute misclassification rate. Produce error histograms; Algorithm 4: Unsupervised Dictionary Learning by Atom Clusterization on MNIST images 8

4.5 Unsupervised Clustering (by Atoms/Split Initialization) 4 ALGORITHMS 4.5 Unsupervised Clustering (by Atoms/Split Initialization) Here we detail an alternative initialization proposed by [16], in which we start with no partitions in our dictionary, then iteratively cluster each of our current partitions into two new ones. We choose the partition split which causes the largest decrease in energy, giving us one more partition than we had before. This process is continued until the desired number of clusters is reached. The algorithm is here detailed. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1}; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see spams documentation): param.k (number of atoms in initial dictionary), param.lambda = λ, param.iter = 1, param.mode, param.lambda2 = ; 2 Set max iter refinement (number of refinements to initial set of dictionaries); Set digitvector (digits to be classified); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; Step 2 Cluster atoms and refine dictionaries: Train dictionary D R n K from all training images; for i = 1 to the total number of digit classes - 1 do for p = 1 to i do A = [α 1,..., α k ], where α j is the minimum-energy sparse representation of x j for D p; Find similarity matrix S 2 = A A T ; Apply spectral clustering on G 2 := {D p, S 2} to extract atoms D i+1, D i+2; Compute E p (total energy for split p); Take p = arg min p {1,...,i} E j; Set D 1,..., D p 1, D p +1,..., D i, D i+1, D i+2 as D 1,..., D i+1; for iteration = 1 to max iter refinement do Classify the m training images using current dictionaries by minimizing Equation (1); for j = 1 to m do Compute E i(x j) = min α R K xj Diα 2 2 + λ α 1 for i =,..., N 1; Take i = arg min E i(x j), x j is classified as class i ; i {,...,N 1} For each i =,..., N 1, train D i R n k 2 using training images classified to class i ; Step 4 Reorder dictionaries. Classify test images. Compute misclassification rate. Produce error histograms; Algorithm 5: Unsupervised Dictionary Learning by Atoms/Split Initialization on MNIST images 9

4.6 Unsupervised Clustering (by Signals), kmeans 4 ALGORITHMS 4.6 Unsupervised Clustering (by Signals), kmeans We have found that the reported performance of the unsupervised algorithm by [16] can be improved by simply doing away with the initialization given by spectral clustering and instead performing simple K-means clustering. Input: labeled MNIST training images x j R n, j = 1,..., m; labeled MNIST test images y h R n, h = 1,..., l; N, the number of classes; Output: Dictionaries {D,..., D N 1}; Misclassification rate; Error histogram; Step 1 Set parameters and extract data: Set SPAMS parameters (see spams documentation): param.k (number of atoms in initial dictionary), param.lambda = λ, param.iter = 1, param.mode, param.lambda2 = ; 2 Set max iter refinement (number of refinements to initial set of dictionaries); Set digitvector (digits to be classified); Set decision on center data and l2 normalize data as true or false ; Load images from MNIST dataset; Use MNIST labels to extract data corresponding to indicated digitvector; Step 2 Create an initial set of dictionaries: Perform k-means clustering on the set of training images to obtain an initial N clusters of images; if center data == true then Each x j and y h is subtracted by its respective mean, so the mean of each signal is ; if l2 normalize data == true then Each x j and y h is divided by its respective l 2 norm, so x j 2 = 1 and y h 2 = 1; For each class i = 1 to N 1, train dictionary D i on the images in that class to obtain the initial dictionaries D,..., D N 1; Step 3 Refine the initial set of dictionaries: for i = 1 to max iter refinement do Classify the m training images using current dictionaries by minimizing Equation (1); for j = 1 to m do Compute E i(x j) = min α R K xj Diα 2 2 + λ α 1 for i =,..., N 1; Take i = arg min E i(x j), x j is classified as class i ; i {,...,N 1} For each i =,..., N 1, train D i R n k 2 using training images classified to class i ; Step 4 Reorder dictionaries. Classify test images. Compute misclassification rate. Produce error histograms; Algorithm 6: Unsupervised Dictionary Learning by Atom Clusterization on MNIST images 1

5 EXPERIMENTS AND RESULTS 5 Experiments and Results We performed experiments on the MNIST handwritten digits dataset using three types of algorithms: supervised dictionary learning, semisupervised dictionary learning, and unsupervised dictionary learning. Unsupervised dictionary learning was further split into two methods: spectral clustering (clustering signals, clustering atoms in one step, clustering atoms in multiple steps) and kmeans. We ran the test for the digit sets {, 1}, {2, 3}, {,..., 4}, {,..., 5}, and {,..., 9}. For each algorithm (unless otherwise specified), we repeated the experiment with: param.k = 5, non-centered and non-normalized image signals param.k = 8, non-centered and non-normalized image signals param.k = 5, centered and normalized image signals param.k = 8, centered and normalized image signals Following [16], for Algorithm 3 we set dictionary sizes refinement = 2 for the dictionaries once we are in the refinement step. The following parameters were used for all of our experiments: param.λ =.5 param.λ 2 = param.iter = 1 (unless otherwise specified) param.mode = 2 (Note: Our param.λ =.5 is equivalent to [16] λ =.1. Explanation in Section 6.) 5.1 Sprechmann Results As a point of comparison, consider the following numbers from Sprechmann [16] and Ramirez s [15] results: Table 1: Sprechmann [16] and Ramirez s [15] results from Supervised Dictionary Learning (Algorithm 1) and Unsupervised Dictionary Learning with signal clustering (Algorithm 3). We were unable to replicate their results with unsupervised dictionary learning. Cluster Type Centered & Digits K misclassification Supervised Unknown {,..., 9} 8 1.26% Unsupervised - Signals Unknown {,..., 4} 5 1.44% Unsupervised - Signals Unknown {,..., 5} 3 6.9% Explanation of Table Numbers After explaining the algorithm for supervised dictionary learning, the algorithm is applied in the following: Table 1 of [16] reports a misclassification rate of 1.26% and refers to Section 3 of the same document, which reads MNIST... the usual training/testing split. In Table 1 we present the obtained results.... used a penalty parameter λ =.1... for a dictionary with k = 8. After proposing the algorithm for unsupervised dictionary learning via signal clustering, the algorithm is applied in the following: Section 5 of [16] reads We clustered the digits [from] to 4 (K = 5) using the testing set of MNIST... used an initial dictionary of k = 5 atoms... We used G 1 for initialization... using during the iterations dictionaries of 2... had a misclassification rate of 1.44% Table 2 of [15] reports a misclassification rate of 6.9% and refers to section 4.3 of the same document, which reads We first clustered the digits [from] to 5 (K = 6) from the testing set of MNIST... The size of the initial dictionaries are... k = 3 for MNIST... The initial clustering of the data was done using spectral clustering on the graph G 1. 11

5.2 Supervised 5 EXPERIMENTS AND RESULTS 5.2 Supervised 5.2.1 Non-centered and Non-normalized Table 2: Results from Supervised Dictionary Learning (Algorithm 1) with K = 5, non-centered and non-normalized image signals. 3.33% is far from Sprechmann s supervised results of 1.26% with K = 8 for digits {,..., 9} (see Table 1). The gap is slightly smaller when we run our experiments with K = 8 (see Table 3), and significantly smaller when we run the experiments with centered and normalized data (see Tables 4 and 5). It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification rate. Cluster Type Centered & Digits K misclassification Supervised False {, 1} 5.19% Supervised False {2, 3} 5.54% Supervised False {,..., 4} 5.95% Supervised False {,..., 5} 5 1.34% Supervised False {,..., 9} 5 3.33% Repartition of the errors in function of the class, misclassification rate.18913.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate.53869.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate.95349.4 Repartition of the errors in function of the class, misclassification rate 1.3431.35.35.3.3.25.2.15.1.5.25.2.15.1.5 1 2 3 4 1 2 3 4 5.18 Repartition of the errors in function of the class, misclassification rate 3.33.16.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 1: Misclassification rate of results from Supervised Dictionary Learning (Algorithm 1) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 2 is misclassified significantly more often than the other digits, roughly a 75% increase from the next highest. 12

5.2 Supervised 5 EXPERIMENTS AND RESULTS Table 3: Results from Supervised Dictionary Learning (Algorithm 1) with K = 8, non-centered and non-normalized image signals. 3.11% is far from Sprechmann s results of 1.26% with K = 8 (see Table 1). The gap is significantly smaller when we run the experiments with centered and normalized data (see Tables 4 and 5). It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification rate. Cluster Type Centered & Digits K misclassification Supervised False {, 1} 8.24% Supervised False {2, 3} 8.39% Supervised False {,..., 4} 8.78% Supervised False {,..., 5} 8 1.19% Supervised False {,..., 9} 8 3.11% Repartition of the errors in function of the class, misclassification rate.23641.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate.39177.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate.77836.4 Repartition of the errors in function of the class, misclassification rate 1.1938.35.35.3.3.25.2.15.1.5.25.2.15.1.5 1 2 3 4 1 2 3 4 5.16 Repartition of the errors in function of the class, misclassification rate 3.11.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 2: Misclassification rate of results from Supervised Dictionary Learning (Algorithm 1) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 2 is misclassified significantly more often than the other digits, roughly two times more than the next highest. 13

5.2 Supervised 5 EXPERIMENTS AND RESULTS 5.2.2 Centered and Table 4: Results from Supervised Dictionary Learning (Algorithm 1) with K = 5, centered and normalized image signals. 1.96% is close to Sprechmann s supervised results of 1.26% with K = 8 for digits {,..., 9} (see Table 1). It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification rate. Cluster Type Centered & Digits K misclassification Supervised True {, 1} 5.% Supervised True {2, 3} 5.29% Supervised True {,..., 4} 5.41% Supervised True {,..., 5} 5.63% Supervised True {,..., 9} 5 1.96% Repartition of the errors in function of the class, misclassification rate 1.8.6.4.2.2.4.6.8 Repartition of the errors in function of the class, misclassification rate.29383.5.45.4.35.3.25.2.15.1.5 1 1 2 3 Repartition of the errors in function of the class, misclassification rate.4864.45 Repartition of the errors in function of the class, misclassification rate.638.35.4.35.3.25.3.25.2.2.15.1.5.15.1.5 1 2 3 4 1 2 3 4 5.16 Repartition of the errors in function of the class, misclassification rate 1.96.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 3: Misclassification rate of results from Supervised Dictionary Learning (Algorithm 1) with K = 5, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 is misclassified significantly more often than the other digits, at least two times more. When centering and normalizing image signals, the comparative rate of misclassification for increases, almost doubling (see Figure 1). 14

5.2 Supervised 5 EXPERIMENTS AND RESULTS Table 5: Results from Supervised Dictionary Learning (Algorithm 1) with K = 8, centered and normalized image signals. 1.89% is close to Sprechmann s supervised results of 1.26% with K = 8 for digits {,..., 9} (see Table 1). It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification. In fact, there is no misclassification in {, 1}. Cluster Type Centered & Digits K misclassification Supervised True {, 1} 8.% Supervised True {2, 3} 8.24% Supervised True {,..., 4} 8.41% Supervised True {,..., 5} 8.6% Supervised True {,..., 9} 8 1.89% Repartition of the errors in function of the class, misclassification rate 1.8.6.4.2.2.4.6.8 Repartition of the errors in function of the class, misclassification rate.24486.5.45.4.35.3.25.2.15.1.5 1 1 2 3 Repartition of the errors in function of the class, misclassification rate.4864.45 Repartition of the errors in function of the class, misclassification rate.59692.35.4.35.3.25.3.25.2.2.15.1.5.15.1.5 1 2 3 4 1 2 3 4 5.16 Repartition of the errors in function of the class, misclassification rate 1.89.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 4: Misclassification rate of results from Supervised Dictionary Learning (Algorithm 1) with K = 8, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 is misclassified significantly more often than the other digits, at least two times more. When centering and normalizing image signals, the comparative rate of misclassification for increases, almost doubling (see Figure 2). 15

5.3 Semisupervised 5 EXPERIMENTS AND RESULTS 5.3 Semisupervised Table 6: Misclassification of the dictionary set in the final iteration for digits 4, perturbation percentages 2 to 9 (increments of 1). All misclassification rates are low, ranging between.85% and 1.5%. Misclassification rates mostly increase as perturbation percent increases, though there are times when it decreases sharply (3% and 7%), likely due to the random nature of the perturbation. Cluster Type Centered & Digits K perturbation % misclassification Semisupervised True {,...,4} 2 2.9535% Semisupervised True {,...,4} 2 3.8757% Semisupervised True {,...,4} 2 4.973% Semisupervised True {,...,4} 2 6 1.2259% Semisupervised True {,...,4} 2 5 1.72% Semisupervised True {,...,4} 2 7.8951% Semisupervised True {,...,4} 2 8 1.72% Semisupervised True {,...,4} 2 9 1.425%.1298 Energy at Each Iteration, digits 4, 4% 1.35 Average misclassification Rate at Each Iteration, digits 4, 4% 45 Changes in Training Signals at Each Iteration, digits 4, 4%.1296 1.3 4.1294 1.25 Energy.1292.129.1288 Average misclassification Rate 1.2 1.15 1.1 Number of signals that changed 35 3 25.1286 1.5 2.1284 1.1282 1 2 3 4 5 6 7 8 9 1 11.95 1 2 3 4 5 6 7 8 9 1 11 15 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement).133 Energy at Each Iteration, digits 4, 7% 5 Average misclassification Rate at Each Iteration, digits 4, 7% 14 Changes in Training Signals at Each Iteration, digits 4, 7%.1325 4.5 12.132 4 Energy.1315.131.135.13.1295 Average misclassification Rate 3.5 3 2.5 2 Number of signals that changed 1 8 6 4.129 1.5.1285 1 2.128 1 2 3 4 5 6 7 8 9 1 11.5 1 2 3 4 5 6 7 8 9 1 11 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement).142 Energy at Each Iteration, digits 4, 9% 35 Average misclassification Rate at Each Iteration, digits 4, 9% 8 Changes in Training Signals at Each Iteration, digits 4, 9%.14 3 7 Energy.138.136.134.132 Average misclassification Rate 25 2 15 1 Number of signals that changed 6 5 4 3 2.13 5 1.128 1 2 3 4 5 6 7 8 9 1 11 1 2 3 4 5 6 7 8 9 1 11 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement) Figure 5: Semisupervised results for digits 4. Iteration is the initial dictionary set; each iteration i (i > ) is the ith refinement. Change i is the change between dictionary sets i 1 and i (dictionary set i is the ith refinement, dictionary set is the initial set). Left to right: energy, misclassification, changes in training image classification. Note that as perturbation increases, the payoff from the refinement process increases. 16

5.3 Semisupervised 5 EXPERIMENTS AND RESULTS Table 7: Misclassification of the dictionary set in the final iteration for digits 9, perturbation percentages 2 to 9 (increments of 1). The misclassification rates fall within a consistent range of 4% 6%, mostly increasing as perturbation percent increases, though there was a decrease in misclassification for a perturbation of 7%. Cluster Type Centered & Digits K perturbation misclassification Semisupervised True {,...,9} 2 2 4.47% Semisupervised True {,...,9} 2 3 4.67% Semisupervised True {,...,9} 2 4 4.62% Semisupervised True {,...,9} 2 5 4.81% Semisupervised True {,...,9} 2 6 5.28% Semisupervised True {,...,9} 2 7 4.74% Semisupervised True {,...,9} 2 8 5.16% Semisupervised True {,...,9} 2 9 6.9%.1334 Energy at Each Iteration, digits 9, 4% 5.2 Average misclassification Rate at Each Iteration, digits 9, 4% 22 Changes in Training Signals at Each Iteration, digits 9, 4% 5.1.1332 5 21 Energy.133.1328.1326.1324 Average misclassification Rate 4.9 4.8 4.7 4.6 4.5 Number of signals that changed 2 19 18 17 4.4.1322 4.3 16.132 1 2 3 4 5 6 7 8 9 1 11 4.2 1 2 3 4 5 6 7 8 9 1 11 15 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement).137 Energy at Each Iteration, digits 9, 7% 8.5 Average misclassification Rate at Each Iteration, digits 9, 7% 45 Changes in Training Signals at Each Iteration, digits 9, 7%.1365 8 4.136 7.5 Energy.1355.135.1345.134.1335 Average misclassification Rate 7 6.5 6 5.5 Number of signals that changed 35 3 25 2.133.1325 5 15.132 1 2 3 4 5 6 7 8 9 1 11 4.5 1 2 3 4 5 6 7 8 9 1 11 1 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement).148 Energy at Each Iteration, digits 9, 9% 4 Average misclassification Rate at Each Iteration, digits 9, 9% 18 Changes in Training Signals at Each Iteration, digits 9, 9%.146 35 16.144 3 14 Energy.142.14.138 Average misclassification Rate 25 2 15 Number of signals that changed 12 1 8 6.136 4.134 1 2.132 1 2 3 4 5 6 7 8 9 1 11 5 1 2 3 4 5 6 7 8 9 1 11 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement) Figure 6: Semisupervised results for digits 9. Iteration is the initial dictionary set; each iteration i (i > ) is the ith refinement. Change i is the change between dictionary sets i 1 and i (dictionary set i is the ith refinement, dictionary set is the initial set). Left to right: energy, misclassification, changes in training image classification. Misclassification corresponds better to changes in training image classifications than energy. Note that as perturbation increases, the payoff from the refinement process increases. 17

5.4 Unsupervised - Signals 5 EXPERIMENTS AND RESULTS 5.4 Unsupervised - Signals 5.4.1 Non-centered and Non-normalized Table 8: Results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 5, non-centered and non-normalized image signals. 31.4% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification rate. Interestingly, it is also more difficult to distinguish {,..., 5} than {,..., 9}, though the latter has more classes. Cluster Type Centered & Digits K misclassification Unsupervised - Signals False {, 1} 5 3.36% Unsupervised - Signals False {2, 3} 5 8.37% Unsupervised - Signals False {,..., 4} 5 31.4% Unsupervised - Signals False {,..., 5} 5 49.8% Unsupervised - Signals False {,..., 9} 5 42.45% Repartition of the errors in function of the class, misclassification rate 3.357.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 8.3741.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 31.372.45 Repartition of the errors in function of the class, misclassification rate 49.798.35.4.35.3.25.3.25.2.2.15.1.5.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 42.45.2.18.16.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 7: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over three times as often. This is also seen in clustering atoms (see Figure 12). 18

5.4 Unsupervised - Signals 5 EXPERIMENTS AND RESULTS Table 9: Results from Unsupervised Dictionary Learning with signals clustering (Algorithm 3) with K = 8, non-centered and non-normalized image signals. 38.24% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). Misclassification rate for 2 and 3 is significantly larger than the one for and 1. Interestingly, it is also more difficult to distinguish {,..., 5} than {,..., 9}, though the latter has more classes. Cluster Type Centered & Digits K misclassification Unsupervised - Signals False {, 1} 8 2.13% Unsupervised - Signals False {2, 3} 8 13.3% Unsupervised - Signals False {,..., 4} 8 38.24% Unsupervised - Signals False {,..., 5} 8 67.7% Unsupervised - Signals False {,..., 9} 8 45.93% Repartition of the errors in function of the class, misclassification rate 2.1277.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 13.264.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 38.237.35 Repartition of the errors in function of the class, misclassification rate 67.71.45.3.25.2.4.35.3.25.15.1.5.2.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 45.93.25.2.15.1.5 1 2 3 4 5 6 7 8 9 Figure 8: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 8, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 3 is misclassified significantly more often than the other digits. When classifying digits to 5, is misclassified significantly more often than the other digits, over three times as often. When classifying digits to 9, 3 and 8 are misclassified significantly more often than the other digits, over around twice as often. 19

5.4 Unsupervised - Signals 5 EXPERIMENTS AND RESULTS 5.4.2 Centered and Table 1: Results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 5, centered and normalized image signals. 27.36% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). Misclassification rate for 2 and 3 is significantly larger than the one for and 1. There is a high jump in misclassification rate when the number of digits classified increases from two to five. Also the misclassification rate jumps by around 15% when the digits being classified increase from {,..., 4} to {,..., 5}. Interestingly, it is also more difficult to distinguish {,..., 5} than {,..., 9}, though the latter has more classes. Cluster Type Centered & Digits K misclassification Unsupervised - Signals True {, 1} 5.23% Unsupervised - Signals True {2, 3} 5 3.87% Unsupervised - Signals True {,..., 4} 5 27.36% Unsupervised - Signals True {,..., 5} 5 43.41% Unsupervised - Signals True {,..., 9} 5 38.14% Repartition of the errors in function of the class, misclassification rate.18913.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 4.9951.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 27.2621.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 42.3976.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5.25 Repartition of the errors in function of the class, misclassification rate 42.4.2.15.1.5 1 2 3 4 5 6 7 8 9 Figure 9: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 5, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over more than three times as often. When classifying digits to 5, 3 is misclassified significantly more often than the other digits, over twice as often. When classifying digits to 9, 8 is misclassified significantly more often than the other digits, over twice as often 2

5.4 Unsupervised - Signals 5 EXPERIMENTS AND RESULTS Table 11: Results from Unsupervised Dictionary Learning with signal clustering (Algorithm 3) with K = 8, centered and normalized image signals. 27.4% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. It can also be seen that {2, 3} is more difficult to distinguish than {, 1}, from the former s higher misclassification rate. Interestingly, it is also more difficult to distinguish {,..., 5} than {,..., 9}, though the latter has more classes. Cluster Type Centered & Digits K misclassification Unsupervised - Signals True {, 1} 8.28% Unsupervised - Signals True {2, 3} 8 3.53% Unsupervised - Signals True {,..., 4} 8 27.4% Unsupervised - Signals True {,..., 5} 8 36.28% Unsupervised - Signals True {,..., 9} 8 34.87% Repartition of the errors in function of the class, misclassification rate.28369.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 3.526.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 27.3983.5.45.4.35 Repartition of the errors in function of the class, misclassification rate 36.2792.35.3.25.3.25.2.15.1.5.2.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 34.87.25.2.15.1.5 1 2 3 4 5 6 7 8 9 Figure 1: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 5, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over three times as often. This is also seen in clustering signals (see Figure 9). When classifying digits to 9, 3 is misclassified significantly more often than the other digits, over twice as often. 21

5.4 Unsupervised - Signals 5 EXPERIMENTS AND RESULTS 5.4.3 Changes over Refinements Figure 11: Results from multiple refinement iterations for Unsupervised Dictionary Learning with signal clustering (Algorithm 3). Top to bottom: K = 5, non-centered and non-normalized image signals; K = 8, non-centered and non-normalized image signals; K = 5, centered and normalized image signals; K = 8, centered and normalized image signals. Left to right: Energy, misclassification, changes in training image classification. Energy decreases gradually exponentially with refinements. Misclassification increases slightly with refinements for all cases except K = 8, centered and normalized image signals. Changes in training images are small (one or no changes per refinement). Energy at Each Iteration, digits 4, NCNN Average misclassification Rate at Each Iteration, digits 4, NCNN 1 Changes in Training Signals at Each Iteration, digits 4, NCNN 2.5.9 2 5.8 1.95 Energy 1.9 1.85 1.8 1.75 Average misclassification Rate 45 4 Number of signals that changed.7.6.5.4.3 1.7 35.2 1.65.1 1.6 1 2 3 4 5 6 7 8 9 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Energy at Each Iteration, digits 4, NCNN Average misclassification Rate at Each Iteration, digits 4, NCNN 1 Changes in Training Signals at Each Iteration, digits 4, NCNN 2.5.9 2 5.8 1.95 Energy 1.9 1.85 1.8 1.75 Average misclassification Rate 45 4 Number of signals that changed.7.6.5.4.3 1.7 35.2 1.65.1 1.6 1 2 3 4 5 6 7 8 9 1 3 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Energy at Each Iteration, digits 4, CN Average misclassification Rate at Each Iteration, digits 4, CN 1 Changes in Training Signals at Each Iteration, digits 4, CN.88.9 29.5.8.875 Energy.87.865 Average misclassification Rate 29 28.5 28 Number of signals that changed.7.6.5.4.3.86 27.5.2.1.855 1 2 3 4 5 6 7 8 9 1 27 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Energy at Each Iteration, digits 4, CN Average misclassification Rate at Each Iteration, digits 4, CN 1 Changes in Training Signals at Each Iteration, digits 4, CN.88.9 29.5.8.875 Energy.87.865 Average misclassification Rate 29 28.5 28 Number of signals that changed.7.6.5.4.3.86 27.5.2.1.855 1 2 3 4 5 6 7 8 9 1 27 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) 22

5.5 Unsupervised - Atoms 5 EXPERIMENTS AND RESULTS 5.5 Unsupervised - Atoms 5.5.1 Non-centered and Non-normalized Table 12: Results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 5, non-centered and non-normalized image signals. 32.63% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. Cluster Type Centered & Digits K misclassification Unsupervised - Atoms False {, 1} 5 3.92% Unsupervised - Atoms False {2, 3} 5 3.72% Unsupervised - Atoms False {,..., 4} 5 32.63% Unsupervised - Atoms False {,..., 5} 5 36.26% Unsupervised - Atoms False {,..., 9} 5 48.51% Repartition of the errors in function of the class, misclassification rate 3.9243 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 3.7218 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 32.6328 in %.45 Repartition of the errors in function of the class, misclassification rate 36.2626 in %.45.4.4.35.35.3.3.25.25.2.2.15.15.1.1.5.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 48.51 in %.2.18.16.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 12: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, up to four times more often. This is also seen in clustering signals (see Figure 7). When classifying digits to 5, 4 is misclassified significantly more often than the other digits, approximately an 85% increase from the next highest (compare to centered and normalized signals in Table 14, where 2 is significantly more misclassified). When classifying digits to 9, 6 is misclassified significantly less often than the other digits, less than half as often of the next lowest. 23

5.5 Unsupervised - Atoms 5 EXPERIMENTS AND RESULTS Table 13: Results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 8, non-centered and non-normalized image signals. 29.5% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. Cluster Type Centered & Digits K misclassification Unsupervised - Atoms False {, 1} 8 3.31% Unsupervised - Atoms False {2, 3} 8 3.62% Unsupervised - Atoms False {,..., 4} 8 29.5% Unsupervised - Atoms False {,..., 5} 8 33.43% Unsupervised - Atoms False {,..., 9} 8 55.37% Repartition of the errors in function of the class, misclassification rate 3.397 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 3.6239 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 29.523 in %.4 Repartition of the errors in function of the class, misclassification rate 33.4273 in %.45.35.3.25.2.15.1.5.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 55.37 in %.4.35.3.25.2.15.1.5 1 2 3 4 5 6 7 8 9 Figure 13: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 8, non-centered and non-normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over twice as often. When classifying digits to 5, 4 is misclassified significantly more often than the other digits, over twice as often. This is also seen in clustering signals (see Figure 8). When classifying digits to 9, is misclassified significantly more often than the other digits, over three times as often. Meanwhile, 3 5 are misclassified significantly less ofen than with K = 5 (see Table 12). 24

5.5 Unsupervised - Atoms 5 EXPERIMENTS AND RESULTS 5.5.2 Centered and Table 14: Results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 5, centered and normalized image signals. 24.89% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. Cluster Type Centered & Digits K misclassification Unsupervised - Atoms True {, 1} 5.5% Unsupervised - Atoms True {2, 3} 5 1.8% Unsupervised - Atoms True {,..., 4} 5 25.8% Unsupervised - Atoms True {,..., 5} 5 34.64% Unsupervised - Atoms True {,..., 9} 5 33.87% Repartition of the errors in function of the class, misclassification rate.47281 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 1.774 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 25.827 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 34.6377 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 33.87 in %.18.16.14.12.1.8.6.4.2 1 2 3 4 5 6 7 8 9 Figure 14: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 5, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over four times as often. This is also seen in clustering signals (see Figure 9). When classifying digits to 5, 2 is misclassified significantly more often than the other digits, over twice as often (compare to non-centered and non-normalized signals in Table 12, where 4 is significantly more misclassified). 25

5.5 Unsupervised - Atoms 5 EXPERIMENTS AND RESULTS Table 15: Results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 8, centered and normalized image signals. 27.18% is significantly different from Sprechmann s unsupervised results of 1.44% with K = 5 for digits {,..., 4} (see Table 1). There is a high jump in misclassification rate when the number of digits classified increases from two to five. Cluster Type Centered & Digits K misclassification Unsupervised - Atoms True {, 1} 8.9% Unsupervised - Atoms True {2, 3} 8 1.8% Unsupervised - Atoms True {,..., 4} 8 25.47% Unsupervised - Atoms True {,..., 5} 8 36.58% Unsupervised - Atoms True {,..., 9} 8 32.9% Repartition of the errors in function of the class, misclassification rate.94563 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 1.774 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 25.4719 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 36.5777 in %.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Repartition of the errors in function of the class, misclassification rate 32.9 in %.25.2.15.1.5 1 2 3 4 5 6 7 8 9 Figure 15: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering (Algorithm 4) with K = 8, centered and normalized image signals, w.r.t. classes. When classifying digits to 4, 2 and 3 are misclassified significantly more often than the other digits, over four times as often. This is also seen in clustering signals (see Figure 1). When classifying digits to 5, 2 is misclassified significantly more often than the other digits, over twice as often (compare to non-centered and non-normalized signals in Table 12, where 4 is significantly more misclassified). 26

5.5 Unsupervised - Atoms 5 EXPERIMENTS AND RESULTS 5.5.3 Changes over Refinements Figure 16: Results from multiple refinement iterations for Unsupervised Dictionary Learning with atom clustering (Algorithm 4). Top to bottom: K = 5, non-centered and non-normalized image signals; K = 8, non-centered and non-normalized image signals; K = 5, centered and normalized image signals; K = 8, centered and normalized image signals. Left to right: Energy, misclassification, changes in training image classification. Energy decreases gradually exponentially with refinements. Misclassification increases with refinements for cases all except K = 8, non-centered and non-normalized image signals. Changes in training images decrease sporadically with refinements. Energy at Each Iteration, digits 4, NCNN Average misclassification Rate at Each Iteration, digits 4, NCNN Changes in Training Signals at Each Iteration, digits 4, NCNN 33 22 2.9 32.5 2 Energy 2.8 2.7 2.6 2.5 2.4 Average misclassification Rate 32 31.5 31 3.5 3 29.5 Number of signals that changed 18 16 14 12 1 8 1 2 3 4 5 6 7 8 9 1 11 Energy at Each Iteration, digits 4, NCNN 29 1 2 3 4 5 6 7 8 9 1 11 Average misclassification Rate at Each Iteration, digits 4, NCNN 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement) Changes in Training Signals at Each Iteration, digits 4, NCNN 33 22 2.9 32.5 2 Energy 2.8 2.7 2.6 2.5 2.4 Average misclassification Rate 32 31.5 31 3.5 3 29.5 Number of signals that changed 18 16 14 12 1 8 1 2 3 4 5 6 7 8 9 1 11 Energy at Each Iteration, digits 4, CN 29 1 2 3 4 5 6 7 8 9 1 11 Average misclassification Rate at Each Iteration, digits 4, CN 2 3 4 5 6 7 8 9 1 11 Change to refinement # (change 1 = change to first refinement) Changes in Training Signals at Each Iteration, digits 4, CN.123 25.5 9.122 8.121 Energy.12.119.118.117.116 Average misclassification Rate 25 24.5 Number of signals that changed 7 6 5 4.115 3.114.113 24 2 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Energy at Each Iteration, digits 4, CN Average misclassification Rate at Each Iteration, digits 4, CN Changes in Training Signals at Each Iteration, digits 4, CN.123 25.5 9.122 8.121 Energy.12.119.118.117.116 Average misclassification Rate 25 24.5 Number of signals that changed 7 6 5 4.115 3.114.113 24 2 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) 27

5.6 Unsupervised - Atoms, split in 2 each iteration 5 EXPERIMENTS AND RESULTS 5.6 Unsupervised - Atoms, split in 2 each iteration 5.6.1 Non-centered and Non-normalized Table 16: Results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 5, non-centered and non-normalized image signals. Performs worse than other unsupervised algorithms (Tables 8 and 12). Cluster Type Centered & Digits K misclassification Unsupervised - Atoms False {,1} 5 4.11% (split initialization) Unsupervised - Atoms False {2,3} 5 4.26% (split initialization) Unsupervised - Atoms False {,...,4} 5 63.46% (split initialization) Unsupervised - Atoms False {,...,5} 5 7.87% (split initialization) Repartition of the errors in function of the class, misclassification rate 4.1135 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 4.265 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 63.4559 in %.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 7.8672 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 17: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. 28

5.6 Unsupervised - Atoms, split in 2 each iteration 5 EXPERIMENTS AND RESULTS Table 17: Results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 8, non-centered and non-normalized image signals. Performs worse than other unsupervised algorithms (Tables 9 and 13). Cluster Type Centered & Digits K misclassification Unsupervised - Atoms False {,1} 8 3.17% (split initialization) Unsupervised - Atoms False {2,3} 8 4.5% (split initialization) Unsupervised - Atoms False {,...,4} 8 65.83% (split initialization) Unsupervised - Atoms False {,...,5} 8 71.96% (split initialization) Repartition of the errors in function of the class, misclassification rate 3.1678 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 4.554 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 65.8299 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 71.9615 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 18: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 8, non-centered and non-normalized image signals, w.r.t. classes. 29

5.6 Unsupervised - Atoms, split in 2 each iteration 5 EXPERIMENTS AND RESULTS 5.6.2 Centered and Table 18: Results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 5, centered and normalized image signals. 62.6776% is significantly different from Sprechmann s results of 1.44% with K = 5 (see Table 1). Performs worse than other unsupervised algorithms (Tables 1 and 14). Cluster Type Centered & Digits K misclassification Unsupervised - Atoms True {,1} 5.19% (split initialization) Unsupervised - Atoms True {2,3} 5 1.57% (split initialization) Unsupervised - Atoms True {,...,4} 5 62.68% (split initialization) Unsupervised - Atoms True {,...,5} 5 68.81% (split initialization) Repartition of the errors in function of the class, misclassification rate.18913 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 1.5671 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 62.6776 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 68.8111 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 19: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 5, centered and normalized image signals, w.r.t. classes. is the most misclassified digit; misclassified over three times as often as other digits. 1 is the least misclassified, at least a quarter less often. 3

5.6 Unsupervised - Atoms, split in 2 each iteration 5 EXPERIMENTS AND RESULTS Table 19: Results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 8, centered and normalized image signals. 62.378% is significantly different from Sprechmann s results of 1.44% with K = 5 (see Table 1). Performs worse than other unsupervised algorithms (Tables 11 and 15). Cluster Type Centered & Digits K misclassification Unsupervised - Atoms True {,1} 8.19% (split initialization) Unsupervised - Atoms True {2,3} 8 1.86% (split initialization) Unsupervised - Atoms True {,...,4} 8 62.31% (split initialization) Unsupervised - Atoms True {,...,5} 8 67.83% (split initialization) Repartition of the errors in function of the class, misclassification rate.18913 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 1.869 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 Repartition of the errors in function of the class, misclassification rate 62.378 in %.5.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 67.8329 in %.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 2: Misclassification rate of results from Unsupervised Dictionary Learning with atom clustering, one dictionary split each step (Algorithm 5) with K = 8, centered and normalized image signals, w.r.t. classes. is the most misclassified digit; misclassified over three times as often as other digits. 1 is the least misclassified, at least a quarter less often. 31

5.7 Unsupervised (kmeans) - Signals 5 EXPERIMENTS AND RESULTS 5.7 Unsupervised (kmeans) - Signals 5.7.1 Non-centered and Non-normalized Table 2: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and param.iter = 2, non-centered and non-normalized image signals. 1.67% comes close to Sprechmann s unsupervised results of 1.44% with K = 5 (see Table 1). Cluster Type Centered & Digits K iterations misclassification Unsupervised - Signals False {,...,4} 5 2 1.67% (kmeans) Unsupervised - Signals Flase {,...,5} 5 2 13.35% (kmeans) Repartition of the errors in function of the class, misclassification rate 1.6735.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 13.3477.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 21: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5, non-centered and non-normalized image signals, w.r.t. classes. Left: 2 is the most misclassified, approximately twice more often than the rest. Right: 3 and 5 are the most misclassified digits, approximately eight times more often than the rest. (Compare to centered and normalized case, Figure 23.) 32

5.7 Unsupervised (kmeans) - Signals 5 EXPERIMENTS AND RESULTS Table 21: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and param.iter = 2, non-centered and non-normalized image signals. 2.14% approaches Sprechmann s unsupervised results of 1.44% with K = 5 (see Table 1). Cluster Type Centered & Digits K iterations misclassification Unsupervised - Signals False {,...,4} 5 2 2.14% (kmeans) Unsupervised - Signals False {,...,5} 5 2 14.71% (kmeans) Repartition of the errors in function of the class, misclassification rate 2.145.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 14.773.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 22: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5, centered and normalized image signals, w.r.t. classes. Left: 2 is the most misclassified, approximately 3% more than the second most. Right: 3 and 5 misclassified often. Unlike other cases of classifying {,..., 5}, 1 is also misclassified often. 33

5.7 Unsupervised (kmeans) - Signals 5 EXPERIMENTS AND RESULTS 5.7.2 Centered and Table 22: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and param.iter = 2, centered and normalized image signals. 1.13% beats Sprechmann s unsupervised results of 1.44% with K = 5 (see Table 1). There is a large increase from adding one class (going from classifying digits {,..., 4} to {,..., 5}). Cluster Type Centered & Digits K iterations misclassification Unsupervised - kmeans True {,...,4} 5 2 1.13% Unsupervised - kmeans True {,...,5} 5 2 9.4% Repartition of the errors in function of the class, misclassification rate 1.1286.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 9.414.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 23: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5, centered and normalized image signals, w.r.t. classes. Left: 2 is the most misclassified, approximately twice more often than the rest. Right: 3 and 5 are the most misclassified digits, approximately eight times more often than the rest. (Compare to non-centered and non-normalized case, Figure 21.).1298 Energy at Each Iteration, digits 4, CN 5 Average misclassification Rate at Each Iteration, digits 4, CN 8 Changes in Training Signals at Each Iteration, digits 4, CN.1296 4.5 7.1294 4 Energy.1292.129.1288 Average misclassification Rate 3.5 3 2.5 Number of signals that changed 6 5 4.1286 2 3.1284 1.5.1282 1 2 3 4 5 6 7 8 9 1 1 1 2 3 4 5 6 7 8 9 1 2 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Figure 24: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and param.iter = 2, centered and normalized image signals. Left to right: energy, misclassification, changes in training image classification. Iteration is the initial dictionary set; each iteration i (i > ) is the ith refinement. Change i is the change between dictionary sets i 1 and i (dictionary set i is the ith refinement, dictionary set is the initial set). Left to right: energy, misclassification, changes in training image classification. Misclassification corresponds better to changes in training image classifications than energy, unlike spectral clustering cases. 34

5.7 Unsupervised (kmeans) - Signals 5 EXPERIMENTS AND RESULTS Table 23: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and iparam.iter = 2, centered and normalized image signals..54% beats Sprechmann s results of 1.44% with K = 5 (see Table 1) Cluster Type Centered & Digits K iterations misclassification Unsupervised - kmeans True {,...,4} 5 2.54% Unsupervised - kmeans True {,...,5} 5 2 8.94% Repartition of the errors in function of the class, misclassification rate.54485.45.4.35.3.25.2.15.1.5 Repartition of the errors in function of the class, misclassification rate 8.9372.5.45.4.35.3.25.2.15.1.5 1 2 3 4 1 2 3 4 5 Figure 25: Misclassification rate of results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5, centered and normalized image signals, w.r.t. classes. Left: 2 is the most misclassified, approximately twice more often than the rest. Right: 3 and 5 are the most misclassified digits, approximately eight times more often than the rest..115 Energy at Each Iteration, digits 4, CN 7 Average misclassification Rate at Each Iteration, digits 4, CN 6 Changes in Training Signals at Each Iteration, digits 4, CN 55.1148 6 5 Energy.1146.1144.1142.114 Average misclassification Rate 5 4 3 Number of signals that changed 45 4 35 3 25.1138 2 2 15.1136 1 2 3 4 5 6 7 8 9 1 1 1 2 3 4 5 6 7 8 9 1 1 2 3 4 5 6 7 8 9 1 Change to refinement # (change 1 = change to first refinement) Figure 26: Results from Unsupervised Dictionary Learning with signal clustering, clustering via kmeans (Algorithm 6) with K = 5 and param.iter = 2, centered and normalized image signals. Left to right: energy, misclassification, changes in training image classification. Iteration is the initial dictionary set; each iteration i (i > ) is the ith refinement. Change i is the change between dictionary sets i 1 and i (dictionary set i is the ith refinement, dictionary set is the initial set). Left to right: energy, misclassification, changes in training image classification. Misclassification corresponds better to changes in training image classifications than energy, unlike spectral clustering cases. 35

5.8 Gaussian Noise Experiments 5 EXPERIMENTS AND RESULTS 5.8 Gaussian Noise Experiments 5.8.1 Classifying Pure Gaussian Noise For this experiment, we generate one thousand images of pure Gaussian noise and use dictionaries trained on the MNIST images to classify each image of noise as a digit -9. The purpose is to analyze how robust our algorithm is with respect to noisy images. The dictionaries used are trained with supervised dictionary learning, with 8 atoms per dictionary. We perform the experiment with and without centering and normalizing the training images, and we perform it for several different variances of the Gaussian noise: σ 2 =.1,.5, and1.. The histograms contain the classification rates for how often a signal of Gaussian noise was classified as each digit. The histograms for noise variance.1 and 1. are similar to the results displayed here for noise variance.5. Table 24: Classification rate histograms for pure Gaussian noise Train dictionaries on Train dictionaries on centered and normalized data not centered and not normalized data noise variance =.5 noise variance =.5 5.8.2 Adding Noise to Test Images For this experiment, we consider the case when there are clean training images, but noisy test images. Gaussian noise is added to the MNIST test images, and then the supervised dictionaries with 8 atoms for digits -9 are used to classify the modified test images. The purpose, again, is to analyze the robustness of our algorithm with respect to noisy images. We perform the experiment with Gaussian noise variance set to: σ 2 =.1,.5, and1. and we perform the experiment with and without centering and normalizing the data. For the test images, we center and normalize after the noise has been added to the image. The histograms contain the misclassification rates for each digit. The histograms for noise variance.1 and 1. are similar to the results displayed here for noise variance.5. 5.8.3 Adding Noise to Training and Test Images For this experiment, we consider the case when there is noise both in the training and test data. The same procedure as above is used again, with the modification that this time Gaussian noise is added to the MNIST training images before training the dictionaries. We perform this experiment first with the setting the variance in the Gaussian noise to be the same for both training and test images, and then with the training variance equal to twice the test variance. 36

5.8 Gaussian Noise Experiments 5 EXPERIMENTS AND RESULTS Figure 27: Visualization of Noisy MNIST Images Table 25: Misclassification histograms for noise in test images only Centered and Not Centered and Not test noise variance =.5 test noise variance =.5 Misclassification rate 8.21% Misclassification rate 87.5% Table 26: Misclassification rates for noise in test images only Test noise variance.1.5 1. Centered & 2.31% 8.21% 21.98% Not Centered & Not 57.5% 87.5% 88.77% Table 27: Misclassification rates for training noise variance equal to test noise variance Training and test noise variance.1.5 1. Centered & 3.2% 17.33% 38.92% Not Centered & Not 11.8% 57.56% 75.87% 37

5.8 Gaussian Noise Experiments 5 EXPERIMENTS AND RESULTS Table 28: Misclassification rates for training noise variance equal to 2 times test noise variance Noise variances training:.1 training:.5 training:1. test:.5 test:.25 test:.5 Centered & 2.8% 9.15% 2.63% Not Centered & Not 5.66% 31.97% 59.8% 38

7 DISCUSSION 6 Code and Toolboxes The following folders of MATLAB code have been provided: MNIST: MNIST database and MNIST MATLAB helper functions spams-matlab: Sparse Modeling Software (SPAMS) toolbox[11][12] Spectral Clustering: AFFECT MATLAB Toolbox for clustering dynamic data[18] Dictionary Learning Supervised Clustering Semisupervised Clustering Unsupervised Clustering atoms Unsupervised Clustering signals Unsupervised Clustering atoms split initialization Unsupervised Clustering kmeans Gaussian Noise: experiments adding noise to MNIST The experiments can be replicated using the provided MATLAB code, using the following steps: I. Copy the contents of the MNIST, SPAMS, and Spectral Clustering folders into the desired Dictionary Learning subfolder. II. Set up the SPAMS toolbox by following the directions in HOW TO INSTALL.txt. III. Choose which experiment to perform. Set parameters and run the corresponding MATLAB script. Parameters can be set in STEP. of each script. i. Supervised Dictionary Learning: run SCRIPT SupervisedDictionary (in Supervised Clustering folder) ii. Semisupervised Dictionary Learning: run SCRIPT SemisupervisedDictionary (in Semisupervised Clustering folder) iii. Unsupervised Dictionary Learning (clustering by atoms, spectral clustering): run SCRIPT UnsupervisedDictionary atoms (in Unsupervised Clustering atoms folder) iv. Unsupervised Dictionary Learning (clustering by signals, spectral clustering): run SCRIPT UnsupervisedDictionary signals (in Unsupervised Clustering signals folder) v. Unsupervised Dictionary Learning (clustering by atoms, split-by-two each iter., spectral clustering) run SCRIPT UnsupervisedDictionary atoms split initialization (in Unsupervised Clustering atoms split initialization folder) vi. Unsupervised Dictionary Learning (clustering by signals, kmeans) run SCRIPT UnsupervisedDictionary kmeans (in Unsupervised Clustering kmeans) vii. Classifying Pure Gaussian Noise (in Gaussian Noise folder) run SCRIPT classifygaussiannoise viii. Add Noise to MNIST training/test images and classify (in Gaussian Noise folder) run SCRIPT classifymnistwithnoise NOTE: We mimicked Equation (1) when using SPAMS by setting param.mode = 2, which aims at solving 1 min D n n i=1 setting λ 2 = and scaling λ by a factor of 1/2. ( ) 1 2 xi Dαi 2 2 + λ α i 1 + λ 2 α i 2 2, (3) 7 Discussion Our experiments agree with [16] (Table 1) for supervised dictionary learning (Table 4 and Table 5), as long as we center and normalize the data. It is possible to improve the results by increasing the number of iterations per dictionary training (param.iter), though doing so increases running time tremously. We were unable to mimic their result, with the algorithms described in [16], using unsupervised dictionary learning, as can be seen Table 15 and Table 11. 39

7 DISCUSSION Nevertheless, we deduce that the dictionary learning process (utilizing SPAMS [11, 12]) works as it should. Thus, the disparity between supervised and unsupervised clustering results lies in the spectral clustering process. We tried multiple spectral clustering toolboxes ([2, 3, 18]), none of which produced desirable results. This can be illustrated visibly by Figure 28, which displays the initial clusterings to produce the unsupervised dictionaries. These figures were produced with SPAMS displaypatches function. Figure 28: Clusters from centered and normalized signals of digits to 4, K = 5 Left: initial cluster of atoms, result of Algorithm 4; Right: initial cluster of signals, result of Algorithm 3. As can be seen, the clusters in both figures are not meaningful because each of them mixes 2, 3 and a few other digits together. 5 1 1 15 2 25 3 35 2 3 4 5 6 7 4 5 1 15 2 25 3 35 4 1 2 3 4 5 6 7 As can be seen in Figure 28 spectral clustering was unable to differentiate between 2 and 3, resulting in a high misclassification rate. This can further be seen, when clustering atoms, in Table 14, where there was an average misclassification rate of.946% when clustering {, 1}, compared to a rate of 1.5671% when clustering {2, 3}. Moreover, for clustering signals as in Table 1, there was an average misclassification rate of.1891% when clustering {, 1}, compared to a rate of 4.9951% when clustering {2, 3}. One can wonder why {2, 3} is more difficult to classify than {, 1}. Our first results point out a significant difference between the misclassification rates achieved with dictionaries associated with these two sets. We first tried examining whether there were any specific atoms of our large unclustered dictionary which were used disproportionately often in constructing sparse representations for the images. If there was such an atom, its appearance in multiple classes may have helped cause many false classifications. Upon further examination, such an atom was not found (see Figure 29). We moved on to considering other potential problems: concerning the initial partition, obtained in [16], the authors do not justify or explain the proposed choices of similarities measures. A very simple mathematical argument allows us to understand the undesired effects of these choices: Assume that the matrix A (see Section 4) contains only positive entries. Under that condition, S 1 := A t A = A t A. This means that two columns α 1, α 2 R K of A, or equivalently two sparse signals representations, are similar if their standard inner product a 1, a 2 = a 1 2 a 2 2 cos(a 1, a 2) on R K is large. Thus, given two pairs of vectors a 1, a 2 and b 1, b 2 of R K such that cos(a 1, a 2) = cos(b 1, b 2) the similarity measure proposed in [16] will favor vectors whose norm products are bigger. As a corollary, nearly orthogonal vectors can be seen as more similar than co-linear vectors if the product of their norms are big enough. This surprising fact is not discussed in [16]. To cope with that fact, we tried to normalize the matrix A, so that each of its columns have unit l2 norms. For an unknown reason the norms of the columns of A are, with MNIST training images, almost constant (see Figure 3 for an example). 4

7 DISCUSSION Figure 29: Histogram of atom usage in sparse signal decomposition for dictionaries of 8 atoms. Left: non-centered, non-normalized. Right: centered, normalized..7 Histogram of the l2 norm of the columns of A ().6.5 Emirical probability.4.3.2.1 2 3 4 5 6 7 8 9 1 11 Norm interval Figure 3: The normalized histogram of the L 2 norm of the columns of A. Most of the columns norm accumulate in the interval [3, 4], which takes around 66%. Another point that is undiscussed in [16] is the absolute value. For example, the two orthogonal vectors a 1 = (1, 1) t and a 2 = (1, 1) t become co-linear when the entries are replaced by their absolute values. Surprisingly, again, replacing changing S 1 to AA t didn t dramatically improve our results. This is justified, empirically, by examining how few of the entries in A are negative numbers: 135245 over 1 6 entries. The most puzzling sentences of [16] are the last few before the concluding remarks: We observed that best results are obtained for all the experiments when the initial dictionaries in the learning stage are constructed by randomly selecting signals from the training set. If the size of the dictionary compared to the dimension of the data 41