Applied Machine Learning Assignment 1 Professor: Aude Billard Assistants: Guillaume de Chambrier, Nadia Figueroa, Joao Abrantes contacts: aude.billard@epfl.ch guillaume.dechambrier@epfl.ch nadia.figueroafernandez@epfl.ch joao.abrantes@epfl.ch@epfl.ch Winter Semester 2015 1 Goals The goal of this assignment is to familiarize yourself with the Principal Component Analysis technique presented during the class and get you acquainted with the importance of choosing well one s dataset to obtain the best performance of an algorithm. 1.1 Structure of the practicals Part I of this assignment comprises of ungraded exercises to familiarize yourself with the machine learning software we will use for this and future practical sessions. Part II consists of a set of graded exercises. The percentage of marks carried are indicated next to each exercise. For the graded part, you will have to submit a written report, in which you answer each of the listed questions. In this, and all future practical sessions, you will use the MLDemos (http: //lasa.epfl.ch/teaching/lectures/ml_msc/mldemos_master.zip) toolkit that provides a collection of machine learning algorithms which you can apply on hand-made as well as realworld datasets. Recall that practical sessions are performed by teams of 3 persons. If you do not yet have a partner, let the assistants know and they will assign you to a team. 1.1.1 PCA During the first practice session on PCA, 3 different data sets will be analyzed step-by-step by the class as a whole with the help of the assistants. The objective is to draw your attention to the different types of situations and caveats you might encounter when performing PCA. You should also heed the different techniques for visualizing high-dimensional data provided in MLDemos. This section will be ungraded. The second section of the assignment will require to form your own data set via MLDemos and carry out the same analysis in your respective groups and will be graded. 1
1.1.2 Grading & Submission This assignment will be graded through a report which you must hand in no later than by October 16th, 18h00. All reports should be submitted online at the course webpage http: //lasa.epfl.ch/teaching/lectures/ml_msc/#submission. The submission form is located at the bottom of the page and indicates which submission is currently open. You should select your group and upload a.pdf file not more than 10 MB in size. You may upload multiple times, in which case, only the latest file will be graded. Delays will be penalized: 1 point will be subtracted for each day of delay. The first day late counts starting one hour after the deadline. This report counts for 5% of the total grade of the course. Practicals are conducted in teams of two. Unless told otherwise, we assume that the work has been shared equally by the members of the team and hence all members will be given the same grade. More information on the assignment and on the way the report should be written are given below. 2
2 Part I: Principal Component Analysis (ungraded) For this first practical you will focus on choosing a suitable projection of the data through Principal Component Analysis. Such a projection aims at improving the separability of the data and at reducing the dimensionality of the dataset. Throughout the practical session, you will be working on synthetic and real data. Synthetic data can be created and analysed with ML methods by using MLDemos. It is a useful help to visualize how changes in data affect the results of a learning algorithm. However, synthetic data seldom help to grasp some of the issues arising when using realistic and hence noisy data. You are advised to start with synthetic data during the practical sessions, but then move to using real data. The report should focus solely on results obtained when using your own real datasets. The first, non-graded, step will require you to investigate different PCA projections (analyzing with PCA) for 3 pre-chosen data sets, namely iris, biotac and ads. Your task is to determine which projection is best suited for the purpose of dimensionality reduction and class separability. You will then be asked to discuss the influence those particular choices may have in improving or degrading performance of the classification or clustering process. 2.1 Getting started 1. Download MLdemos and datasets The software (downloadable at http://lasa. epfl.ch/teaching/lectures/ml_msc/mldemos_master.zip ) provides a graphical interface for visualizing the data and algorithms you will use throughout this year. The datasets for the in-class part of this practical can be downloaded from http://lasa. epfl.ch/teaching/lectures/ml_msc/practical1_data.rar. It is advised to decompress the MLdemos zip file in the desktop folder if you are using an EPFL computer to avoid folder/files path issues. For each dataset, carry out the following tasks and answer the questions. 2. Load your data set Launch mldemos.exe and load a *.data file (Drag and drop the file in MLdemos, or go file > Import > Data (csv,text) ). The data is displayed in its original space. Figure 1: Display of the first two dimensions of the iris data set. Figure 2: Choose the dimensions your data are displayed on (1) and the way they are displayed (2). 3
3. Interpret high-dimensional data By default, the data is projected on the first two dimensions (either in the original space or in the projected space). You can change this by selecting the dimensions you want to display your data on: select the dimensions on the bottom left of the main window (Figure 2, (1)). As shown in Figure 3, there are other ways of visualizing your data to display more than two dimensions at the same time, choose from the list (Figure 2, (2)) between Scatterplots (Plot every combination of 2x2 dimensions next to each other. Expect some slowdown if you have a big dataset), Parrallel coordinates (Each datapoint is a line passing through each dimension.), BubblePlots (display a third dimension by varying the size of each datapoint). Figure 3: Different data displays are available. Parallel coordinates and Bubble plots methods. From left to right: Standard, Scatterplots, 4. Project your data To project it with PCA, click on the Algorithms button (Figure 4, Figure 4: How to project your data: open the Algorithms window (1), select the Projections tab (2) and choose PCA (3). Click on Project (4). (1)) and go to the Projections tab (Figure 4, (2)). Select Principal Component Analysis and click on Project. Your data is now projected onto the eigenvectors of its covariance matrix. Note that you can project your data back to its original space by clicking the 4
Revert button. The graph of the reconstruction error (Figure 5) when projected on each Figure 5: Reconstruction error and component variance (1). Eigen button (2) to display the eigenvectors in a new window eigenvector and a table with each component s (precentage) variance is shown. This infromation can be useful to get an idea of how much information is stored in each eigencomponent. The Eigen button will display the eigenvectors in a new window. 2.2 In-class questions 1. Using the visualization type: correlations, how many eigen vectors would you expect to achieve a 99% reconstruction error (iris and biotac datasets). 2. What qualitative difference do you see in the data projected onto the first eigenvectors as opposed to the later ones (for all datasets)? 2.2.1 Data set description iris This data set was introduced by Sir Ronald Fisher as test dataset for discriminant analysis. It is a multivariate data set consisting of three different different flower types (Iris Setosa, Iris Versicolour, Iris Virginica). Each type of flower is represented by a 4 dimensional vector. 1 Sepal length 2 Sepal width 3 Petal length 4 Petal width The data set in question was taken from the UCI Machine Learning Repository http:// archive.ics.uci.edu/ml/datasets/iris. biotac The biotac data set consists of two classes where each sample has 19 features. The data was recording during a simple sweeping motion on a table top in which all the biotac receptor patches (see figure 6) where in contact with the table. Two sweeping motions were performed which correspond to the two classes. The first sweep was from left to right, whilst the second from right to left. 5
7 11 9 8 1 10 12 17 2 14 13 3 4 18 16 15 19 5 6 (a) (b) Figure 6: a) Biotac finger, meant to be as close as possible to the sensing capabilities of a human finger. b) Each circle represents a sensor, the spacial location of the patches map onto the skin of the finger. 3 Part II: Principal Component Analysis (graded) Face and object images present an interesting source of data as they live in a high-dimensional space (images often have several thousand dimensions). In this part, groups should be using original images that they have gathered through the internet or other sources (personal photos, etc). Note that in the report that you will submit, the images must be a mix of different object types, e.g. mugs, pens, faces etc. When creating the image dataset, keep in mind that you will have to split each dataset into different classes later on in the practicals, therefore make sure to have enough samples for whichever types of objects you choose to have in your dataset. The minimal size for a dataset should be 50-60 samples (i.e. 25-30 samples per class), but you will realize that a bigger dataset can help your understanding of how the algorithm works. The system should be able to process up to a couple of thousands of samples. Figure 7: PCAFaces plugin GUI for creating image-based datasets and projecting the results into MLDemos. Follow these steps to create your own dataset with images. 1. Launch MLDemos and select Plugins > Input / Output > PCAFaces from the menu. 6
2. An interface should pop up (See Figure 7). If you have a camera attached to your computer, it should open up on the left-hand side of the interface and allow you to select a region of the image that can be captured multiple times (e.g. different faces or different face expressions). Alternatively, an image can be loaded and sub-regions of that image be selected as samples. 3. Use the button marked with >> to add the selected regions to the dataset. 4. Once you have selected enough number of samples (all samples will be gathered in the righthand side of the interface), you can assign labels to each sample by left/right-clicking on them. You can left/right-shift-click a sample to change the class label of all samples below. Ctrl+clicking on a sample will remove it from the dataset. Save the dataset once you re satisfied with the results. The dataset is saved as an image which you can open and edit with any imaging software (and which you could for example include in your report). 5. In the PCAFaces window, you select the eigenvectors to project your data on in the bottom right of the window. Figure 8: Selecting the eigenvectors in the PCAFaces window will determine onto which two eigenvectors the data is projected in the main window. 4 Report Write a report of maximum 4 pages (single column, 10pt minimum) in PDF format. Pages beyond the fourth one will be ignored. The best way to write the report is to fill it in as you go during the practical session. Just jotting down some quick notes while you experiment will save you hours once you work on the report itself. A qualitative evaluation should contain images (e.g. screenshots) which exemplify the concepts you want to explain (e.g. an image of a good projection and an image of a bad one). Make sure to plot only a subset of all the plots you may have visualized during the practical. Choose the ones that are the most representative. Make sure that there is no redundancy in the information conveyed by the graphs and thus that each graph presents a different concept. Each graph/image should be accompanied by a caption that explains the content of the image. Bad captions are captions that contain solely the figure number! An example of good caption would typically read as follows: Figure 2: The left plot shows the e1 and e2 projection of 10 images of human faces, typical of those shown in Figure 1. In the main text, refer to all figures using their figure numbers. Bad captions and lack of clear references to pictures in the text will be penalized. 7
4.1 Format In this first report, we expect solely a qualitative assessment of the performance and behavior of the system. Your report will be graded on the following aspects: 1. Description of your data set. This would include the number of classes, number of samples per class and the dimensionality of your data. You may provide illustrations depicting a typical member from each class. (20%) 2. Following discussions regarding the PCA algorithm: (a) Discuss the effectiveness of using PCA as a preprocesssing step before classification. Think in terms of the separability of the data in the projected space. (20%) (b) Can you find one or more projections of the data, that would make the classes separable? If this is the case, can you decipher which feature of the data was extracted by the projection and whether these features correspond to your expectations. If you did not manage to find a suitable pair or group of projections to separate the data, discuss why this is the case. (30%) (c) What happens if you do not use all samples to train PCA? (You can do this by right/left + clicking on the samples in the dataset window), e.g., if all objects used in PCA have similar shape/color etc. Repeat this process 3 times by selecting different subgroups of images, and discuss how the choice of training set affects the choice of PCA features and the separability of the data. (30%) 8