Applied Machine Learning Assignment 1

Similar documents
MyUni - Turnitin Assignments

Moodle 2 Assignments. LATTC Faculty Technology Training Tutorial

Using SAM Central With iread

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

Storytelling Made Simple

The following information has been adapted from A guide to using AntConc.

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Python Machine Learning

STUDENT MOODLE ORIENTATION

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

/ On campus x ICON Grades

2 User Guide of Blackboard Mobile Learn for CityU Students (Android) How to download / install Bb Mobile Learn? Downloaded from Google Play Store

SECTION 12 E-Learning (CBT) Delivery Module

Evaluating Interactive Visualization of Multidimensional Data Projection with Feature Transformation

Moodle Student User Guide

PowerTeacher Gradebook User Guide PowerSchool Student Information System

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

i>clicker Setup Training Documentation This document explains the process of integrating your i>clicker software with your Moodle course.

Houghton Mifflin Online Assessment System Walkthrough Guide

MOODLE 2.0 GLOSSARY TUTORIALS

Schoology Getting Started Guide for Teachers

16.1 Lesson: Putting it into practice - isikhnas

The Keele University Skills Portfolio Personal Tutor Guide

Creating a Test in Eduphoria! Aware

Millersville University Degree Works Training User Guide

TK20 FOR STUDENT TEACHERS CONTENTS

Introduction to Moodle

ACCESSING STUDENT ACCESS CENTER

CHANCERY SMS 5.0 STUDENT SCHEDULING

Quick Reference for itslearning

LMS - LEARNING MANAGEMENT SYSTEM END USER GUIDE

EXPO MILANO CALL Best Sustainable Development Practices for Food Security

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

EMPOWER Self-Service Portal Student User Manual

EdX Learner s Guide. Release

Field Experience Management 2011 Training Guides

Rule Learning With Negation: Issues Regarding Effectiveness

Online ICT Training Courseware

WHEN THERE IS A mismatch between the acoustic

Your School and You. Guide for Administrators

Creating an Online Test. **This document was revised for the use of Plano ISD teachers and staff.

Emporia State University Degree Works Training User Guide Advisor

CS Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

Ascension Health LMS. SumTotal 8.2 SP3. SumTotal 8.2 Changes Guide. Ascension

Preparing for the School Census Autumn 2017 Return preparation guide. English Primary, Nursery and Special Phase Schools Applicable to 7.

Assignment 1: Predicting Amazon Review Ratings

IT Students Workshop within Strategic Partnership of Leibniz University and Peter the Great St. Petersburg Polytechnic University

InCAS. Interactive Computerised Assessment. System

INSTRUCTOR USER MANUAL/HELP SECTION

Minitab Tutorial (Version 17+)

The Moodle and joule 2 Teacher Toolkit

Odyssey Writer Online Writing Tool for Students

DOCTORAL SCHOOL TRAINING AND DEVELOPMENT PROGRAMME

Automating Outcome Based Assessment

Faculty Feedback User s Guide

Outreach Connect User Manual

Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

Experience College- and Career-Ready Assessment User Guide

Using NVivo to Organize Literature Reviews J.J. Roth April 20, Goals of Literature Reviews

Excel Intermediate

TotalLMS. Getting Started with SumTotal: Learner Mode

ALEKS. ALEKS Pie Report (Class Level)

Welcome to California Colleges, Platform Exploration (6.1) Goal: Students will familiarize themselves with the CaliforniaColleges.edu platform.

Students Understanding of Graphical Vector Addition in One and Two Dimensions

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Appendix L: Online Testing Highlights and Script

Indiana Collaborative for Project Based Learning. PBL Certification Process

M55205-Mastering Microsoft Project 2016

Lecture 1: Machine Learning Basics

How to set up gradebook categories in Moodle 2.

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Getting Started Guide

Statewide Framework Document for:

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Attendance/ Data Clerk Manual.

Notes on The Sciences of the Artificial Adapted from a shorter document written for course (Deciding What to Design) 1


Justin Raisner December 2010 EdTech 503

Spring 2015 Achievement Grades 3 to 8 Social Studies and End of Course U.S. History Parent/Teacher Guide to Online Field Test Electronic Practice

Australian Journal of Basic and Applied Sciences

Common Core Exemplar for English Language Arts and Social Studies: GRADE 1

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 17, NO. 3, MARCH

Modeling function word errors in DNN-HMM based LVCSR systems

Student Handbook. This handbook was written for the students and participants of the MPI Training Site.

Dialogue Live Clientside

ACCOUNTING FOR MANAGERS BU-5190-AU7 Syllabus

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Ministry of Education, Republic of Palau Executive Summary

Donnelly Course Evaluation Process

Parent s Guide to the Student/Parent Portal

*Lesson will begin on Friday; Stations will begin on the following Wednesday*

USER GUIDANCE. (2)Microphone & Headphone (to avoid howling).

Moodle 3.2 Backup and Simple Restore

ACADEMIC TECHNOLOGY SUPPORT

ACCOUNTING FOR MANAGERS BU-5190-OL Syllabus

Student User s Guide to the Project Integration Management Simulation. Based on the PMBOK Guide - 5 th edition

Transcription:

Applied Machine Learning Assignment 1 Professor: Aude Billard Assistants: Guillaume de Chambrier, Nadia Figueroa, Joao Abrantes contacts: aude.billard@epfl.ch guillaume.dechambrier@epfl.ch nadia.figueroafernandez@epfl.ch joao.abrantes@epfl.ch@epfl.ch Winter Semester 2015 1 Goals The goal of this assignment is to familiarize yourself with the Principal Component Analysis technique presented during the class and get you acquainted with the importance of choosing well one s dataset to obtain the best performance of an algorithm. 1.1 Structure of the practicals Part I of this assignment comprises of ungraded exercises to familiarize yourself with the machine learning software we will use for this and future practical sessions. Part II consists of a set of graded exercises. The percentage of marks carried are indicated next to each exercise. For the graded part, you will have to submit a written report, in which you answer each of the listed questions. In this, and all future practical sessions, you will use the MLDemos (http: //lasa.epfl.ch/teaching/lectures/ml_msc/mldemos_master.zip) toolkit that provides a collection of machine learning algorithms which you can apply on hand-made as well as realworld datasets. Recall that practical sessions are performed by teams of 3 persons. If you do not yet have a partner, let the assistants know and they will assign you to a team. 1.1.1 PCA During the first practice session on PCA, 3 different data sets will be analyzed step-by-step by the class as a whole with the help of the assistants. The objective is to draw your attention to the different types of situations and caveats you might encounter when performing PCA. You should also heed the different techniques for visualizing high-dimensional data provided in MLDemos. This section will be ungraded. The second section of the assignment will require to form your own data set via MLDemos and carry out the same analysis in your respective groups and will be graded. 1

1.1.2 Grading & Submission This assignment will be graded through a report which you must hand in no later than by October 16th, 18h00. All reports should be submitted online at the course webpage http: //lasa.epfl.ch/teaching/lectures/ml_msc/#submission. The submission form is located at the bottom of the page and indicates which submission is currently open. You should select your group and upload a.pdf file not more than 10 MB in size. You may upload multiple times, in which case, only the latest file will be graded. Delays will be penalized: 1 point will be subtracted for each day of delay. The first day late counts starting one hour after the deadline. This report counts for 5% of the total grade of the course. Practicals are conducted in teams of two. Unless told otherwise, we assume that the work has been shared equally by the members of the team and hence all members will be given the same grade. More information on the assignment and on the way the report should be written are given below. 2

2 Part I: Principal Component Analysis (ungraded) For this first practical you will focus on choosing a suitable projection of the data through Principal Component Analysis. Such a projection aims at improving the separability of the data and at reducing the dimensionality of the dataset. Throughout the practical session, you will be working on synthetic and real data. Synthetic data can be created and analysed with ML methods by using MLDemos. It is a useful help to visualize how changes in data affect the results of a learning algorithm. However, synthetic data seldom help to grasp some of the issues arising when using realistic and hence noisy data. You are advised to start with synthetic data during the practical sessions, but then move to using real data. The report should focus solely on results obtained when using your own real datasets. The first, non-graded, step will require you to investigate different PCA projections (analyzing with PCA) for 3 pre-chosen data sets, namely iris, biotac and ads. Your task is to determine which projection is best suited for the purpose of dimensionality reduction and class separability. You will then be asked to discuss the influence those particular choices may have in improving or degrading performance of the classification or clustering process. 2.1 Getting started 1. Download MLdemos and datasets The software (downloadable at http://lasa. epfl.ch/teaching/lectures/ml_msc/mldemos_master.zip ) provides a graphical interface for visualizing the data and algorithms you will use throughout this year. The datasets for the in-class part of this practical can be downloaded from http://lasa. epfl.ch/teaching/lectures/ml_msc/practical1_data.rar. It is advised to decompress the MLdemos zip file in the desktop folder if you are using an EPFL computer to avoid folder/files path issues. For each dataset, carry out the following tasks and answer the questions. 2. Load your data set Launch mldemos.exe and load a *.data file (Drag and drop the file in MLdemos, or go file > Import > Data (csv,text) ). The data is displayed in its original space. Figure 1: Display of the first two dimensions of the iris data set. Figure 2: Choose the dimensions your data are displayed on (1) and the way they are displayed (2). 3

3. Interpret high-dimensional data By default, the data is projected on the first two dimensions (either in the original space or in the projected space). You can change this by selecting the dimensions you want to display your data on: select the dimensions on the bottom left of the main window (Figure 2, (1)). As shown in Figure 3, there are other ways of visualizing your data to display more than two dimensions at the same time, choose from the list (Figure 2, (2)) between Scatterplots (Plot every combination of 2x2 dimensions next to each other. Expect some slowdown if you have a big dataset), Parrallel coordinates (Each datapoint is a line passing through each dimension.), BubblePlots (display a third dimension by varying the size of each datapoint). Figure 3: Different data displays are available. Parallel coordinates and Bubble plots methods. From left to right: Standard, Scatterplots, 4. Project your data To project it with PCA, click on the Algorithms button (Figure 4, Figure 4: How to project your data: open the Algorithms window (1), select the Projections tab (2) and choose PCA (3). Click on Project (4). (1)) and go to the Projections tab (Figure 4, (2)). Select Principal Component Analysis and click on Project. Your data is now projected onto the eigenvectors of its covariance matrix. Note that you can project your data back to its original space by clicking the 4

Revert button. The graph of the reconstruction error (Figure 5) when projected on each Figure 5: Reconstruction error and component variance (1). Eigen button (2) to display the eigenvectors in a new window eigenvector and a table with each component s (precentage) variance is shown. This infromation can be useful to get an idea of how much information is stored in each eigencomponent. The Eigen button will display the eigenvectors in a new window. 2.2 In-class questions 1. Using the visualization type: correlations, how many eigen vectors would you expect to achieve a 99% reconstruction error (iris and biotac datasets). 2. What qualitative difference do you see in the data projected onto the first eigenvectors as opposed to the later ones (for all datasets)? 2.2.1 Data set description iris This data set was introduced by Sir Ronald Fisher as test dataset for discriminant analysis. It is a multivariate data set consisting of three different different flower types (Iris Setosa, Iris Versicolour, Iris Virginica). Each type of flower is represented by a 4 dimensional vector. 1 Sepal length 2 Sepal width 3 Petal length 4 Petal width The data set in question was taken from the UCI Machine Learning Repository http:// archive.ics.uci.edu/ml/datasets/iris. biotac The biotac data set consists of two classes where each sample has 19 features. The data was recording during a simple sweeping motion on a table top in which all the biotac receptor patches (see figure 6) where in contact with the table. Two sweeping motions were performed which correspond to the two classes. The first sweep was from left to right, whilst the second from right to left. 5

7 11 9 8 1 10 12 17 2 14 13 3 4 18 16 15 19 5 6 (a) (b) Figure 6: a) Biotac finger, meant to be as close as possible to the sensing capabilities of a human finger. b) Each circle represents a sensor, the spacial location of the patches map onto the skin of the finger. 3 Part II: Principal Component Analysis (graded) Face and object images present an interesting source of data as they live in a high-dimensional space (images often have several thousand dimensions). In this part, groups should be using original images that they have gathered through the internet or other sources (personal photos, etc). Note that in the report that you will submit, the images must be a mix of different object types, e.g. mugs, pens, faces etc. When creating the image dataset, keep in mind that you will have to split each dataset into different classes later on in the practicals, therefore make sure to have enough samples for whichever types of objects you choose to have in your dataset. The minimal size for a dataset should be 50-60 samples (i.e. 25-30 samples per class), but you will realize that a bigger dataset can help your understanding of how the algorithm works. The system should be able to process up to a couple of thousands of samples. Figure 7: PCAFaces plugin GUI for creating image-based datasets and projecting the results into MLDemos. Follow these steps to create your own dataset with images. 1. Launch MLDemos and select Plugins > Input / Output > PCAFaces from the menu. 6

2. An interface should pop up (See Figure 7). If you have a camera attached to your computer, it should open up on the left-hand side of the interface and allow you to select a region of the image that can be captured multiple times (e.g. different faces or different face expressions). Alternatively, an image can be loaded and sub-regions of that image be selected as samples. 3. Use the button marked with >> to add the selected regions to the dataset. 4. Once you have selected enough number of samples (all samples will be gathered in the righthand side of the interface), you can assign labels to each sample by left/right-clicking on them. You can left/right-shift-click a sample to change the class label of all samples below. Ctrl+clicking on a sample will remove it from the dataset. Save the dataset once you re satisfied with the results. The dataset is saved as an image which you can open and edit with any imaging software (and which you could for example include in your report). 5. In the PCAFaces window, you select the eigenvectors to project your data on in the bottom right of the window. Figure 8: Selecting the eigenvectors in the PCAFaces window will determine onto which two eigenvectors the data is projected in the main window. 4 Report Write a report of maximum 4 pages (single column, 10pt minimum) in PDF format. Pages beyond the fourth one will be ignored. The best way to write the report is to fill it in as you go during the practical session. Just jotting down some quick notes while you experiment will save you hours once you work on the report itself. A qualitative evaluation should contain images (e.g. screenshots) which exemplify the concepts you want to explain (e.g. an image of a good projection and an image of a bad one). Make sure to plot only a subset of all the plots you may have visualized during the practical. Choose the ones that are the most representative. Make sure that there is no redundancy in the information conveyed by the graphs and thus that each graph presents a different concept. Each graph/image should be accompanied by a caption that explains the content of the image. Bad captions are captions that contain solely the figure number! An example of good caption would typically read as follows: Figure 2: The left plot shows the e1 and e2 projection of 10 images of human faces, typical of those shown in Figure 1. In the main text, refer to all figures using their figure numbers. Bad captions and lack of clear references to pictures in the text will be penalized. 7

4.1 Format In this first report, we expect solely a qualitative assessment of the performance and behavior of the system. Your report will be graded on the following aspects: 1. Description of your data set. This would include the number of classes, number of samples per class and the dimensionality of your data. You may provide illustrations depicting a typical member from each class. (20%) 2. Following discussions regarding the PCA algorithm: (a) Discuss the effectiveness of using PCA as a preprocesssing step before classification. Think in terms of the separability of the data in the projected space. (20%) (b) Can you find one or more projections of the data, that would make the classes separable? If this is the case, can you decipher which feature of the data was extracted by the projection and whether these features correspond to your expectations. If you did not manage to find a suitable pair or group of projections to separate the data, discuss why this is the case. (30%) (c) What happens if you do not use all samples to train PCA? (You can do this by right/left + clicking on the samples in the dataset window), e.g., if all objects used in PCA have similar shape/color etc. Repeat this process 3 times by selecting different subgroups of images, and discuss how the choice of training set affects the choice of PCA features and the separability of the data. (30%) 8