CSCI 374 Machine Learning and Data Mining Oberlin College Fall Homework #1: Decision Trees

Similar documents
CS Machine Learning

Houghton Mifflin Online Assessment System Walkthrough Guide

CS 1103 Computer Science I Honors. Fall Instructor Muller. Syllabus

Learning From the Past with Experiment Databases

Rule Learning With Negation: Issues Regarding Effectiveness

Python Machine Learning

Rule Learning with Negation: Issues Regarding Effectiveness

CSL465/603 - Machine Learning

CS 446: Machine Learning

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Foothill College Fall 2014 Math My Way Math 230/235 MTWThF 10:00-11:50 (click on Math My Way tab) Math My Way Instructors:

CS 101 Computer Science I Fall Instructor Muller. Syllabus

Data Structures and Algorithms

CS 100: Principles of Computing

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

GACE Computer Science Assessment Test at a Glance

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

An Introduction to Simio for Beginners

Introduction to Causal Inference. Problem Set 1. Required Problems

Generating Test Cases From Use Cases

Parent Information Welcome to the San Diego State University Community Reading Clinic

Computer Science 1015F ~ 2016 ~ Notes to Students

Interactive Whiteboard

ENG 111 Achievement Requirements Fall Semester 2007 MWF 10:30-11: OLSC

InCAS. Interactive Computerised Assessment. System

WSU Five-Year Program Review Self-Study Cover Page

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Spring 2014.

Learning goal-oriented strategies in problem solving

Using focal point learning to improve human machine tacit coordination

Linking Task: Identifying authors and book titles in verbose queries

CS177 Python Programming

CS Course Missive

Setting Up Tuition Controls, Criteria, Equations, and Waivers

Beyond the Pipeline: Discrete Optimization in NLP

Guide to Teaching Computer Science

Lecture 1: Machine Learning Basics

Beginning Blackboard. Getting Started. The Control Panel. 1. Accessing Blackboard:

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

LEGO MINDSTORMS Education EV3 Coding Activities

Let s think about how to multiply and divide fractions by fractions!

RETURNING TEACHER REQUIRED TRAINING MODULE YE TRANSCRIPT

Getting Started with Deliberate Practice

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Online Administrator Guide

Mathematics Success Grade 7

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Schoology Getting Started Guide for Teachers

Tools and Techniques for Large-Scale Grading using Web-based Commercial Off-The-Shelf Software

Automating Outcome Based Assessment

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Modeling function word errors in DNN-HMM based LVCSR systems

Android App Development for Beginners

Online Marking of Essay-type Assignments

Second Grade Saigling Elementary Back to School Night August 22nd, 2017

DOCENT VOLUNTEER EDUCATOR APPLICATION Winter Application Deadline: April 15, 2013

(Sub)Gradient Descent

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Learning Methods in Multilingual Speech Recognition

While you are waiting... socrative.com, room number SIMLANG2016

Appendix L: Online Testing Highlights and Script

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

Multiple Measures Assessment Project - FAQs

Java Programming. Specialized Certificate

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Given a directed graph G =(N A), where N is a set of m nodes and A. destination node, implying a direction for ow to follow. Arcs have limitations

Syntax Parsing 1. Grammars and parsing 2. Top-down and bottom-up parsing 3. Chart parsers 4. Bottom-up chart parsing 5. The Earley Algorithm

Economics 201 Principles of Microeconomics Fall 2010 MWF 10:00 10:50am 160 Bryan Building

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

SARDNET: A Self-Organizing Feature Map for Sequences

SECTION 12 E-Learning (CBT) Delivery Module

Multi-label classification via multi-target regression on data streams

Outreach Connect User Manual

PCSD Lesson Planning Template

A Simple VQA Model with a Few Tricks and Image Features from Bottom-up Attention

Modeling function word errors in DNN-HMM based LVCSR systems

Science Olympiad Competition Model This! Event Guidelines

An OO Framework for building Intelligence and Learning properties in Software Agents

English Language Arts Summative Assessment

Problems of the Arabic OCR: New Attitudes

General Physics I Class Syllabus

Undergraduate Program Guide. Bachelor of Science. Computer Science DEPARTMENT OF COMPUTER SCIENCE and ENGINEERING

Mathematics Success Level E

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

Team Formation for Generalized Tasks in Expertise Social Networks

Reinforcement Learning by Comparing Immediate Reward

Transcription:

Important Dates Assigned: September 21 CSCI 374 Machine Learning and Data Mining Oberlin College Fall 2016 Snapshot 1: September 28 (11:59 PM) Snapshot 2: October 5 (11:59 PM) Final Due Date: October 10 (11:59 PM) Assignment In this assignment, you will practice: Homework #1: Decision Trees 1) implementing machine learning algorithms from scratch, 2) experimenting with those algorithms on a variety of provided data sets with different properties, 3) analyzing the results of those experiments to evaluate the performance of the different implemented learning algorithms with respect to different data sets, and 4) writing a technical report detailing (i) how your implementation works, (ii) your experimental setup, (iii) the results of your experiments, and (iv) any implications or lessons learned from your implementation and results. In particular, you will implement the two or three machine learning algorithms discussed in class for learning decision tree representations of a supervised learning classifier: ID3, C4.5, and (optionally) CART. Through implementing the algorithms (rather than re-using existing implementations), you will gain a better understanding of how decision trees are learned, how they can be used, as well as the differences between various algorithms for learning decision trees and their relative advantages and disadvantages. Acceptable Programming Languages You can use either the Java or Python programming languages to complete this assignment. Data Sets For this assignment, you will use three pre-defined data sets in CSV files that can be downloaded from the Course Content/Homework 1 folder on Blackboard: 3) monks1.csv: A data set describing two classes of robots using all nominal attributes and a binary label. This data set has a simple rule set for determining the label: if head_shape = body_shape jacket_color = red, then yes, else no.

This data set is useful for debugging your implementations and verifying their correctness. Monks1 was one of the first machine learning challenge problems (http://www.mli.gmu.edu/papers/91-95/91-28.pdf). This data set comes from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/monk%27s+problems 4) opticaldigit.csv: A data set of optical character recognition of numeric digits from processed pixel data. Each instance represents a different 32x32 pixel image of a handwritten numeric digit (from 0 through 9). Each image was partitioned into 64 4x4 pixel segments and the number of pixels with non-background color were counted in each segment. These 64 counts (ranging from 0-16) are the 64 attributes in the data set, and the label is the number from 0-9 that is represented by the image. This data set is more complex than the Monks1 data set, but still contains only nominal attributes and a nominal label. This data set comes from the UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/datasets/optical+recognition+of+handwritten+digits 5) hypothyroid.csv: A data set describing patient health data using a mix of nominal and continuous attributes that can be used to diagnose the health of a patient s thyroid into four possible labels. This data set is more complex in the types of attributes and the number of labels than the other two data sets. This data set comes from Weka 3.8: http://www.cs.waikato.ac.nz/ml/weka/ The file format for each of these data sets is as follows: The first row contains a comma-separated list of the names of the label and attributes Each successive row represents a single instance The first entry (before the first comma) of each instance is the label to be learned, and all other entries (following the commas) are attribute values. Some attributes are strings (representing nominal values), some are integers, and others are real numbers. Each label is a string. Program Behavior Your program should behave as follows: 1) It should take as input three parameters: a. The path to a file containing a data set (e.g., monks1.csv) b. The name of the algorithm to use for training (see below for more details) c. A random seed as an integer 2) Next, the program should read in the data set as a set of instances 3) The instances should be split into training and test sets (using the random seed input to the program) 4) The training set should be fed into the specified machine learning algorithm to construct a decision tree fitting the training data 5) The learned decision tree should be evaluated using the test set created in Step 3.

6) The confusion matrix counted during Step 5 should be output as a file with its name following the pattern: results_<dataset>_<algorithm>_<seed>.csv (e.g., results_monks1_id3_12345.csv). The file format for your output file should be as follows: The first row should be a comma-separated list of the possible labels in the data set, representing the list of possible predictions of the decision tree. This row should end in a comma. The second row should be a comma-separated list of the counts for the instances predicted as the different labels whose true label is the first possible label, ending with the name of the first possible label (and not a final comma). The third row should be a comma-separated list of the counts for the instances predicted as the different labels whose true label is the second possible label, ending with the name of the second possible label (and not a final comma). Etc. for the remaining possible labels For example, the confusion matrix: would be output as: Yes,No, 200,100,Yes 50,250,No Predicted Label Yes No 200 100 Yes Actual 50 250 No Label The output for your program should be consistent with the random seed. That is, if the same seed is input twice, your program should learn the exact same tree and output the exact same confusion matrix. You are free to also output other files, too, if you wish (e.g., a file describing the learned tree). Experiments There are two options for completing this assignment: Option #1: Implement each of the ID3, C4.5, and CART algorithms, then use the three data sets to conduct the following experiments: 1) Pick a single random seed (include it in your report) and run each learning algorithm on each data set (with the exception of do not run ID3 on the hypothyroid data set since it contains numeric data), then compare the resulting performance of the learned decision

trees from each algorithm. For each data set, how do the accuracies? Remember to use 95% confidence intervals in your comparisons. 2) Pick one data set, then learn 30 different decision trees with each algorithm, and calculate the average accuracy per algorithm across the 30 runs. To do so, use 30 different random seeds to generate 30 different training sets and 30 different trees. Then, compare the average accuracy across those 30 runs with the confidence intervals found in Experiment 1 above to answer the following questions for each algorithm: a. How close was the average accuracy across the 30 runs to the original accuracy found in Experiment 1? b. Does the average accuracy fall within or outside the confidence interval found in Experiment 1? c. Are the average accuracies across algorithms closer or farther apart than the original accuracies computed for Experiment 1? Only calculate standard errors and confidence intervals in Experiment 1 and not for your 30 additional runs in Experiment 2. The goal of Experiment 1 is to investigate how the different algorithms compare on different data sets and gain practice evaluating their differences. The goal of Experiment 2 is to gain additional understanding into how confidence intervals measure the performance of machine learning algorithms. For Option #1, the names of the algorithms to use as input to your program should be ID3, C4.5, and CART. Option #2: Implement the ID3 algorithm, as well as three variants of C4.5: (1) full C4.5, (2) C4.5 without pruning, and (3) C4.5 without using SplitInformation when determining the best attribute (only use Gain as in ID3). Then, using the three data sets, conduct the same two experiments as in Option #1, except consider all three variants of C4.5 (and leave out CART) in both Experiment 1 and 2. In particular, add the following analyses: For the monks1.csv and opticaldigit.csv data sets, draw the root and children of the trees found by ID3 and C4.5 without pruning. Compare any similarities or differences between the trees. For the monks1.csv and opticaldigit.csv data sets, compare the attributes found at the top of the tree in ID3 and in the most accurate rules found by (full) C4.5. Do the same attributes appear in both? What differences do you find? For all three data sets, compare full C4.5 to C4.5 without pruning to evaluate the benefits of pruning on total accuracy on the test set. For all three data sets, compare full C4.5 and C4.5 without SplitInformation to evaluate any possible benefits on total accuracy on the test set caused by considering SplitInformation when choosing the best attribute for each node.

For Option #2, the names of the algorithms to use as input to your program should be ID3, C4.5, C4.5NP (for C4.5 without pruning), C4.5NSI (for C4.5 without SplitInformation) Snapshots Since the homework assignment is multiple weeks long, there are two intermediate deadlines to help you make sure you complete the entire assignment on time: Snapshot 1 (due Wednesday September 28 at 11:59 PM): your program should be capable of: Inputting the program parameters described above Reading a data set into a set of instances Splitting the data set into training and test sets (using the random seed) Running the ID3 algorithm Outputting the confusion matrix from testing the learned tree Snapshot 2 (due Wednesday October 5 at 11:59 PM): your program should additionally be capable of: Running the C4.5 algorithm For each snapshot, your code (and associated Makefile and README described below) should be organized in a ZIP file and turned in on Blackboard. Your zip file should be named: <OCCSUserName>_SnapshotX.zip For example, Alice Student s second snapshot would be named: astudent_snapshot2.zip Final Handin Before the assignment due date (Monday October 10 at 11:59 PM), you will turn in: 1) A ZIP file (named as your OCCS username) containing: a. Your source code b. A Makefile for compiling your source code c. A README file 2) Your technical report as a PDF file, named the same as your ZIP file. Your Makefile must be able to compile your source code into an executable program that behaves as described above. Your README file should describe the different source code files used by your program, as well as instructions for running your program and finding its output file(s). Your technical report should contain:

An introduction describing the assignment and the contents of the report (provide the reader with the background needed to understand the rest of the report) A description of your implementation (what did you create?) A description of your experimental setup (what did you run and for what purpose?) A discussion of the results (what did you find, why did you find that, and what are the implications?) A conclusion summarizing the report and assignment Grading The homework will be graded as follows: Snapshot 1: 5% Snapshot 2: 5% Implementation Correctness and Documentation: 50% Report: 40% Honor Code Each student is to complete this assignment individually. However, since the assignment is a mini-project in scope, students are encouraged to collaborate with one another to discuss the abstract design and processes of their implementations. For example, please feel free to discuss the pseudocode for each learning algorithm to help each other work through issues understanding exactly how the learning algorithms work. You might also want to discuss the processes used to generate the training and test sets from the read in data set. Or, you might need to discuss how to work with the input and output files. At the same, since this is an individual assignment, no code can be shared between students, nor can students look at each other s code. All discussions should be limited to abstract details and not implementation-specific concerns. For example, no discussing of the code used in the classes used to represent a decision tree, nor the lines of code used to build the trees from training data. Furthermore, the source code of existing machine learning libraries (e.g., Weka for Java, scikit-learn for Python) must not be consulted. Any violation of the above will be considered an Honor Code violation. If you have any questions about what is permissible and what is not, please discuss with the professor. Please also feel free to stop by office hours to discuss the homework assignment if you have any questions or concerns.