Caution: in order to realize this tutorial, we must use the 3.1 version of Sipina Research. Please, check the version number in the title bar.

Similar documents
Storytelling Made Simple

Creating an Online Test. **This document was revised for the use of Plano ISD teachers and staff.

CS Machine Learning

TeacherPlus Gradebook HTML5 Guide LEARN OUR SOFTWARE STEP BY STEP

PowerTeacher Gradebook User Guide PowerSchool Student Information System

Millersville University Degree Works Training User Guide

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Moodle 2 Assignments. LATTC Faculty Technology Training Tutorial

Excel Intermediate

Home Access Center. Connecting Parents to Fulton County Schools

Closing out the School Year for Teachers and Administrators Spring PANC Conference Wrightsville Beach April 7-9, 2014

Odyssey Writer Online Writing Tool for Students

i>clicker Setup Training Documentation This document explains the process of integrating your i>clicker software with your Moodle course.

Using Blackboard.com Software to Reach Beyond the Classroom: Intermediate

CHANCERY SMS 5.0 STUDENT SCHEDULING

On-Line Data Analytics

Emporia State University Degree Works Training User Guide Advisor

MyUni - Turnitin Assignments

Learning From the Past with Experiment Databases

Creating Your Term Schedule

Urban Analysis Exercise: GIS, Residential Development and Service Availability in Hillsborough County, Florida

Student Handbook. This handbook was written for the students and participants of the MPI Training Site.

CENTRAL MAINE COMMUNITY COLLEGE Introduction to Computer Applications BCA ; FALL 2011

INSTRUCTOR USER MANUAL/HELP SECTION

DegreeWorks Advisor Reference Guide

Lecture 1: Machine Learning Basics

Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming

ecampus Basics Overview

SCT Banner Student Fee Assessment Training Workbook October 2005 Release 7.2

Schoology Getting Started Guide for Teachers

Field Experience Management 2011 Training Guides

16.1 Lesson: Putting it into practice - isikhnas

Managing the Student View of the Grade Center

Preferences...3 Basic Calculator...5 Math/Graphing Tools...5 Help...6 Run System Check...6 Sign Out...8

IVY TECH COMMUNITY COLLEGE

MOODLE 2.0 GLOSSARY TUTORIALS

Appendix L: Online Testing Highlights and Script

STUDENT MOODLE ORIENTATION

Implementing a tool to Support KAOS-Beta Process Model Using EPF

Using SAM Central With iread

Adult Degree Program. MyWPclasses (Moodle) Guide

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

How to set up gradebook categories in Moodle 2.

Intel-powered Classmate PC. SMART Response* Training Foils. Version 2.0

Python Machine Learning

New Features & Functionality in Q Release Version 3.1 January 2016

Houghton Mifflin Online Assessment System Walkthrough Guide

TotalLMS. Getting Started with SumTotal: Learner Mode

SCOPUS An eye on global research. Ayesha Abed Library

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

COMPUTER-ASSISTED INDEPENDENT STUDY IN MULTIVARIATE CALCULUS

Getting Started Guide

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

6 Financial Aid Information

Generative models and adversarial training

Experience College- and Career-Ready Assessment User Guide

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Justin Raisner December 2010 EdTech 503

SECTION 12 E-Learning (CBT) Delivery Module

1 Use complex features of a word processing application to a given brief. 2 Create a complex document. 3 Collaborate on a complex document.

An Introduction to Simio for Beginners

Using NVivo to Organize Literature Reviews J.J. Roth April 20, Goals of Literature Reviews

School Year 2017/18. DDS MySped Application SPECIAL EDUCATION. Training Guide

Test Administrator User Guide

POWERTEACHER GRADEBOOK

Getting Started with MOODLE

USER GUIDANCE. (2)Microphone & Headphone (to avoid howling).

Dyslexia and Dyscalculia Screeners Digital. Guidance and Information for Teachers

BRAZOSPORT COLLEGE LAKE JACKSON, TEXAS SYLLABUS. POFI 1301: COMPUTER APPLICATIONS I (File Management/PowerPoint/Word/Excel)

SCT Banner Financial Aid Needs Analysis Training Workbook January 2005 Release 7

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Once your credentials are accepted, you should get a pop-window (make sure that your browser is set to allow popups) that looks like this:

The following information has been adapted from A guide to using AntConc.

Creating a Test in Eduphoria! Aware

Introduction to Causal Inference. Problem Set 1. Required Problems

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Preparing for the School Census Autumn 2017 Return preparation guide. English Primary, Nursery and Special Phase Schools Applicable to 7.

ACCESSING STUDENT ACCESS CENTER

Major Milestones, Team Activities, and Individual Deliverables

Minitab Tutorial (Version 17+)

ACADEMIC TECHNOLOGY SUPPORT

Mining Association Rules in Student s Assessment Data

BLACKBOARD TRAINING PHASE 2 CREATE ASSESSMENT. Essential Tool Part 1 Rubrics, page 3-4. Assignment Tool Part 2 Assignments, page 5-10

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

FACULTY Tk20 TUTORIALS: PORTFOLIOS & FIELD EXPERIENCE BINDERS

LMS - LEARNING MANAGEMENT SYSTEM END USER GUIDE

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Student User s Guide to the Project Integration Management Simulation. Based on the PMBOK Guide - 5 th edition

Laboratorio di Intelligenza Artificiale e Robotica

Quick Start Guide 7.0

Generating Test Cases From Use Cases

Android App Development for Beginners

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Artificial Neural Networks written examination

New Features & Functionality in Q Release Version 3.2 June 2016

Office of Planning and Budgets. Provost Market for Fiscal Year Resource Guide

Laboratorio di Intelligenza Artificiale e Robotica

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Transcription:

1 Introduction Sipina: Speed up decision tree induction via local sampling. Caution: in order to realize this tutorial, we must use the 3.1 version of Sipina Research. Please, check the version number in the title bar. During the decision tree learning process, the algorithm detects the better variable according to a goodness of fit measure when it tries to split a node. The calculation can take a long time, particularly when it deals with a continuous descriptors for which it must detect the optimal cut point. For all the decision tree algorithms, Sipina introduces a local sampling option when it searches the best splitting attribute on a node. The idea is the following: on a node, it draws a random sample of size n, and then all the computations are made on this sample. Of course, if n is lower than the number of the existing examples on the node, Sipina uses all the available examples. It occurs when we have a very large tree with a high number of nodes. We have described this approach in a paper (Chauchat and Rakotomalala, IFCS 2000) 1. We describe in this tutorial how to implement it with Sipina. We note in this tutorial that using a sample on each node allows to reduce drastically the computation time without loss of accuracy. But, we can obtain a tree which can be different than dealing with the 1 J.H. Chauchat and R. Rakotomalala, «A new sampling strategy for building decision trees from large databases», Proc. Of IFCS 2000, 199 204, 2000. 28/10/2009 1/16

entire dataset. But is it really a problem? We know that the decision trees are very unstable. If we use a different dataset, we can obtain a very different tree (for instance, they do not use the same splitting attributes). But on the other hand, it does not mean that the models classify the examples of the test set differently. One way to check this idea is to compare the generalization error rate. In the tutorial, we show the interest of the local sampling strategy for the induction of the decision tree: First, we build the tree on the whole dataset using the usual approach. We measure the computation time and the test error rate. Then, we use the local sampling strategy with n = 5000. We compare the performances (error rate and computation time). The sample size above is in fact very conservative. The enough sample size depends on various factors (complexity of the underlying concept, densities of the examples into the representation space, etc.). We see on our dataset that about a hundred of examples are enough when we search the best splitting attribute on a node, without loss of accuracy. Last, we compare the computation time with the implementation of the decision tree algorithm of Tanagra 2 where another strategy is used in order to speed up the learning process when we handle continuous attributes. 2 Dataset We use the famous WAVEFORM dataset (Breiman and al., 1984) 3. The class attribute has 3 values. There are 21 continuous descriptors. We have generated 2,000,000 examples; the half is used for the learning process, the other half for the evaluation of the classifier. The data file is in ARFF (Weka) file format 4. We have used another version of this dataset previously 5. It will be interesting to compare the behavior of Sipina and Tanagra on the same dataset (class attributes and descriptors) but with 4 times over examples. 3 Analysis with SIPINA 3.1 Preparing the learning process After we launch Sipina, we click on the FILE / OPEN menu. We choose the WEKA (ARFF) file format. We select the WAVE2M.ARFF data file. During the loading, the progress bar seems frozen. But it does not matter. The process is running. 2 http://eric.univ lyon2.fr/~ricco/tanagra/en/tanagra.html 3 http://archive.ics.uci.edu/ml/datasets/waveform+database+generator+(version+1) 4 http://eric.univ lyon2.fr/~ricco/tanagra/fichiers/wave2m.zip 5 http://data mining tutorials.blogspot.com/2008/11/decision tree and large dataset.html 28/10/2009 2/16

After 18 seconds, the values are displayed in the visualization grid. We have 22 columns and 2,000,000 rows. Then, we want to define the types of the variables and to specify the train and the test set. In order to define the types of the variables, we click on the ANALYSIS / DEFINE CLASS ATTRIBUTES menu. Using drag and drop, we set ONDE as CLASS, the others (V1 to V21) as ATTRIBUTES. 28/10/2009 3/16

The resulting selection appears in the left part of the main window. In order to subdivide the examples, we click on the ANALYSIS / SELECT ACTIVE EXAMPLES menu. We select the RANDOM SAMPLING 50/50 option. 3.2 The usual decision tree learning algorithm Selecting the method and the associated settings. For specifying the learning method, we click on the 28/10/2009 4/16

INDUCTION METHOD / STANDARD ALGORITHM menu. We select the IMPROVED CHAID approach in the dialog box. When we click on OK, the settings of the method appear. We do not modify anything for the moment. We note here that the number of levels of the tree is limited to 4 (MAX DEPTH = 4). This setting has no theoretical justification. It allows to control the size of the tree in order to obtain an easy to read model. Training process. We launch the analysis by clicking on the ANALYSIS / LEARNING menu. 28/10/2009 5/16

We obtain a tree with 8 leaves and, indeed, 4 levels (the first level is the root of the tree). The computation time is 93 seconds. If we see the details the tree (Figure 1), we note the splitting attribute for the root node is V07; then, the nodes on the second level are split with the same attribute (V11), but not with the same cut points (3.57 at the left, 3.36 at the right); etc. 28/10/2009 6/16

Figure 1 Decision tree, without local sampling strategy Evaluation on the test set. In order to compute the error rate on the test set, we click on the ANALYSIS / TEST menu. We can compute the confusion matrix on the training sample or the test sample. We choose this second option (Figure 2). We obtain the confusion matrix and the error rate (34% Figure 3). We keep in mind this value. It is the reference which we will use for evaluating the local sampling approach for the decision tree learning process. Figure 2 Assessment on the test set 28/10/2009 7/16

Figure 3 Confusion matrix and test error rate 3.3 Local sampling strategy for the decision tree induction We quit the current analysis by clicking on the WINDOW / CLOSE ALL menu. We want now to implement the local sampling strategy during the learning process. Again, we select the INDUCTION METHOD / STANDARD ALGORITHM menu then, as above, we select the IMPROVED CHAID method. In the dialog settings, the SAMPLING tab is very important here. Rather than using all the examples on a node, we perform a sampling with n = 5000. For each node, all the calculations (detecting the right cut point, select the most important splitting attribute) are performed on the local sample. We validate this choice. We can launch a new analysis by clicking on ANALYSIS / LEARNING menu. We 28/10/2009 8/16

obtain a tree after 3 seconds! The computation time is drastically reduced. SIPINA gives the class distribution on the whole dataset for each node, but the computations related to the decision tree induction are performed on the samples. Figure 4 Decision tree with the local sampling strategy (n = 5000) The obtained tree (Figure 4) is a little different than the previous one (Figure 1). The three first levels of the trees are identical. From the 4th level, there are some differences. But what about their behaviors? In order to check this, we apply the new tree on the test set 6. We click on the ANALYSIS / TEST menu. We select the test set (INACTIVE EXAMPLES OF THE DATABASE). We obtain a new confusion matrix. The test error rate is 34%. We observe that the local sampling approach preserves the classifier performance and reduces drastically the computation time. The approach is very attractive. 6 Other approaches are possible. For instance, by comparing the prediction on the two trees on the test set; by checking if the examples in the same leaf on the first tree are in the same leaf on the second tree. 28/10/2009 9/16

3.4 Decreasing the sample size The main difficult is to determine the appropriate sample size. We use a very conservative default value (n = 5000) with Sipina. But, in the most of the databases, we can use lower values. On the WAVEFORM dataset, we observe that about a hundred of instances are sufficient to perform the calculations. We varied the sample size (n = 100, 200, 400, 800, 1600, 3200, 6400). We measure both the test error rate (Figure 6) and the calculation time (Figure 5). Execution time according to the sample size 4000 Computation time (ms.) 3500 3000 2500 2000 1500 1000 500 0 0 1000 2000 3000 4000 5000 6000 7000 Sample size n Figure 5 Calculation time according to the sample size 28/10/2009 10/16

Error rate according to the sample size 50 45 40 Error rate (%) 35 30 25 20 15 10 5 0 0 1000 2000 3000 4000 5000 6000 7000 Sample size n Figure 6 Test error rate according to the sample size As we can see, we obtain satisfactory results from n = 100 on this dataset. We note here that the learning algorithm and its settings can influence also the appropriate sample size. In our paper (Chauchat and Rakotomalala, IFCS 2000), we use the C4.5 algorithm. We obtain deeper trees. In this context, the error rate becomes stable from n = 300. Concerning the computation time, it increases linearly with the sample size. 4 Analysis with TANAGRA We choose another approach with Tanagra. The continuous descriptors are sorted before the decision tree learning process. We maintain in memory an index of the sorted examples for each continuous descriptor. For our dataset, we have 1,000,000 examples and 21 descriptors, wee need (21 x 4 x 1.000.000 # 80 MB) in memory. We choose this approach because the memory is abundant when I developed Tanagra (2004) in comparison with the time I had programmed Sipina (1998). The aim is to speed up the learning process by using all the available examples. It is suitable for the most of the dataset. But it can become a problem if we handle a very large database with a high number of continuous descriptors. 4.1 Diagram creation and data importation After we launch Tanagra (version 1.4.33 or later), we create a diagram by clicking on the FILE / NEW menu. We select the WAVE2M.ARFF data file. 28/10/2009 11/16

In the next step, we want to subdivide the database in a train and test samples. We use the SAMPLING (INSTANCE SELECTION tab) component. We open the dialog settings by clicking on the contextual PARAMETERS menu. We use the half of the database for the training phase. In order to define the status of the variables, we use the DEFINE STATUS component by using the 28/10/2009 12/16

shortcut in the toolbar. We set ONDE as TARGET and the others as INPUT. 4.2 ID3 learning algorithm We plan to use the ID3 (SPV LEARNING tab) algorithm because it is similar to the IMPROVED CHAID of SIPINA: we can limit the depth of the tree. We insert the component into the diagram. We click on the SUPERVISED PARAMETERS menu. We set MAX DEPTH OF THE TREE = 4. 28/10/2009 13/16

We validate the choice. We launch the learning process by clicking on the VIEW menu. The execution time is 21 seconds. With the same settings, we obtain a tree with 4 levels (8 leaves); the preliminary computation of the sorted index enables to dramatically reduce the execution time when we use all the available examples on each node (93 sec. for Sipina, 21 sec. for Tanagra). 4.3 Test error rate We insert again the DEFINE STATUS component for the evaluation of the tree on the test set. We set ONDE as TARGET, the predicted values supplied by the tree PRED_SPVINSTANCE_1 as INPUT. Then, we insert the TEST component (SPV LEARNING ASSESSMENT tab) into the diagram. We click on 28/10/2009 14/16

the PARAMETERS menu. We check that the confusion matrix is computed on the test sample (UNSELECTED instances). We click on the VIEW menu. The test error rate is 32%. The values is similar to the one supplied by Sipina. We have not identical value because, even if SIPINA and TANAGRA use the same proportion, we have not the same instances in the train and test samples. 28/10/2009 15/16

5 Conclusion As we see in this tutorial, working on a sample on each node enables to decrease dramatically the execution time without a loss of accuracy. The generalization error rate of the resulting classifier is similar to those computed on the whole examples. The local sampling strategy can be very useful when we handle a very large database. We summarize the main results in the following table: Execution time (sec.) Test error rate (%) Sipina without local sampling 93 34 Tanagra 21 32 Sipina with sampling (n = 5000) 3 34 The improvements of TANAGRA speed up the calculations compared to the traditional implementation into SIPINA (without sampling). When we use the local sampling strategy, we reduce even more strongly the execution time (30 times!). The generalization error rate is not modified. Let's be honest. We are in a very favorable configuration in this tutorial. We create a small tree and all the descriptors are continuous. In the contrary, when we create a very large tree (with the C4.5 method for instance) or if most of the descriptors are discrete, the improvement is less spectacular. Of course, the user can evaluate the efficiency of this approach on its dataset. Last, we are concerned about local sampling in this tutorial. Another approach is the global sampling. We draw a sample before the usual learning process. The determination of the right size of the sample remains the main problem in this context (see for instance the Quinlan s Windowing approach, 1993; or the John s Dynamic sampling approach, 1996). 28/10/2009 16/16