Decision Trees and Cost Estimating

Similar documents
Python Machine Learning

CS Machine Learning

Learning From the Past with Experiment Databases

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

(Sub)Gradient Descent

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Assignment 1: Predicting Amazon Review Ratings

Lecture 1: Machine Learning Basics

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Rule Learning With Negation: Issues Regarding Effectiveness

Universidade do Minho Escola de Engenharia

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Rule Learning with Negation: Issues Regarding Effectiveness

CSL465/603 - Machine Learning

Probabilistic Latent Semantic Analysis

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Word Segmentation of Off-line Handwritten Documents

A Case Study: News Classification Based on Term Frequency

Reducing Features to Improve Bug Prediction

Mining Association Rules in Student s Assessment Data

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Switchboard Language Model Improvement with Conversational Data from Gigaword

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

CS 446: Machine Learning

Applications of data mining algorithms to analysis of medical data

Learning Methods in Multilingual Speech Recognition

Business Analytics and Information Tech COURSE NUMBER: 33:136:494 COURSE TITLE: Data Mining and Business Intelligence

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

FRAMEWORK FOR IDENTIFYING THE MOST LIKELY SUCCESSFUL UNDERPRIVILEGED TERTIARY STUDY BURSARY APPLICANTS

Human Emotion Recognition From Speech

Multi-Lingual Text Leveling

Multivariate k-nearest Neighbor Regression for Time Series data -

Softprop: Softmax Neural Network Backpropagation Learning

Calibration of Confidence Measures in Speech Recognition

An Empirical Analysis of the Effects of Mexican American Studies Participation on Student Achievement within Tucson Unified School District

12- A whirlwind tour of statistics

Indian Institute of Technology, Kanpur

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Analysis of Emotion Recognition System through Speech Signal Using KNN & GMM Classifier

Linking Task: Identifying authors and book titles in verbose queries

Speech Emotion Recognition Using Support Vector Machine

Measurement & Analysis in the Real World

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

Issues in the Mining of Heart Failure Datasets

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Using Web Searches on Important Words to Create Background Sets for LSI Classification

Truth Inference in Crowdsourcing: Is the Problem Solved?

Analyzing sentiments in tweets for Tesla Model 3 using SAS Enterprise Miner and SAS Sentiment Analysis Studio

Impact of Cluster Validity Measures on Performance of Hybrid Models Based on K-means and Decision Trees

Large-Scale Web Page Classification. Sathi T Marath. Submitted in partial fulfilment of the requirements. for the degree of Doctor of Philosophy

*Net Perceptions, Inc West 78th Street Suite 300 Minneapolis, MN

Instructor: Mario D. Garrett, Ph.D. Phone: Office: Hepner Hall (HH) 100

Using dialogue context to improve parsing performance in dialogue systems

Australian Journal of Basic and Applied Sciences

Mathematics Scoring Guide for Sample Test 2005

A Decision Tree Analysis of the Transfer Student Emma Gunu, MS Research Analyst Robert M Roe, PhD Executive Director of Institutional Research and

Time series prediction

Session 2B From understanding perspectives to informing public policy the potential and challenges for Q findings to inform survey design

Lecture 1: Basic Concepts of Machine Learning

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

School Year 2017/18. DDS MySped Application SPECIAL EDUCATION. Training Guide

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

TIMSS ADVANCED 2015 USER GUIDE FOR THE INTERNATIONAL DATABASE. Pierre Foy

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

WHEN THERE IS A mismatch between the acoustic

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Chapter 2 Rule Learning in a Nutshell

Beyond the Pipeline: Discrete Optimization in NLP

Machine Learning from Garden Path Sentences: The Application of Computational Linguistics

Using the Attribute Hierarchy Method to Make Diagnostic Inferences about Examinees Cognitive Skills in Algebra on the SAT

Probability and Statistics Curriculum Pacing Guide

arxiv: v1 [cs.lg] 15 Jun 2015

Data Fusion Through Statistical Matching

CS 100: Principles of Computing

Conference Presentation

Active Learning. Yingyu Liang Computer Sciences 760 Fall

Exploration. CS : Deep Reinforcement Learning Sergey Levine

Introduction to Causal Inference. Problem Set 1. Required Problems

Model Ensemble for Click Prediction in Bing Search Ads

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

Analysis of Hybrid Soft and Hard Computing Techniques for Forex Monitoring Systems

arxiv: v1 [cs.lg] 3 May 2013

Fragment Analysis and Test Case Generation using F- Measure for Adaptive Random Testing and Partitioned Block based Adaptive Random Testing

Statistical Analysis of Climate Change, Renewable Energies, and Sustainability An Independent Investigation for Introduction to Statistics

Data Structures and Algorithms

Mining Student Evolution Using Associative Classification and Clustering

On-Line Data Analytics

Automatic Pronunciation Checker

(Includes a Detailed Analysis of Responses to Overall Satisfaction and Quality of Academic Advising Items) By Steve Chatman

Comment-based Multi-View Clustering of Web 2.0 Items

Experiment Databases: Towards an Improved Experimental Methodology in Machine Learning

Bug triage in open source systems: a review

Why Did My Detector Do That?!

GACE Computer Science Assessment Test at a Glance

EDCI 699 Statistics: Content, Process, Application COURSE SYLLABUS: SPRING 2016

Activity Recognition from Accelerometer Data

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

An Empirical and Computational Test of Linguistic Relativity

QUESTIONS and Answers from Chad Rice?

Transcription:

Decision Trees and Cost Estimating Josh Wilson Booz Allen Hamilton

Agenda Motivation Integration of Data Science Methods within Cost Estimating Field Obligatory Data Science slide Decision Trees Definition & Explanation Strengths & Weaknesses Extensions Applicability to Cost Estimating Data Challenges Example Can we predict installation cost overruns? Conclusions

Motivation Background in cost estimating Interest in data science Exploring application of data science to cost estimating

Data Science? http://www.prooffreader.com/2016/09/battle-of-data-science-venndiagrams.html

Decision Trees First, a clarification There are two types of decision trees Decision trees for decision analysis Model decisions and consequences https://en.wikipedia.org/wiki/decision_tree These types of trees ARE NOT the topic of this presentation Decision trees for prediction Maps observations to outcomes https://en.wikipedia.org/wiki/decision_tree_learning These types of trees ARE the topic of this presentation

Decision Trees What are they? Nonparametric supervised learning method Nonparametric = makes no assumptions about underlying data distributions Supervised = model learns from examples where we know the outcome Can be used for classification or regression Classification if we are trying to predict a categorical outcome Regression if we are trying to predict a continuous outcome Makes predictions by learning simple if-then-else decision rules from data Recursively partition data into subgroups and apply simple prediction models Example: Predicting passenger survival on Titanic If sex is female, then predict passenger survived, else If age > 9.5, then predict passenger died, else (and so on)

Decision Trees How do they work? (the basic idea) At each step, split data to maximize homogeneity of target variable within resulting subgroups i.e. We want to separate out the different outcomes as best we can Algorithm scans all possible splits and chooses the best Process continues on resulting subgroups until stopping condition reached: Maximum # levels reached All subgroups are smaller than some specified threshold size No possible split improves the result

Decision Trees How do they work? (good vs. bad splits) Good split - Separates classes: Bad split Classes still impure

Decision Trees How do they work? (Titanic example) We can predict survival using Titanic passenger demographic info If sex is female, then predict passenger survived, else If (male) passenger age > 9.5, then predict passenger died, else If (male, child) passenger is traveling with 3+ family members, predict passenger died, else Predict passenger survived sibsp = number of siblings/spouses (i.e. family members) onboard

Decision Trees Strengths Easy to interpret, explain, and visualize Little data preparation or cleaning Can handle both numerical and categorical input data Robust to outliers and missing data Handles nonlinear relationships and correlated variables Ignores useless variables Automates modeling of variable interactions i.e. Perhaps age is important if you re male, but not if you re female

Decision Trees Weaknesses Susceptible to overfitting Overfitting = model captures random peculiarities of training data and does not generalize well to new data Splitting decisions tend to favor categorical variables with many levels Consider a full name variable in tree to predict Titanic survival Greedy algorithm makes best current decision, possibly bad for long-term

Decision Trees Extensions Ensemble method = prediction based on multiple individual models Random Forests Ensemble of many individual decision trees, each built from a subset of the data and/or features Generalize to new data better than single trees Boosted Trees Ensemble method where new trees are built to improve performance of their sums E.g. by increasing the weight of incorrectly classified data points Overall prediction based on individual trees weighted by accuracy

Decision Trees Applicability to Cost Estimating Another method to predict cost, or things useful for predicting cost Examples: Efforts likely to result in cost over/under runs Categories of SW code growth Less impacted by certain types of cost estimating challenges Messy data Mixture of numeric/categorical? Outliers? Missing values? Inconsistent units across different variables? Time constraints Which independent variables are useful? Which are correlated?

Example: Can we predict installation cost overruns? Data / Background Presented at the ICEAA 2017 Professional Development & Training Workshop - www.iceaaonline.com/portland2017 Raw installation data is from SPIDER database SPIDER = SPAWAR PEO C4I Information Data Enterprise Repository Data for >6k install efforts from a single program office 141 columns of data mostly text/categorical, some numeric, some dates Descriptors of effort Ship type, location, system, type of install, etc. Cost estimates Includes initial estimate and actual cost if completed Key event dates Ship availability, planned installation dates, etc. Lots of missing data eliminating rows with missing data results in 0 rows left

Example: Can we predict installation cost overruns? General Process Presented at the ICEAA 2017 Professional Development & Training Workshop - www.iceaaonline.com/portland2017 Data preprocessing Filtered data to remove incomplete efforts Removed various ID number columns Converted dates to number of days prior to ship availability Defined target variable Cost Growth Category as Over Low if 0% < Cost Growth % < 40% Over High if Cost Growth % > 40% Under Low if -40% < Cost Growth % < 0% Under High if Cost Growth % < 40% Split data into training and test datasets Built various models to predict Cost Growth Category

Example: Can we predict installation cost overruns? Confusion Matrix for Characterizing Classification Errors Confusion Matrix = visualization of predicted versus actual outcomes Good if high values along diagonal, low values elsewhere

Example: Can we predict installation cost overruns? Naïve Results Baseline for Comparison What if we predict the most common outcome from our training data? Then we correctly predict that outcome, but miss everything else 31% prediction accuracy

Example: Can we predict installation cost overruns? Current Results - Boosted Tree Model Almost 60% prediction accuracy Highest accuracy for extreme cases (i.e. high underruns and high overruns) Most important features = ship avail duration, lead time for ship check, drawings, system test

Example: Can we predict installation cost overruns? Next Steps Presented at the ICEAA 2017 Professional Development & Training Workshop - www.iceaaonline.com/portland2017 Find other sources of complementary data Performer? Weather/temperature/season? In general, having more/better data is much better than having a better model! Feature Engineering Number of concurrent installations? Direct prediction of install cost (i.e. regression instead of classification)

Conclusions Decision Trees are a viable tool for the cost estimator Easy to interpret and explain Robust to common deficiencies in data quality Little overhead for variable screening Ensemble methods to address weaknesses of single tree models Good method to expose non-technical people to data science approaches

Way Forward Learning curve can be a challenge Self-study resources are available Python http://scikit-learn.org/stable/modules/tree.html R - http://www.statmethods.net/advstats/cart.html Titanic tutorials - https://www.kaggle.com/c/titanic#tutorials Other methods that may be appropriate when considering decision trees Naïve Bayes k-nearest Neighbors (k-nn) Logistic Regression / Linear Regression Support Vector Machines (SVM)

Questions? Presented at the ICEAA 2017 Professional Development & Training Workshop - www.iceaaonline.com/portland2017

BACKUP Presented at the ICEAA 2017 Professional Development & Training Workshop - www.iceaaonline.com/portland2017

All Model Accuracy Results Most Common Occurrence (Naïve Model) = 31% Logistic Regression = 38% Logistic Regression + PCA Transform = 48% Single Decision Tree Classifier = 50% Support Vector Classifier = 50% Random Forest Classifier = 55% Gradient Boosted Tree Classifier = 59%

Decision Trees Impurity Functions Various decision tree algorithms have been implemented, and various impurity metrics are used to measure node homogeneity ID3, C4.5, C5.0 use entropy/information gain: CART uses Gini impurity for classification: CART uses variance reduction for regression: Any strictly convex function can be used