Optimal Task Assignment within Software Development Teams Caroline Frost Stanford University CS221 Autumn 2016

Similar documents
Lecture 1: Machine Learning Basics

Python Machine Learning

A Case Study: News Classification Based on Term Frequency

Learning From the Past with Experiment Databases

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Reducing Features to Improve Bug Prediction

(Sub)Gradient Descent

Activities, Exercises, Assignments Copyright 2009 Cem Kaner 1

CS Machine Learning

Generative models and adversarial training

Probabilistic Latent Semantic Analysis

Softprop: Softmax Neural Network Backpropagation Learning

System Implementation for SemEval-2017 Task 4 Subtask A Based on Interpolated Deep Neural Networks

Assignment 1: Predicting Amazon Review Ratings

Artificial Neural Networks written examination

Unsupervised Learning of Word Semantic Embedding using the Deep Structured Semantic Model

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Software Maintenance

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Australian Journal of Basic and Applied Sciences

How to Judge the Quality of an Objective Classroom Test

Rule Learning with Negation: Issues Regarding Effectiveness

Lecture 1: Basic Concepts of Machine Learning

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

CS 446: Machine Learning

Assessing System Agreement and Instance Difficulty in the Lexical Sample Tasks of SENSEVAL-2

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Linking Task: Identifying authors and book titles in verbose queries

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Rule Learning With Negation: Issues Regarding Effectiveness

A Neural Network GUI Tested on Text-To-Phoneme Mapping

Knowledge Transfer in Deep Convolutional Neural Nets

Human Emotion Recognition From Speech

Thesis-Proposal Outline/Template

Pair Programming. Spring 2015

Training a Neural Network to Answer 8th Grade Science Questions Steven Hewitt, An Ju, Katherine Stasaski

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Modeling function word errors in DNN-HMM based LVCSR systems

Testing A Moving Target: How Do We Test Machine Learning Systems? Peter Varhol Technology Strategy Research, USA

Time series prediction

Historical maintenance relevant information roadmap for a self-learning maintenance prediction procedural approach

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

Introduction to Ensemble Learning Featuring Successes in the Netflix Prize Competition

Indian Institute of Technology, Kanpur

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Learning Methods in Multilingual Speech Recognition

QuickStroke: An Incremental On-line Chinese Handwriting Recognition System

An OO Framework for building Intelligence and Learning properties in Software Agents

Modeling function word errors in DNN-HMM based LVCSR systems

Conference Presentation

Why Did My Detector Do That?!

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

OPTIMIZATINON OF TRAINING SETS FOR HEBBIAN-LEARNING- BASED CLASSIFIERS

Getting Started with Deliberate Practice

A cognitive perspective on pair programming

Issues in the Mining of Heart Failure Datasets

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

A study of speaker adaptation for DNN-based speech synthesis

Circuit Simulators: A Revolutionary E-Learning Platform

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

PREDICTING SPEECH RECOGNITION CONFIDENCE USING DEEP LEARNING WITH WORD IDENTITY AND SCORE FEATURES

arxiv: v1 [cs.cv] 10 May 2017

Semi-Supervised GMM and DNN Acoustic Model Training with Multi-system Combination and Confidence Re-calibration

Lesson plan for Maze Game 1: Using vector representations to move through a maze Time for activity: homework for 20 minutes

Essentials of Ability Testing. Joni Lakin Assistant Professor Educational Foundations, Leadership, and Technology

CSL465/603 - Machine Learning

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

A Version Space Approach to Learning Context-free Grammars

Mining Association Rules in Student s Assessment Data

Exploration. CS : Deep Reinforcement Learning Sergey Levine

ADVANCES IN DEEP NEURAL NETWORK APPROACHES TO SPEAKER RECOGNITION

CS4491/CS 7265 BIG DATA ANALYTICS INTRODUCTION TO THE COURSE. Mingon Kang, PhD Computer Science, Kennesaw State University

Medical Complexity: A Pragmatic Theory

As a high-quality international conference in the field

Bachelor Class

arxiv: v1 [cs.lg] 3 May 2013

Ohio s Learning Standards-Clear Learning Targets

Maximizing Learning Through Course Alignment and Experience with Different Types of Knowledge

A student diagnosing and evaluation system for laboratory-based academic exercises

WORK OF LEADERS GROUP REPORT

AQUA: An Ontology-Driven Question Answering System

Visit us at:

Detecting English-French Cognates Using Orthographic Edit Distance

Phonetic- and Speaker-Discriminant Features for Speaker Recognition. Research Project

Preliminary Report Initiative for Investigation of Race Matters and Underrepresented Minority Faculty at MIT Revised Version Submitted July 12, 2007

Evaluation of a College Freshman Diversity Research Program

CS 100: Principles of Computing

On-Line Data Analytics

Build on students informal understanding of sharing and proportionality to develop initial fraction concepts.

Evolution of Symbolisation in Chimpanzees and Neural Nets

University of Groningen. Systemen, planning, netwerken Bosman, Aart

Team Dispersal. Some shaping ideas

Model Ensemble for Click Prediction in Bing Search Ads

A Comparison of Two Text Representations for Sentiment Analysis

ESTABLISHING A TRAINING ACADEMY. Betsy Redfern MWH Americas, Inc. 380 Interlocken Crescent, Suite 200 Broomfield, CO

Discriminative Learning of Beam-Search Heuristics for Planning

ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY DOWNLOAD EBOOK : ADVANCED MACHINE LEARNING WITH PYTHON BY JOHN HEARTY PDF

AUTOMATED FABRIC DEFECT INSPECTION: A SURVEY OF CLASSIFIERS

Proficiency Illusion

Transcription:

Optimal Task Assignment within Software Development Teams Caroline Frost Stanford University CS221 Autumn 2016 Introduction The number of administrative tasks, documentation and processes grows with the size of the code base and the size of software development teams. These new organizational needs increase overhead for an organization and slow down the software development process. One such need that arises is the need to assign work to engineers. Usually, a project manager takes on these tasks to match the best engineer to a given job. The best engineer can mean several different things, depending on the context. The best engineer can, naturally, be the engineer who has the most expertise in the domain. The best engineer can also mean the engineer with some expertise, but is available to take on another task. The best engineer alternatively could be the engineer who has expressed interest in developing a new skill by taking on a task out of their comfort zone. Additionally, there may be a few engineers who could be considered the best engineer, and perhaps the best engineer is taken arbitrarily from that set. This project attempts to automate the project manager s process of assigning tasks to engineers. Related Work Automating the assignment process has several subproblems within the larger project. Scheduling and duration estimation are important parts of assigning the correct engineer in addition to understanding an engineer s current relationship with a codebase. Qualitatively, choosing the correct engineer also requires a calculation to minimize the cost of interactions across design teams and is an important characteristic of complex engineered systems [1]. A group at the Pakistani National University of Science and Technology published a paper in 2016 that presented a mathematical model minimizing cost, time and load balancing for software development teams that utilized multiple levels of clustering. The group focused on reducing the intercommunication cost, and used clustering for both predicting work division and duration[2]. Optimal task assignment is considered to be NP-hard [3], yet converting the problem into a maximization one, where we try to determine and avoid large communication and execution penalties seems to have better results when tradeoffs between communication and task allocation are considered. Another paper considers optimal allocation as minimizing the completion time of a project, and focuses on the interweaving of tasks to minimize the time an employee or team is blocked [4]. It seems that most studies in the space largely ignore workers who are best suited for particular tasks, and instead treated workers as very similar beings. Recently, a machine learning framework that factors in programmer attributes was proposed to address this problem for software development, with a twist: a simple EEG device detected programmer mood and the final task assignment was solved by a PDTS solver [5]. Additionally, a study published in Empirical Software Engineering in 2015 applied stacked generalization learner to!1

50,000 bug reports, reaching between 50-89% accuracies [6]. It seems that most efforts have focused on Naive Bayes and Support Vector Machines classifiers above other machine learning techniques for classifying tasks. This project attempts to use neural networks to solve the problem, in a way that may be able to easily scale to multiple organizations. Task definition For this project, the optimal choice of engineer will be defined to be the engineer with the most domain expertise. Given a group of tasks and the assigned engineer for each task, it is easier to ascertain areas of expertise for engineers; there is no data for which new skills each engineer wants to gain, and it is difficult to tell how large of a scope a project has, or how much time it will take to complete. The project manager on the software development team is considered the oracle in this project. They know the expertise of the developers and the scope of all tasks, and thus this project assumes the project manager chooses the single optimal assignment each time. Unfortunately, if the commit message is vague, it will be difficult for a model to assign an engineer to the task. Fortunately, with a public and open source repository, there are incentives for engineers committing to describe their work in higher detail if many others will be viewing their messages. Additionally, if the author of the commit is not the engineer with the highest level of knowledge and instead a seeming random choice, this will confuse the model. The project manager may have good reasons for assigning that particular engineer the task, but often the reasons are not been logged online. A baseline model would always predict the engineer with the highest number of contributions for a given task. In the final data set used in this project, this was about a 10% accuracy. This project uses a few different algorithms within machine learning to learn the correct assignment. With enough data, machine learning models may be able to do a pretty good job of creating profiles for each of the engineers despite the data limitations. Data collection and feature extraction GitHub hosts a large and open sourced amount of task assignment data in the form of commits. This project took the past five year commit history from ten popular projects associated with the Docker organization on GitHub, namely "docker", "machine", "docker.github.io", "notary", "compose", "vpnkit", "swarmkit", "datakit", "libnetwork", and infrakit. The available engineers set only includes engineers who had direct access to commit to those repositories, and each engineer/commit pair represents the optimal engineer for the task contained within that commit. Each commit contains the author s name, date, files changed, repository changed, languages associated with the files changed and the commit message. The commit messages have a significant amount of data associated with them. Tokenizing by words and by characters with different lengths are both promising. Qualitatively, looking at top words used in the commit messages seem to not be very useful like commit, completion, fix and make. Looking at top two-word phrases is more useful, and three-word phrases more useful still to help describe the task and extract features. However, there was a significantly higher number of instances!2

for two-word phrases than there was for three-word phrases, indicating that twoword phrases may group tasks in a way that favors generalization. Support vector machine and multi-layer perceptron algorithms were run on data using different subsets of the complete feature set: 1) date of the commit, 2) files changed, 3) repository of commit, 4) languages used in files, and 5) 1500 popular two-word phrases, all codified into binary data during feature extraction. Pulling commit data from the past five years across ten repositories totaled more than 44,000 commits with more than 13,000 possible authors. If the data is cleaned to only contain authors who have committed more than 500 times, there are 27,000 commits across 26 authors. Experiments and results The algorithm first used to make these predictions is an unsupervised k means clustering algorithm, with two hundred of the most common two-word phrases and languages as features for a total of 252 features and more than 500 data points. I allowed as many clusters as available engineers: perfect clustering would have clustered all tasks assigned to each particular engineer together. Testing the model on the set used to train it gave an accuracy of about 30% -- and with a 15- fold cross validation with a total of about 500 data points, the mean accuracy was Figure 1 above, Figure 2 below.!3

about 10% but with a standard deviation of 7%. This wasn t particularly surprising, as engineers within a team tend to resemble one another and perhaps clustering would be more accurate if setting k to the number of teams within the organization. This indicated that perhaps the other algorithms may confuse engineers with similar skills and limit the upper limit of accurately identifying the correct engineer. The next step was to apply a support vector machine (SVM) using SKLearn. A one-versus-rest (OVR) SVM model was trained on more than 44,000 data points (Figure 1). Increasing features from 252 to 852 by adding more common phrases features made a small impact on accuracy, with a greater impact on testing on the training set which suggests an occurrence of overfitting. Additionally, more features were added for identifying the repository of the task, which increased accuracy over the initial training still and brought the number of features to 863 (Figure 1). Across these three models, the 252 SVM had the lowest standard deviation when 5-fold cross validation was run at 23%, though not by much: the 852 and 863 SVMs had a standard deviation of. 28% and.36% respectively (Figure 3). With a closer look at the errors, the previous models all had difficulty classifying engineers who did not commit very often. There is a core group of 26 engineers who all have contributed more than 500 times in the Docker organization, and thus the data was cleaned to only contain commits from these 26 people. This brought the total amount of data points down to more than 27,000. Using the prior 863 feature set, both a SVM one-versus-rest and a SVM one-versus-one (OVO) model were run on these 27,000 data points. Using 5-fold cross validation again, the 863 oneversus-rest proved to be the better model for generalization, both in terms of accuracy (Figure 2) and standard deviation (Figure 3). In an attempt to further improve the SVM one-versus-rest model, an additional features describing the next 700 most popular phrases were added. Running 5-fold cross validation with this model increased accuracy to about 30% (Figure 2). Training a neural network model was feasible due to the large amount of data available, and the neural network in the Figure 3, left.!4

Figure 4, right. form of a multilayered perceptron (MLP) improved this statistic further. A three layer MLP with relatively small sizes 100, 50, and 10 respectively was trained. Using 863 features, the MLP outperformed the SVM by nearly 10%, while using 1562 features the MLP outperformed the SVM by more than 5%. To train neural networks more quickly, I switched from SKLearn to Tensorflow at this point. For faster testing, I created a random training set of more than 23,700 data points and a random test set of more than 3,700 data points with no data point Figure 5, below. shared between the two sets. Using Tensorflow s deep neural network (DNN) classifier to train a 3 layer with layer sizes of 10000, 1000, and 100 and then testing the trained model on the given test set yielded an accuracy of 39% (Figure 4). To increase performance further, I used principle component analysis (PCA) to shrink the feature set to first 1000, then 800 and finally 500. Post PCA, the 1000 featured DNN model had the highest accuracy at 41% (Figure 4). Error analysis There were two large sources of error in the model. First, classification confusion seemed to concentrate in several areas, as seen in the confusion matrix for the test set using the 1000 DNN model (Figure 5). For example, engineers A and C were consistently confused with B during classification. This suggests that engineers in certain disciplines tend to resemble one another, and there could be several correct engineers for a task. The second large source of error stemmed from using common two-word phrases as features. A few engineers tended to use phrases that did not help describe the task and only occurred in their commit messages. In this way, these phrases acted as a signature, and this!5

Figure 6, below. source of error possibly explains the higher F1 scores rates for engineer X (Figure 6). It seemed that the model was learning engineer writing styles, allowing the model to perhaps unfairly identify the right engineer. Conclusion and future work Although the final 41% accuracy from the 1000 DNN model is a large improvement over the initial 16% accuracy from the 252 SVM model, there is still a lot of work to be done in this space. First, from a macro viewpoint, this model is not ready to be used in production many software development teams take the time to detail projects in other sites, and so using GitHub data is not necessarily relevant to them. On a more micro level, this project would benefit from additional work with the commit message features, perhaps by tokenizing the words using Tensorflow s word2vec. Finally, it would be very interesting to try to separate classifying an engineer by their writing styles from identifying them by task experience; this may call for finding a cleaner data set. [2] Iftikhar, Sundas et al. Optimal Task Allocation Algorithm for Cost Minimization and Load Balancing of GSD Teams. Pakistani National University of Science and Technology, 2016. [3] Y. Kopidakis, M. Lamari, and V. Zissimopoulos. On the Task Assignment Problem: Two New Efficient Heuristic Algorithms. Journal of Parallel and Distributed Computing, 1997. [4] Pankaj Jalote, Gourav Jain. Assigning Tasks in a 24-Hour Software Development Model. Indian Institute of Technology, 2016. [5] Harry Raymond Joseph. Software Programmer Management: A Machine Learning and Human Computer Interaction Framework for Optimal Task Assignment. TUM, Germany, 2015. [6] Leif Jonsson et al. Automated Bug Assignment: Ensemble-based Machine Learning in Large Scale Industrial Contexts. Empirical Software Engineering, 2015. References [1] Braha, Dan. Partitioning Tasks to Product Development Teams. Massachusetts Institute of Technology, 2002.!6