Optimal Task Assignment within Software Development Teams Caroline Frost Stanford University CS221 Autumn PDF Free Download

Optimal Task Assignment within Software Development Teams Caroline Frost Stanford University CS221 Autumn 2016 Introduction The number of administrative tasks, documentation and processes grows with the size of the code base and the size of software development teams. These new organizational needs increase overhead for an organization and slow down the software development process. One such need that arises is the need to assign work to engineers. Usually, a project manager takes on these tasks to match the best engineer to a given job. The best engineer can mean several different things, depending on the context. The best engineer can, naturally, be the engineer who has the most expertise in the domain. The best engineer can also mean the engineer with some expertise, but is available to take on another task. The best engineer alternatively could be the engineer who has expressed interest in developing a new skill by taking on a task out of their comfort zone. Additionally, there may be a few engineers who could be considered the best engineer, and perhaps the best engineer is taken arbitrarily from that set. This project attempts to automate the project manager s process of assigning tasks to engineers. Related Work Automating the assignment process has several subproblems within the larger project. Scheduling and duration estimation are important parts of assigning the correct engineer in addition to understanding an engineer s current relationship with a codebase. Qualitatively, choosing the correct engineer also requires a calculation to minimize the cost of interactions across design teams and is an important characteristic of complex engineered systems [1]. A group at the Pakistani National University of Science and Technology published a paper in 2016 that presented a mathematical model minimizing cost, time and load balancing for software development teams that utilized multiple levels of clustering. The group focused on reducing the intercommunication cost, and used clustering for both predicting work division and duration[2]. Optimal task assignment is considered to be NP-hard [3], yet converting the problem into a maximization one, where we try to determine and avoid large communication and execution penalties seems to have better results when tradeoffs between communication and task allocation are considered. Another paper considers optimal allocation as minimizing the completion time of a project, and focuses on the interweaving of tasks to minimize the time an employee or team is blocked [4]. It seems that most studies in the space largely ignore workers who are best suited for particular tasks, and instead treated workers as very similar beings. Recently, a machine learning framework that factors in programmer attributes was proposed to address this problem for software development, with a twist: a simple EEG device detected programmer mood and the final task assignment was solved by a PDTS solver [5]. Additionally, a study published in Empirical Software Engineering in 2015 applied stacked generalization learner to!1

50,000 bug reports, reaching between 50-89% accuracies [6]. It seems that most efforts have focused on Naive Bayes and Support Vector Machines classifiers above other machine learning techniques for classifying tasks. This project attempts to use neural networks to solve the problem, in a way that may be able to easily scale to multiple organizations. Task definition For this project, the optimal choice of engineer will be defined to be the engineer with the most domain expertise. Given a group of tasks and the assigned engineer for each task, it is easier to ascertain areas of expertise for engineers; there is no data for which new skills each engineer wants to gain, and it is difficult to tell how large of a scope a project has, or how much time it will take to complete. The project manager on the software development team is considered the oracle in this project. They know the expertise of the developers and the scope of all tasks, and thus this project assumes the project manager chooses the single optimal assignment each time. Unfortunately, if the commit message is vague, it will be difficult for a model to assign an engineer to the task. Fortunately, with a public and open source repository, there are incentives for engineers committing to describe their work in higher detail if many others will be viewing their messages. Additionally, if the author of the commit is not the engineer with the highest level of knowledge and instead a seeming random choice, this will confuse the model. The project manager may have good reasons for assigning that particular engineer the task, but often the reasons are not been logged online. A baseline model would always predict the engineer with the highest number of contributions for a given task. In the final data set used in this project, this was about a 10% accuracy. This project uses a few different algorithms within machine learning to learn the correct assignment. With enough data, machine learning models may be able to do a pretty good job of creating profiles for each of the engineers despite the data limitations. Data collection and feature extraction GitHub hosts a large and open sourced amount of task assignment data in the form of commits. This project took the past five year commit history from ten popular projects associated with the Docker organization on GitHub, namely "docker", "machine", "docker.github.io", "notary", "compose", "vpnkit", "swarmkit", "datakit", "libnetwork", and infrakit. The available engineers set only includes engineers who had direct access to commit to those repositories, and each engineer/commit pair represents the optimal engineer for the task contained within that commit. Each commit contains the author s name, date, files changed, repository changed, languages associated with the files changed and the commit message. The commit messages have a significant amount of data associated with them. Tokenizing by words and by characters with different lengths are both promising. Qualitatively, looking at top words used in the commit messages seem to not be very useful like commit, completion, fix and make. Looking at top two-word phrases is more useful, and three-word phrases more useful still to help describe the task and extract features. However, there was a significantly higher number of instances!2

for two-word phrases than there was for three-word phrases, indicating that twoword phrases may group tasks in a way that favors generalization. Support vector machine and multi-layer perceptron algorithms were run on data using different subsets of the complete feature set: 1) date of the commit, 2) files changed, 3) repository of commit, 4) languages used in files, and 5) 1500 popular two-word phrases, all codified into binary data during feature extraction. Pulling commit data from the past five years across ten repositories totaled more than 44,000 commits with more than 13,000 possible authors. If the data is cleaned to only contain authors who have committed more than 500 times, there are 27,000 commits across 26 authors. Experiments and results The algorithm first used to make these predictions is an unsupervised k means clustering algorithm, with two hundred of the most common two-word phrases and languages as features for a total of 252 features and more than 500 data points. I allowed as many clusters as available engineers: perfect clustering would have clustered all tasks assigned to each particular engineer together. Testing the model on the set used to train it gave an accuracy of about 30% -- and with a 15- fold cross validation with a total of about 500 data points, the mean accuracy was Figure 1 above, Figure 2 below.!3

about 10% but with a standard deviation of 7%. This wasn t particularly surprising, as engineers within a team tend to resemble one another and perhaps clustering would be more accurate if setting k to the number of teams within the organization. This indicated that perhaps the other algorithms may confuse engineers with similar skills and limit the upper limit of accurately identifying the correct engineer. The next step was to apply a support vector machine (SVM) using SKLearn. A one-versus-rest (OVR) SVM model was trained on more than 44,000 data points (Figure 1). Increasing features from 252 to 852 by adding more common phrases features made a small impact on accuracy, with a greater impact on testing on the training set which suggests an occurrence of overfitting. Additionally, more features were added for identifying the repository of the task, which increased accuracy over the initial training still and brought the number of features to 863 (Figure 1). Across these three models, the 252 SVM had the lowest standard deviation when 5-fold cross validation was run at 23%, though not by much: the 852 and 863 SVMs had a standard deviation of. 28% and.36% respectively (Figure 3). With a closer look at the errors, the previous models all had difficulty classifying engineers who did not commit very often. There is a core group of 26 engineers who all have contributed more than 500 times in the Docker organization, and thus the data was cleaned to only contain commits from these 26 people. This brought the total amount of data points down to more than 27,000. Using the prior 863 feature set, both a SVM one-versus-rest and a SVM one-versus-one (OVO) model were run on these 27,000 data points. Using 5-fold cross validation again, the 863 oneversus-rest proved to be the better model for generalization, both in terms of accuracy (Figure 2) and standard deviation (Figure 3). In an attempt to further improve the SVM one-versus-rest model, an additional features describing the next 700 most popular phrases were added. Running 5-fold cross validation with this model increased accuracy to about 30% (Figure 2). Training a neural network model was feasible due to the large amount of data available, and the neural network in the Figure 3, left.!4

Figure 4, right. form of a multilayered perceptron (MLP) improved this statistic further. A three layer MLP with relatively small sizes 100, 50, and 10 respectively was trained. Using 863 features, the MLP outperformed the SVM by nearly 10%, while using 1562 features the MLP outperformed the SVM by more than 5%. To train neural networks more quickly, I switched from SKLearn to Tensorflow at this point. For faster testing, I created a random training set of more than 23,700 data points and a random test set of more than 3,700 data points with no data point Figure 5, below. shared between the two sets. Using Tensorflow s deep neural network (DNN) classifier to train a 3 layer with layer sizes of 10000, 1000, and 100 and then testing the trained model on the given test set yielded an accuracy of 39% (Figure 4). To increase performance further, I used principle component analysis (PCA) to shrink the feature set to first 1000, then 800 and finally 500. Post PCA, the 1000 featured DNN model had the highest accuracy at 41% (Figure 4). Error analysis There were two large sources of error in the model. First, classification confusion seemed to concentrate in several areas, as seen in the confusion matrix for the test set using the 1000 DNN model (Figure 5). For example, engineers A and C were consistently confused with B during classification. This suggests that engineers in certain disciplines tend to resemble one another, and there could be several correct engineers for a task. The second large source of error stemmed from using common two-word phrases as features. A few engineers tended to use phrases that did not help describe the task and only occurred in their commit messages. In this way, these phrases acted as a signature, and this!5

Figure 6, below. source of error possibly explains the higher F1 scores rates for engineer X (Figure 6). It seemed that the model was learning engineer writing styles, allowing the model to perhaps unfairly identify the right engineer. Conclusion and future work Although the final 41% accuracy from the 1000 DNN model is a large improvement over the initial 16% accuracy from the 252 SVM model, there is still a lot of work to be done in this space. First, from a macro viewpoint, this model is not ready to be used in production many software development teams take the time to detail projects in other sites, and so using GitHub data is not necessarily relevant to them. On a more micro level, this project would benefit from additional work with the commit message features, perhaps by tokenizing the words using Tensorflow s word2vec. Finally, it would be very interesting to try to separate classifying an engineer by their writing styles from identifying them by task experience; this may call for finding a cleaner data set. [2] Iftikhar, Sundas et al. Optimal Task Allocation Algorithm for Cost Minimization and Load Balancing of GSD Teams. Pakistani National University of Science and Technology, 2016. [3] Y. Kopidakis, M. Lamari, and V. Zissimopoulos. On the Task Assignment Problem: Two New Efficient Heuristic Algorithms. Journal of Parallel and Distributed Computing, 1997. [4] Pankaj Jalote, Gourav Jain. Assigning Tasks in a 24-Hour Software Development Model. Indian Institute of Technology, 2016. [5] Harry Raymond Joseph. Software Programmer Management: A Machine Learning and Human Computer Interaction Framework for Optimal Task Assignment. TUM, Germany, 2015. [6] Leif Jonsson et al. Automated Bug Assignment: Ensemble-based Machine Learning in Large Scale Industrial Contexts. Empirical Software Engineering, 2015. References [1] Braha, Dan. Partitioning Tasks to Product Development Teams. Massachusetts Institute of Technology, 2002.!6