A Replicate Empirical Comparison between Pair Development and Software Development with Inspection

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "A Replicate Empirical Comparison between Pair Development and Software Development with Inspection"

Transcription

1 A Replicate Empirical Comparison between Pair Development and Software Development with Inspection Monvorath Phongpaibul, Barry Boehm Center for Systems and Software Engineering University of Southern California {phongpai, Abstract In 2005, we studied the development effort and effect of quality comparisons between software development with Fagan s inspection and pair development. Three experiments were conducted in Thailand: two classroom experiments and one industry experiment. We found that in the classroom experiments, the pair development group had less average development effort than the inspection group with the same or higher level of quality. The industry experiment s result showed pair development to have a bit more effort but about 40% fewer major defects. However, since this set of experiments was conducted in Thailand, the results may be different if we conducted the experiment in other countries due to the impact of cultural differences. To investigate this we conducted another experiment with Computer Science graduate students at USC in Fall Unfortunately, the majority of the graduate students who participated in the experiment were from India, a country in which the culture is not much different from Thailand [18], [19]. As a result, we cannot compare the impact of cultural differences in this paper. However, the results showed that the experiment can be replicated in other countries where the cultures are similar. 1. Introduction Both software inspection and pair programming are effective verification techniques. Software inspection is one of the practices in traditional software development while pair programming is one of the practices in agile development. Numerous studies have shown the success of software inspection in largescale software development over the past three decades [1],[4],[10],[14],[20],[26]. Since software inspection requires discipline and structure, the cost of achieving similar high quality for smaller and less critical software may be too high. Although Pair Programming (PP) is a newer approach and less structured, it has had a strong impact on the success of agile software development projects over the past five years [3],[11],[13],[15],[16], [23],[27],[28],[29]. Many agile studies have shown the success in delivering quality product within a limited time frame. Agile development people credit PP as a major contributor to the success of agile projects. In Wernick s and Hall s study [25], they suggested that pair programming practices might successfully be applied in traditional software development process. We define the traditional software development process, which performs pair programming practice as a verification technique, as Pair Development [18],[19]. In 2005, we conducted three control experiments (two classroom experiments and one industry experiment) to compare the commonalities and differences between software development with inspection and pair development in Thailand [18],[19]. We found that average development effort of the pair development group was less than inspection group with equal or improved product quality in the classroom experiments. The industry experiment s result showed pair development to have a bit more effort but about 40% fewer major defects. However, since this set of experiments was conducted in Thailand, the results may be different if we conduct the experiment in other countries due to the impact of cultural differences [17]. To investigate that the results can be replicated in other countries, we conducted another control experiment with the graduate students at University of Southern California. The objective of this paper is to replicate the comparison experiment between pair development and software development with inspection that was conduct previously in Thailand. We investigated the differences between software inspection and pair development in terms of effort and quality. The control experiment was performed from September 2006 to December Either pair development or Fagan s inspection was used as the peer review process. Only one peer review process approach was assigned to a group. The experiment results are similar 1

2 to our previous experiment, which showed that the total development effort of the pair development group was less than the inspection group with the same product quality. The paper is structured as follows. The background knowledge about Fagan s Inspection, Pair Development, and Cost of Software Quality (CoSQ), are reviewed in section 2. Section 3 describes the design of our experiment. Section 4 present results from the experiment. Limitations, threats to validity, and conclusions of the study are discussed in section 5 and section 6 respectively. 2. Background knowledge 2.1. Fagan s inspection For consistency with the Thailand experiments, we used Fagan s inspection. There are 4 different roles in Fagan s inspection: moderator, author, reader, and tester. Each role has a different function. Moderator leads the inspection. Author is the owner of the artifact being inspected who verifies the understanding of the artifacts and confirms the validity of tester or reader s defects. Reader paraphrases and interprets the artifact from his/her understanding. Tester considers testability, traceability and interface of artifact. Michael Fagan at IBM originally developed inspection in the early 1970s [6]. Fagan s study has shown that inspection can identify up to 80 percent of all software defects. The studies in [1],[4],[5][10],[14],[20],[26] also presented positive results for inspection Pair programming and pair development Pair programming is one of the practice areas in Extreme Programming (XP) methodology to improve the quality of the system. XP people consider pair programming as a continuous review. As defined by Laurie Williams, Pair programming is a style of programming in which two programmers work sideby-side at one computer, continuously collaborating on the same design, algorithm, code, or test. One of the pair, called the driver, types at the computer or writes down a design. The other partner, called the navigator, has many jobs. One is to observe the work of the driver looking for defects in the work of the driver. The navigator has a much more objective point of view and is the strategic, long-range thinker. Additionally, the driver and the navigator can brainstorm on-demand at any time. An effective pair programming relationship is very active. The driver and the navigator communicate, if only through utterances, at least every 45 to 60 seconds. Periodically, it s also very important to switch roles between the driver and the navigator.[29] In [15],[16],[23],[27],[28],[29] the empirical experiment data showed that pair programming improves the quality of the product, reduces the time spent in development life cycle, and increases the happiness of developers. From the control experiment to investigate the benefit of pair programming [27],[28], the pair spent approximately 15% more working hours to complete their assignments. However, the final artifacts of the pair have 15% less defects than artifacts done by an individual. Cockburn and Williams [3] also reported that more than 90% of the developers enjoyed the work and were more confident in their work because of pair programming. In 2005, Muller reported the empirical comparison between pair programming and individual code review [12]. His results showed that pair programming and single developer with code review are interchangeable in terms of development cost. In [25], Wernick and Hall suggested that pair programming practices might successfully be applied in traditional software development process. As in the Thailand study [18],[19], our study extended pair programming to serve as the peer review process for Pair Development, which included pairs developing almost every artifact during the development life cycle, including the project plan, vision document, system requirement definition, system architecture definition, source code, test plan and test cases. We do not recommend performing only pairing during requirement negotiation since the requirement should represent the most need and concern from all stakeholders Cost of Software Quality (CoSQ) Cost of quality (CoQ) is commonly used in manufacturing to present the economic trade-offs involved in delivering good quality products. Basically, it is the framework used to discuss how much good quality and poor quality costs. It is an accounting technique that has been adapted to software development, which we call Cost of Software Quality (CoSQ). C production quality = C prevention TDC C appraisal quality C = C + C + C + C Figure 1: Software production cost and CoSQ. IF EF. By definition [22], CoSQ is a measure of the costs specifically associated with the non-achievement of software product quality encompassing all requirements established by the company, its customer contracts, and society. Figure 2 presents the model of 2

3 Cost of Software Quality (CoSQ). Total Development Cost (TDC) is cost that the team spends on producing the system called production cost and cost that the team spent on assuring the quality of the system called CoSQ. CoSQ is composed of four categories of cost: prevention costs, appraisal costs, internal failures (Ifailure) costs and external (Efailure) failures costs. Figure 2: Model of Cost of Software Quality Prevention costs and appraisal costs are conformance costs, which are amounts spent on conforming to requirements. Prevention costs are costs related to the amount or effort needed to prevent the defects before they happen. Examples of prevention costs are cost of training, cost spending on process improvement, data collection costs, and cost of defining standards. Appraisal costs include expenses associated to cost of assuring conformance to quality standards. Examples of appraisal costs are inspection or peer review costs, auditing products, and costs of testing. Internal failure costs and external failure costs are non-conformance costs. It includes all expenses when things go wrong. Internal failures occur before the product is delivered or shipped to customer. In software development, internal failure costs can be cost of rework defects, re-inspection, re-review, and retesting. External failure arises after the product is delivered or shipped to customer. External failure costs include post release technical support costs, maintenance costs, recalls costs, and litigation costs. In our study, only efforts of developers are taken into account in our CoSQ analysis. 3. Research methodology In this section, research methodology and research design are discussed. Section 3.1 describes the subjects of each experiment. Section 3.2 illustrates the experimental design. The data collection and hypothesis are discussed in section 3.3 and 3.4. To understand the differences between software development with inspection and pair development, control experiments were designed and conducted. The research framework on pair programming as illustrated in [7] is used as a framework to design the experiment. This framework is the same framework that we were using in our previous experiments. The experiments compare software development with Fagan s inspection and pair development. The Fagan s inspection teams were the control group and the pair development teams were the experimental group. The dependent variables of this experiment are time, cost and quality. Since the team sizes were equal, the effect for calendar time is the same as for effort Subjects An experiment was conducted as part of a directed research course at USC. The experiment took place in Fall 2006 (August December 2006). The participants were 56 graduate students in computer science. The experiment was part of a team project, which was the main part of the course. The project s objective was to provide additional features to an existing system. In the first month of the course, the students are required to participate in 2 hours weekly meeting where they learned how to use the existing system, how the system was designed, and learned what new features they had to develop. In addition, the students were trained in how to perform their verification technique, either inspection or pair development. All students were informed about the experiment at the beginning of the course. Many of the USC Computer Science graduate students are international students. We were expecting to have a very diverse pool of graduate students in the course. Unfortunately, all but one of the student participants are students from India. We will explain the threat to validity due to these circumstances in section 5. To avoid bias, we clarified to the students that the objective of the experiment was to not explore which technique was better, but to understand the differences between both techniques. All students were informed that the number of defects found during the project and effort spent on developing the project were not part of the grading criteria. We based the grading on quality of delivered product and compliance process Design Students were divided into 4-persons team. To avoid schedule clashes, we allowed students to set up their own team with other students who had compatible schedules. Otherwise, by randomizing teams, this increases the probability of teams having schedule 3

4 mismatches. With a schedule clash, teams would not be able to meet and thus no work would get done. There were a total of 14 teams. Seven teams were randomly assigned to pair development group (PD group) and seven teams were randomly assigned to inspection group (FI group). After validating the data, we dropped five teams from our experiment due to three main reasons: invalid data, outlier data, or the team violated academic integrity. At the beginning of semester, the students are required to fill out a background survey. Table 1 summarizes the average GPA (Grade Point Average) and experiences from the teams. The average GPA from the pair development group is 3.37 and the average GPA from the inspection group is Table 1: Team s average GPA and experiences Pair Development Group Inspection Group Team # Average GPA Average years of experience Average level of C and C++ knowledge P P P P P I I I I The average level of experience is measured by the number of years the students have been working in the industry. The average industry experience in the pair development group is 0.85 years and the average industry experience in the inspection group is 0.81 years. There is one team from pair development group that has the lowest GPA of the teams and another team from inspection group that has the highest GPA but have no industry experience. We initially thought these teams would be outlier data points. However, since these two teams did not perform either the best or the worst in the experiment, we did not drop them from the experiment. In addition, the experiments required C and C++ knowledge. The average level of C and C++ knowledge is measured by the familiarity of the language. The students rated themselves based on a scale from 1 (never heard of it) to 10 (expert). All of students rated themselves from 7 to 9, thus all students had a similar background in C/C++. Students were required to work in teams to develop CodeCount for Visual Basic (VB CodeCount). VB CodeCount is the new CodeCount tool, which will be added to the USC CodeCount toolset. The USC CodeCount toolset is a collection of tools designed to automate the collection of source code sizing information [24]. The USC CodeCount toolset spans multiple programming languages such as Ada, JavaScript, C and C++, Perl, and SQL. It provides the information on two possible Source Lines of Code (SLOC): physical SLOC and logical SLOC. USC CodeCount Toolset is the tool that our center provides to the affiliates from the aerospace and defense industries to use in their projects. Table 2: Experiment schedule Phase Schedule Major Activities Aug 23 Meeting, team formation, Training Sep 12 training session Identify requirement, develop share vision, Sep 13 Requirement develop use case Sep 26 specification, plan project, verify major document Design Sep 27 Oct 10 Define VB physical and logical SLOC definitions, VB keyword list, design the system, verify document Implementati -on Oct 11 - Nov 14 Implementation, code verification, unit test Testing Nov 15 System test, test case Dec 13 generation, verify test cases Delivery Dec 13 Final Delivery UAT Dec 14 Dec 21 User Acceptance Test (UAT) The physical SLOC is programming language syntax independent, which enables it to collect other useful information such as comments, blank lines, and overall size, all independent of information content. The logical SLOC definition depends on the programming language. The logical SLOC definition is compatible with the SEI s Code Counting Standard [8]. In our experiment, the teams were using SEI s Code Counting Standard as the template to develop the physical and logical SLOC definitions. The students were required to develop the VB CodeCount in the C/C++ language and to follow the USC CodeCount s architecture. To avoid the threat of validity due to the knowledge of development process, all teams were required to follow the course schedule. Table 2 shows the experiment schedule and major activities in each phase. However, there were teams that deviated from the plan since they had to go back and rework their artifacts in the previous phase. The experiment was conducted over a period of 13 weeks (exclude training and UAT phases). Development life cycle composed of 4 phases: 2-week 4

5 requirement, 2-week design and 5-week implementation (breaking into 2 iterations) and 4-week testing. Every other week the teams were required to meet with the instructor to track their progress. At the end of each phase, the teams were required to meet with the instructor to review the major artifacts in each phase. If there are defects in the artifacts, the teams were required to fix the defects before they could enter the next phase. After the deliverable, the instructor generated the test cases for UAT phase. The final products from every team were tested with these test cases and the results were recorded to compare the level of quality Data collection and data analysis We developed the inspection data sheets and the pair development data sheets (called quality report) for data collection. Inspection data sheets are composed of planning record, individual preparation log, defects list, defect detail report, and inspection summary report. Pair development data sheets are composed of planning record, time sheet, individual defects list, and pair development summary report. These data sheets contain the results of either inspection or pair development. Besides data sheet, the teams were required to submit individual task logs every week. For validation purpose, data from individual task logs and the quality report was checked for consistency. After the first review, all teams are required to meet with the instructor to discuss the verification technique that they were performing (either pair development or inspection). All questions and concerns about the techniques are raised and solved. The experiment data are analyzed with descriptive analysis and statistical tests. Since the population is small, the student t-test was used to investigate the hypotheses [21]. The significance value of rejecting the hypotheses is 0.05 for all tests Hypotheses As stated in Section 1, this paper is focused on the difference between pair development and inspection in areas of the cost of process and the effects of quality. Total Development Cost (TDC) and component of Cost of Software Quality (CoSQ) are used for comparing the cost of process. In additional to CoSQ, we analyzed the different of TDC in each group for each development phase: requirement, design, implementation and testing to demonstrate the effect of the schedule. The development cost in requirement phase (DCR) is number of man-hours to identify requirement, develop vision document, develop use case specification, plan project, review major artifacts, meet with client, and fix the defects found in requirement phase. The development cost in design phase (DCD) includes number of man-hours to define VB physical and logical SLOC definitions, define VB keyword list, design the system, review the major artifacts, discuss or research in the design issues, and fix the design defects. The development cost in implementation phase (DCI) consists of time in man-hours to code the system, review the code, discuss the system issues, unit test, and fix the defects in implementation phase. The development cost in testing phase (DCT) is number of man-hours to generate test cases (test description, test input), run test cases, record test log, review test artifacts, and fix the test defects. The effects of quality are determined by the number of defects found at the testing phase and number of un-passed test cases at UAT. The list of hypotheses and its results are shown and discussed in the next section. 4. Experiment results In this section, the hypotheses and its results are discussed. Since the number of the subjects (5 teams for pair development group and 4 teams for inspection group) is small, the t-test was used in the analyses of the difference between mean of pair development group and inspection group [21]. The 5% significance level was used in hypotheses testing Total Development Cost (TDC) H1: There is no difference in Total Development Cost (TDC) in man-hour between pair development and inspection groups. Total Development Cost (TDC) is the total number of man-hours all team members spent on developing the project. Each week, students submitted their effort report in terms of man-hours to online repository. The effort reports were crossed check with quality reports for consistency. In the effort report, students identified the tasks, which they performed, the time they spent on each task, and the date they were performing each task. Then, all the effort reports were accumulated to TDC. Figure 3 shows the TDC for five pair development teams and four inspection teams. The y-axis is the number of TDC in man-hour. It is quite significant that all the teams in the pair development group spent less effort to develop the project than all the teams in the inspection group. The average TDC for the pair development group is man-hours and the average TDC for the inspection group is manhours. It is interesting to note that team P4 that has the 5

6 highest level of C and C++ knowledge (8.25) is the team that has the lowest TDC (179.43). In Table 3, the mean, standard deviation, and P- values of TDC are shown for both groups. TDC mean of the pair development group is man-hours lower than the mean of the inspection group. The p- value between the two groups is , thus we can reject the hypothesis that there is no difference between the two groups for TDC. Man-hour Total Development Cost (TDC) Team Figure 3: Results of TDC PD Teams FI Teams The result from 2006 experiment is similar to the classroom experiments in 2005 which the pair development group spent less development effort than inspection group. We will discuss the reasons of this result when we analyze the component of CoSQ in the next section. Table 3: Statistic results of TDC Sample Mean Standard Deviation TDC (manhour) PD Group FI Group Cost of Software Quality (CoSQ) P- Value H2: There is no difference in failure costs in man-hour between pair development and inspection groups. H3: There is no difference in appraisal costs in manhour between pair development and inspection groups. H4: There is no difference in production costs in manhour between pair development and inspection groups. In our study, we are using CoSQ as measurement to compare the cost to develop quality software between pair development and software development with inspection. Basically, CoSQ represents the distribution of actual cost in each category (refer to CoSQ in section 2.3) to produce the quality product. For example, if internal failure cost is high, it implies that the development process is ineffective or the product is of poor quality since there are lots of rework that waste efforts. From Table 4, all of the teams from pair development group spent approximately more production cost than the inspection group (around 5 man-hours more). However, the appraisal cost and failure cost from the pair development group tends to be less than the inspection group. As a result, the inspection teams took more effort to develop the system than the pair development teams as described in section 4.1. These results are consistent with the experiments conducted in. We do not show the prevention cost since every team spent the same training time (10.5 man-hours per team). Team # Production Cost Table 4: Results of CoSQ Appraisal Cost Failure Cost TDC P P P P P I I I I There are two main reasons that the pair development group has less appraisal costs and failure costs than the inspection group. First, continuous review is an invariant of pair development. While the driver is working on the product, the observer is actively reviewing the product in the same time. The possible defects can be removed as quickly as they are generated. Hence, there are less rework costs, which means less failure costs. Second, the cost of performing Fagan s inspection is high due to the structure of the inspection process especially for the relatively small projects. However, Fagan s inspection has been well documented as an effective verification technique. Table 5 shows the mean, standard deviation, and standard error values in man-hour for the production costs, appraisal costs, and failure costs. As indicated, on average, the pair development group spent more effort on creating software than inspection group. This result is different from the classroom experiments in 2005 in which on average the pair development group and the inspection group spent approximately the same amount of man -hours in the production cost. 6

7 However, it is consistent with the 2005 industry experiment in which the pair development group spent more effort in production cost than the inspection group. This can be due to the overhead during pair session, in which the pair needed to prepare and plan before the pair execution began. Appraisal costs are the cost of assuring the quality of the product. In our study, appraisal costs refers to two major costs: cost of reviewing and cost of testing. On average, the pair development group spent man-hours in continuous review and man-hours in testing. The inspection group spent on average man-hours in inspecting and man-hours in testing. The results in Table 5 show that there are significant differences in internal production costs, appraisal costs, and failure costs between the pair development group and the inspection group. Table 5: Statistic results of production costs, appraisal costs and failure costs Production Costs Appraisal Costs Group Activities Mean Standard Deviation PD Group Pair Driving, Meeting, Individual Development FI Group Individual Development, Meeting Continuous Review PD Group (Pair Navigation/ Observer) and Testing FI Group Inspection and Testing Failure Rework PD Group Costs (pair and individual) FI Group Rework Difference between means Standard Error P- Value Development costs per phase H5: There is no difference in development costs (manhour) in requirement phase (DCR) between pair development and inspection groups. H6: There is no difference in development costs (manhour) in design phase (DCD) between pair development and inspection groups. H7: There is no difference in development costs (manhour) in implementation phase (DCI) between pair development and inspection groups. H8: There is no difference in development costs (manhour) in testing phase (DCT) between pair development and inspection groups. This section we analyze the distribution of development costs per phase. From Table 6, you can see that in every phase the pair development group spent less amount of effort than the inspection group except during the design phase. In the design phase, development costs from both groups were about the same. This is due to the fact that all of the students were new to the system. For the pair development group, this is due to the impact of pairing a non-expert with another non-expert [28]. Due to their limited knowledge of the system, the pair had to spend lots of time on designing and even then, the students would not know if their designs were correct or not. For the inspection group, the time spent on inspection was also not effective since few design defects had been removed. At the end of the design phase, most of the teams were asked by the instructor to go back and fix all the defects. The pair development group was required to do pair-rework and the inspection group was required to re-inspect the design. PD Group FI Group Table 6: Development costs per phase Team # DCR DCD DCI DCT P P P P P I I I I As mentioned in the beginning, since the team sizes were equal, the effect for calendar time is the same as for effort. The experiment results showed that the pair development group had about 36% less effort in requirement phase, 4% less effort in design phase, 7

8 26% less effort in implementation phase, and 16% less effort in testing phase than the inspection group, which implies that the pair development group required 36%, 4%, 26% and 16% less calendar time in each phase than the inspection group. This means pair development offers the option of starting each phase earlier and reducing the calendar time. As a result, the system is delivered to the market sooner and the return of investment will be increased. Figure 4 illustrates the difference of calendar time between the pair development group and the inspection group. Figure 4: Effect of the calendar time Table 7 reports comparative average development costs per phase. On average, the pair development group took less effort per phase than the inspection group. The p-value shows that there are significant differences in the development costs in requirement, implementation and testing phase between the pair development group and the inspection group. However, there is no difference between the two groups in the development costs in design phase (pvalue=0.35). Table 7: Statistic results of development costs Developme -nt Costs DCR DCD DCI DCT Group 4.4. Product quality Mean Standard Deviation PD Group FI Group PD Group FI Group PD Group FI Group PD Group FI Group P- Value H9: There is no difference in number of un-passed test cases in testing phase between pair development and inspection groups. H10: There is no difference in number of un-passed test cases in UAT between pair development and inspection groups. Table 8: The results of product quality Number Group Mean Un-passed Test Cases in Testing Phase Standard Deviation PD Group FI Group Un-passed Test Cases PD Group in UAT FI Group P- Value The product quality was determined by the number of un-passed test cases in the testing phase and in UAT. During testing phase, all teams generated test cases and recorded the results. All of the defects found during this phase were required to be fixed before product delivery at the end of semester. After product delivery, the instructor tested the final product from every team with the same set of test cases that were prepared during the semester. We called this phase as User Acceptance Test (UAT). The students had no knowledge about the test cases. All the un-passed test cases were recorded and classified by defect type. The student t-test analysis from Table 8 indicates that there is no evidence of any statistically significant difference in quality (number of un-passed test cases in testing phase and UAT) between two groups Comparison with Thailand results Table 9 summarizes the results from all of the experiments, which were conducted in both Thailand (E1, E2, and E3) and US (E4). E1, E2 and E4 were the classroom experiments and E3 was the industrial experiment. The classroom experiments from both Thailand and US shared the common results that the pair development group took less effort to develop the system than the inspection group with equal or less defects in the final product. On average, the pair development teams spent about 20%to 30%less effort than the inspection teams. This also implied that the pair development groups shortened 20%to 30% of the calendar time. For Thailand industrial experiment (E3), even though the pair development group took about 4% more effort, the final product had 40% fewer major defects. This difference may be due to the project in industry environment being more complex than the classroom project. For the inspection team to achieve the same quality as the pair development team may require more TDC than for the pair development team. 8

9 This result also showed that with limited timeframe, the pair development team produced a higher level of product quality than the inspection team. Table 9: Summary of results from all experiments Team TDC #Test Defects E1 PD Group (Thailand 05) FI Group E2 PD Group (Thailand 05) FI Group E3 PD Group (11 major) (Thailand 05) FI Group (18 major) E4 PD Group and 1.4 (US 06) FI Group and Validity threats and control 5.1. Cultural differences All of our previous experiments were conducted in Thailand where the culture is much different from the United States or the European countries. For example, Thai people socialize more than Western people do. Pair development requires continuous interaction between the developers, which may not be a practical approach in the United States where people generally require more personal space [17]. To compare the cultural differences, this experiment was conducted at USC, a university in the United States. Unfortunately, all of the students in the study were from India, a country in which the culture is not much different from Thailand [18],[19]. As a result, we cannot compare the impact of cultural differences in this paper. However, the results of the paper indicated that across countries where the cultures are similar, the results of experiment can be replicated Non-equality of team experiences Since the subjects from inspection group and pair development group are not the same population, we needed to make sure that the subjects from both groups have the same level of C/C++ knowledge. In our experiment, we could not use the background survey to assist in the team formation since most of the students have different class schedules. We allowed the students to set up their own team with other students who had compatible schedules. However, the data on the students from both groups indicated that they had essentially the same level of C and C++ knowledge. For the team projects, students needed to only know C and C++ to design/implement the system. From our data, the level of GPA and industry experiences did not have an effect on team success Non-representativeness of subjects The participants in the experiments are not representative of the people who work in the software industry since most of the graduate students have at most two years of experience in the industry and none of them have experience in software inspection, peer review or pair programming. To overcome this threat, we provided extensive training in the practices and the software verification techniques (either Fagan s inspection or pair development) prior to the start of the experiment Non-representativeness of team size The team size of projects is relatively smaller than industry projects. Four-people teams are not representative of team size in US. However, about 60% of the US projects are comprised of less than 10 people and about 60% of US software effort is spent on projects with over 50 people [2] Non-representativeness of the size of project The size and complexity of project is quite small compared to the size and complexity of projects in industry. As mentioned above, USC CodeCount Toolset is a sizing tool that many developers from industry use in their projects, including the defense industry. Since this project needed a quick development time, the project is representative of the three to four months rapid development projects found in industry. 6. Conclusion In this paper, the experiments to compare the difference between pair development and software development with inspection are discussed. The objective of this experiment is to replicate the 2005 Thailand experiment in a US university in order to study the impact of cultural differences on the results of the experiment. Unfortunately, we were not able to compare the impact of the cultural differences in this paper since all of the participants were from India where the culture is similar to Thailand [9]. The results of the paper indicated that the results of experiment can be replicated, when the developers have a similar cultural background. The common result from all of the experiments is that the pair development group spent less total development effort than the inspection group to produce the same level or higher of quality due to the reduction of appraisal costs and failure costs. This is the advantage of the early feedback cycle from pair 9

10 development, in which the defects are found while the developers are producing the artifacts. From the 2006 experiment, TDC for the pair development group is 22% less than the inspection group with the same quality. In addition to TDC, the development costs per phase from the pair development group were less than the inspection group. Since the team sizes were equal, the effect for calendar time is the same as for effort. For example, from the experiment results, reducing 26% effort in implementation phase from the pair development team implies reducing 26% calendar time. Hence, pair development offers the option of starting the next phase earlier and reducing the overall calendar time. At the end, the system is delivered to the market sooner and the return of investment will be increased. From the results of the 2005 and 2006 experiments, we can conclude that in small projects, in which the developers are from similar cultures, such as Thailand and India, pair development can generally perform the verification more effective than software inspection. However, future empirical assessments are still necessary to fully understand the commonalities and differences between pair development and software development with inspection in different environments such as larger project size, larger team size, and developers from different cultures. 7. References [1] Ackerman, A.F., Buchwald, L.S., and Lewski, F.H., Software Inspection: An Effective Verification Process, IEEE Software, Vol. 6, No. 3, May 1989, pp [2] Boehm, B., and Turner, R., Balancing Agility and Discipline, Addison-Wesley, [3] Cockburn, A., and Williams, L., The Costs and Benefits of Pair Programming, extreme Programming and Flexible Processes in Software Engineering XP2000. [4] Dion, R., Process Improvement and the Corporate Balance Sheet, IEEE Software, Vol. 10, No. 4, July 1993, pp [5] Fagan, M.E., Advances in Software Inspections, IEEE Trans. Software Eng., Vol. 12, No. 7, July 1986, pp [6] Fagan, M.E., Design and Code Inspections to Reduce Errors in Program Development, IBM Syst. J., Vol. 15, No. 3, 1976, pp [7] Gallis, H., Arisholm, E. and Dyba, T. An Initial Framework for Research on Pair Programming, Proceedings, ISESE, [8] Humphrey, W.S., A Discipline for Software Engineering, Addison-Wesley, [9] Hofstede, G., Culture s Consequences -- Comparing Values, Behaviors, Institutions and Organizations Across Nations, Thousand Oaks, CA, Sage, 2001 [10] Kelly, J.C., Sherif, J.S. and Hops, J., An Analysis of Defect Densities Found During Software Inspections, Journal of Systems and Software, Vol. 17, No. 2, Feb. 1992, pp [11] Mcdowell, C., Werner, L., Bullock, H. and Fernald, F. The Effects of Pair-Programming on Performance in an Introductory Programming Course, Proceedings, the 33rd SIGCSE, 2002, pp [12] Muller, M.M., Two Controlled Experiments Concerning the Comparison of Pair Programming to Peer Review, Journal of Systems and Software (JSS), Vol. 78, pp [13] Muller, M.M. and Tichy, W.F. Case Study:Extreme Programming in a University Environment, Proceedings, ICSE, 2001, pp [14] Myers, W., Shuttle Code Achieves Very Low Error Rate, IEEE Software, Vol. 5, No. 5, Sept [15] Nagappan, N., Williams, L., Wiebe, E., Miller, C., Balik, S., Ferzli, M., Petlick, M., Pair Learning: With an Eye Toward Future Success, Extreme Programming/Agile Universe [16] Nawrocki, J. and Wojciechowski, A. Experimental Evaluation of Pair Programming, Proceedings, ESCOM, 2001, pp [17] Phongpaibul, M., Improving Quality Through Software Process Improvement in Thailand: Initial Analysis, Proceeding of 3-WoSQ, ICSE 2005, May 17. [18] Phongpaibul, M., An Empirical Comparison Between Pair Development and Software Inspection in Thailand, ISESE 2006, Sept [19] Phongpaibul, M., Experimental and Analytical Comparisons between Software Development with Fagan s Inspection and Pair Development, Qualify Report, University of Southern California, April [20] Russell, G.W., Experience with Inspection in Ultralarge-Scale Development, IEEE Software, Vol. 8, No. 1, Jan. 1991, pp [21] Siegel, A.F., Statistic and Data Analysis, John Wiley & Sons, Singapore, [22] Slaughter, S.A., Harter, D.E., and Krishnan M.S., Evaluating the Cost of Software Quality, Communications of ACM, Vol. 41, No. 8, August 1998, pp [23] Succi, G., Marchesi, M., Pedrycz,W., Williams, L., Preliminary Analysis of the Effects of Pair Programming on Job Satisfaction, 4th International Conference on extreme Programming and Agile Processes in Software Engineering (XP2002). [24] USC CodeCount Toolset, [25] Wernick, P., and Hall, T., The Impact of Using Pair Programming on System Evolution: a Simulation-Based Study, Proceedings of ICSM 04, IEEE, [26] Wheeler, D.A., Brykczynski, B., and Meeson, R.N.Jr., Software Inspection: An Industry Best Practice, IEEE CS Press, Los Alamitos, CA, [27] Williams, L., The Collaborative Software Process, PhD Dissertation, [28] Williams, L., and Kessler, R.R., Pair Programming Illuminated, Addison-Wesley, [29] Williams, L., Wiebe, E., Yang, K., Ferzli, M., and Miller, C., In Support of Pair Programming in the Introductory Computer Science Course, Computer Science Education, September