TEACHING AND LEARNING PARALLEL PROCESSING THROUGH PERFORMANCE ANALYSIS USING PROBER

TEACHING AND LEARNING PARALLEL PROCESSING THROUGH PERFORMANCE ANALYSIS USING PROBER Luiz E. S. Ramos 1, Luís F. W. Góes 2, Carlos A. P. S. Martins 3 Abstract In this paper we analyze the teaching and learning of parallel processing through performance analysis using a software tool called Prober. This tool is a functional and performance analyzer of parallel programs that we proposed and developed during an undergraduate research project. Our teaching and learning approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. Our main goal is to show that students can learn concepts of parallel processing in a clearer, faster and more efficient way using our approach. Index Terms Parallel Processing, Performance Analysis, Single Aid Tool, Teaching and Learning Approach. INTRODUCTION Nowadays, clusters of workstations are widely used in academical, industrial and commercial areas. Usually built with commodity-off-the-shelf hardware components and freeware or shareware available from the net, they are a low cost and high performance alternative to supercomputers [4]. Many universities around the world are building clusters with existent hardware from their laboratories and free software. The primary motivation for building clusters is not always specifically to use them to teach and learn parallel processing. They generally end up becoming single task parallel systems [2]. For this reason and also for economical problems, many universities from developing countries do not offer undergraduate parallel processing courses. Usually, they do not integrate parallel processing concepts into the existing courses. Parallel processing courses generally focus on parallel programming. There are two main parallel programming models: shared variable and message passing [2][4][9][13]. In the shared variable model, a variable is placed in the local memory of a multiprocessor. The programs based on this model normally use libraries (like OpenMP, POSIX, Win and Java threads) as a support [4][9]. On the other hand, in the message passing model, each machine has its own private memory. So, it is necessary for the machines to exchange messages among them over a network in order to request remote data and to make synchronizations [8][9][11]. Programs written for multicomputers (like clusters of workstations) normally use message passing libraries (like PVM and MPI) as a support for the message passing model [4][7][9]. This model introduces new problems and concepts that should be contemplated in an undergraduate parallel processing course. The performance analysis and the evaluation of parallel programs are other important issues. Both can be made using available benchmarks (like Linpack and NAS Parallel Benchmark) or performance analyzers (like XPVM) [4][5]. In the traditional approach to teach and learn parallel processing, the students first take theoretical classes in which topics like: parallel algorithms, programming and architectures are discussed [4]. Only after those classes, the students implement parallel programs to solve some computational problems (e.g.: sorting, image processing, numerical algorithms etc.). So, it may take several weeks of theoretical classes before they verify the advantages and the disadvantages of parallel processing through practice [8]. During the functional and performance analysis of the programs, the students may spend too much time learning how to use different tools. Those tools do not prevent the students from having to make some manual tasks, which should be avoided [4][8]. All those problems may frustrate students. Moreover this teaching and learning approach could be more efficient, productive and less exhaustive. Motivation Our university offers a parallel processing course that uses the traditional approach to teach and learn parallel processing. In this approach, the students must study the theory of parallel processing for weeks before they verify the advantages and disadvantages of parallelism through practice. The students could be more motivated to learn and more prepared to use parallel processing if they had a practical class before the theoretical classes. In this class, they would learn some parallel processing concepts through the performance analysis of parallel programs. Moreover, in the functional and performance analysis process, the students spend the most of time doing secondary 1 Luiz E. S. Ramos, Pontifical Catholic University of Minas Gerais Informatics Institute, Belo Horizonte, MG, Brazil luizedu@pucmg.br 2 Luís F. W. Góes, Pontifical Catholic University of Minas Gerais, Informatics Institute, Belo Horizonte, MG, Brazil lfwg@pucmg.br 3 Carlos A. P. S. Martins, Pontifical Catholic University of Minas Gerais, Informatics Institute and Post-Graduation Program in Electrical Engineering, Belo Horizonte, MG, Brazil capsm@pucminas.br S2F-13

activities like configuring the parallel environment, manually registering the values of performance metrics, learning how to use different software tools (like statistic tools, compilers, benchmarks and daemons), etc. For this reason, we proposed and developed a tool called Prober, which is a functional and performance analyzer of parallel programs [12]. It was created during an undergraduate research project, to aid students, teachers and programmers in the configuration of parallel environments, test and analysis of parallel programs. Since the first implementation, the tool has been updated and expanded with the addition of some new features. Based on the good results reached at [12], in which Prober was only tested as a software tool, we are now using it as a single aid tool for the teaching and the learning of parallel processing. The main differences between the traditional method to teach and learn parallel processing and our proposed approach are: the inclusion of a practical class before the theoretical classes, in which students will learn concepts, advantages and disadvantages of parallel processing in a real environment through performance analysis; the use of a single tool to aid students in the test and the performance analysis of parallel programs; the use of a Windows/PC cluster that is a low cost popular environment. In our experiment we describe and analyze this practical class, which will be a part of a method to teach and learn parallel processing. Objectives and Goals Our main objective is to analyze the teaching and learning of parallel processing through performance analysis of parallel programs using a software tool called Prober. As a secondary objective we are interested in the usability of Prober as a single aid tool for performance analysis of parallel programs. Our main goal is to show that students can learn parallel processing concepts in a clearer, faster and more efficient way using our approach. This approach consists of a practical class where students receive explanations about some concepts of parallel processing and the use of the tool. They do some oriented and simple performance tests on parallel programs and analyze their results using Prober as a single aid tool. Finally, students answer a self-assessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. PROBER PRESENTATION Many academic and commercial software tools have been designed to deal with parallel processing challenges. Those tools (e.g.: message passing libraries and parallel compilers) can be used to support parallel processing or to manage, test and analyze parallel systems and programs (e.g.: performance and functional analyzers, monitors, debuggers, job management systems and benchmarks) [3][5][10]. Prober is a functional and performance analysis tool for parallel programs and can be used to teach and learn parallel processing. It combines features from system monitors, performance analyzers, benchmarks and job management systems. It can monitor the execution of the parallel programs and simultaneously collect the values of some performance metrics (e.g.: response time). After that, it makes those values available in binary files, which can be visualized through its graphical interface. The tool was implemented for the MS Windows environment using C++ language (widely spread in the academic area) and compiled with Borland C++ Builder 4.0, available in our university. The future versions will be planned to run on multiple platforms (e.g.: Linux, Solaris). In order to calculate the average response time of a program, it is important to make repeated time measures to reach a reliable result. For this reason, Prober provides an automatic iterative execution mechanism through which the user can repeatedly measure the response time in the execution of a program [1]. The main features of the present version of Prober are: the collection of performance data, the automatic storage of the collected data in binary files, the conversion of binary files into text files (readable by the user), the execution of different programs in a sequence (batch execution), the measuring of the response time of code segments (using a special support library), the interpretation of user scripts (specified in submission files), the automatic variation of the value of the numerical arguments of a program, the generation of graphics (response time, speedup and efficiency), the automatic calculation of statistics, and a new graphical interface. Prober has its own script language and provides a mechanism for the edition and the interpretation of scripts (saved as submission files). A user must create a script using the graphical interface of Prober or using a usual text editor. In this script, he must specify: the path of the target program to be tested, its arguments, the number of executing iterations, and the variation (a constant that tells Prober to automatically change the values of variable arguments). In an iterative execution, Prober can measure the global response time of an entire program code or the time spent in the execution of different internal code blocks (using a special support library). During the iterative execution, a work directory is created and the measured values are stored in a binary file (located in that directory). This file can be converted into a readable text file, which makes Prober interoperable with other statistical tools. After the execution of a script (submission file), the user can utilize the graphics generation and the statistics calculation operations on the binary files. The first option can be used to build a graphic (response time, speedup or efficiency) based on the minimum, the maximum or the geometric mean from the values stored in the binary file. S2F-14

Figure 1 shows the performance graphic and the statistics collected from an iterative execution (e.g.: arithmetic and geometric means and standard deviation). FIGURE 1 VISUALIZATION OF STATISTICS AND THE GRAPHIC OF RESPONSE TIME IN PROBER Prober has several functionalities, but in this paper we will only focus on the ones that are essential for teaching and learning parallel processing. EXPERIMENTAL METHOD Our experiment is based on a 120 minutes long practical class, in which the students receive general explanations about the context and some of the concepts of parallel processing. Besides, they are told about the stages of the experiment that will be performed and the use of the Prober. Then they perform some simple oriented performance tests and analyze their results. Finally, the students answer a selfassessment questionnaire about their formation, their knowledge of parallel processing concepts and also about the usability of Prober. This class is composed by an introductory lecture, a practical experiment and a self-assessment questionnaire. It is important to highlight that this is an oriented class and the topics presented in the introductory lecture are repeatedly discussed and explained whenever it is necessary. Before a stage starts it is explained in details and the explanation takes approximately 5 minutes long. The main objectives of the introductory lecture are: to introduce the context of parallel processing to the students; to present the main concepts of the performance analysis of parallel programs; to explain the stages of the experiment and how to use Prober. The introduction of the parallel processing context approaches parallel architectures, some complex computational problems, and parallel programming models (message passing and shared variable). The main discussed performance analysis concepts are: the response time, the throughput, the granularity, the speedup, the efficiency and the scalability. The three stages of the practical experiment are presented (the configuration, the test and the analysis). Each stage is performed with and without the tool. Finally, we present the functionalities of the tool and how to use them. This lecture approaches those topics in a superficial way and has 30 minutes long. The practical experiment is a key component of the class. In this experiment, the students should learn, emp loy and verify the concepts of parallel processing in practice by doing performance analysis of parallel programs. They should also find problems, advantages and disadvantages in the use of parallel processing. The teacher must observe the motivation and the satisfaction of the students during the experiment, and measure the time spent for the students to accomplish each stage of the experiment (with and without the tool). After that, the teacher should evaluate the productivity of the students, the usability of the tool and the importance of the orientation. The practical experiment is divided in three stages (configuration, test and analysis). These stages are necessary for the execution and the performance analysis of parallel programs. The division of the practical experiment is inherited from an earlier work [12], in which we customized each stage of the experiment in order to test a subset of functionalities from Prober and to evaluate its usability. In this work, one of the objectives is the evaluation of some new functionalities of Prober. Moreover on each stage, the students aggregate knowledge, reaffirm concepts, discover and discuss different problems, while the teacher can evaluate the motivation, the productivity and the satisfaction of the students. For these reasons, we kept the same division of the experiment presented in [12]. The first stage consists of defining, creating and configuring the execution sets. The students receive explanations about the Pi calculation program and are also instructed on how to test parallel programs using or not using Prober. They must create a set of program executions specifying the program arguments and the number of times those executions should be repeated. In this stage, the students have a first contact with the concepts of scalability and granularity. In the second stage, the students must repeatedly run the set of program executions created in stage one. The manual execution method (without the tool) is interactive, so that the students can observe the response time for each execution. On the other hand, Prober automatically runs the specified programs and collects the response time of each execution, but that response time cannot be simultaneously visualized. The best solution would be the utilization of both methods (the result of each execution can be visualized as soon as it is collected). In this stage the students learn the concepts of response time and granularity. They also observe that there may be time differences among the repeated executions of a same program. Thus, they learn that it is important to repeatedly execute a program in order to obtain a more reliable result. This stage also highlights the importance of the use of a tool for the automatic execution of parallel programs and the collection of performance metrics. In the third stage the students must generate the graphics of speedup and efficiency. They analyze those S2F-15

graphics and realize that the use of parallelism may or may not be suitable for a given problem. This is essential to their formation because they must know that real parallel systems do not always have an ideal (optimal) behavior (due to instability on the network or on the operating system, etc). Besides, they should learn that the size of the processing grain influences the performance of a parallel program and also that the scalability and the speedup of a program are not always the best (the efficiency should be constant and the speedup should be linear). This is the most important stage because the students analyze the obtained results and construct the concepts of speedup and efficiency. They also learn how to calculate these metrics. The entire practical experiment takes about 70 minutes to be accomplished. However this time may be different depending on the number of participating students and their level of knowledge in parallel processing. The last component of the practical class is the selfassessment questionnaire that should be answered by each student at the end of the class. The questionnaire is composed by 14 questions and takes about 20 minutes to be answered. The questions are related to the formation of the student (course, year etc.), the knowledge of parallel processing before and after the class, the difficulties to perform the tasks with and without the tool, the learning of concepts (speedup, efficiency, etc.), the need of orientation during the performed tests and analysis, the motivation for the students to improve their knowledge in parallel processing and the evaluation of Prober and its functionalities. Analyzing this questionnaire we have an assessment of the motivation and satisfaction levels of the students, the learning of parallel processing concepts and the need of orientation to perform the tasks. We can also verify the efficiency of our approach and the usability of Prober. Experimental Setup We formed a group of some undergraduate volunteer students to take part in the class. We used a cluster formed by four Pentium III 933Mhz interconnected by a Fast Ethernet switch, running MS Windows 98 and WPVM to support parallelism. WPVM (Windows Parallel Virtual Machine) is a message passing library based on PVM developed for the Windows environment by the Dependable Systems Group of University of Coimbra [1]. We chose to perform the tests on a parallel implementation of the Pi calculation problem through numerical integration. This problem was chosen because of its simplicity and its high level of parallelism. The Pi calculation algorithm was implemented in C++ using WPVM library and requires three parameters to be run: the number of integration intervals, the number of employed processes and the number of machines. Due to our limitations (class duration and number of available machines), we decided to employ from one to four processes (each one in a different machine), also varying the number of intervals (1, 10 and 100 millions). We repeated each execution five times, in order to get reliable response times. We consider that an execution is repeated whenever it uses the exact same arguments used before. EXPERIMENTAL RESULTS During the presentation of the results we will analyze the executions of the Pi calculation program (100 million intervals), and discuss what did the students learn in each stage of the experiment. During the introductory lecture, some doubts emerged. They were related to the stages of the experiment and the use of the tool, and not to the theory of parallel processing itself. We believe that those doubts emerged because the students were anxious to execute the programs and because they already had a basic knowledge in parallel processing. Anyway this introductory lecture is informative and instructive (it is only a first contact). As a first practical task, the students initialized and configured their WPVM daemons, creating the parallel virtual environment. In the first stage we explained the purpose of the Pi program and showed the students how to use and test it. The students had to create a set of executions, specifying: the number of processes involved in the calculation, the number of intervals to be calculated and the number of times that the test should be repeated. In the manual method, the students created a MS DOS batch file (composed by several command lines) in order to repeatedly execute the program. They spent an average time of 6.75 minutes. Using Prober, the students had to create a script file through its graphical interface. When the tool interprets the script shown in Figure 2, it executes the Pi program 20 times in a row (automatically varying the number of process from one to four and executing each variation five times). Using the tool, the time spent in this stage was reduced to 3.5 minutes and it was less exhaustive. FIGURE 2 CONFIGURATION OF THE SUBMISSION FILE TO EXECUTE THE PI PROGRAM THROUGH PROBER At the end of this stage, the students noticed that: the number of processes influences the application scalability (performance may grow when more processes are used), the S2F-16

number of intervals influences the grain size (amount of processing) and the number of exe cution repetitions influences the reliability of the results (more executions mean more reliability). In the second stage, the students executed the batch/script file and collected response times. We used two different Pi implementations: the interactive one (used with the manual method) and the batch one (used with Prober). As WPVM does not allow simultaneous program executions, students had to wait for one another before they did their own tests (they waited about 16.25 minutes each). We highlight that this wait time is proportional to the number of participating students. If the group size is too big, it is possible to divide it into sub groups. In the manual method, the students run their batch files and registered the response time of the program on a piece of paper at the end of each execution. They spent an average time of 11.5 minutes. Using Prober the students only had to select their submission files and to start the execution. The tool automatically executed the program, collected the response times and stored them into binary files. This task took only 2 minutes per student and the average wait time was reduced to 3 minutes. At the end of this stage, the students verified in practice that: the grain size influences the response time (network collis ions and small sized grains harm the response time); there is a difference between the response times obtained from different executions of a same program (programs should be executed many times to obtain more reliable results); the response time changes according to variation of the number of processes and the size of the grain. In our example, the response time reduced while the number of process increased. It happened because the processing time prevailed over communication and synchronization overheads among processes (coarse grain). In the third stage, the students created some performance graphics and calculated some performance statistics. Using the manual method, students had to transcribe their results to MS Excel and use it to generate speedup and efficiency graphics and calculate geometric means, arithmetic means, standard deviations, minimum and maximum values. This part of the experiment depended on the skills of each student to manipulate MS Excel. The fastest student finished his task in 8 minutes and the slowest spent 14 minutes. The average time was 12.5 minutes. Prober generated the performance graphics and calculated the performance statistics automatically. So the students could spend most of the time analyzing results (meeting one of our primary objectives). The students received explanations about the calculation of speedup and efficiency. The speedup is the sequential response time (one process) divided by the parallel response time (more than one process). It is the gain of the parallel execution over the sequential one. The efficiency is the speedup divided by the number of processes (or machines). It measures the mean utilization of each process. In the case of the Pi calculation example, the obtained speedup was small and the efficiency was low because the communication overhead prevailed over processing (a small number of intervals was used). So, after the explanation, the students learned those concepts correctly and also reaffirmed their assumption that parallelism in not always advantageous. FIGURE 3 GRAPHIC OF SPEEDUP USING 100 MILION INTERVALS In Figure 3, we show the graphic of speedup for the given example. The maximum reached speedup was 2.8, using four machines and 100 million intervals. The ideal speedup in this case would be 4 but this value is difficult to be achieved. With this example, the students observed this difficulty and became motivated to repeat the test using different numbers of intervals, trying to obtain the ideal speedup. The practical experiment demonstrated that the students employed, verified and learned some concepts of parallel processing. They also verified that parallel processing is not always a good solution, but in many situations it can be used to reduce the response time of a program. Students observed that the response time of a program might vary from one execution to another because of the overhead generated by factors like the network and operational system. The use of Prober apparently made the students more motivated and satisfied during the experiment. They spent about 30.75 minutes (average total time) using the manual method and 6.5 minutes using Prober. This means that the average total time spent to execute the practical experiment reduced 78.8% using Prober. Consequently the students learned better and increased their productivity, so the use of Prober proved to be highly efficient. During the experiment, we noticed that the orientation was very important because students had doubts in the execution of the tasks. The future versions of Prober will be more focused on the learning process. They will have selfexplained examples, better help options and some other learning functionalities. The self-assessment questionnaire showed that students had the same profile. They attended the last year of an undergraduate computer science course and had the basic S2F-17

understanding of parallel processing (one of them had intermediary understanding). We observed that it was harder for the students to execute the tests than to analyze their results without any orientation (above all when they used the manual method). Only one student said that he was not able to perform the analysis using any method with no orientation. Only one of the students had a high level of difficulty while using the manual method (the others had a medium level). However all of them had a low level of difficulty while using Prober. After the class, half of the students stated that their knowledge on parallel processing had increased (the others had no improvement). In spite of this, all students said that the concepts of speedup, efficiency and scalability had become clearer to them, so they were able to learn some concepts of parallel processing. On their opinions, the least important functionalities of Prober were: the file conversion from binary to text and the graphical interface. The most important functionalities were: the automatic collection and storage of data, the batch execution, the creation of submission files and the generation of graphics. The students suggested some new features for Prober, such as: remote access through Internet, more hot keys and options in the generation of graphics. All the students became more motivated to test and analyze parallel programs using Prober. They recommend it for people that are interested to learn parallel processing. CONCLUSION Generally, the complexity of computational problems tends to increase, demanding high performance processing. Sequential computers seem to be reaching their processing speed limits, so parallel machines emerge as a good solution to fulfill the performance requirements. For this reason, the teaching of parallel processing has become an essential subject in computer science courses. The students should be encouraged to learn and employ the concepts of parallelism. In this work we showed that students can learn parallel processing concepts in a clearer, faster and more efficient way using our approach. We analyzed the teaching and learning of parallel processing through a practical class, where students were able to do some oriented performance tests on parallel programs and analyze their results. They learned some parallel processing concepts (e.g.: speedup and scalability) and became satisfied and more motivated to study and learn parallel processing. The use of Prober as a single aid tool for performance analysis of parallel programs was approved. Those facts confirm that our approach to teach and learn parallel processing is efficient and could be used on a parallel processing course. FUTURE WORK As a future work, we intend to apply this experiment on different groups of students. Some different teaching methods will be applied to each group. Those methods are: the manual method, the proposed approach (that uses Prober) and an approach that uses some other software tools. The performance of each group will be analyzed through objective pre-tests and post-tests. Based on this we will create a new method for teaching and learning parallel processing that will include the practical class. Prober will aid in the application of the method. We also intend to implement some new functionalities in the future versions of Prober. They will have learning purposes and will be based on the suggestions made by the students. ACKNOWLEDGMENTS We would like to thank the Department of Mechanical Engineering for lending the laboratory, PIBIC, CNPq, ProPPG (research project FIP 2001/29P), LSDC and the Informatics Institute for the support. A special thank to the volunteer students and our English reviewers. REFERENCES [1] Alves, A., Silva, L., Carreira, J., Silva., J. "WPVM: Parallel Computing for the People", Proceedings of HPCN'95, High Performance Computing and Networking Conference, in Springer Verlag Lecture Notes in Computer Science, pp 582-587, Italy 1995. [2] Andersen, P. The Texas Tech Tornado Cluster: A Linux/MPI Cluster for Parallel Programming Education and Research, ACM CrossRoads Eletronic Magazine, 1999. [3] Baker, M.A., Fox, G.C., Yau, I.W., Cluster Management Software, NHSE Review, 1995. [4] Buyya, R., Apon, A., Jin, H., Mache, J. Cluster Computing in the Classroom: Topics, Guidelines and Experiences, 1st IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001. [5] Dongarra, J., Browne, S., London, K. Review of Performance Analysis Tools for MPI Parallel Programs, 1997. (http://www.cs.utk.edu/~browne/perftools-review/) [6] Foster, I. Designing and Building Parallel Programs, On-line book, 1995. (http://www-unix.mcs.anl.gov/dbpp/text/book.html) [7] Geist, A. PVM and MPI: What Else Is Needed for Cluster Computing?, 7th European PVM/MPI Users' Group Meeting, 2000. [8] Guha, K. R., Hartman, J. Teaching Parallel Processing: Where Architecture and Language Meet, 22 nd ASEE/IEEE Frontiers in Education Conference, 1992. [9] Hartman, J., Sanders, D. Teaching Parallel Processing using Free Resources, 26 th ASEE/IEEE Frontiers in Education Conference, 1996. [10] Hassaine, O. Issues in Selecting a Job Management System, CPRE Engineering-HPC Sun BluePrints OnLine, 2002. [11] Jin, L. Teaching Parallel Processing to Undergraduate Students, 23 rd ASEE/IEEE Frontiers in Education Conference, 1993. [12] Ramos, L. E. S., Góes, L. F. W., Martins, C. A. P. S. Prober: Uma Ferramenta de Análise Funcional e de Desempenho de Programas Paralelos e Configuração de Cluster, 2º Workshop de Sistemas Computacionais de Alto Desempenho, Brasil, 2001. (in Portuguese) [13] White, Kent Experiments in Parallel Processing for Undergraduate Students, 29 th ASEE/IEEE Frontiers in Education Conference, 1999. S2F-18