Case Study: Nikos Tryfonidis, Aristotle University of Thessaloniki PRACE Autumn School in HPC Programming Techniques 25-28 November 2014 Athens, Greece You are given a scientific code, parallelized with MPI. Question: Any possible performance benefits by a mixed-mode implementation? We will go through the steps of preparing, implementing and evaluating the addition of threads to the code. MG : General Purpose Computational Fluid Dynamics Code (~20000 lines). Written in C and parallelized with MPI, using communication library written by author. Developed by Mantis Numerics and provided by Prof. Sam Falle (director of company and author of the code). MG has been used professionally for research in Astrophysics and simulations of liquid CO2 in pipelines, non-ideal detonations, groundwater flow etc. 1. Preparation: Code description, initial benchmarks. 2. Implementation: Introduction of threads into the code. Application of some interesting OpenMP concepts: 3. Parallelizing linked list traversals OpenMP Tasks Avoiding race conditions Results - Conclusion Code Description Step 1: Inspection of the code, discussion with the author. Step 1: Inspection of the code, discussion with the author. Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling. Step 1: Inspection of the code, discussion with the author. Step 2: Run some initial benchmarks to get an idea of the program’s (pure MPI) runtime and scaling. Step 3: Use profiling to gain some insight into the code’s hotspots/bottlenecks. Computational domain: Consists of cells (yellow boxes) and joins (arrows). 1st Join 1st Cell 2nd Join 2nd Cell … …… Last Join Last Cell (1D example) The code performs computational work by looping through all cells and joins. Cells are distributed to all MPI Processes, using a 1D decomposition (each Process gets a contiguous group of cells and joins). Proc. 1 Halo Communication Proc. 2 Computational hotspot of the code: “step” function (~500 lines). “step” determines the stable time step and then advances the solution over that time step. Mainly consists of halo communication and multiple loops over cells and joins (separate). 1st Order Step Halo Communication (Calls to MPI) Loops through Cells and Joins (Computational Work) 1st Order Step Halo Communication (Calls to MPI) Loops through Cells and Joins (Computational Work) 2nd Order Step Halo Communication (Calls to MPI) Loops through Cells , Halo Cells and Joins: Multiple Loops (Heavier Computational Work) Initial Benchmarks and Profiling Initial benchmarks were run, using test case suggested by the code author. A 3D computational domain was used. Various domain sizes were tested (100³, 200³ and 300³ cells), for 10 computational steps. Representative performance results will be shown here. Figure 1: Execution time (in seconds) versus number of MPI Processes (size: 300³) Figure 2: Speedup versus number of MPI Processes (all sizes). Profiling of the code was done using CrayPAT. Four profiling runs were performed, with different numbers of processors (2, 4, 128, 256) and a grid size of 200³ cells. Most relevant result of the profiling runs for the purpose of this presentation: percentage of time spent in MPI functions. Figure 3: Percentage of time spent in MPI communication, for 2, 4, 128 and 256 processors (200³ cells) The performance of the code is seriously affected by increasing the number of processors. Performance actually becomes worse after a certain point. Profiling shows that MPI communication dominates the runtime for high processor counts. Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications. MG uses a lot of these (written in communication library). Smaller number of MPI Processes means: Fewer calls to MPI. Cheaper MPI collective communications. MG uses a lot of these (written in communication library). Fewer halo cells (less data communicated, less memory required). Note: Simple 1D decomposition of domain requires more halo cells per MPI process than 2D or 3D domain decompositions. Mixed-Mode, requiring fewer halo cells, helps here. Addition of OpenMP code: possible additional synchronization (barriers, critical regions etc) needed for threads – bad for performance! Only one thread (master) used for communication, means we will not be using the system’s maximum bandwidth potential. The Actual Work All loops in “step” function are linked list traversals! Linked List example (pseudocode) : pointer = first cell while (pointer != NULL) { - Do Work on current pointer / cell – pointer = next cell } Linked list traversals use a while loop. Iterations continue until the final element of the linked list is reached. Linked list traversals use a while loop. Iterations continue until the final element of the linked list is reached. In other words: Next element that the loop will work on is not known until the end of current iteration No well-defined loop boundaries! Linked list traversals use a while loop. Iterations continue until the final element of the linked list is reached. In other words: Next element that the loop will work on is not known until the end of current iteration No well-defined loop boundaries! Manual Parallelization of Linked List Traversals Straightforward way to parallelize a linked list traversal: transform the while loop into a for loop. This can be parallelized with a for loop! 1. Count number of cells (1 loop needed) 2. Allocate array of pointers of appropriate size 3. Point to every cell (1 loop needed) 4. Rewrite the original while loop as a for loop BEFORE: pointer = first cell while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } AFTER: pointer = first cell while (pointer != NULL) { counter+=1 pointer = next cell } Allocate pointer array (size of counter) for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell } for (i=0; i<counter; i++) { pointer = pointer_array[i] - Do Work } BEFORE: pointer = first cell while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } AFTER: pointer = first cell while (pointer != NULL) { counter+=1 pointer = next cell } Allocate pointer array (size of counter) for (i=0; i<counter; i++) { pointer_array[i] = pointer pointer = next cell } for (i=0; i<counter; i++) { pointer = pointer_array[i] - Do Work } After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote. After verifying that the code still produces correct results, we are ready to introduce OpenMP to the “for” loops we wrote. In similar fashion to plain OpenMP, we must pay attention to: The data scope of the variables. Data dependencies that may lead to race conditions. 1 #pragma omp parallel shared (cptr_ptr, ...) 2 private (t, cptr, ...) 3 firstprivate (cptr_counter, ...) 4 default (none) 5 { 6 #pragma omp for schedule(type, chunk) 7 for (t=0; t<cptr_counter; t++) { 8 9 cptr = cptr_ptr[t]; 10 11 / Do Work / 12 / ( . . . ) / 13 } 14 } After introducing OpenMP to the code and verifying correctness, performance tests took place, in order to evaluate performance as a plain OpenMP code. Tests were run for different problem sizes, using different numbers of threads (1,2,4,8). Figure 4: Execution time versus number of threads, for second – order step loops (size: 200³ cells) Figure 5: Speedup versus number of threads, for second – order step loops (size: 200³ cells) Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup. Similar results for smaller problem size (100³ cells), only less speedup. Almost ideal speedup for up to 4 threads. With 8 threads, the two heaviest loops continue to show decent speedup. Similar results for smaller problem size (100³ cells), only less speedup. In mixed mode, cells will be distributed to processes: interesting to see if we will still have speedup there. Parallelization of Linked List Traversals Using OpenMP Tasks OpenMP Tasks: a feature introduced with OpenMP 3.0. The Task construct basically wraps up a block of code and its corresponding data, and schedules it for execution by a thread. OpenMP Tasks allow the parallelization of a more wide variety of loops, making OpenMP more flexible. The Task construct is the right tool for parallelizing a “while” loop with OpenMP. Each iteration of the “while” loop can be a Task. Using Tasks is an elegant method for our case, leading to cleaner code with minimal additions. AFTER: BEFORE: pointer = first cell #pragma omp parallel { #pragma omp single { pointer = first cell while (pointer != NULL) { #pragma omp task { -Do Work on current pointer / cell– while(pointer!= NULL) { - Do Work on current pointer / cell – pointer = next cell } } pointer = next cell } } Using OpenMP Tasks, we were able to parallelize the linked list traversal by just adding OpenMP directives! Fewer additions to the code, elegant method. Usual OpenMP work still applies: data scope and dependencies need to be resolved. Figure 6: Execution time versus number of threads, for second – order step loops, using Tasks (size: 200³ cells) Figure 7: Speedup versus number of threads, for second – order step loops, using Tasks (size: 200³ cells) Figure 8: OpenMP Task creation and dispatch overhead versus number of Threads¹. 1. J.M. Bull, F. Reid, N. McDonnell - A Microbenchmark Suite for OpenMP Tasks. 8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings For the current code, performance tests show that creating the Tasks and dispatching them requires roughly the same time needed to complete them, for one thread. With more threads, it gets much worse (remember the logarithmic axis in previous graph). The problem: very big number of Tasks, not heavy enough each to justify huge overheads. Despite being elegant and clear, OpenMP Tasks are clearly not the way to go. Could try different strategies (e.g. grouping Tasks together), but that would cancel the benefits of Tasks (elegance and clarity). Manual Parallelization of linked list traversals will be used for our mixed-mode MPI+OpenMP implementation with this particular code. It may be ugly and inelegant, but it can get things done. In defense of Tasks: If the code had been written with the intent of using OpenMP Tasks, things could have been different. Avoiding Race Conditions Without Losing The Race Additional synchronization required by OpenMP can prove to be very harmful for the performance of the mixed-mode code. While race conditions need to be avoided at all costs, this must be done in the least expensive way possible. At a certain point, the code needs to find the maximum value of an array. While trivial in serial, with OpenMP this is a race condition waiting to happen. Part of loop to be parallelized with OpenMP: for (i=0; i<n; i++){ } if (a[i] > max){ max = a[i]; } At a certain point, the code needs to find the maximum value of an array. While trivial in serial, with OpenMP this is a race condition waiting to happen. for (i=0; i<n; i++){ Two ways to tackle this: 1. Critical Regions 2. Manually (Temporary Shared Arrays) } Part of loop to be parallelized with OpenMP: if (a[i] > max){ max = a[i]; } What happens if (when) 2 or more threads try to write to “max” at the same time? With a Critical Region we can easily avoid the race conditions. However, Critical Regions are very bad for performance. Question: Include loop in critical region or not? for (i=0;i<n;i++) { #pragma omp critical if (a[i] > max) { max = a[i]; } } Now only one thread at a time can be inside critical block. Data (Shared Array, 4 Threads): 1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2 Data (Shared Array, 4 Threads): Thread 0 Thread 1 Thread 2 Thread 3 1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2 Temp. Shared Array: 8 7 9 5 Each thread writes its own maximum to corresponding element Data (Shared Array, 4 Threads): Thread 0 Thread 1 Thread 2 Thread 3 1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2 Temp. Shared Array: Single Thread: 8 7 9 5 9 Each thread writes its own maximum to corresponding element A single thread picks out the total maximum Benchmarks were carried out, measuring execution time for the “find maximum” loop only. Three cases tested: Critical Region, single “find max” instruction inside Critical Regions, whole “find max” loop inside Temporary Arrays Figure 9: Execution time versus number of threads (size: 200³ cells). Figure 10: Speedup versus number of threads (size: 200³ cells). The temporary array method is clearly the winner. However: Additional code needed for this method. Smaller problem sizes give less performance gains for more threads (nothing we can do about that, though). Mixed-Mode Performance The code was tested in mixed-mode with 2, 4 and 8 threads per MPI Process. Same variation in problem size as before (100³, 200³, 300³ cells). Representative results will be shown here. Figure 11: Time versus number of threads, 2 threads per MPI Proc. Figure 12: Time versus number of threads, 4 threads per MPI Proc. Figure 13: Time versus number of threads, 8 threads per MPI Proc. Figure 14: Speedup versus number of threads, all combinations Figure 15: Speedup versus number of threads, all combinations Figure 16: Speedup versus number of threads, all combinations Mixed-Mode outperforms the original MPIonly implementation for the higher processor numbers tested. Mixed-Mode outperforms the original MPIonly implementation for the higher processor numbers tested. MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested. Mixed-Mode outperforms the original MPIonly implementation for the higher processor numbers tested. MPI-only performs better (or almost the same) as mixed mode for the lower processor numbers tested. Mixed-Mode with 4 threads/MPI Process is the best choice for problem sizes tested. Figure 17: Memory usage versus number of PEs , 8 threads per MPI Process (200³ cells) Was Mixed-Mode Any Good Here? For problem sizes and processor numbers tested: Mixed-Mode performed better or equally compared to pure MPI. Higher processor numbers: Mixed-Mode manages to achieve speedup where pure MPI slows down. Mixed-Mode required significantly less memory. Any possible performance benefits from a Mixed-Mode implementation? Any possible performance benefits from a Mixed-Mode implementation for this code? Answer: Yes, for larger numbers of processors (> 256), a mixed-mode implementation of this code: Provides Speedup instead of Slow-Down. Uses less memory