Parallel Computing Explained Parallel Code Tuning Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 5.1 Sequential Code Limitation 5.2 Parallel Overhead 5.3 Load Balance 5.3.1 Loop Schedule Types 5.3.2 Chunk Size Parallel Code Tuning This chapter describes several of the most common techniques for parallel tuning, the type of programs that benefit, and the details for implementing them. The majority of this chapter deals with improving load balancing. Sequential Code Limitation Sequential code is a part of the program that cannot be run with multiple processors. Some reasons why it cannot be made data parallel are: The code is not in a do loop. The do loop contains a read or write. The do loop contains a dependency. The do loop has an ambiguous subscript. The do loop has a call to a subroutine or a reference to a function subprogram. Sequential Code Fraction As shown by Amdahl’s Law, if the sequential fraction is too large, there is a limitation on speedup. If you think too much sequential code is a problem, you can calculate the sequential fraction of code using the Amdahl’s Law formula. Sequential Code Limitation Measuring the Sequential Code Fraction Decide how many processors to use, this is p. Run and time the program with 1 processor to give T(1). Run and time the program with p processors to give T(2). Form a ratio of the 2 timings T(1)/T(p), this is SP. Substitute SP and p into the Amdahl’s Law formula: f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code. Solve for f, this is the fraction of sequential code. Decreasing the Sequential Code Fraction The compilation optimization reports list which loops could not be parallelized and why.You can use this report as a guide to improve performance on do loops by: Removing dependencies Removing I/O Removing calls to subroutines and function subprograms Parallel Overhead Parallel overhead is the processing time spent creating threads spin/blocking threads starting and ending parallel regions synchronizing at the end of parallel regions When the computational work done by the parallel processes is too small, the overhead time needed to create and control the parallel processes can be disproportionately large limiting the savings due to parallelism. Measuring Parallel Overhead To get a rough under-estimate of parallel overhead: Run and time the code using 1 processor. Parallelize the code. Run and time the parallel code using only 1 processor. Subtract the 2 timings. Parallel Overhead Reducing Parallel Overhead To reduce parallel overhead: Don't parallelize all the loops. Don't parallelize small loops. To benefit from parallelization, a loop needs about 1000 floating point operations or 500 statements in the loop.You can use the IF modifier in the OpenMP directive to control when loops are parallelized. !$OMP PARALLEL DO IF(n > 500) do i=1,n ... body of loop ... end do !$OMP END PARALLEL DO Use task parallelism instead of data parallelism. It doesn't generate as much parallel overhead and often more code runs in parallel. Don't use more threads than you need. Parallelize at the highest level possible. Load Balance Load balance is the even assignment of subtasks to processors so as to keep each processor busy doing useful work for as long as possible. Load balance is important for speedup because the end of a do loop is a synchronization point where threads need to catch up with each other. If processors have different work loads, some of the processors will idle while others are still working. Measuring Load Balance On the SGI Origin, to measure load balance, use the perfex tool which is a command line interface to the R10000 hardware counters. The command perfex -e16 -mp a.out > results reports per thread cycle counts. Compare the cycle counts to determine load balance problems. The master thread (thread 0) always uses more cycles than the slave threads. If the counts are vastly different, it indicates load imbalance. Load Balance For linux systems, the thread cpu times can be compared with ps. A thread with unusually high or low time compared to the others may not be working efficiently [high cputime could be the result of a thread spinning while waiting for other threads to catch up]. ps uH Improving Load Balance To improve load balance, try changing the way that loop iterations are allocated to threads by changing the loop schedule type changing the chunk size These methods are discussed in the following sections. Loop Schedule Types On the SGI Origin2000 computer, 4 different loop schedule types can be specified by an OpenMP directive. They are: Static Dynamic Guided Runtime If you don't specify a schedule type, the default will be used. Default Schedule Type The default schedule type allocates 20 iterations on 4 threads as: Loop Schedule Types Static Schedule Type The static schedule type is used when some of the iterations do more work than others. With the static schedule type, iterations are allocated in a round-robin fashion to the threads. An Example Suppose you are computing on the upper triangle of a 100 x 100 matrix, and you use 2 threads, named t0 and t1. With default scheduling, workloads are uneven. Loop Schedule Types Whereas with static scheduling, the columns of the matrix are given to the threads in a round robin fashion, resulting in better load balance. Loop Schedule Types Dynamic Schedule Type The iterations are dynamically allocated to threads at runtime. Each thread is given a chunk of iterations. When a thread finishes its work, it goes into a critical section where it’s given another chunk of iterations to work on. This type is useful when you don’t know the iteration count or work pattern ahead of time. Dynamic gives good load balance, but at a high overhead cost. Guided Schedule Type The guided schedule type is dynamic scheduling that starts with large chunks of iterations and ends with small chunks of iterations. That is, the number of iterations given to each thread depends on the number of iterations remaining. The guided schedule type reduces the number of entries into the critical section, compared to the dynamic schedule type. Guided gives good load balancing at a low overhead cost. Chunk Size The word chunk refers to a grouping of iterations. Chunk size means how many iterations are in the grouping. The static and dynamic schedule types can be used with a chunk size. If a chunk size is not specified, then the chunk size is 1. Suppose you specify a chunk size of 2 with the static schedule type. Then 20 iterations are allocated on 4 threads: The schedule type and chunk size are specified as follows: !$OMP PARALLEL DO SCHEDULE(type, chunk) … !$OMP END PARALLEL DO Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any positive integer.