Parallel Computing Explained - Florida International University

advertisement
Parallel Computing Explained
Parallel Code Tuning
Slides Prepared from the CI-Tutor Courses at NCSA
http://ci-tutor.ncsa.uiuc.edu/
By
S. Masoud Sadjadi
School of Computing and Information Sciences
Florida International University
March 2009
Agenda
1 Parallel Computing Overview
2 How to Parallelize a Code
3 Porting Issues
4 Scalar Tuning
5 Parallel Code Tuning
5.1 Sequential Code Limitation
5.2 Parallel Overhead
5.3 Load Balance
5.3.1 Loop Schedule Types
5.3.2 Chunk Size
Parallel Code Tuning
 This chapter describes several of the most common
techniques for parallel tuning, the type of programs that
benefit, and the details for implementing them.
 The majority of this chapter deals with improving load
balancing.
Sequential Code Limitation
 Sequential code is a part of the program that cannot be run with
multiple processors. Some reasons why it cannot be made data
parallel are:





The code is not in a do loop.
The do loop contains a read or write.
The do loop contains a dependency.
The do loop has an ambiguous subscript.
The do loop has a call to a subroutine or a reference to a function
subprogram.
 Sequential Code Fraction
 As shown by Amdahl’s Law, if the sequential fraction is too large,
there is a limitation on speedup. If you think too much sequential
code is a problem, you can calculate the sequential fraction of code
using the Amdahl’s Law formula.
Sequential Code Limitation
 Measuring the Sequential Code Fraction





Decide how many processors to use, this is p.
Run and time the program with 1 processor to give T(1).
Run and time the program with p processors to give T(2).
Form a ratio of the 2 timings T(1)/T(p), this is SP.
Substitute SP and p into the Amdahl’s Law formula:
 f=(1/SP-1/p)/(1-1/p), where f is the fraction of sequential code.
 Solve for f, this is the fraction of sequential code.
 Decreasing the Sequential Code Fraction
 The compilation optimization reports list which loops could not be
parallelized and why.You can use this report as a guide to improve
performance on do loops by:
 Removing dependencies
 Removing I/O
 Removing calls to subroutines and function subprograms
Parallel Overhead
 Parallel overhead is the processing time spent





creating threads
spin/blocking threads
starting and ending parallel regions
synchronizing at the end of parallel regions
When the computational work done by the parallel processes is too
small, the overhead time needed to create and control the parallel
processes can be disproportionately large limiting the savings due to
parallelism.
 Measuring Parallel Overhead
 To get a rough under-estimate of parallel overhead:
 Run and time the code using 1 processor.
 Parallelize the code.
 Run and time the parallel code using only 1 processor.
 Subtract the 2 timings.
Parallel Overhead
 Reducing Parallel Overhead
 To reduce parallel overhead:
 Don't parallelize all the loops.
 Don't parallelize small loops.
 To benefit from parallelization, a loop needs about 1000 floating
point operations or 500 statements in the loop.You can use the IF
modifier in the OpenMP directive to control when loops are
parallelized.
!$OMP PARALLEL DO IF(n > 500)
do i=1,n
... body of loop ...
end do
!$OMP END PARALLEL DO
 Use task parallelism instead of data parallelism. It doesn't generate as
much parallel overhead and often more code runs in parallel.
 Don't use more threads than you need.
 Parallelize at the highest level possible.
Load Balance
 Load balance
 is the even assignment of subtasks to processors so as to keep each
processor busy doing useful work for as long as possible.
 Load balance is important for speedup because the end of a do loop is
a synchronization point where threads need to catch up with each
other.
 If processors have different work loads, some of the processors will
idle while others are still working.
 Measuring Load Balance
 On the SGI Origin, to measure load balance, use the perfex tool
which is a command line interface to the R10000 hardware counters.
The command
perfex -e16 -mp a.out > results
 reports per thread cycle counts. Compare the cycle counts to
determine load balance problems. The master thread (thread 0)
always uses more cycles than the slave threads. If the counts are vastly
different, it indicates load imbalance.
Load Balance
 For linux systems, the thread cpu times can be compared
with ps. A thread with unusually high or low time compared
to the others may not be working efficiently [high cputime
could be the result of a thread spinning while waiting for
other threads to catch up].
ps uH
 Improving Load Balance
 To improve load balance, try changing the way that loop
iterations are allocated to threads by
 changing the loop schedule type
 changing the chunk size
 These methods are discussed in the following sections.
Loop Schedule Types
 On the SGI Origin2000 computer, 4 different loop schedule
types can be specified by an OpenMP directive. They are:
 Static
 Dynamic
 Guided
 Runtime
 If you don't specify a schedule type, the default will be used.
 Default Schedule Type
 The default schedule type allocates 20 iterations on 4 threads as:
Loop Schedule Types
 Static Schedule Type
 The static schedule type is used when some of the iterations do more
work than others. With the static schedule type, iterations are
allocated in a round-robin fashion to the threads.
 An Example
 Suppose you are computing on the
upper triangle of a 100 x 100
matrix, and you use 2 threads,
named t0 and t1. With default
scheduling, workloads are uneven.
Loop Schedule Types
 Whereas with static scheduling, the columns of the matrix
are given to the threads in a round robin fashion, resulting in
better load balance.
Loop Schedule Types
 Dynamic Schedule Type
 The iterations are dynamically allocated to threads at runtime. Each
thread is given a chunk of iterations. When a thread finishes its work,
it goes into a critical section where it’s given another chunk of
iterations to work on.
 This type is useful when you don’t know the iteration count or work
pattern ahead of time. Dynamic gives good load balance, but at a high
overhead cost.
 Guided Schedule Type
 The guided schedule type is dynamic scheduling that starts with large
chunks of iterations and ends with small chunks of iterations. That is,
the number of iterations given to each thread depends on the number
of iterations remaining. The guided schedule type reduces the number
of entries into the critical section, compared to the dynamic schedule
type. Guided gives good load balancing at a low overhead cost.
Chunk Size
 The word chunk refers to a grouping of iterations. Chunk size means
how many iterations are in the grouping. The static and dynamic
schedule types can be used with a chunk size. If a chunk size is not
specified, then the chunk size is 1.
 Suppose you specify a chunk size of 2 with the static schedule type.
Then 20 iterations are allocated on 4 threads:
 The schedule type and chunk size are specified as follows:
!$OMP PARALLEL DO SCHEDULE(type, chunk)
…
!$OMP END PARALLEL DO
 Where type is STATIC, or DYNAMIC, or GUIDED and chunk is any
positive integer.
Download