Application_of_Mixed-Mode_Programming_in_a_Real

advertisement
Case Study:
Nikos Tryfonidis,
Aristotle University of Thessaloniki
PRACE Autumn School in HPC Programming Techniques
25-28 November 2014
Athens, Greece

You are given a scientific code, parallelized
with MPI.

Question: Any possible performance benefits
by a mixed-mode implementation?

We will go through the steps of preparing,
implementing and evaluating the addition of
threads to the code.

MG : General Purpose Computational Fluid Dynamics
Code (~20000 lines). Written in C and parallelized with
MPI, using communication library written by author.

Developed by Mantis Numerics and provided by Prof.
Sam Falle (director of company and author of the
code).

MG has been used professionally for research in
Astrophysics and simulations of liquid CO2 in
pipelines, non-ideal detonations, groundwater flow
etc.
1.
Preparation: Code description, initial
benchmarks.
2.
Implementation: Introduction of threads into
the code. Application of some interesting
OpenMP concepts:



3.
Parallelizing linked list traversals
OpenMP Tasks
Avoiding race conditions
Results - Conclusion
Code Description

Step 1: Inspection of the code, discussion with
the author.

Step 1: Inspection of the code, discussion with
the author.

Step 2: Run some initial benchmarks to get an
idea of the program’s (pure MPI) runtime and
scaling.

Step 1: Inspection of the code, discussion with
the author.

Step 2: Run some initial benchmarks to get an
idea of the program’s (pure MPI) runtime and
scaling.

Step 3: Use profiling to gain some insight into
the code’s hotspots/bottlenecks.

Computational domain: Consists of cells
(yellow boxes) and joins (arrows).
1st Join
1st Cell
2nd Join
2nd Cell
…
……
Last
Join
Last Cell
(1D example)

The code performs computational work by
looping through all cells and joins.

Cells are distributed to all MPI Processes,
using a 1D decomposition (each Process gets a
contiguous group of cells and joins).
Proc. 1
Halo Communication
Proc. 2

Computational hotspot of the code: “step”
function (~500 lines).

“step” determines the stable time step and
then advances the solution over that time
step.

Mainly consists of halo communication and
multiple loops over cells and joins (separate).
1st Order Step
Halo Communication (Calls to MPI)
Loops through Cells and Joins (Computational Work)
1st Order Step
Halo Communication (Calls to MPI)
Loops through Cells and Joins (Computational Work)
2nd Order Step
Halo Communication (Calls to MPI)
Loops through Cells , Halo Cells and Joins:
Multiple Loops (Heavier Computational Work)
Initial Benchmarks and Profiling

Initial benchmarks were run, using test case
suggested by the code author.

A 3D computational domain was used.
Various domain sizes were tested (100³, 200³
and 300³ cells), for 10 computational steps.

Representative performance results will be
shown here.
Figure 1: Execution time (in seconds) versus number
of MPI Processes (size: 300³)
Figure 2: Speedup versus number of MPI Processes
(all sizes).

Profiling of the code was done using CrayPAT.

Four profiling runs were performed, with
different numbers of processors (2, 4, 128,
256) and a grid size of 200³ cells.

Most relevant result of the profiling runs for
the purpose of this presentation: percentage
of time spent in MPI functions.
Figure 3: Percentage of time spent in MPI communication,
for 2, 4, 128 and 256 processors (200³ cells)

The performance of the code is seriously
affected by increasing the number of
processors.

Performance actually becomes worse after a
certain point.

Profiling shows that MPI communication
dominates the runtime for high processor
counts.

Smaller number of MPI Processes means:
 Fewer calls to MPI.
 Cheaper MPI collective communications. MG uses
a lot of these (written in communication library).

Smaller number of MPI Processes means:
 Fewer calls to MPI.
 Cheaper MPI collective communications. MG uses
a lot of these (written in communication library).
 Fewer halo cells (less data communicated, less
memory required).
Note: Simple 1D decomposition of domain requires
more halo cells per MPI process than 2D or 3D
domain decompositions. Mixed-Mode, requiring
fewer halo cells, helps here.

Addition of OpenMP code: possible
additional synchronization (barriers, critical
regions etc) needed for threads – bad for
performance!

Only one thread (master) used for
communication, means we will not be using
the system’s maximum bandwidth potential.
The Actual Work

All loops in “step” function are linked list
traversals!
Linked List example (pseudocode) :
pointer = first cell
while (pointer != NULL) {
- Do Work on current pointer / cell –
pointer = next cell
}

Linked list traversals use a while loop.

Iterations continue until the final element of
the linked list is reached.

Linked list traversals use a while loop.

Iterations continue until the final element of
the linked list is reached.

In other words:
 Next element that the loop will work on is
not known until the end of current iteration
 No well-defined loop boundaries!

Linked list traversals use a while loop.

Iterations continue until the final element of
the linked list is reached.

In other words:
 Next element that the loop will work on is
not known until the end of current iteration
 No well-defined loop boundaries!
Manual Parallelization of Linked List Traversals

Straightforward way to parallelize a linked list
traversal: transform the while loop into a for
loop.
This can be parallelized with a for loop!
1.
Count number of cells (1 loop needed)
2.
Allocate array of pointers of appropriate size
3.
Point to every cell (1 loop needed)
4.
Rewrite the original while loop as a for loop
BEFORE:
pointer = first cell
while(pointer!= NULL) {
- Do Work on current
pointer / cell –
pointer = next cell
}
AFTER:
pointer = first cell
while (pointer != NULL) {
counter+=1
pointer = next cell
}
Allocate pointer array
(size of counter)
for (i=0; i<counter; i++) {
pointer_array[i] = pointer
pointer = next cell
}
for (i=0; i<counter; i++) {
pointer = pointer_array[i]
- Do Work }
BEFORE:
pointer = first cell
while(pointer!= NULL) {
- Do Work on current
pointer / cell –
pointer = next cell
}
AFTER:
pointer = first cell
while (pointer != NULL) {
counter+=1
pointer = next cell
}
Allocate pointer array
(size of counter)
for (i=0; i<counter; i++) {
pointer_array[i] = pointer
pointer = next cell
}
for (i=0; i<counter; i++) {
pointer = pointer_array[i]
- Do Work }

After verifying that the code still produces
correct results, we are ready to introduce
OpenMP to the “for” loops we wrote.

After verifying that the code still produces
correct results, we are ready to introduce
OpenMP to the “for” loops we wrote.

In similar fashion to plain OpenMP, we must
pay attention to:
 The data scope of the variables.
 Data dependencies that may lead to race
conditions.
1 #pragma omp parallel shared (cptr_ptr, ...)
2
private (t, cptr, ...)
3
firstprivate (cptr_counter, ...)
4
default (none)
5 {
6
#pragma omp for schedule(type, chunk)
7
for (t=0; t<cptr_counter; t++) {
8
9
cptr = cptr_ptr[t];
10
11
/ Do Work /
12
/ ( . . . ) /
13
}
14 }

After introducing OpenMP to the code and
verifying correctness, performance tests took
place, in order to evaluate performance as a
plain OpenMP code.

Tests were run for different problem sizes,
using different numbers of threads (1,2,4,8).
Figure 4: Execution time versus number of threads,
for second – order step loops (size: 200³ cells)
Figure 5: Speedup versus number of threads, for
second – order step loops (size: 200³ cells)


Almost ideal speedup for up to 4 threads.
With 8 threads, the two heaviest loops
continue to show decent speedup.
Similar results for smaller problem size (100³
cells), only less speedup.



Almost ideal speedup for up to 4 threads.
With 8 threads, the two heaviest loops
continue to show decent speedup.
Similar results for smaller problem size (100³
cells), only less speedup.
In mixed mode, cells will be distributed to
processes: interesting to see if we will still
have speedup there.
Parallelization of Linked List Traversals
Using OpenMP Tasks

OpenMP Tasks: a feature introduced with
OpenMP 3.0.

The Task construct basically wraps up a block
of code and its corresponding data, and
schedules it for execution by a thread.

OpenMP Tasks allow the parallelization of a
more wide variety of loops, making OpenMP
more flexible.

The Task construct is the right tool for
parallelizing a “while” loop with OpenMP.

Each iteration of the “while” loop can be a
Task.

Using Tasks is an elegant method for our
case, leading to cleaner code with minimal
additions.
AFTER:
BEFORE:
pointer = first cell
#pragma omp parallel
{
#pragma omp single
{
pointer = first cell
while (pointer != NULL)
{
#pragma omp task
{
-Do Work on current
pointer / cell–
while(pointer!= NULL) {
- Do Work on current
pointer / cell –
pointer = next cell
}
}
pointer = next cell
}
}

Using OpenMP Tasks, we were able to
parallelize the linked list traversal by just
adding OpenMP directives!

Fewer additions to the code, elegant method.

Usual OpenMP work still applies: data scope
and dependencies need to be resolved.
Figure 6: Execution time versus number of threads, for second
– order step loops, using Tasks (size: 200³ cells)
Figure 7: Speedup versus number of threads, for second –
order step loops, using Tasks (size: 200³ cells)
Figure 8:
OpenMP Task
creation and
dispatch
overhead
versus
number of
Threads¹.
1. J.M. Bull, F. Reid, N. McDonnell - A Microbenchmark Suite for OpenMP Tasks.
8th International Workshop on OpenMP, IWOMP 2012, Rome, Italy, June 11-13, 2012. Proceedings

For the current code, performance tests show
that creating the Tasks and dispatching them
requires roughly the same time needed to
complete them, for one thread.

With more threads, it gets much worse
(remember the logarithmic axis in previous
graph).

The problem: very big number of Tasks, not
heavy enough each to justify huge overheads.

Despite being elegant and clear, OpenMP
Tasks are clearly not the way to go.

Could try different strategies (e.g. grouping
Tasks together), but that would cancel the
benefits of Tasks (elegance and clarity).

Manual Parallelization of linked list traversals
will be used for our mixed-mode MPI+OpenMP
implementation with this particular code.

It may be ugly and inelegant, but it can get
things done.

In defense of Tasks: If the code had been
written with the intent of using OpenMP Tasks,
things could have been different.
Avoiding Race Conditions Without Losing The Race

Additional synchronization required by
OpenMP can prove to be very harmful for the
performance of the mixed-mode code.

While race conditions need to be avoided at
all costs, this must be done in the least
expensive way possible.

At a certain point, the code
needs to find the maximum
value of an array.

While trivial in serial, with
OpenMP this is a race
condition waiting to
happen.
Part of loop to be parallelized
with OpenMP:
for (i=0; i<n; i++){
}
if (a[i] > max){
max = a[i];
}

At a certain point, the code
needs to find the maximum
value of an array.

While trivial in serial, with
OpenMP this is a race
condition waiting to
happen.
for (i=0; i<n; i++){
Two ways to tackle this:
1. Critical Regions
2. Manually (Temporary
Shared Arrays)
}

Part of loop to be parallelized
with OpenMP:
if (a[i] > max){
max = a[i];
}
What happens if (when)
2 or more threads try to
write to “max” at the
same time?



With a Critical Region we
can easily avoid the race
conditions.
However, Critical Regions
are very bad for
performance.
Question: Include loop in
critical region or not?
for (i=0;i<n;i++) {
#pragma omp critical
if (a[i] > max) {
max = a[i];
}
}
Now only one thread at
a time can be inside
critical block.
Data (Shared Array, 4 Threads):
1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2
Data (Shared Array, 4 Threads):
Thread 0
Thread 1
Thread 2
Thread 3
1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2
Temp. Shared Array:
8 7 9 5
Each thread writes its
own maximum to
corresponding element
Data (Shared Array, 4 Threads):
Thread 0
Thread 1
Thread 2
Thread 3
1 4 8 5 2 7 6 3 4 3 1 9 5 1 2 2
Temp. Shared Array:
Single Thread:
8 7 9 5
9
Each thread writes its
own maximum to
corresponding element
A single thread picks out
the total maximum

Benchmarks were carried out, measuring
execution time for the “find maximum” loop
only.

Three cases tested:
 Critical Region, single “find max” instruction inside
 Critical Regions, whole “find max” loop inside
 Temporary Arrays
Figure 9: Execution time versus number of threads (size: 200³ cells).
Figure 10: Speedup versus number of threads (size: 200³ cells).

The temporary array method is clearly the
winner.

However:
 Additional code needed for this method.
 Smaller problem sizes give less performance gains
for more threads (nothing we can do about that,
though).
Mixed-Mode Performance

The code was tested in mixed-mode with 2, 4
and 8 threads per MPI Process.

Same variation in problem size as before
(100³, 200³, 300³ cells).

Representative results will be shown here.
Figure 11: Time versus number of threads, 2 threads per MPI Proc.
Figure 12: Time versus number of threads, 4 threads per MPI Proc.
Figure 13: Time versus number of threads, 8 threads per MPI Proc.
Figure 14: Speedup versus number of threads, all combinations
Figure 15: Speedup versus number of threads, all combinations
Figure 16: Speedup versus number of threads, all combinations

Mixed-Mode outperforms the original MPIonly implementation for the higher processor
numbers tested.

Mixed-Mode outperforms the original MPIonly implementation for the higher processor
numbers tested.

MPI-only performs better (or almost the
same) as mixed mode for the lower processor
numbers tested.

Mixed-Mode outperforms the original MPIonly implementation for the higher processor
numbers tested.

MPI-only performs better (or almost the
same) as mixed mode for the lower processor
numbers tested.

Mixed-Mode with 4 threads/MPI Process is
the best choice for problem sizes tested.
Figure 17: Memory usage versus number of PEs , 8 threads per
MPI Process (200³ cells)
Was Mixed-Mode Any Good Here?

For problem sizes and processor numbers
tested: Mixed-Mode performed better or
equally compared to pure MPI.

Higher processor numbers: Mixed-Mode
manages to achieve speedup where pure MPI
slows down.

Mixed-Mode required significantly less memory.

Any possible performance benefits from a
Mixed-Mode implementation?

Any possible performance benefits from a
Mixed-Mode implementation for this code?

Answer:
Yes, for larger numbers of processors (> 256),
a mixed-mode implementation of this code:
 Provides Speedup instead of Slow-Down.
 Uses less memory
Download