Chap 3 - Performance

advertisement
DISTRIBUTED AND
HIGH-PERFORMANCE COMPUTING
PARALLEL PROGRAMMING AND
PERFORMANCE
PROGRAMMING AND
PERFORMANCE ANALYSIS OF
PARALLEL COMPUTERS
DHPC
Programming and Performance
Programming Models for Parallel Computing
First look at an example of parallelizing a real-world task,
taken from Solving Problems on Concurrent Processors,
Fox et al.
Hadrian’s Wall was built by the ancient Romans to keep
the marauding Scottish barbarians out of Roman England.
It was originally 120 km long and 2 meters high.
How would you build such a huge structure in the shortest
possible time? Clearly the sequential approach — a single
bricklayer building the entire wall — would be too slow.
Can have modest task parallelism by having different
specialist workers concurrently:
• making the bricks
• delivering the bricks
• laying the bricks
But unless we can replicate these workers, this only gives
us a threefold speedup of the process.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 2
DHPC
Programming and Performance
Data Parallelism
To really speed up completion of the task, need many
workers laying the bricks concurrently. In general there are
two ways to do this:
• pipelining (vectorization)
• replication (parallelization)
Concurrent execution of a task requires assigning different
sections of the problem (processes and data) to different
processors.
We will concentrate on data parallelism, where the
processors work concurrently on their own section of the
whole data set, or domain.
Splitting up the domain between processors is known as
domain decomposition. Each processor has their own
sub-domain, or grain, to work on. Parallelism may be
described as fine-grained (lots of very small domains) or
coarse-grained (a smaller number of larger domains).
Deciding how the domain decomposition is done is a key
issue in implementing efficient parallel processing. For the
most efficient (least time) execution of the task, need to:
• minimize communication of data between processors
• distribute workload equally among the processors
(known as load balancing)
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 3
DHPC
Programming and Performance
Vectorization (Pipelining)
One possible approach is to allocate a different bricklayer
to each row of the wall, i.e. a horizontal decomposition of
the problem domain.
This is a pipelined approach — each bricklayer has to wait
until the row underneath them has been started, so there is
some inherent inefficiency.
Once all rows have been started (the pipeline is full) all the
bricklayers (processors) are working efficiently at the same
time, until the end of the task, when there is some
overhead (idle workers) while the upper levels are
completed (the pipeline is flushed).
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 4
DHPC
Programming and Performance
Parallelization (Replication)
Another approach is to do a vertical decomposition of the
problem domain, so each bricklayer gets a vertical section
of the wall to complete.
In this case, the workers must communicate and
synchronize their actions at the edges where the
sub-domains meet. In general this communication and
synchronization will incur some overhead, so there is some
inefficiency.
However each worker has an inner section of wall within
their sub-domain that is completely independent of the
others, which they can build just as efficiently as if there
were no other workers.
As long as the time taken to build this inner section is
much longer than the time taken up by the communication
and synchronization overhead, then the parallelization will
be efficient and give good speedup over using a single
worker.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 5
DHPC
Programming and Performance
Parallel I/O
For large tasks with lots of data, need efficient means to
pass the appropriate data to each processor. In building a
wall, need to keep each bricklayer supplied with bricks.
Simple approach has a single host processor connected to
outside network and handling all I/O. Passes data to other
processors through internal comms network of the machine.
Host processor is a sequential bottleneck for I/O.
Bandwidth is limited to a single network connection.
Better approach is to do I/O in parallel, so each node (or
group of nodes) has a direct I/O channel to a disk array.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 6
DHPC
Programming and Performance
Domain Decompositions
Some standard domain decompositions of a regular 2D grid
(array) include:
BLOCK — contiguous chunks of rows or columns of data
on each processor.
BLOCK-BLOCK — block decomposition in both
dimensions.
CYCLIC — data is assigned to processors like cards
dealt to poker players, so neighbouring points are on
different processors.
This can be good for load balancing applications with
varying workloads that have certain types of
communication, e.g. very little, or a lot (global sums or
all-to-all), or strided.
BLOCK-CYCLIC — a block decomposition in one
dimension, cyclic in the other.
SCATTERED — points are scattered randomly across
processors.
This can be good for load balancing applications with little
(or lots of) communication. The human brain seems to
work this way – neighboring sections may control widely
separated parts of the body.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 7
DHPC
Programming and Performance
Domain Decompositions Illustrated
BLOCK
BLOCK-BLOCK
CYCLIC
BLOCK-CYCLIC
The squares represent the 2D data array, the colours
represent the processors where the data elements are
stored.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 8
DHPC
Programming and Performance
Static Load Balancing
For maximum efficiency, domain decomposition should give
equal work to each processor. In building the wall, can just
give each bricklayer an equal length segment.
But things can become much more complicated:
• What if some bricklayers are faster than others? (this is
like an inhomogeneous cluster of different workstations)
• What if there are guard towers every few hundred
meters, which require more work to construct? (in some
applications, more work is required in certain parts of
the domain)
If we know in advance
1. the relative speed of the processors, and
2. the relative amount of processing required for each part
of the problem
then we can do a domain decomposition that takes this
into account, so that different processors may have
different sized domains, but the time to process them will
be about the same. This is static load balancing, and
can be done at compile-time.
For some applications, maintaining load balance while
simultaneously minimising communication can be a very
difficult optimisation problem.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 9
DHPC
Programming and Performance
Irregular Doman Decomposition
In this figure the airflow over an aeroplane wing is modeled
on an irregular triangulated mesh. The grid is finer in
areas where there is the most change in the airflow (e.g.
turbulent regions) and coarsest where the flow is more
regular (laminar).
The domain is distributed among processors given by
different colours.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 10
DHPC
Programming and Performance
Dynamic Load Balancing
In some cases we do not know in advance one (or both) of:
• the effective performance of the processors – may be
sharing the processors with other applications, so the
load and available CPU may vary
• the amount of work required for each part of the domain
– many applications are adaptive or dynamic, and the
workload is only known at runtime (e.g. dynamic
irregular mesh for CFD, or varying convergence rates for
PDE solvers in different sections of a regular grid)
In this case we need to dynamically change the domain
decomposition by periodically repartitioning the data
between processors. This is dynamic load balancing,
and it can involve substantial overheads in:
• Figuring out how to best repartition the data – may
need to use a fast method that gives a good (but not
optimal) domain decomposition, to reduce computation.
• Moving the data between processors – could restrict this
to local (neighbouring processor) moves to reduce
communication.
Usually repartition as infrequently as possible (e.g. every
few iterations instead of every iteration). There is a
tradeoff between performance improvement and
repartitioning overhead.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 11
DHPC
Programming and Performance
Causes of Inefficiency in Parallel Programs
• Communication Overhead
In some cases communication can be overlapped with
computation, i.e. needed data can be prefetched at the
same time as useful computation is being done. But
often processors must wait for data, causing inefficiency
(like a cache miss on a sequential machine, but worse).
• Load Imbalance Overhead
Any load imbalance will cause some processors to be idle
some of the time. In some problems the domain will
change during the computation, requiring expensive
dynamic load balancing.
• Algorithmic Overhead
The best algorithm to solve a problem in parallel is
often slightly different (or in some cases very different)
to the best sequential algorithm. Any excess cycles
caused by the difference in algorithm (more operations
or more iterations) gives an algorithmic overhead.
Note that to calculate speedup honestly, should not use
the time for the parallel algorithm on 1 processor, but
the time for the best sequential algorithm.
• Sequential Overhead
The program may have parts that are not parallelizable,
so that each processor replicates the same calculation.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 12
DHPC
Programming and Performance
Parallel Speedup
If N workers or processors are used, expect to be able to
finish the task N times faster.
Time on 1 processor
Speedup =
Time on N processors
Speedup will (usually) be at most N on N processors.
(Question: How can it be more?)
N.B. Usually better to plot speedup vs number of
processors, rather than time taken (1/speedup) vs #procs,
since it is much harder to judge deviation from a 1/x curve
than from a straight line.
Any problem of fixed size will have a maximum speedup,
beyond which adding more processors will not reduce (and
will usually increase) the time taken.
Speedup
Linear Speedup
(Perfect Speedup)
Actual Speedup
Number of Processors
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 13
DHPC
Programming and Performance
Superlinear Speedup
Answer to previous question .....
It is possible to get speedups greater than N on N
processors if the parallel algorithm is very efficient (no load
imbalance or algorithmic overhead, and any communication
is overlapped with computation), and if splitting the data
among processors allows a greater proportion of the data to
fit into cache memory on each processor, resulting in faster
sequential performance of the program on each processor.
Same idea applies to programs that would be out-of-core
on 1 processor (i.e. require costly paging to disk) but can
have all data in core memory if they are spread over
multiple processors.
Some parallel implementations of tree search applications
(e.g. branch and bound optimisation algorithms) can also
have superlinear speedup, by exploring multiple branches
in parallel, allowing better pruning and thus finding the
solution faster. However the same effect can be achieved by
running multiple processes (or threads) on a single
processor.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 14
DHPC
Programming and Performance
Parallel Efficiency
Efficiency measures the actual speedup relative to the
maximum speedup, given as a fraction or a percentage.
It is often quoted instead of speedup.
Efficiency =
Speedup
N
Efficiency will (usually) be at most 1 (or 100%).
It is still a function of the number of processors N, but not
as much as speedup, since it is scaled by N.
Parallel programs are rarely 100% efficient - there is
usually some overhead f from the parallel implementation,
so that the time taken for the parallel program is
Time on N procs =
Time on 1 proc
∗ (1 + f )
N
N
Speedup =
Efficiency =
[KAH/PDC]
(1 + f)
1
(1 + f) ≈ 1 − f
(for smallf )
DHPC CS7933, Performance, Jul-Oct 2000, 15
DHPC
Programming and Performance
Alternative Definitions of Speedup and Efficiency
Previous definitions of speedup and efficiency assume that
times are measured for a fixed problem size.
This is artificial – in practice, often use more processors to
solve larger problems in the same time, not the same sized
problem in less time.
Also, cannot use this definition to assess scalability of the
program or algorithm to larger numbers of processors.
Cannot expect to obtain good speedups as amount of data
per processor is reduced to zero, no matter how good the
parallel algorithm might be.
Alternative definitions of speedup and efficiency use:
• Constant problem size (standard definition)
• Constant domain size on each processor
• Constant time for solution (adjust size accordingly)
When quoting speedups, be sure to say which of these you
are measuring.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 16
DHPC
Programming and Performance
Vector Speedup and Efficiency
For vector machines, efficiency is largest when the vector
size is largest, so the startup time to fill the pipeline is
amortized.
For large vectors (i.e. lots of parallelism), may get close to
maximum performance, while for very small vectors, may
get little or no speedup.
Performance of vector and parallel machines for a vector of
length N is given approximately by the empirical formula
rN=
r∞
n1/2/N+ 1
flops
where
• r∞
– maximum performance of the machine
• n1/2 – vector length that gives half the maximum
performance
are two parameters that quantify the performance of a
given vector or parallel machine for a particular program.
For good performance, N
[KAH/PDC]
should be much larger than n1/2.
DHPC CS7933, Performance, Jul-Oct 2000, 17
DHPC
Programming and Performance
Amdahl’s Law
Suppose there is some fraction f of the calculation that
cannot be vectorized or parallelized.
Then the time TNon N processors will be
(1 − f )T1
TN=
+ f T1
N
Speedup =
N
(1 − f ) + N f
≤1
f
So if 5% of the computational time in a program is spent
doing work that is not parallelizable, the maximum
speedup of the program is 20.
This bound on the speedup is known as Amdahl’s Law.
It was originally believed that this would mean that
massive parallelism would not be viable, since most
applications were thought to have some sequential
component that would limit the speedup.
However there are many applications which can be fully
parallelised, or have a sequential component that is very
small compared to the computation that can be done in
parallel, allowing large speedups.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 18
DHPC
Programming and Performance
Minimizing Communication Overhead
A crucial problem in parallel programming is to minimize
the ratio of communication to computation, otherwise
cannot have efficient programs.
Want to:
• Minimize the amount of communication
• Overlap communications with computation where
possible
• Reduce latency by sending a few large messages, rather
than a lot of small messages.
At the hardware level, can reduce latency by using fast
(but expensive) communications.
At the systems software level, can reduce latency by using
lightweight message passing protocols, such as Active
Messages.
But most of the real work needs to be done by the
programmer and/or the compiler.
• Keep data local
• Calculate using local data while getting remote data
• Buffer communications to send fewer messages (e.g. send
all edge data at once rather than one point at a time)
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 19
DHPC
Programming and Performance
Benchmarking
The peak performance of a computer is the maximum
number of operations it can perform per second. This is
usually:
Flops = number of pipes (flops/cycle) *
clock speed (cycles/sec) * processors
Many applications do not achieve anywhere near the peak
performance, particularly on high-performance computers.
Many standard benchmarks have been developed to
determine actual performance of a computer over a range
of applications, e.g. SPECmarks, transactions per second.
Benchmarks aimed at parallel and vector HPC machines
include:
• LINPACK matrix solver (from SCALAPACK parallel
linear algebra library), used to rank Top 500 list
• NAS Benchmarks (kernels for some NASA fluid
dynamics applications)
• PARKBENCH Benchmarks (PARallel Kernels and
BENCHmarks)
• Others at BenchWeb (www.netlib.org/benchweb/)
Best benchmark is of course to run your applications (or
the compute-intensive application kernel) on the machine.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 20
DHPC
Programming and Performance
Using Benchmarks
Care is needed in obtaining and analyzing results from
benchmarks.
• Run multiple runs using different data sizes and
numbers of processors.
• May need to do multiple runs and either average the
results or take the best result.
• Analyze performance, speedup, efficiency, scalability as
functions of number of procs and different data sizes.
• Should have a dedicated machine - someone else running
a job one node of a cluster could reduce performance by
a factor of 2!
• I/O and initialization time may affect results - may be
large difference in CPU time and total wall clock time.
• Percentage of time spent on I/O and initialization may
be very different for benchmark (short run time) vs real
application (long run time), so should they be excluded?
Or is I/O an important part of the application?
• Need to be careful in comparing results on different
machines - may need different optimization flags for
compilers, different data decomposition, or even to
modify code to perform better on different architecture.
[KAH/PDC]
DHPC CS7933, Performance, Jul-Oct 2000, 21
Download