Parallel Concepts An Introduction TM

advertisement
TM
Parallel Concepts
An Introduction
TM
The Goal of Parallelization
• Reduction of elapsed time of a program
• Reduction in turnaround time of jobs
cpu
time
1 processor
communication
overhead
4 processors
start
Elapsed time
finish
• Overhead:
–
–
–
–
–
total increase in cpu time
communication
synchronization
additional work in algorithm
non-parallel part of the program
• (one processor works, others spin idle)
Reduction in
elapsed time
Elapsed time
• Overhead vs Elapsed time is better expressed as
Speedup and Efficiency
TM
Speedup and Efficiency
Both measure the parallelization properties of a program
• Let T(p) be the elapsed time on p processors
• The Speedup S(p) and the Efficiency E(p) are defined as:
S(p) = T(1)/T(p)
E(p) = S(p)/p
T(p) = T(1)/p
S(p) = T(1)/T(p) = p
• For ideal parallel speedup we get: E(p) = S(p)/p = 1 or 100%
Speedup
Efficiency
ideal
1
Super-linear
Saturation
Disaster
Number of processors
Number of processors
• Scalable programs remain efficient for large number of processors
Amdahl’s Law
This rule states the following for parallel programs:
The non-parallel fraction of the code (I.e. overhead)
imposes the upper limit on the scalability of the code
• The non-parallel (serial) fraction s of the program includes the
communication and synchronization overhead
(1)
(2)
(3)
(4)
T(1) =
=
=
T(p) =
S(p) =
=
<
1 = s + f
! program has serial and parallel fractions
T(parallel) + T(serial)
T(1) *(f + s)
T(1) *(f + (1-f))
T(1) *(f/p + (1-f))
T(1)/T(p)
1/(f/p + 1-f)
1/(1-f)
! for p-> inf.
• Thus, the maximum parallel Speedup S(p) for a program that has
parallel fraction f:
(5)
S(p) < 1/(1-f)
TM
TM
The Practicality of Parallel Computing
Speedup
8.0
P=8
7.0
6.0
5.0
4.0
P=4
3.0
2.0
1.0
P=2
0%
20%
40%
60%
80%
100%
Percentage parallel code
David J. Kuck,
Hugh Performance Computing,
Oxford Univ.. Press 1996
1970s
1980s
1990s
Best
Hand-tuned codes
~99% range
• In practice, making programs parallel is not as difficult as it
may seem from Amdahl’s law
• It is clear that a program has to spend significant portion (most)
of run time in the parallel region
TM
Fine-Grained Vs Coarse-Grained
• Fine-grain parallelism (typically loop level)
–
–
–
–
can be done incrementally, one loop at a time
does not require deep knowledge of the code
a lot of loops have to be parallel for decent speedup
potentially many synchronization points
MAIN
(at the end of each parallel loop)
A
E
B
F
C
G
K
• Coarse-grain parallelism
– make larger loops parallel at higher call-tree level
potentially in-closing many small loops
– more code is parallel at once
– fewer synchronization points, reducing overhead
– requires deeper knowledge of the code
H
L
p
Coarse-grained
D
I
J
N
M
O
q
r
s
t
Fine-grained
TM
Other Impediments to Scalability
Load imbalance:
p0
p1
p2
p3
• the time to complete a parallel
execution of a code segment is start
determined by the longest running thread Elapsed time
finish
• unequal work load distribution leads to some processors being
idle, while others work too much
with coarse grain parallelization, more opportunities for load
imbalance exist
Too many synchronization points
• compiler will put synchronization points at the start and exit of
each parallel region
• if too many small loops have been made parallel,
synchronization overhead will compromise scalability.
TM
Parallel Programming Models
Classification of Programming models:
Feature
Originally
sequential
Examples: Autoparalelising
compilers
Control flow
Single
•
•
•
•
•
Explicit
Implicit
F90, HPF
MPI, PVM
Shared
variable
OpenMP
pthreads
Single
Multiple
Multiple
Address space
Single
Communication Implicit
Single
Implicit
Multiple
Explicit
Single
Implicit
Synchronisation Implicit
Implicit
Implicit/explicit Explicit
Data allocation
Implicit/semiexplicit
Explicit
Implicit
Data parallel
Message
passing
Implicit/semiexplicit
Control flow
- number of explicit threads of execution
Address space - access to global data from multiple threads
Communication - data transfer part of language or library
Synchronization - mechanism to regulate access to data
Data allocation - control of the data distribution to execution threads
Computing p with DPL
p=
1
4 dx
(1+x2)
0
=0<i<N
S
4
N(1+((i+0.5)/N)2)
PROGRAM PIPROG
INTEGER, PARAMETER:: N = 1000000
REAL (KIND=8):: LS,PI, W = 1.0/N
PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) )
PRINT *, PI
END
Notes:
–
–
–
–
–
essentially sequential form
automatic detection of parallelism
automatic work sharing
all variables shared by default
number of processors specified outside of the code
compile with:
f90 -apo -O3 -mips4 -mplist
TM
Computing p with Shared Memory
p=
Notes:
–
–
–
–
–
1
4 dx
(1+x2)
0
=0<i<N
S
4
N(1+((i+0.5)/N)2)
#define n 1000000
main() {
double pi, l, ls = 0.0, w = 1.0/n;
int i;
#pragma omp parallel private(i,l) reduction(+:ls)
{
#pragma omp for
for(i=0; i<n; i++) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
#pragma omp master
printf(“pi is %f\n”,ls*w);
#pragma omp end master
}
}
essentially sequential form
automatic work sharing
all variables shared by default
directives to request parallel work distribution
number of processors specified outside of the code
TM
Computing p with Message Passing
1
#include <mpi.h>
4 dx
#define N 1000000
(1+x2)
main()
0
{
double pi, l, ls = 0.0, w = 1.0/N;
p=
=0<i<N
S
4
N(1+((i+0.5)/N)2)
int i, mid, nth;
MPI_init(&argc, &argv);
MPI_comm_rank(MPI_COMM_WORLD,&mid);
MPI_comm_size(MPI_COMM_WORLD,&nth);
for(i=mid; i<N; i += nth) {
l = (i+0.5)*w;
ls += 4.0/(1.0+l*l);
}
MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if(mid == 0) printf(“pi is %f\n”,pi*w);
MPI_finalize();
}
Notes:
–
–
–
–
–
–
thread identification first
explicit work sharing
all variables are private
explicit data exchange (reduce)
all code is parallel
number of processors is specified outside of code
TM
Computing p with POSIX Threads
TM
TM
Comparing Parallel Paradigms
• Automatic parallelization combined with explicit Shared Memory
programming (compiler directives) used on machines with global
memory
– Symmetric Multi-Processors, CC-NUMA, PVP
– These methods collectively known as Shared Memory Programming (SMP)
– SMP programming model works at loop level, and coarse level parallelism:
• the coarse level parallelism has to be specified explicitly
• loop level parallelism can be found by the compiler (implicitly)
– Explicit Message Passing Methods are necessary with machines that have
no global memory addressability:
• clusters of all sort, NOW & COW
– Message Passing Methods require coarse level parallelism to be scalable
•Choosing programming model is largely a matter of the application, personal
preference and the target machine.
•it has nothing to do with scalability.
– communication overhead
– process synchronization
Scalability limitations:
•scalability is mainly a function of the hardware and (your) implementation of
the parallelism
TM
Summary
• The serial part or the communication overhead of the code limits the
scalability of the code (Amdahl Law)
• Programs have to be >99% parallel to use large (>30 proc) machines
• Several Programming Models are in use today:
– Shared Memory programming (SMP) (with Automatic Compiler
parallelization, Data-Parallel and explicit Shared Memory models)
– Message Passing model
• Choosing a Programming Model is largely a matter of the application,
personal choice and target machine. It has nothing to do with scalability.
– Don’t confuse Algorithm and implementation
• Machines with a global address space can run applications based on
both, SMP and Message Passing programming models
Download