Intro. to Parallel Programming

advertisement
Lecture 29: Parallel Programming Overview
Lecture 29
Fall 2011
Parallel Programming Paradigms --Various Methods

There are many methods of programming parallel computers. Two of
the most common are message passing and data parallel.







Message Passing - the user makes calls to libraries to explicitly share
information between processors.
Data Parallel - data partitioning determines parallelism
Shared Memory - multiple processes sharing common memory space
Remote Memory Operation - set of processes in which a process can
access the memory of another process without its participation
Threads - a single process having multiple (concurrent) execution paths
Combined Models - composed of two or more of the above.
Note: these models are machine/architecture independent, any of the
models can be implemented on any hardware given appropriate
operating system support. An effective implementation is one which
closely matches its target hardware and provides the user ease in
programming.
Lecture 29
Fall 2011
Parallel Programming Paradigms: Message Passing
The message passing model is defined as:


set of processes using only local memory

processes communicate by sending and receiving messages

data transfer requires cooperative operations to be performed by
each process (a send operation must have a matching receive)
Programming with message passing is done by linking
with and making calls to libraries which manage the data
exchange between processors. Message passing
libraries are available for most modern programming
languages.
Lecture 29
Fall 2011
Parallel Programming Paradigms: Data Parallel

The data parallel model is defined as:

Each process works on a different part of the same data structure

Commonly a Single Program Multiple Data (SPMD) approach

Data is distributed across processors

All message passing is done invisibly to the programmer

Commonly built "on top of" one of the common message passing
libraries

Programming with data parallel model is accomplished by
writing a program with data parallel constructs and
compiling it with a data parallel compiler.

The compiler converts the program into standard code
and calls to a message passing library to distribute the
data to all the processes.
Lecture 29
Fall 2011
Implementation of Message Passing: MPI

Message Passing Interface often called MPI.

A standard portable message-passing library definition
developed in 1993 by a group of parallel computer
vendors, software writers, and application scientists.

Available to both Fortran and C programs.

Available on a wide variety of parallel machines.

Target platform is a distributed memory system

All inter-task communication is by message passing.

All parallelism is explicit: the programmer is responsible
for parallelism the program and implementing the MPI
constructs.

Programming model is SPMD (Single Program Multiple
Data)
Lecture 29
Fall 2011
Implementations: F90 / High Performance Fortran (HPF)

Fortran 90 (F90) - (ISO / ANSI standard extensions to
Fortran 77).

High Performance Fortran (HPF) - extensions to F90 to
support data parallel programming.

Compiler directives allow programmer specification of
data distribution and alignment.

New compiler constructs and intrinsics allow the
programmer to do computations and manipulations on
data with different distributions.
Lecture 29
Fall 2011
Steps for Creating a Parallel Program
1.
2.
If you are starting with an existing serial program, debug the serial code
completely
Identify the parts of the program that can be executed concurrently:



3.
Decompose the program:



4.



6.
Functional Parallelism
Data Parallelism
Combination of both
Code development

5.
Requires a thorough understanding of the algorithm
Exploit any inherent parallelism which may exist.
May require restructuring of the program and/or algorithm. May require an
entirely new algorithm.
Code may be influenced/determined by machine architecture
Choose a programming paradigm
Determine communication
Add code to accomplish task control and communications
Compile, Test, Debug
Optimization


Lecture 29

Measure Performance
Locate Problem Areas
Improve them
Fall 2011
Amdahl’s Law

Speedup due to enhancement E is
Exec time w/o E
Speedup w/ E = ---------------------Exec time w/ E

Suppose that enhancement E accelerates a fraction F
(F <1) of the task by a factor S (S>1) and the remainder
of the task is unaffected
ExTime w/ E = ExTime w/o E  ((1-F) + F/S)
Speedup w/ E = 1 / ((1-F) + F/S)
Lecture 29
Fall 2011
Examples: Amdahl’s Law

Amdahl’s Law tells us that to achieve linear speedup
with 100 processors (e.g., speedup of 100), none of
the original computation can be scalar!

To get a speedup of 99 from 100 processors, the
percentage of the original program that could be scalar
would have to be 0.01% or less

What speedup could we achieve from 100 processors
if 30% of the original program is scalar?
Speedup w/ E = 1 / ((1-F) + F/S)
= 1 / (0.7 + 0.7/100)
= 1.4
Serial program/algorithm might need to be restructuring
to allow for efficient parallelization.
Fall 2011
Lecture 29

Decomposing the Program


There are three methods for decomposing a problem into smaller tasks to
be performed in parallel: Functional Decomposition, Domain Decomposition,
or a combination of both
Functional Decomposition (Functional Parallelism)



Decomposing the problem into different tasks which can be distributed to
multiple processors for simultaneous execution
Good to use when there is not static structure or fixed determination of number
of calculations to be performed
Domain Decomposition (Data Parallelism)


Partitioning the problem's data domain and distributing portions to multiple
processors for simultaneous execution
Good to use for problems where:
- data is static (factoring and solving large matrix or finite difference calculations)
- dynamic data structure tied to single entity where entity can be subsetted (large multibody problems)
- domain is fixed but computation within various regions of the domain is dynamic (fluid
vortices models)

There are many ways to decompose data into partitions to be distributed:
- One Dimensional Data Distribution
–
–
Block Distribution
Cyclic Distribution
- Two Dimensional Data Distribution
Lecture 29
–
–
–
Block Block Distribution
Block Cyclic Distribution
Cyclic Block Distribution
Fall 2011
Functional Decomposing of a Program
Lecture 29

Decomposing the problem into different tasks which can be
distributed to multiple processors for simultaneous execution

Good to use when there is not static structure or fixed
determination of number of calculations to be performed
Fall 2011
Functional Decomposing of a Program
Sequential Program
Parallel Program
Subtask 1
Subtask 1
Subtask 2
Subtask 2
Subtask 3
Subtask 3
Subtask 4
Subtask 5
*
Subtask 4
Subtask 6
Subtask 6
Subtask 5
**
Lecture 29
Fall 2011
Domain Decomposition (Data Parallelism)
Lecture 29

Partitioning the problem's data domain and distributing portions to
multiple processors for simultaneous execution

There are many ways to decompose data into partitions to be
distributed:
Fall 2011
Summing 100,000 Numbers on 100 Processors

Start by distributing 1000 elements of vector A to each of
the local memories and summing each subset in parallel
sum = 0;
for (i = 0; i<1000; i = i + 1)
sum = sum + Al[i];
/* sum local array subset

The processors then coordinate in adding together the sub
sums (Pn is the number of the processor, send(x,y)
sends value y to processor x, and receive() receives a
value)
half = 100;
limit = 100;
repeat
half = (half+1)/2;
/*dividing line
if (Pn>= half && Pn<limit) send(Pn-half,sum);
if (Pn<(limit/2)) sum = sum + receive();
limit = half;
until (half == 1);
/*final sum in P0’s sum
Lecture 29
Fall 2011
An Example with 10 Processors
sum
sum
sum
sum
sum
sum
sum
sum
sum
sum
P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
Lecture 29
half = 10
Fall 2011
Domain Decomposition (Data Parallelism)

Partitioning the problem's data domain and distributing portions to
multiple processors for simultaneous execution

There are many ways to decompose data into partitions to be
distributed:
For example, the matrix multiplication A x B = C requires a dot-product calculation
for each ci, j element.
A
Lecture 29
X
jth column of B
ith row of A
B
C
=
c i, j
Fall 2011
Cannon's Matrix Multiplication
2-D Block decompose A and B and arrange submatrices as
A00
B00
A11
B10
A22
B20
A33
B30
A01
B11
A12
B21
A23
B31
A30
B01
A02
B22
A13
B32
A20
B02
A31
B12
A03
B33
A10
B03
A21
B13
A32
B23
Slave Algorithm:
Receive initial submatrices from master
Determine IDs (ranks) of mesh neighbors
Multiple local submatrices
For Sqrt(# of processors) times do
Shift A submatrices West
Shift B submatrices North
Multiple local submatrices
end for
Lecture 29
Send resulting submatrices back to master
Fall 2011
Review: Multiprocessor Basics

Q1 – How do they share data?

Q2 – How do they coordinate?

Q3 – How scalable is the architecture? How many
processors?
# of Proc
Communication Message passing 8 to 2048
model
Shared NUMA 8 to 256
address UMA
2 to 64
Physical
connection
Lecture 29
Network
8 to 256
Bus
2 to 36
Fall 2011
Review: Bus Connected SMPs (UMAs)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Single Bus
Memory
I/O

Caches are used to reduce latency and to lower bus traffic

Must provide hardware for cache coherence and process
synchronization

Bus traffic and bandwidth limits scalability (<~ 36
processors)
Lecture 29
Fall 2011
Network Connected Multiprocessors
Processor
Processor
Processor
Cache
Cache
Cache
Memory
Memory
Memory
Interconnection Network (IN)

Either a single address space (NUMA and ccNUMA) with
implicit processor communication via loads and stores or
multiple private memories with message passing
communication with sends and receives

Lecture 29
Interconnection network supports interprocessor communication
Fall 2011
Communication in Network Connected Multi’s


Lecture 29
Implicit communication via loads and stores

hardware designers have to provide coherent caches and
process synchronization primitive

lower communication overhead

harder to overlap computation with communication

more efficient to use an address to remote data when demanded
rather than to send for it in case it might be used (such a
machine has distributed shared memory (DSM))
Explicit communication via sends and receives

simplest solution for hardware designers

higher communication overhead

easier to overlap computation with communication

easier for the programmer to optimize communication
Fall 2011
Cache Coherency in NUMAs

For performance reasons we want to allow the shared
data to be stored in caches

Once again have multiple copies of the same data with
the same address in different processors


bus snooping won’t work, since there is no single bus on which all
memory references are broadcast
Directory-base protocols
Lecture 29

keep a directory that is a repository for the state of every block in
main memory (which caches have copies, whether it is dirty, etc.)

directory entries can be distributed (sharing status of a block
always in a single known location) to reduce contention

directory controller sends explicit commands over the IN to each
processor that has a copy of the data
Fall 2011
IN Performance Metrics


Network cost

number of switches

number of (bidirectional) links on a switch to connect to the
network (plus one link to connect to the processor)

width in bits per link, length of link
Network bandwidth (NB) – represents the best case


Bisection bandwidth (BB) – represents the worst case


bandwidth of each link * number of links
divide the machine in two parts, each with half the nodes and
sum the bandwidth of the links that cross the dividing line
Other IN performance issues
Lecture 29

latency on an unloaded network to send and receive messages

throughput – maximum # of messages transmitted per unit time

# routing hops worst case, congestion control and delay
Fall 2011
Bus IN
Bidirectional
network switch
Processor
node

N processors, 1 switch (

Only 1 simultaneous transfer at a time
Lecture 29

NB = link (bus) bandwidth * 1

BB = link (bus) bandwidth * 1
), 1 link (the bus)
Fall 2011
Ring IN

N processors, N switches, 2 links/switch, N links

N simultaneous transfers


NB = link bandwidth * N

BB = link bandwidth * 2
If a link is as fast as a bus, the ring is only twice as fast
as a bus in the worst case, but is N times faster in the
best case
Lecture 29
Fall 2011
Fully Connected IN

N processors, N switches, N-1 links/switch,
(N*(N-1))/2 links

N simultaneous transfers
Lecture 29

NB = link bandwidth * (N*(N-1))/2

BB = link bandwidth * (N/2)2
Fall 2011
Crossbar (Xbar) Connected IN

N processors, N2 switches (unidirectional),2 links/switch,
N2 links

N simultaneous transfers
Lecture 29

NB = link bandwidth * N

BB = link bandwidth * N/2
Fall 2011
Hypercube (Binary N-cube) Connected IN
2-cube
3-cube

N processors, N switches, logN links/switch, (NlogN)/2
links

N simultaneous transfers
Lecture 29

NB = link bandwidth * (NlogN)/2

BB = link bandwidth * N/2
Fall 2011
2D and 3D Mesh/Torus Connected IN
N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D
torus) links/switch, 4N/2 links or 6N/2 links
 N simultaneous transfers

Lecture 29

NB = link bandwidth * 4N
or

BB = link bandwidth * 2 N1/2 or
link bandwidth * 6N
link bandwidth * 2 N2/3
Fall 2011
Fat Tree

Trees are good structures. People in CS use them all the
time. Suppose we wanted to make a tree network.
A

Lecture 29
D
The bisection bandwidth on a tree is horrible - 1 link, at all times
The solution is to 'thicken' the upper links.


C
Any time A wants to send to C, it ties up the upper links,
so that B can't send to D.


B
More links as the tree gets thicker increases the bisection
Rather than design a bunch of N-port switches, use pairs
Fall 2011
Fat Tree

N processors, log(N-1)*logN switches, 2 up + 4 down = 6
links/switch, N*logN links

N simultaneous transfers
Lecture 29

NB = link bandwidth * NlogN

BB = link bandwidth * 4
Fall 2011
SGI NUMAlink Fat Tree
www.embedded-computing.com/articles/woodacre
Lecture 29
Fall 2011
IN Comparison

For a 64 processor system
Bus
Network
bandwidth
1
Bisection
bandwidth
1
Total # of
Switches
1
Ring
Torus
6-cube
Fully
connected
Links per
switch
Total # of
links
Lecture 29
1
Fall 2011
Network Connected Multiprocessors
Proc
SGI Origin
R16000
Cray 3TE
Alpha
21164
Intel ASCI Red
Proc
Speed
# Proc
BW/link
(MB/sec)
fat tree
800
300MHz 2,048
3D torus
600
Intel
333MHz 9,632
mesh
800
IBM ASCI
White
Power3
375MHz 8,192
multistage
Omega
500
NEC ES
SX-5
500MHz 640*8
640-xbar
16000
NASA
Columbia
Intel
1.5GHz
Itanium2
512*20
IBM BG/L
Power
PC 440
65,536*2 3D torus,
fat tree,
barrier
Lecture 29
128
IN
Topology
0.7GHz
fat tree,
Infiniband
Fall 2011
IBM BlueGene
512-node proto
BlueGene/L
Peak Perf
1.0 / 2.0 TFlops/s
180 / 360 TFlops/s
Memory Size
128 GByte
16 / 32 TByte
Foot Print
9 sq feet
2500 sq feet
Total Power
9 KW
1.5 MW
# Processors
512 dual proc
65,536 dual proc
Networks
3D Torus, Tree,
Barrier
3D Torus, Tree,
Barrier
Torus BW
3 B/cycle
3 B/cycle
Lecture 29
Fall 2011
A BlueGene/L Chip
11GB/s
32K/32K L1
128
440 CPU
2KB
L2
5.5
Double FPU GB/s
256
256
700 MHz
256
32K/32K L1
128
440 CPU
Lecture 29
4MB
2KB
L2
3D torus
1
6 in, 6 out
1.6GHz
1.4Gb/s link
L3
ECC
eDRAM
128B line
8-way assoc
256
5.5
Double FPU GB/s
Gbit
ethernet
16KB
Multiport
SRAM
buffer
11GB/s
Fat tree
8
3 in, 3 out
350MHz
2.8Gb/s link
Barrier
4 global
barriers
DDR
control
144b DDR
256MB
5.5GB/s
Fall 2011
Networks of Workstations (NOWs) Clusters

Clusters of off-the-shelf, whole computers with multiple
private address spaces

Clusters are connected using the I/O bus of the
computers

lower bandwidth that multiprocessor that use the memory bus

lower speed network links

more conflicts with I/O traffic

Clusters of N processors have N copies of the OS limiting
the memory available for applications

Improved system availability and expandability


easier to replace a machine without bringing down the whole
system

allows rapid, incremental expandability
Economy-of-scale advantages with respect to costs
Lecture 29
Fall 2011
Commercial (NOW) Clusters
Proc
Lecture 29
Proc
Speed
# Proc
Network
Dell
P4 Xeon
PowerEdge
3.06GHz 2,500
eServer
IBM SP
1.7GHz
2,944
VPI BigMac Apple G5
2.3GHz
2,200
HP ASCI Q
Alpha 21264
1.25GHz 8,192
LLNL
Thunder
Intel Itanium2 1.4GHz
1,024*4
Quadrics
Barcelona
PowerPC 970 2.2GHz
4,536
Myrinet
Power4
Myrinet
Mellanox
Infiniband
Quadrics
Fall 2011
Summary

Flynn’s classification of processors - SISD, SIMD, MIMD




Q1 – How do processors share data?
Q2 – How do processors coordinate their activity?
Q3 – How scalable is the architecture (what is the maximum
number of processors)?
Shared address multis – UMAs and NUMAs



Scalability of bus connected UMAs limited (< ~ 36 processors)
Network connected NUMAs more scalable
Interconnection Networks (INs)
- fully connected, xbar
- ring
- mesh
- n-cube, fat tree
Message passing multis
 Cluster connected (NOWs) multis

Lecture 29
Fall 2011
Download