Parallel Computing
Final Exam Review
What is Parallel computing?
Parallel computing involves performing
parallel tasks using more than one computer.
Example in real life with related principles -book shelving in a library
Single worker
P workers with each worker stacking n/p books,
but with arbitration problem(many workers try to
stack the next book in the same shelf.)
P workers with each worker stacking n/p books,
but without arbitration problem (each worker work
on a different set of shelves)
Important Issues in parallel
Task/Program Partitioning.
Data Partitioning.
How to split a single task among the processors so
that each processor performs the same amount of
work, and all processors work collectively to
complete the task.
How to split the data evenly among the processors
in such a way that processor interaction is
How we allow communication among different
processors and how we arbitrate communication
related conflicts.
Design of parallel computers so that we
resolve the above issues.
Design, analysis and evaluation of parallel
algorithms run on these machines.
Portability and scalability issues related to
parallel programs and algorithms
Tools and libraries used in such systems.
Units of Measure in HPC
• High Performance Computing (HPC) units are:
—Flop: floating point operation
—Flops/s: floating point operations per second
—Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…
Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes
Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes
Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes
Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes
Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes
Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes
Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
• See for current list of fastest machines
What is a parallel computer?
A parallel computer is a collection of processors that
cooperatively solve computationally intensive
problems faster than other computers.
Parallel algorithms allow the efficient programming of
parallel computers.
This way the waste of computational resources can
be avoided.
Parallel computer v.s. Supercomputer
supercomputer refers to a general-purpose computer that
can solve computational intensive problems faster than
traditional computers.
A supercomputer may or may not be a parallel computer.
Flynn’s taxonomy of computer
architectures (control mechanism)
Depending on the execution and data streams computer architectures
can be distinguished into the following groups.
(1) SISD (Single Instruction Single Data) : This is a sequential computer.
(2) SIMD (Single Instruction Multiple Data) : This is a parallel machine like
the TM CM-200. SIMD machines are suited for data-parallel programs
where the same set of instructions are executed on a large data set.
Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and
MasPar MP-1 belonged to this class of machines
(3) MISD (Multiple Instructions Single Data) : Some consider a systolic array
a member of this group.
(4) MIMD (Multiple Instructions Multiple Data) : All other parallel machines.
A MIMD architecture can be an MPMD or an SPMD. In a Multiple Program
Multiple Data organization, each processor executes its own program as
opposed to a single program that is executed by all processors on a Single
Program Multiple Data architecture.
Examples of such platforms include current generation Sun Ultra Servers, SGI
Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP
Note: Some consider CM-5 as a combination of a MIMD and SIMD as it
contains control hardware that allows it to operatein a SIMD mode.
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
Taxonomy based on Address-Space
Organization (memory distribution)
Message-Passing Architecture
In a distributed memory machine each processor has its own memory. Each processor
can access its own memory faster than it can access the memory of a remote
processor (NUMA for Non-Uniform Memory Access). This architecture is also
known as message-passing architecture and such machines are commonly referred to
as multicomputers.
Examples: Cray T3D/T3E, IBM SP1/SP2, workstation clusters.
Shared-Address-Space Architecture
Provides hardware support for read/write to a shared address space. Machines built
this way are often called multiprocessors.
(1) A shared memory machine has a single address space shared by all processors
(UMA, for Uniform Memory Access).
The time taken by a processor to access any memory word in the system is identical.
Examples: SGI Power Challenge, SMP machines.
(2) A distributed shared memory system is a hybrid between the two previous
ones. A global address space is shared among the processors but is distributed
among them. Example: SGI Origin 2000
Note: The existence of a cache in shared-memory parallel machines cause cache
coherence problems when a cached variable is modified by a processor and the
shared-variable is requested by another processor. cc-NUMA for cachecoherent NUMA
architectures (Origin 2000).
NUMA and UMA Shared-AddressSpace Platforms
Typical shared-address-space architectures: (a) Uniform-memory
access shared-address-space computer; (b) Uniform-memoryaccess shared-address-space computer with caches and
memories; (c) Non-uniform-memory-access shared-address-space
computer with local memory only.
Message Passing
Shared Address Space Platforms
Message passing requires little hardware support, other than a
Shared address space platforms can easily emulate message
passing. The reverse is more difficult to do (in an efficient
Taxonomy based on processor
The granularity sometimes refers to the power of individual
processors. Sometimes is also used to denote the degree of
(1) A coarse-grained architecture consists of (usually few)
powerful processors (eg old Cray machines).
(2) a fine-grained architecture consists of (usually many
inexpensive) processors (eg TM CM-200, CM-2).
(3) a medium-grained architecture is between the two (eg CM-5).
Process Granularity refers to the amount of computation
assigned to a particular processor of a parallel machine for a
given parallel program. It also refers, within a single program,
to the amount of computation performed before communication
is issued. If the amount of computation is small (low degree of
concurrency) a process is fine-grained. Otherwise granularity
is coarse.
Taxonomy based on processor
(1) In a fully synchronous system a global clock is
used to synchronize all operations performed by the
(2) An asynchronous system lacks any
synchronization facilities. Processor synchronization
needs to be explicit in a user’s program.
(3) A bulk-synchronous system comes in between a
fully synchronous and an asynchronous system.
Synchronization of processors is required only at
certain parts of the execution of a parallel program.
Physical Organization of Parallel
Platforms – ideal architecture(PRAM)
The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a
parallel computer.
A PRAM consists of a collection of (sequential) processors that can synchronously
access a global shared memory in unit time. Each processor can thus access its
shared memory as fast (and efficiently) as it can access its own local memory.
The main advantages of the PRAM is its simplicity in capturing parallelism and
abstracting away communication and synchronization issues related to parallel computing.
Processors are considered to be in abundance and unlimited in number. The resulting PRAM
algorithms thus exhibit unlimited parallelism (number of processors used is a function of
problem size).
The abstraction thus offered by the PRAM is a fully synchronous collection of processors and a
shared memory which makes it popular for parallel algorithm design.
It is, however, this abstraction that also makes the PRAM unrealistic from a practical point
of view.
Full synchronization offered by the PRAM is too expensive and time demanding in parallel
machines currently in use.
Remote memory (i.e. shared memory) access is considerably more expensive in real machines
than local memory access
UMA machines with unlimited parallelism are difficult to build.
Four Subclasses of PRAM
Depending on how concurrent access to a single memory cell (of the shared memory)
is resolved, there are various PRAM variants.
ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent access of the
shared memory.
It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write) PRAMs.
Combining the rules for read and write access there are four PRAM variants:
Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory
location are serialized.
access to a memory location is exclusive. No concurrent read or write operations are allowed.
Weakest PRAM model
Multiple write accesses to a memory location are allowed. Multiple read accesses to a memory
location are serialized.
Can simulate an EREW PRAM
Allows multiple read and write accesses to a common memory location.
Most powerful PRAM model
Can simulate both EREW PRAM and CREW PRAM
Resolve concurrent write access
(1) in the arbitrary PRAM, if multiple processors write into a single
shared memory cell, then an arbitrary processor succeeds in writing
into this cell.
(2) in the common PRAM, processors must write the same value into
the shared memory cell.
(3) in the priority PRAM the processor with the highest priority
(smallest or largest indexed processor) succeeds in writing.
(4) in the combining PRAM if more than one processors write into the
same memory cell, the result written into it depends on the combining
operator. If it is the sum operator, the sum of the values is written, if it
is the maximum operator the maximum is written.
Note: An algorithm designed for the common PRAM can be executed on a priority or
arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM
algorithm when run on a priority PRAM.
Innerconnection Networks for
Parallel Computers
Interconnection networks carry data between
processors and to memory.
Interconnects are made of switches and links (wires,
Interconnects are classified as static or dynamic.
Static networks
Consists of point-to-point communication links among
processing nodes
Also referred to as direct networks
Dynamic networks
Built using switches (switching element) and links
Communication links are connected to one another
dynamically by the switches to establish paths among
processing nodes and memory banks.
Static and Dynamic
Interconnection Networks
Classification of interconnection networks: (a) a static
network; and (b) a dynamic network.
Network Topologies
Bus-Based Networks
The simplest network that consists a shared medium(bus)
that is common to all the nodes.
The distance between any two nodes in the network is
constant (O(1)).
Ideal for broadcasting information among nodes.
Scalable in terms of cost, but not scalable in terms of
The bounded bandwidth of a bus places limitations on the
overall performance as the number of nodes increases.
Typical bus-based machines are limited to dozens of nodes.
Sun Enterprise servers and Intel Pentium based shared-bus
multiprocessors are examples of such architectures
Network Topologies
Crossbar Networks
Employs a grid of switches or switching nodes to connect p
processors to b memory banks.
Nonblocking network:
the connection of a processing node to a memory bank doesnot
block the connection of any other processing nodes to other
memory banks.
The total number of switching nodes required is Θ(pb). (It is
reasonable to assume b>=p)
Scalable in terms of performance
Not scalable in terms of cost.
Examples of machines that employ crossbars include the Sun
Ultra HPC 10000 and the Fujitsu VPP500
Network Topologies:
Multistage Network
Multistage Networks
Intermediate class of networks between bus-based network and crossbar network
Blocking networks: access to a memory bank by a processor may disallow access
to another memory bank by another processor.
More scalable than the bus-based network in terms of performance, more scalable
than crossbar network in terms of cost.
The schematic of a typical multistage interconnection network.
Network Topologies:
Multistage Omega Network
Omega network
Consists of log p stages, p is the number of inputs(processing
nodes) and also the number of outputs(memory banks)
Each stage consists of an interconnection pattern that
connects p inputs and p outputs:
Each switch has two conncetion modes:
Perfect shuffle(left rotation):
0  i  p / 2 1
 2i,
2i  1  p, p / 2  i  p  1
Pass-thought conncetion: the inputs are sent straight through to
the outputs
Cross-over connection: the inputs to the switching node are
crossed over and then sent out.
Has p/2*log p switching nodes
Network Topologies:
Multistage Omega Network
A complete Omega network with the perfect shuffle
interconnects and switches can now be illustrated:
A complete omega network connecting eight inputs and eight outputs.
An omega network has p/2 × log p switching
nodes, and the cost of such a network grows as (p log p).
Network Topologies:
Multistage Omega Network – Routing
Let s be the binary representation of the source and
d be that of the destination processor.
The data traverses the link to the first switching node.
If the most significant bits of s and d are the same,
then the data is routed in pass-through mode by the
switch else, it switches to crossover.
This process is repeated for each of the log p
switching stages.
Note that this is not a non-blocking switch.
Network Topologies:
Multistage Omega Network – Routing
An example of blocking in omega network: one of the messages
(010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies - Fixed
Connection Networks (static)
Completely-connection Network
Star-Connected Network
Linear array
2d-array or 2d-mesh or mesh
Complete Binary Tree (CBT)
2d-Mesh of Trees
Static Interconnection Network
One can view an interconnection network as a graph whose nodes
correspond to processors and its edges to links connecting
neighboring processors. The properties of these interconnection
networks can be described in terms of a number of criteria.
(1) Set of processor nodes V . The cardinality of V is the number of
processors p (also denoted by n).
(2) Set of edges E linking the processors. An edge e = (u, v) is
represented by a pair (u, v) of nodes. If the graph G = (V,E) is directed,
this means that there is a unidirectional link from processor u to v. If
the graph is undirected, the link is bidirectional. In almost all networks
that will be considered in this course communication links will be
bidirectional. The exceptions will be clearly distinguished.
(3) The degree du of node u is the number of links containing u as an
endpoint. If graph G is directed we distinguish between the out-degree
of u (number of pairs (u, v) ∈ E, for any v ∈ V ) and similarly, the indegree of u. T he degree d of graph G is the maximum of the degrees
of its nodes i.e. d = maxudu.
Evaluating Static Interconnection
Network (cont.)
(4) The diameter D of graph G is the maximum of the lengths of the
shortest paths linking any two nodes of G. A shortest path between u
and v is the path of minimal length linking u and v. We denote the
length of this shortest path by duv. Then, D = maxu,vduv. The diameter
of a graph G denotes the maximum delay (in terms of number of links
traversed) that will be incurred when a packet is transmitted from one
node to the other of the pair that contributes to D (i.e. from u to v or
the other way around, if D = duv). Of course such a delay would hold if
messages follow shortest paths (the case for most routing algorithms).
(5) latency is the total time to send a message including software
overhead. Message latency is the time to send a zero-length message.
(6) bandwidth is the number of bits transmitted in unit time.
(7) bisection width is the number of links that need to be removed from
G to split the nodes into two sets of about the same size (±1).
Network Topologies - Fixed
Connection Networks (static)
Completely-connection Network
Each node has a direct communication link to every other
node in the network.
Ideal in the sense that a node can send a message to
another node in a single step.
Static counterpart of crossbar switching networks
Star-Connected Network
One processor acts as the central processor. Every other
processor has a communication link connecting it to this
central processor.
Similar to bus-based network.
The central processor is the bottleneck.
Network Topologies: Completely Connected
and Star Connected Networks
Example of an 8-node completely connected network.
(a) A completely-connected network of eight nodes;
(b) a star connected network of nine nodes.
Network Topologies - Fixed
Connection Networks (static)
Linear array
2d-array or 2d-mesh or mesh
The processors are ordered to form a 2-dimensional structure (square) so that each
processor is connected to its four neighbor (north, south, east, west) except perhaps
for the processors of the boundary.
Extension of linear array to two-dimensions: Each dimension has p nodes with a node
identified by a two-tuple (i,j).
In a linear array, each node(except the two nodes at the ends) has two neighbors, one
each to its left and right.
Extension: ring or 1-D torus(linear array with wraparound).
A generalization of a 2d-mesh in three dimensions. Exercise: Find the characteristics of
this network and its generalization in k dimensions (k > 2).
Complete Binary Tree (CBT)
There is only one path between any pair of two nodes
Static tree network: have a processing element at each node of the tree.
Dynamic tree network: nodes at intermediate levels are switching nodes and the leaf
nodes are processing elements.
Communication bottleneck at higher levels of the tree.
Solution: increasing the number of communication links and switching nodes closer to the root.
Network Topologies: Linear
Linear arrays: (a) with no wraparound links; (b) with
wraparound link.
Network Topologies:
Two- and Three Dimensional Meshes
Two and three dimensional meshes: (a) 2-D mesh with no
wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c)
a 3-D mesh with no wraparound.
Network Topologies: Tree-Based
Complete binary tree networks: (a) a static tree network; and (b)
a dynamic tree network.
Network Topologies: Fat Trees
A fat tree network of 16 processing nodes.
Evaluating Network Topologies
Linear array
2d-array or 2d-mesh or mesh
For |V | = N, we have a √N ×√N mesh structure, with |E| ≤ 2N =
O(N), d = 4, D = 2√N − 2, bw = √N.
|V | = N, |E| = N −1, d = 2, D = N −1, bw = 1 (bisection width).
Exercise: Find the characteristics of this network and its
generalization in k dimensions (k > 2).
Complete Binary Tree (CBT) on N = 2n leaves
For a complete binary tree on N leaves, we define the level of a
node to be its distance from the root. The root is of level 0 and the
number of nodes of level i is 2i. Then, |V | = 2N − 1, |E| ≤ 2N − 2
= O(N), d = 3, D = 2lgN, bw = 1(c)
Network Topologies - Fixed
Connection Networks (static)
2d-Mesh of Trees
An N2-leaf 2d-MOT consists of N2 nodes ordered as in a 2d-array N
×N (but without the links). The N rows and N columns of the 2dMOT form N row CBT and N column CBTs respectively.
For such a network, |V | = N2+2N(N −1), |E| = O(N2), d = 3, D =
4lgN, bw = N. The 2d-MOT possesses an interesting decomposition
property. If the 2N roots of the CBT’s are removed we get 4 N/2 ×
N/2 CBT s.
The 2d-Mesh of Trees (2d-MOT) combines the advantages of 2dmeshes and binary trees. A 2d-mesh has large bisection width but
large diameter (√N). On the other hand a binary tree on N leaves
has small bisection width but small diameter. The 2d-MOT has
small diameter and large bisection width.
A 3d-MOTcan be defined similarly.
Network Topologies - Fixed
Connection Networks (static)
The hypercube is the major representative of a class of networks that are called
hypercubic networks. Other such networks is the butterfly, the shuffle-exchange
graph, de-Bruijn graph, Cube-connected cycles etc.
Each vertex of an n-dimensional hypercube is represented by a binary string of
length n. Therefore there are |V | = 2n = N vertices in such a hypercube. Two
vertices are connected by an edge if their strings differ in exactly one bit position.
Let u = u1u2 . . .ui . . . un. An edge is a dimension i edge if it links two nodes that
differ in the i-th bit position.
This way vertex u is connected to vertex ui= u1u2 . . . ūi . . .un with a dimension i
edge. Therefore |E| = N lg N/2 and d = lgN = n.
The hypercube is the first network examined so far that has degree that is not a
constant but a very slowly growing function of N. The diameter of the hypercube
is D = lgN. A path from node u to node v can be determined by correcting the
bits of u to agree with those of v starting from dimension 1 in a “left-to-right”
fashion. The bisection width of the hypercube is bw = N. This is a result of the
following property of the hypercube. If all edges of dimension i are removed from
an n dimensional hypercube, we get two hypercubes each one of dimension n −
Network Topologies:
Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension.
Static Interconnection Networks
(No. of
Complete binary tree
Linear array
2-D mesh, no
2-D wraparound mesh
Wraparound k-ary dcube
Performance Metrics for
Parallel Systems
Number of processing elements p
Execution Time
Parallel runtime: the time that elapses from the moment a
parallel computation starts to the moment the last processing
element finishes execution.
Ts: serial runtime
Tp: parallel runtime
Total Parallel Overhead T0
Total time collectively spent by all the processing elements –
running time required by the fastest known sequential
algorithm for solving the same problem on a single processing
Performance Metrics for
Parallel Systems
Speedup S:
The ratio of the serial runtime of the best sequential algorithm for
solving a problem to the time taken by the parallel algorithm to
solve the same problem on p processing elements.
Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn)
Theoretically, speedup can never exceed the number of processing
elements p(S<=p).
Proof: Assume a speedup is greater than p, then each processing
element can spend less than time Ts/p solving the problem. In this case,
a single processing element could emulate the p processing elements
and solve the problem in fewer than Ts units of time. This is a
contradiction because speedup, by definition, is computed with respect
to the best sequential algorithm.
Superlinear speedup: In practice, a speedup greater than p is
sometimes observed, this usually happens when the work
performed by a serial algorithm is greater than its parallel
formulation or due to hardware features that put the serial
implementation at a disadvantage.
Example for Superlinear speedup
Superlinear speedup:
Example1: Superlinear effects from caches: With the problem
instance size of A and 64KB cache, the cache hit rate is 80%.
Assume latency to cache of 2ns and latency of DRAM of 100ns,
then memory access time is 2*0.8+100*0.2=21.6ns. If the
computation is memory bound and performs one FLOP/memory
access, this corresponds to a processing rate of 46.3 MFLOPS. With
the problem instance size of A/2 and 64KB cache, the cache hit
rate is higher, i.e., 90%, 8% the remaining data comes from local
DRAM and the other 2% comes from the remote DRAM with
latency of 400ns, then memory access time is
2*0.9+100*0.08+400*0.02=17.8. The corresponding execution
rate at each processor is 56.18MFLOPS, and for two processors the
total processing rate is 112.36MFLOPS. Then the speedup will be
Example for Superlinear speedup
Superlinear speedup:
Example2: Superlinear effects due to exploratory decomposition:
explore leaf nodes of an unstructured tree. Each leaf has a label
associated with it and the objective is to find a node with a
specified label, say ‘S’. The solution node is the rightmost leaf in
the tree. A serial formulation of this problem based on depth-first
tree traversal explores the entire tree, i.e. all 14 nodes, time is 14
units time. Now a parallel formulation in which the left subtree is
explored by processing element 0 and the right subtree is explored
by processing element 1. The total work done by the parallel
algorithm is only 9 nodes and corresponding parallel time is 5 units
time. Then the speedup is 14/5=2.8.
Performance Metrics for
Parallel Systems(cont.)
Efficiency E
Cost(also called Work or processor-time product) W
Ratio of speedup to the number of processing element.
A measure of the fraction of time for which a processing element is usefully
Examples: adding n numbers on n processing elements: Tp=Θ(logn), Ts=
Θ(n), S= Θ(n/logn), E= Θ(1/logn)
Product of parallel runtime and the number of processing elements used.
Examples: adding n numbers on n processing elements: W= Θ(nlogn).
Cost-optimal: if the cost of solving a problem on a parallel computer has the
same asymptotic growth(in Θ terms) as a function of the input size as the
fastest-known sequential algorithm on a single processing element.
Problem Size W2
The number of basic computation steps in the best sequential algorithm to
solve the problem on a single processing element.
W2=Ts of the fastest known algorithm to solve the problem on a sequential
Parallel vs Sequential Computing:
Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be
the fraction of a computation that is inherently
sequential. Then the maximum obtainable speedup S
on p processors is S ≤1/(f + (1 − f)/p)
Proof. Let T be the sequential running time for the named
computation. fT is the time spent on the inherently sequential
part of the program. On p processors the remaining
computation, if fully parallelizable, would achieve a running
time of at most (1−f)T/p. This way the running time of the
parallel program on p processors is the sum of the execution
time of the sequential and parallel components that is, fT + (1
− f)T/p. The maximum allowable speedup is therefore S ≤
T/(fT + (1 − f)T/p) and the result is proven.
Amdahl’s Law
Amdahl used this observation to advocate the building of even
more powerful sequential machines as one cannot gain much by
using parallel machines. For example if f = 10%, then S ≤ 10 as
p → ∞. The underlying assumption in Amdahl’s Law is that the
sequential component of a program is a constant fraction of the
whole program. In many instances as problem size increases
the fraction of computation that is inherently sequential
decreases with time. In many cases even a speedup of 10 is
quite significant by itself.
In addition Amdahl’s law is based on the concept that parallel
computing always tries to minimize parallel time. In some cases
a parallel computer is used to increase the problem size that can
be solved in a fixed amount of time. For example in weather
prediction this would increase the accuracy of say a three-day
forecast or would allow a more accurate five-day forecast.
Parallel vs Sequential Computing:
Gustaffson’s Law
Theorem 0.2 (Gustafson’s Law) Let the execution time of a
parallel algorithm consist of a sequential segment fT and a
parallel segment (1 − f)T and the sequential segment is
constant. The scaled speedup of the algorithm is then. S =(fT +
(1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p
For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤
1 proc
p proc
T(f+(1-f)p) T
Amdahl’s Law assumes that problem size is fixed when it deals with
scalability. Gustafson’s Law assumes that running time is fixed.
Brent’s Scheduling Principle
Suppose we have an unlimited parallelism efficient parallel algorithm,
i.e. an algorithm that runs on zillions of processors. In practice zillions
of processors may not available. Suppose we have only p processors. A
question that arises is what can we do to “run” the efficient zillion
processor algorithm on our limited machine.
One answer is emulation: simulate the zillion processor algorithm on
the p processor machine.
Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel
algorithm requires m operations and runs in parallel time t. Then
running this algorithm on a limited processor machine with only p
processors would require time m/p + t.
Proof: Let mi be the number of computational operations at the i-th step,
i.e.  mi  m .If we assign the p processors on the i-th step to work on these
mi operations they can conclude in time mi / p  mi / p 1 . Thus the total
running time on p processors would be
 m / p    m / p  1  t   m / p  t  m / p
i 1
The Message Passing Interface (MPI):
The Message-Passing Interface (MPI)is an attempt to create a standard to
allow tasks executing on multiple processors to communicate through some
standardized communication primitives.
It defines a standard library for message passing that one can use to
develop message-passing program using C or Fortran.
The MPI standard define both the syntax and the semantics of these
functional interface to message passing.
MPI comes intro a variety of flavors, freely available such as LAM-MPI and
MPIch, and also commercial versions such as Critical Software’s WMPI.
It supporst message-passing on a variety of platforms from Linux-based or
Windows-based PC to supercomputer and multiprocessor systems.
After the introduction of MPI whose functionality includes a set of 125
functions, a revision of the standard took place that added C++ support,
external memory accessibility and also Remote Memory Access (similar to
BSP’s put and get capability)to the standard. The resulting standard is
known as MPI-2 and has grown to almost 241 functions.
The Message Passing Interface
A minimum set
A minimum set of MPI functions is described below. MPI functions use the prefix
MPI and after the prefix the remaining keyword start with a capital letter.
A brief explanation of the primitives can be found on the textbook (beginning
page 242). A more elaborate presentation is available in the optional book.
Function Class
Initialization and Termination
Start of SPMD code
End of SPMD code
Abnormal Stop
One process halts all
Process Control
Number of processes
Identifier of Calling Process
Local (wall-clock) time
Message Passing
Send message to remote proc.
Receive mess. from remote proc.
General MPI program
#include <mpi.h>
Main(int argc, char* argv[])
MPI_Init(&argc, argv); /*no MPI functions called before this*/
MPI_Finalize(); /*no MPI functions called after this*/
MPI and MPI-2
Initialization and Termination
#include <mpi.h>
int MPI_Init (int *argc, char **argv);
int MPI_Finalize(void);
Multiple processes from the same source are created by
issuing the function MPI_Init and these processes are safely
terminated by issuing a MPI_Finalize.
The arguments of MPI_Init are the command line arguments
minus the ones that were used/processed by the MPI
implementation(main function’s parameters – argc and argv).
Thus command line processing should only be performed in
the program after the execution of this function call.
Successful return returns a MPI_SUCCESS; otherwise an
error-code that is implementation dependent is returned.
Definitions are available in <mpi.h>.
MPI and MPI-2
int MPI_Abort(MPI_Comm comm, int errcode);
Note that MPI_Abort aborts an MPI program cleanly.
The first one is a communicator and second
argument is an integer error code.
A communicator is a collection of processes that can
send messages to each other. A default
communicator is MPI_COMM_WORLD, which consists
of all the processes running when program execution
MPI and MPI-2
Communicators; Process Control
Under MPI, a communication domain is a set of processors that are
allowed to communicate with each other. Information about such a
domain is stored in a communicator that uniquely identify the processors
that participate in a communication operation.
A default communication domain is all the processors of a parallel
execution; it is called MPI COMM WORLD. By using multiple
communicators between possibly overlapping groups of processors we
make sure that messages are not interfering with each other.
#include <mpi.h>
int MPI_Comm_size ( MPI_Comm comm, int *size);
int MPI_Comm_rank ( MPI_Comm comm, int *rank);
MPI_Comm_size ( MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank ( MPI_COMM_WORLD, &pid );
return the number of processors nprocs and the processor id pid of the calling
MPI and MPI-2
Communicators; Process Control:
A hello world! Program in MPI is the following one.
#include <mpi.h>
int main(int argc, char **argv) {
int nprocs, mypid;
MPI_Comm_rank(MPI_COMM_WORLD,&mypid );
printf("Hello world from process %d of total %d\n", mypid, nprocs);
Submit job to Grid Engine
(1)- created mpi program, e.g. example.c.
(2)- compiled my program, using mpicc -O2 example.c -o example
(3)- type command of vi submit
(4)- i put this into vi submit
#$ -S /bin/sh
#$ -N example
#$ -q default
#$ -pe lammpi 4
#$ -cwd
mpirun C ./example
(5)- then ran chmod a+x submit
(6)- then ran qsub submit
(7)- ran qstat to check the status of your program
Message-Passing primitives
#include <mpi.h>
/* Blocking send and receive */
int MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm);
int MPI_Recv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Status *stat);
/* Non-Blocking send and receive */
int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req);
int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag, MPI_Comm comm, MPI_Request *req);
int MPI_Wait(MPI_Request *preq, MPI_Status *stat);
buf - initial address of send/receive buffer
count - number of elements in send buffer (nonnegative integer) or maximum number of elements in receive buffer.
dtyp - datatype of each send/receive buffer element (handle)
dest,src - rank of destination/source (integer)
Wild-card: MPI_ANY_SOURCE for recv only. No wildcard for dest.
tag - message tag (integer). Range 0...32767.
Wild-card: MPI_ANY_TAG for recv only; send must specify tag.
comm - communicator (handle)
stat - status object (Status), which can be the MPI constant; it returns the source and tag of the message that was
acctually received.
MPI_STATUS_IGNORE if the return status is not desired
Struct of status:
data type correspondence
between MPI and C
MPI_CHAR --> signed char ,
MPI_SHORT --> signed short int ,
MPI_INT --> signed int
MPI_LONG --> signed long int ,
MPI_UNSIGNED_CHAR --> unsigned char ,
MPI_UNSIGNED_SHORT --> unsigned short int ,
MPI_UNSIGNED --> unsigned int
MPI_UNSIGNED_LONG --> unsigned long int ,
MPI_FLOAT --> float ,
MPI_DOUBLE --> double,
MPI_LONG_DOUBLE --> long double,
Message-Passing primitives
The MPI Send and MPI Recv functions are blocking, that is they
do not return unless it is safe to modify or use the contents of
the send/receive buffer.
MPI also provides for non-blocking send and receive primitives.
These are MPI Isend and MPI Irecv, where the I stands for
These functions allow a process to post that it wants to send to
or receive from another process, and then allow the process to
call a function (eg. MPI Wait to complete the send-receive pair.
Non-blocking send-receives allow for the overlapping of
computation/communication. Thus MPI Wait plays the role of
synchronizer: the send/receive are only advisories and
communication is only effected at the MPI Wait.
Message-Passing primitives: tag
A tag is simply an integer argument that is passed to a communication function and that
can be used to uniquely identify a message. For example, in MPI if process A sends a
message to process B, then in order for B to receive the message, the tag used in A's call to
MPI_Send must be the same as the tag used in B's called to MPI_Recv. Thus, if the
characteristics of two messages sent by A to B are identical (i.e., same count and datatype),
then A and B can distinguish between the two by using different tags.
For example, suppose A is sending two floats, x and y, to B. Then the processes can be sure
that the values are received correctly, regardless of the order in which A sends and B
receives, provided different tags are used:
/* Assume system provides some buffering */
if (my_rank == A) {
tag = 0;
MPI_Send(&x, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
tag = 1;
MPI_Send(&y, 1, MPI_FLOAT, B, tag, MPI_COMM_WORLD);
} else if (my_rank == B) {
tag = 1;
MPI_Recv(&y, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status);
tag = 0;
MPI_Recv(&x, 1, MPI_FLOAT, A, tag, MPI_COMM_WORLD, &status); }
MPI Message-Passing primitives:
Now if one message from process A to process B is being sent by the
library, and another, with identical characteristics, is being sent by the
user's code, unless the library developer insists that user programs
refrain from using certain tag values, this approach cannot be
made to work. Clearly, partitioning the set of possible tags is at best an
inconvenience: if one wishes to modify an existing user code so that it
can use a library that partitions tags, each message passing function in
the entire user code must be checked.
The solution that was ultimately decided on was the communicator.
Formally, a communicator is a pair of objects: the first is a group or
ordered collection of processes, and the second is a context, which can
be viewed as a unique, system-defined tag. Every communication
function in MPI takes a communicator argument, and a communication
can succeed only if all the processes participating in the communication
use the same communicator argument. Thus, a library can either
require that its functions be passed a unique library-specific
communicator, or its functions can create their own unique
communicator. In either case, it is straightforward for the library
designer and the user to make certain that their messages are not
For example, suppose now that the user's code is sending a float, x, from process A to process B, while the library is sending a float, y, from
A to B:
/* Assume system provides some buffering */
void User_function(int my_rank, float* x) {
MPI_Status status;
if (my_rank == A) {
/* MPI_COMM_WORLD is pre-defined in MPI */
} else if (my_rank == B) {
MPI_Recv(x, 1, MPI_FLOAT, A, 0, MPI_COMM_WORLD, &status);
void Library_function(float* y) {
MPI_Comm library_comm;
MPI_Status status; int my_rank;
/* Create a communicator with the same group */ /* as MPI_COMM_WORLD, but a different context */
MPI_Comm_dup(MPI_COMM_WORLD, &library_comm);
/* Get process rank in new communicator */
MPI_Comm_rank(library_comm, &my_rank);
if (my_rank == A)
MPI_Send(y, 1, MPI_FLOAT, B, 0, library_comm);
else if (my_rank == B)
{ MPI_Recv(y, 1, MPI_FLOAT, A, 0, library_comm, &status); }
int main(int argc, char* argv[]) {
if (my_rank == A) {
User_function(A, &x);
} else if (my_rank == B) {
User_function(B, &x);
MPI Message-Passing primitives:
User-defined datatypes
The second main innovation in MPI, user-defined datatypes, allows programmers to exploit
this power, and as a consequence, to create messages consisting of logically unified sets of
data rather than only physically contiguous blocks of data.
Loosely, an MPI datatype is a sequence of displacements in memory together with a
collection of basic datatypes (e.g., int, float, double, and char). Thus, an MPI-datatype
specifies the layout in memory of data to be collected into a single message or data to be
distributed from a single message.
For example, suppose we specify a sparse matrix entry with the following definition.
typedef struct {
double entry;
int row, col;
} mat_entry_t;
MPI provides functions for creating a variable that stores the layout in memory of a variable
of type mat_entry_t. One does this by first defining an MPI datatype
MPI_Datatype mat_entry_mpi_t;
to be used in communication functions, and then calling various MPI functions to initialize
mat_entry_mpi_t so that it contains the required layout. Then, if we define
mat_entry_t x;
we can send x by simply calling
MPI_Send(&x, 1, mat_entry_mpi_t, dest, tag, comm);
and we can receive x with a similar call to MPI_Recv.
MPI: An example with the
blocking operations
#include <stdio.h>
#include <mpi.h>
#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char **argv){
int pid,nprocs,i,j;
int sum, start, end, total;
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD,&pid );
sum = 0; total = 0;
start = (N/nprocs)*pid +1 ; // Each processor
end =(N/nprocs)*(pid+1);
for(i=start;i<=end;i++) sum += i;
if (pid != 0 ) {
else {
for (j=1;j<nprocs;j++) {
sum = sum + total;
if (pid == 0 ) {
printf(" The sum from 1 to %d is %d \n",N,sum);
// Note: Program neither compiled nor run!+-
Non-Blocking send and Receive
/* Non-Blocking send and receive */
int MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag,
MPI_Comm comm, MPI_Request *req);
int MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int src, int tag,
MPI_Comm comm, MPI_Request *req);
int MPI_Wait(MPI_Request *preq, MPI_Status *stat);
#include "mpi.h"
int MPI_Wait ( MPI_Request *request, MPI_Status *status)
Waits for an MPI send or receive to complete
Input Parameter
request (handle)
Output Parameter
status object (Status) . May be MPI_STATUS_IGNORE.
MPI: An example with the nonblocking operations
#include <stdio.h>
#include <mpi.h>
#define N 10000000 // Choose N to be multiple of nprocs to avoid problems.
// Parallel sum of 1 , 2 , 3, ... , N
int main(int argc,char **argv){
int pid,nprocs,i,j;
int sum, start, end, total;
MPI_Status status;
MPI_Request request;
MPI_Comm_rank(MPI_COMM_WORLD,&pid );
sum = 0; total = 0;
start = (N/nprocs)*pid +1 ; // Each processor
end = (N/nprocs)*(pid+1);
for(i=start;i<=end;i++) sum += i;
if (pid != 0 ) {
// MPI_Send(&sum,1,MPI_INT,0,1,MPI_COMM_WORLD);
else {
for (j=1;j<nprocs;j++) {
sum = sum + total;
if (pid == 0 ) {
printf(" The sum from 1 to %d is %d \n",N,sum);
} // Note: Program neither compiled nor run!
MPI Basic Collective Operations
One simple collective operations:
int MPI_Bcast(void * message, int count, MPI_Datatype
datatype, int root, MPI_Comm comm)
The routine MPI_Bcast sends data from one process to
all others
Process 1
Data Present
Write data to
all processes
Process 2
Data Present
Process 3
Data Present
Process 0
Data Present
Data written, unblock
Simple Program that Demonstrates MPI_Bcast:
#include <mpi.h>
#include <stdio.h>
int main (int argc, char *argv[]){
int k,id,p,size;
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if(id == 0)
k = 20;
k = 10;
for(p=0; p<size; p++){
if(id == p)
printf("Process %d: k= %d before\n",id,k);
//note MPI_Bcast must be put where all other processes
//can see it.
for(p=0; p<size; p++){
if(id == p)
printf("Process %d: k= %d after\n",id,k);
return 0
Simple Program that Demonstrates
The Output would look like:
Process 0: k= 20 before
Process 0: k= 20 after
Process 3: k= 10 before
Process 3: k= 20 after
Process 2: k= 10 before
Process 2: k= 20 after
Process 1: k= 10 before
Process 1: k= 20 after
Parallel Algorithm Assumptions
Convention: In this subject we name processors arbitrarily
either 0, 1, . . . , p − 1 or 1, 2, . . . , p.
The input to a particular problem would reside in the cells of the
shared memory. We assume, in order to simplify the exposition
of our algorithms, that a cell is wide enough (in bits or bytes) to
accommodate a single instance of the input (eg. a key or a
floating point number). If the input is of size n, the first n cells
numbered 0, . . . , n − 1 store the input.
We assume that the number of processors of the PRAM is n or a
polynomial function of the size n of the input. Processor indices
are 0, 1, . . . , n − 1.
PRAM Algorithm:
Matrix Multiplication
Matrix Multiplication
A simple algorithm for multiplying two n × n matrices on a CREW
PRAM with time complexity T = O(lg n) and P = n3 follows. For
convenience, processors are indexed as triples (i, j, k), where i, j, k =
1, . . . , n. In the first step processor (i, j, k) concurrently reads aij and
bjk and performs the multiplication aijbjk. In the following steps, for all i,
k the results (i, ∗, k) are combined, using the parallel sum algorithm to
form cik = j aijbjk. After lgn steps, the result cik is thus computed.
The same algorithm also works on the EREW PRAM with the same time
and processor complexity. The first step of the CREW algorithm need to
be changed only. We avoid concurrency by broadcasting element aij to
processors (i, j, ∗) using the broadcasting algorithm of the EREW PRAM
in O(lg n) steps. Similarly, bjk is broadcast to processors (∗, j, k).
The above algorithm also shows how an n-processor EREW PRAM can
simulate an n-processor CREW PRAM with an O(lg n) slowdown.
Matrix Multiplication
1. aij to all (i,j,*) procs
bjk to all (*,j,k) procs
2. aij*bjk at (i,j,k) proc
3. parallel sumj aij *bjk (i,*,k) procs
4. cik = sumj aij*bjk
O(lgn) n procs participate
T=O(lgn),P=O(n3 ) W=O( n3 lgn) W2 = O(n3 )
PRAM Algorithm:
Logical AND operation
Problem. Let X1 . . .,Xn be binary/boolean values. Find X = X1 ∧ X2 ∧ . . .
∧ Xn.
The sequential problem accepts a P = 1, T = O(n),W = O(n) direct
An EREW PRAM algorithm solution for this problem works the same
way as the PARALLEL SUM algorithm and its performance is P = O(n),
T = O(lg n),W = O(n lg n) along with the improvements in P and W
mentioned for the PARALLEL SUM algorithm.
In the remainder we will investigate a CRCW PRAM algorithm. Let
binary value Xi reside in the shared memory location i. We can find X =
X1 ∧ X2 ∧ . . . ∧ Xn in constant time on a CRCW PRAM. Processor 1 first
writes an 1 in shared memory cell 0. If Xi = 0, processor i writes a 0 in
memory cell 0. The result X is then stored in this memory cell.
The result stored in cell 0 is 1 (TRUE) unless a processor writes a 0 in
cell 0; then one of the Xi is 0 (FALSE) and the result X should be FALSE,
as it is.
Logical AND operation
begin Logical AND (X1 . . .Xn)
1. Proc 1 writ1es in cell 0.
2. if Xi = 0 processor i writes 0 into cell 0.
end Logical AND
Exercise Give an O(1) CRCW algorithm for LOGICAL OR.
Parallel Operations with Multiple
Outputs – Parallel Prefix
Problem definition: Given a set of n values x0, x1, . . . , xn−1 and an
associative operator, say +, the parallel prefix problem is to compute
the following n results/“sums”.
0: x0,
1: x0 + x1,
2: x0 + x1 + x2,
n − 1: x0 + x1 + . . . + xn−1.
Parallel prefix is also called prefix sums or scan. It has many uses in
parallel computing such as in load-balancing the work assigned to
processors and compacting data structures such as arrays.
We shall prove that computing ALL THE SUMS is no more difficult
that computing the single sum x0 + . . .xn−1.
Parallel Prefix Algorithm1: divideand-conquer
x0 x1 x2 x3 x4 x5 x6 x7 <<Paralel Prefix "Box" for 8 inputs
| <<< 2 PP Boxes for 4 inputs each
| Take rightmost output of Box 1 and
| combine it with the outputs of Box2
Parallel Prefix Algorithm 2:
An algorithm for parallel prefix on an EREW PRAM
would require lg n phases. In phase i, processor j
reads the contents of cells j and j − 2i (if it exists)
combines them and stores the result in cell j.
The EREW PRAM algorithm that solves the parallel
prefix problem has performance P = O(n), T = O(lg
n), and W = O(n lg n), W2 = O(n).
Parallel Prefix Algorithm 2:
For visualization purposes, the second step is written in two different
lines. When we write x1 + . . . + x5 we mean
x1 + x2 + x3 + x4 + x5.
F. x1
x6+x7 x7+x8
x1+...+x3 x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7 x1+...+x8
Parallel Prefix Algorithm 2:
Example 2
For visualization purposes, the second step is written in two different lines.
When we write [1 : 5] we mean x1 +x2 + x3 + x4 + x5.
We write below [1:2] to denote x1+x2
[i:j] to denote xi + ... + x5
[i:i] is xi NOT xi+xi!
[1:2][3:4]=[1:2]+[3:4]= (x1+x2) + (x3+x4) = x1+x2+x3+x4
A * indicates value above remains the same in subsequent steps
0 x1
0 [1:1] [2:2]
1 * [1:1][2:2]
1. *
2. *
2. *
3. *
3. *
[1:1] [1:2]
[3:3][4:4] [4:4][5:5] [5:5][6:6] [6:6][7:7] [7:7][8:8]
[1:2][3:4] [2:3][4:5]
[3:4][5:6] [4:5][6:7] [5:6][7:8]
[1:2][3:6] [1:3][4:7] [1:4][5:8]
x1+...+x4 x1+...+x5 x1+...+x6 x1+...+x7 x1+...+x8
Parallel Prefix Algorithm 2:
// We write below[1:2] to denote X[1]+X[2]
[i:j] to denote X[i]+X[i+1]+...+X[j]
[i:i] is X[i] NOT X[i]+X[i]
[1:2][3:4]=[1:2]+[3:4]= (X[1]+X[2])+(X[3]+X[4])=X[1]+X[2]+X[3]+X[4]
// Input : M[j]= X[j]=[j:j] for j=1,...,n.
// Output: M[j]= X[1]+...+X[j] = [1:j] for j=1,...,n.
1. i=1;
// At this step M[j]= [j:j]=[j+1-2**(i-1):j]
2. while (i < n ) {
3. j=pid();
4. if (j-2**(i-1) >0 ) {
// Before this stepM[j] = [j+1-2**(i-1):j]
b=M[j-2**(i-1)]; // Before this stepM[j-2**(i-1)]= [j-2**(i-1)+1-2**(i-1):j-2**(i-1)]
// After this step M[j]= M[j]+M[j-2**(i-1)]=[j-2**(i-1)+1-2**(i-1):j-2**(i1)]
// [j+1-2**(i-1):j] = [j-2**(i-1)+1-2**(i-1):j]=[j+1-2**i:j]
8. }
9. i=i*2;
At step 5, memory location j − 2i−1 is read provided that j − 2i−1 ≥ 1. This is true for all times i ≤ tj =
lg(j − 1) + 1. For i > tj the test of line 4 fails and lines 5-8 are not executed.
Parallel Prefix Algorithm based
on Complete Binary Tree
Consider the following variation of parallel prefix on n
inputs that works on a complete binary tree with n
leaves (assume n is a power of two).
Action by nodes
Non-leaf : If it receives l and r from left and right children,
computes l + r and sends it up and send down to its right
child the l.
Root : Step [1] except nothing is sent up.
Non-leaf : If it gets p from parent it transmits it to its
left/right children.
Leaf : If it holds l and receives p from its parent it sets l = p
+ l (this order) [note p is the left argument, l is the right
one, order matters]
Parallel Prefix Algorithm based on
Complete Binary Tree: Example
\x1+x2 \x1+x2+x3+x4
after recving:
x1+.+x8 84
PRAM Algorithms:
Maximum finding
Problem. Let X1 . . .,XN be n keys. Find X = max{X1,X2, . . .,XN}.
The sequential problem accepts a P = 1, T = O(N),W = O(N) direct
An EREW PRAM algorithm solution for this problem works the same way
as the PARALLEL SUM algorithm and its performance is P = O(N), T =
O(lgN),W = O(N lgN),W2 = O(N) along with the improvements in P and
W mentioned for the PARALLEL SUM algorithm.
In the remainder we will investigate a CRCW PRAM algorithm. Let binary
value Xi reside in the local memory of processor i.
The CRCW PRAM algorithm MAX1 to be presented has performance T =
O(1), P = O(N2), and work W2 = W = O(N2).
The second algorithm to be presented in the following pages utilizes what
is called a doubly-logarithmic depth tree and achieves T = O(lglgN), P =
O(N) and W = W2 = O(N lglgN).
The third algorithm is a combination of the EREW PRAM algorithm and
the CRCW doubly-logarithmic depth tree-based algorithm and requires T
= O(lglgN), P = O(N) and W2 = O(N).
PRAM Algorithm Maximum Finding
begin Max1 (X1 . . .XN)
1. in proc (i, j) if Xi ≥ Xj then xij = 1;
2. else xij = 0;
3. Yi = xi1 ∧ . . . ∧ xin ;
4. Processor i reads Yi ;
5. if Yi = 1 processor i writes i into cell 0.
end Max1
In the algorithm, we rename processors so that pair
(i, j) could refer to processor j × n + i. Variable Yi is
equal to 1 if and only if Xi is the maximum.
The CRCW PRAM algorithm MAX1 has performance T
= O(1), P = O(N2), and work W2 = W = O(N2).
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
Under the PRAM model, synchronization is ignored and thus is
seen as for free, as PRAM processors work synchronously. It also
ignores communication, as in the PRAM the cost of accessing the
shared memory is as small as the cost of accessing local registers
of the PRAM.
But actually, the exchange of data can significantly impact the
efficiency of parallel programs by introducing interaction delays
during their execution.
It takes roughly ts+mtw time for a simple exchange of an m-word
message between two processes running on different nodes of an
interconnection network with cut-through routing.
ts: latency or the startup time for the data transfer
tw: per-word transfer time, which is inversely proportional to the
available bandwidth between the nodes.
Basic Communication Operations – Oneto-all broadcast and all-to-one reduction
Assume that p processes participate in the operation
and the data to be broadcast or reduced contains m
Since one-to-all broadcast or all-to-one reduction
procedure involves log p point-to-point simple
message transfers, each at a time cost of ts+mtw.
Therefore, the total time taken by the procedure is
T=(ts+mtw) log p
This is true for all interconnection network.
All-to-all Broadcast and Reduction
Linear Array and Ring:
P different messages circulate in the p-node ensemble.
If communication is performed circularly in a single direction, then each node received all (p-1)
pieces of information from all other nodes in (p-1) steps.
So the total time is: T=(ts+mtw)(p-1)
2-D Mesh:
Based on linear array algorithm, treating each rows and columns of the mesh as linear arrays.
Two phases:
Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear
array. In this phase, all nodes collect p corresponding to the p nodes of their respective rows. Each
node consolidates this information into a single message of size mp. The time for this phase is:
T1= =(ts+mtw)(p-1)
Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of this phase,
each node obtains all p pieces of m-word data originally resided on different nodes. The time for this
phase is
T2= =(ts+mptw)(p-1)
The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the
times spent in the individual phases:
log p
T   (t s  2i 1 t wm) t s log p  t wm( p  1)
i 1
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
As an example of how traditional PRAM algorithm design differs from
architecture independent parallel algorithm design, example
algorithm for broadcasting in a parallel machine is introduced.
Problem: In a parallel machine with p processors numbered 0, . . . ,
p − 1, one of them, say processor 0, holds a one-word message The
problem of broadcasting involves the dissemination of this message
to the local memory of the remaining p − 1 processors.
The performance of a well-known exclusive PRAM algorithm for
broadcasting is analyzed below in two ways under the assumption
that no concurrent operations are allowed. One follows the
traditional (PRAM) analysis that minimizes parallel running time. The
other takes into consideration the issues of communication and
synchronization. This leads to a modification of the PRAM-based
algorithm to derive an architecture independent algorithm for
broadcasting whose performance is consistent with observations of
broadcasting operations on real parallel machines.
Broadcasting: MPI Algorithm 1
Algorithm. Without loss of generality let us assume that p is a
power of two. The message is broadcast in lg p rounds of
communication by binary replication. In round i = 1, . . . , lg p,
each processor j with index j < 2i−1 sends the message it
currently holds to processor j + 2i−1 (on a shared memory
system, this may mean copying information into a cell read by
this processor). The number of processors with the message at
the end of round i is thus 2i.
Analysis of Algorithm. Under the PRAM model the algorithm
requires lg p communication rounds and so many parallel steps
to complete. This cost, however, ignores synchronization which is
for free, as PRAM processors work synchronously. It also ignores
communication, as in the PRAM the cost of accessing the shared
memory is as small as the cost of accessing local registers of the
Broadcasting: MPI Algorithm 1
Under the MPI cost model each communication round is assigned a
cost of max {ts, tw · 1} as each processor in each round sends or
receives at most one message containing the one-word message. The
cost of the algorithm is lg p · max {tw, tw · 1}, as there are lgp rounds
of communication.
As the communicated information by any processors is small in size, it
is likely that latency issues prevail in the transmission time (ie
bandwidth based cost tw · 1 is insignificant compared to the
latency/synchronization reflecting term ts).
In high latency machines the dominant term would be ts lg p rather
than tw lg p. Even though each communication round would last for at
least ts time units, only a small fraction tw of it is used for actual
communication. The remainder is wasted.
It makes then sense to increase communication round utilization so
that each processor sends the one-word message to as many
processors as it can accommodate within a round.
The total time is: lg p *(ts+tw)
Broadcasting: MPI Algorithm 2
Input: p processors numbered 0 . . .p − 1. Processor 0
holds a message of length equal to one word.
Output: The problem of broadcasting involves the
dissemination of this message to the remaining p − 1
Algorithm 2. In one superstep, processor 0 sends the
message to be broadcast to processors 1, . . . , p − 1 in
turn (a “sequential”-looking algorithm).
Analysis of Algorithm 2.
The communication time of Algorithm 2 is 1 · max{ts, (p
− 1) · tw} (in a single superstep, the message is
replicated p − 1 times by processor 0).
The total time is ts+(p-1)tw
Broadcasting: MPI Algorithm 3
Algorithm 3
Architecture independent Algorithm 3
Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm 3.
The main observation is that up to L/g words can be sent in a superstep at a cost of ts. Then,
It makes sense for each processor to send L/g messages to other processors. Let k − 1 be
the number of messages a processor sends to other processors in a broadcasting step. The
number of processors with the message at the end of a broadcasting superstep would be k
times larger than that in the start. We call k the degree of replication of the broadcast
In each round, every processor sends the message to k−1 other processors. In round i = 0,
1, . . ., each processor j with index j < ki sends the message to k − 1 distinct processors
numbered j + kiּl, where l = 1, . . . , k−1. At the end of round i (the (i+1)-st overall round),
the message is broadcast to ki ·(k−1)+ki = ki+1 processors. The number of rounds required is
the minimum integer r such that kr ≥ p, The number of rounds necessary for full
dissemination is thus decreased to lgkp, and the total cost becomes lgkp max {ts, (k − 1)tw}.
At the end of each superstep the number of processors possessing the message is k
times more than that of the previous superstep. During each superstep each
processor sends the message to exactly k−1 other processors.
Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2)
and lg p (and it becomes Algorithm 1).
The total time is: lgkp (ts+(k-1)tw)
Broadcasting: MPI Algorithm 3
Broadcast (0, p, k)
my_pid = pid(); mask_pid = 1;
while (mask_pid < p) {
if (my_pid < mask_pid)
for (i = 1, j = mask_pid;i < k; i++, j+ = mask_pid) {
target_pid = my_pid + j;
if (target_pid < p)
mpi_put(target_pid,&M,&M, 0, sizeof(M));
(or mpi_send…)
else if ((my_pid >= mask_pid) and (my_pid < k* mask_pid))
mpi_get() or mpi_Recv…
mask_pid = mask_pid ∗ k;
Broadcasting n > p words: Algorithm 4
Now suppose that the message to be broadcast consists of not
a single word but is of size n > p. Algorithm 4 may be a better
choice than the previous algorithms as one of the processors
sends or receives substantially more than n words of
information. (ntw>>ts)
There is a broadcasting algorithm, call it Algorithm 4, that
requires only two communication rounds and is optimal (for the
communication model abstracted by ts and tw) in terms of the
amount of information (up to a constant) each processor sends
or receives.
Algorithm 4. Two-phase broadcasting
The idea is to split the message into p pieces, have processor 0
send piece i to processor i in the first round and in the second
round processor i replicates the i-th piece p − 1 times by sending
each copy to each of the remaining p − 1 processors (see attached
The total time is: p times one-to-one + one all-to-all broadcast
Matrix Computations
SPMD program design stipulates that processors executes a single program on different pieces of
data. For matrix related computations it makes sense to distribute a matrix evenly among the p
processors of a parallel computer. Such a distribution should also take into consideration the
storage of the matrix by say the compiler so that locality issues are also taken into consideration
(filling cache lines efficiently to speedup computation). There are various ways to divide a
matrix. Some of the most common one are described below.
One way to distribute a matrix is by using block distributions. Split an array into blocks of size
n/p1 × n/p2 so that p = p1 × p2 and assign the i-th block to processor i. This distribution is
suitable for matrices as long as the amount of work for different elements of the matrix is the
The most common block distributions are.
• column-wise (block) distribution. Split matrix into p column stripes so that n/p
consecutive columns form the i-th stripe that will be stored in processor i. This is p1 = 1 and
p2 = p.
• row-wise (block) distribution. Split matrix into p row stripes so that n/p consecutive rows
form the i-th stripe that will be stored in processor i. This is p1 = p and p2 = 1.
• block or square distribution. This is the case p1 = p2 = √p, i.e. the blocks are of size
n/√p× n/√p and store block i to processor i.
There are certain cases (eg. LU decomposition, Cholesky factorization), where the amount of
work differs for different elements of a matrix. For these cases block distributions are not
Matrix block distributions
Matrix-Vector Multiplication
Sequential Alg: the running time is O(n2).
n^2 multiplications and additions
for i=0 to n-1 do
for j=0 to n-1 do
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Assume p=n (p – no. of processors).
Step 1: Initial partition of matrix and vector:
Step 2: All-to-all broadcast
Every process has one element of the vector, but every process needs the
entire vector.
Step 3: computation
Matrix distribution: Each process get one complete row of the matrix.
Vector distribution: The n*1 vector is distributed such that each process
owns one of its elements.
Process Pi computes
Running time:
y[i ] 
n 1
 ( A[i,
j ]  x[ j ])
j 0
All-to-all broadcast: θ(n) at any architecture
Multiplication of a single row of A and with vector x is θ(n)
Total running time is θ(n).
Total work is θ(n^2) – cost-optimal
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Assume p<n (p – no. of processors).
Three Steps:
Initial partition of matrix and vector:
All-to-all broadcast:
Among p processes and involved messages of size n/p
Each process initially stores n/p complete rows of the matrix and a portion of the vector of
size n/p
Each process multiplies n/p rows of the matrix with the vector x to produce n/p elements
of the result vector.
Running Time:
All-to-all broadcast:
T=(ts+ n/p tw)(p-1) on any architecture
T=ts logp + n/p tw(p-1) on hypercube
Computation: T=n* n/p =θ(n2/p)
Total running time T= θ(n2/p+ts logp + n tw)
Total work: W=θ(n2+ts p logp + n p tw) – cost-optimal
Matrix-Vector Multiplication:
Columnwise 1-D Partitioning
Similar to rowwise 1-D Partitioning
Matrix-Vector Multiplication:
2-D Partitioning
Assume p=n2
Step 1: Initial partitioning
Step 2: broadcast
Each process multiplies its matrix element with the corresponding element of x.
Step 4: All-to-one reduction of partial results.
The ith element of vector should be available to the ith element of each row of matrix. So
this step consists of n simultaneous one-to-all broadcast operations, one in each column of
Step 3: computation
Each process get one element of matrix
The vector is distributed only processes in the diagonal, each of which owns one element.
The products computed for each row must be added, leaving the sums in the last column
of processes.
Running time:
One-to-all broadcast: θ(log n)
Computation in each process: θ(1)
All-to-one reduction: θ(log n)
Total running time: θ(log n)
Total work: θ(n2 log n) – not cost-optimal
Matrix-Vector Multiplication:
2-D Partitioning
Matrix-Vector Multiplication:
2-D Partitioning
Assume p<n2
Step 1: Initial partitioning
Step 2: columwise one-to-all broadcast
Each process multiplies its n/p matrix element with the corresponding element of x.
Step 4: All-to-one reduction of partial results.
The ith group of elements of vector should be available to the ith group of each row of matrix. So this
step consists of n simultaneous one-to-all broadcast operations, one in each column of processes.
Step 3: computation
Each process get (n/p)*(n/p) of matrix
The vector is distributed only processes in the diagonal, each of which owns n/p element.
The products computed for each row must be added, leaving the sums in the last column of processes.
Running time:
Columnwise one-to-all broadcast: T= (ts+ n/p tw)(log p) on any architecture
Computation in each process: T=n/p* n/p
All-to-one reduction: T= (ts+ n/p tw)(log p) on any architecture
Total running time: T= n2/p + 2(ts+ n/p tw)(log p) on any architecture
Matrix-Vector Multiplication:
1-D Partitioning vs. 2-D Partitioning
Matrix-vector multiplication is faster with block 2D partitioning of the matrix than with block 1-D
partitioning for the same number of processes.
If the number of processes is greater than n, then
the 1-D partitioning cannot be used.
If the number of processes is less than or equal to
n, 2-D partitioning is preferable.
Matrix Distributions : Block cyclic
In block cyclic distributions the rows (similarly for columns) are split into
q groups of n/q consecutive rows per group, where potentially q > p, and
the i-th group is assigned to a processor in a cyclic fashion.
• column-cyclic distribution. This is an one-dimensional cyclic distribution.
Split matrix into q column stripes so that n/q consecutive columns form the i-th
stripe that will be stored in processor i %p. The symbol % is the mod
(remainder of the division) operator. Usually q > p. Sometimes the term
wrapped-around column distribution is used for the case where n/q = 1, i.e. q
= n.
• row-cyclic distribution. This is an one-dimensional cyclic distribution. Split
matrix into q row stripes so that n/q consecutive rows form the i-th stripe that
will be stored in processor i %p. The symbol % is the mod (remainder of the
division) operator. Usually q > p. Sometimes the term wrapped-around row
distribution is used for the case where n/q = 1, i.e. q = n.
• scattered distribution. Let p = qi · qj processors be divided into qj groups
each group Pj consisting of qi processors. Particularly, Pj = {jqi + l | 0 ≤ l ≤ qi −
1}. Processor jqi + l is called the l-th processor of group Pj . This way matrix
element (i, j), 0 ≤ i, j < n, is assigned to the (i mod qi)-th processor of group P(j
mod qj). A scattered distribution refers to the special case qi = qj = √p.
Block cyclic distributions
Scattered Distribution
Matrix Multiplication – Serial algorithm
Matrix Multiplication
The algorithm for matrix multiplication presented below was presented in the
seminal work of Valiant. It works for p ≤ n2.
Three steps:
Initial partitioning: Matrices A and B are partitioned into p blocks Ai,j, and Bi,j (1
<=i,j < √p) of size n/√p × n/√p each. These blocks are mapped onto a √p ×
√p logical mesh of processes. The process are labeled from P0,0 to P √p-1,√p -1.
All-to-all broadcasting: Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix. Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤k<√p. To aquire all the required blocks, an all-to-all broadcast of
matrix A’s block is performed in each row of processes, and an all-to-all broadcast
of matrix B’s blocks is performed in each column.
Computation: After Pi,j acquire Ai,0, Ai,1, …, Ai, √p -1 and B0,j, B1,j, …, B √p 1,j, it performs the submatrix multiplication and addition step of line 7 and line 8
in Alg 8.3.
Running time:
All-to-all broadcast:
T=(ts+ n^2/p tw)(p-1) on any architecture
T=ts log  p + n^2/p tw( p-1) on hypercube
T= p*(n/p)^3=n^3/p.
Matrix Multiplication
The input matrices A and B are divided into p block-submatrices, each
one of dimension m× m, where m = n/√p. We call this distribution of
the input among the processors block distribution. This way, element
A(i, j), 0 ≤ i < n, 0 ≤ j < n, belongs to the (j/m)∗√p+(i/m)-th block
that is subsequently assigned to the memory of the same-numbered
Let Ai (respectively, Bi) denote the i-th block of A (respectively, B)
stored in processor i. With these conventions the algorithm can be
described in Figure 1. The following Proposition describes the
performance of the aforementioned algorithm.
Matrix Multiplication
Sorting Algorithm
Rearranging a list of numbers into
increasing (strictly nondecreasing)
Potential Speedup
O(nlogn) optimal for any sequential sorting algorithm
without using special properties of the numbers.
Best we can expect based upon a sequential sorting
algorithm using n processors is
Optimal parallel time complexity = O(n logn)/n= O(logn)
Has been obtained but the constant hidden in the order
notation extremely large.
Compare-and-Exchange Sorting Algorithms:
Compare and Exchange
Form the basis of several, if not most, classical sequential
sorting algorithms.
Two numbers, say A and B, are compared. If A > B, A and B are
exchanged, i.e.:
if (A > B)
temp = A;
A = B;
B = temp;
Message Passing Method
For P1 to send A to P2 and P2 to send B to P1.
Then both processes perform compare operations.
P1 keeps the larger of A and B and P2 keeps the smaller of A and B:
Merging Two Sublists
Bubble Sort
First, largest number moved to the end of list
by a series ofcompares and exchanges,
starting at the opposite end.
Actions repeated with subsequent numbers,
stopping just before the previously positioned
In this way, the larger numbers move
(“bubble”) toward one end,
Bubble Sort
Time Complexity
Number of compare and exchange operations
Indicates a time complexity of O(n^2) given that
a single compare-and-exchange operation has a
constant complexity, O(1).
Parallel Bubble Sort
Iteration could start before previous iteration finished if
doesnot overtake previous bubbling action:
Odd-Even (Transposition) Sort
Variation of bubble sort.
Operates in two alternating phases, even phase and
odd phase.
Even phase
Even-numbered processes exchange numbers with
their right neighbor.
Odd phase
Odd-numbered processes exchange numbers with
their right neighbor.
Sequential Odd-Even Transposition
Parallel Odd-Even Transposition Sort
Sorting eight numbers
Parallel Odd-Even Transposition Sort
Consider the one item per processor case.
There are n iterations; in each iteration, each
processor does one compare-exchange –which
can all be done in parallel.
The parallel run time of this formulation is Θ(n).
This is cost optimal with respect to the base
serial algorithm but not the optimal serial
Parallel Odd-Even Transposition Sort
Parallel Odd-Even Transposition Sort
Consider a block of n/p elements per processor.
The first step is a local sort.
In each subsequent step, the compare exchange
operation is replaced by the compare split
There are p phases with each phase performing
Θ(n/p) compares and Θ(n/p) communication.
The parallel run time of the formulation is
Tp   ( log )   (n)   (n)
The parallel formulation is cost-optimal for p=
Very popular sequential sorting algorithm that performs
well with average sequential time complexity of
First list divided into two sublists. All numbers in one
sublist arranged to be smaller than all numbers in other
Achieved by first selecting one number, called a pivot,
against which every other number is compared. If the
number is less than the pivot, it is placed in one sublist.
Otherwise, it is placed in the other sublist.
Pivot could be any number in the list, but often first
number in list chosen. Pivot itself could be placed in
one sublist, or the pivot could be separated and placed
in its final position
Example of the quicksort algorithm
sorting a sequence of size n= 8.
Parallel Quicksort
Lets start with recursive decomposition -the list is
partitioned by a single process and then each of the
subproblems is handled by a different processor.
The time for this algorithm is lower-bounded by Ω(n)!
–Not cost optimal as the process-time product is
Can we parallelize the partitioning step -in particular,
if we can use n processors to partition a list of length
n around a pivot in O(1)time, we have a winner.
This is difficult to do on real machines, though.
Parallel Quicksort
Using tree allocation of processes
Parallel Quicksort
With the pivot being withheld in processes:
Fundamental problem with all tree
constructions –initial division done by a single
processor, which will seriously limit speed.
Tree in quicksort will not, in general, be
perfectly balanced
Pivot selection very important to make
quicksort operate fast.
Parallelizing Quicksort: PRAM
We assume a CRCW (concurrent read, concurrent write) PRAM
with concurrent writes resulting in an arbitrary write succeeding.
The formulation works by creating pools of processors. Every
processor is assigned to the same pool initially and has one
Each processor attempts to write its element to a common
location (for the pool).
Each processor tries to read back the location. If the value read
back is greater than the processor's value, it assigns itself to the
'left' pool, else, it assigns itself to the 'right' pool.
Each pool performs this operation recursively.
Note that the algorithm generates a tree of pivots. The depth of
the tree is the expected parallel runtime. The average value is
Parallelizing Quicksort: PRAM
A binary tree generated by the execution of the quicksortalgorithm.
Each level of the tree represents a different array-partitioning
iteration. If pivot selection is optimal, then the height of the tree is
Θ(log n), which is also the number of iterations.
Parallelizing Quicksort: PRAM Formulation
The execution of the PRAM algorithm on the array shown in (a).
Thank you!