
Data Structures and Algorithms
in Parallel Computing
Lecture 10
Numerical algorithms
• Algorithms that use numerical approximation to solve
mathematical problems
• They do not seek exact solutions because this would be
nearly impossible in practice
• Much work has been done on parallelizing numerical
– Matrix operations
– Particle physics
– Systems of linear equations
Note: https://courses.engr.illinois.edu/cs554/fa2015/notes/index.html
Matrix operations
• Inner product:
• Outer product:
• Matrix vector product:
• Matrix matrix product:
Inner product
• Assign (n/k)/p coarse-grain tasks to each of p processes, for
total of n/p components of x and y per process
• Communication: sum reduction over n/k coarse grained
• Isoefficiency:
– How the amount of computation performed must scale with
processor number to keep efficiency constant
– 1D mesh: Θ(p2)
– 2D mesh: Θ(p3/2)
– Hypercube: Θ(p log p)
Outer product
• At most n tasks store components of x and y:
• for some j, task (i,j) stores xi and task (j,i) stores yi, or
• task (i,i) stores both xi and yi, i = 1,...,n
• Communication:
• For i = 1,...,n, task that stores xi broadcasts it to all
other tasks in ith task row
• For j = 1,...,n, task that stores yj
broadcasts it to all other tasks in jth
task column
1D mapping
• Column wise
• Row wise
• Each task holding either x or y components must broadcast
them to neighbors
• Isoefficiency: Θ(p2)
2D mapping
• Isoefficiency: Θ(p2)
Matrix vector product
• At most 2n fine-grain tasks store components of x and y, say
– For some j, task (j,i) stores xi and task (i,j) stores yi, or
– Task (i,i) stores both xi and yi, i = 1,...,n
• Communication
– For j = 1,...,n, task that stores xj broadcasts
it to all other tasks in jth task column
– For i = 1,...,n, sum reduction over ith task
row gives yi
Matrix vector product
• Steps
1. Broadcast xj to tasks (k,j), k = 1,...,n
2. Compute yi = aijxj
3. Reduce yi across tasks (i,k), k = 1,...,n
2D mapping
• Isoefficiency: Θ(p2)
1D column mapping
• Isoefficiency: Θ(p2)
1D row mapping
• Isoefficiency: Θ(p2)
Matrix matrix product
• Matrix-matrix product can be viewed as:
– n2 inner products, or
– sum of n outer products, or
– n matrix-vector products
• Each viewpoint yields different algorithm
• One way to derive parallel algorithms for matrixmatrix product is to apply parallel algorithms
already developed for inner product, outer
product, or matrix-vector product
• We will investigate parallel algorithms for this
problem directly
Matrix matrix product
• At most 3n2 fine-grain tasks store entries of A, B, or C,
say task (i,j,j) stores aij, task (i,j,i) stores bij, and task
(i,j,k) stores cij for i,j = 1,...,n and some fixed k
• (i,j,k) = (row, column, layer)
• Communication
– Broadcast entries of jth column of A horizontally along
each task row in jth layer
– Broadcast entries of ith row of B vertically along each task
column in ith layer
– For i,j = 1,...,n, result cij is given by sum reduction over
tasks (i,j,k), k = 1,...,n
Matrix matrix product
• Steps
Broadcast aik to tasks (i,q,k), q = 1,...,n
Broadcast bkj to tasks (q,j,k), q = 1,...,n
cij = aikbkj
Reduce cij across tasks (i,j,q), q = 1,...,n
• Task grouping
– Reduce number of processors
Particle systems
• Many physical systems can be modeled as a
collection of interacting particles
– Atoms in molecule
– Planets in solar system
– Stars in galaxy
– Galaxies in clusters
• Particles exert mutual forces on each other
– Gravitational
– Electrostatic
N-body model
• Newton’s law:
• Force between two particles:
• Overall force on ith particle:
• O(n2) due to particle-particle interactions
• Can be reduced to O(n log n) or O(n) through
– Hierarchical trees
– Multipole methods
• Pay penalty of accuracy
Trivial parallelism
• High parallelism but totally work prohibitive and memory
requirements may be expensive
• 2 steps
– Broadcast position of each particle along rows and columns
– Reduce forces diagonally (to home of particle) and perform time
• Agglomeration can reduce communication in rows or columns
Reducing complexity
• Forces have infinite range, but with declining
• Three major options
– Perform full computation at O(n2) cost
– Discard forces from particles beyond certain
range, introducing error that is bounded away
from zero
– Approximate long-range forces, exploiting
behavior of force and/or features of problem
• Monopole representation
• Or tree code
• Method:
– Aggregate distant particles into cells and represent
effect of all particles in a cell by monopole (first term
in multipole expansion) evaluated at center of cell
• Replace influence of far away particles with aggregate
approximation of force
– Use larger cells at greater distances
– Approximation is relatively crude
Parallel Approach
• Divide domain into patches, with each patch assigned to a
• Tree code replaces communication with all processes by
communication with fewer processes
• To avoid accuracy problem of monopole expansion, use full
multipole expansion
What’s next?
• Discuss some recent papers on parallel
algorithms dealing with classes of problems
discussed during this lecture