L10

Data Structures and Algorithms in Parallel Computing Lecture 10 Numerical algorithms • Algorithms that use numerical approximation to solve mathematical problems • They do not seek exact solutions because this would be nearly impossible in practice • Much work has been done on parallelizing numerical algorithms – Matrix operations – Particle physics – Systems of linear equations –… Note: https://courses.engr.illinois.edu/cs554/fa2015/notes/index.html Matrix operations • Inner product: • Outer product: • Matrix vector product: • Matrix matrix product: Inner product • Assign (n/k)/p coarse-grain tasks to each of p processes, for total of n/p components of x and y per process • Communication: sum reduction over n/k coarse grained tasks • Isoefficiency: – How the amount of computation performed must scale with processor number to keep efficiency constant – 1D mesh: Θ(p2) – 2D mesh: Θ(p3/2) – Hypercube: Θ(p log p) Outer product • At most n tasks store components of x and y: • for some j, task (i,j) stores xi and task (j,i) stores yi, or • task (i,i) stores both xi and yi, i = 1,...,n • Communication: • For i = 1,...,n, task that stores xi broadcasts it to all other tasks in ith task row • For j = 1,...,n, task that stores yj broadcasts it to all other tasks in jth task column 1D mapping • Column wise • Row wise • Each task holding either x or y components must broadcast them to neighbors • Isoefficiency: Θ(p2) 2D mapping • Isoefficiency: Θ(p2) Matrix vector product • At most 2n fine-grain tasks store components of x and y, say either – For some j, task (j,i) stores xi and task (i,j) stores yi, or – Task (i,i) stores both xi and yi, i = 1,...,n • Communication – For j = 1,...,n, task that stores xj broadcasts it to all other tasks in jth task column – For i = 1,...,n, sum reduction over ith task row gives yi Matrix vector product • Steps 1. Broadcast xj to tasks (k,j), k = 1,...,n 2. Compute yi = aijxj 3. Reduce yi across tasks (i,k), k = 1,...,n 2D mapping • Isoefficiency: Θ(p2) 1D column mapping • Isoefficiency: Θ(p2) 1D row mapping • Isoefficiency: Θ(p2) Matrix matrix product • Matrix-matrix product can be viewed as: – n2 inner products, or – sum of n outer products, or – n matrix-vector products • Each viewpoint yields different algorithm • One way to derive parallel algorithms for matrixmatrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product • We will investigate parallel algorithms for this problem directly Matrix matrix product • At most 3n2 fine-grain tasks store entries of A, B, or C, say task (i,j,j) stores aij, task (i,j,i) stores bij, and task (i,j,k) stores cij for i,j = 1,...,n and some fixed k • (i,j,k) = (row, column, layer) • Communication – Broadcast entries of jth column of A horizontally along each task row in jth layer – Broadcast entries of ith row of B vertically along each task column in ith layer – For i,j = 1,...,n, result cij is given by sum reduction over tasks (i,j,k), k = 1,...,n Matrix matrix product • Steps 1. 2. 3. 4. Broadcast aik to tasks (i,q,k), q = 1,...,n Broadcast bkj to tasks (q,j,k), q = 1,...,n cij = aikbkj Reduce cij across tasks (i,j,q), q = 1,...,n • Task grouping – Reduce number of processors Particle systems • Many physical systems can be modeled as a collection of interacting particles – Atoms in molecule – Planets in solar system – Stars in galaxy – Galaxies in clusters • Particles exert mutual forces on each other – Gravitational – Electrostatic N-body model • Newton’s law: • Force between two particles: • Overall force on ith particle: Complexity • O(n2) due to particle-particle interactions • Can be reduced to O(n log n) or O(n) through – Hierarchical trees – Multipole methods • Pay penalty of accuracy Trivial parallelism • High parallelism but totally work prohibitive and memory requirements may be expensive • 2 steps – Broadcast position of each particle along rows and columns – Reduce forces diagonally (to home of particle) and perform time integration • Agglomeration can reduce communication in rows or columns Reducing complexity • Forces have infinite range, but with declining strength • Three major options – Perform full computation at O(n2) cost – Discard forces from particles beyond certain range, introducing error that is bounded away from zero – Approximate long-range forces, exploiting behavior of force and/or features of problem Approach • Monopole representation • Or tree code • Method: – Aggregate distant particles into cells and represent effect of all particles in a cell by monopole (first term in multipole expansion) evaluated at center of cell • Replace influence of far away particles with aggregate approximation of force – Use larger cells at greater distances – Approximation is relatively crude Parallel Approach • Divide domain into patches, with each patch assigned to a process • Tree code replaces communication with all processes by communication with fewer processes • To avoid accuracy problem of monopole expansion, use full multipole expansion What’s next? • Discuss some recent papers on parallel algorithms dealing with classes of problems discussed during this lecture

L10

Related documents

Products

Support

L10

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib