The MAP3S Static-and-Regular Mesh Simulation and Wavefront

advertisement
The MAP3S
Static-and-Regular Mesh
Simulation and Wavefront
Parallel-Programming Patterns
By Robert Niewiadomski, José Nelson Amaral, and Duane Szafron
Department of Computing Science
University of Alberta
Edmonton, Alberta, Canada
Pattern-based parallel-programming
• Observation:
– Many seemingly different parallel programs have a common parallel
computation-communication-synchronization pattern.
• A Parallel-programming pattern instance:
– Is a parallel program that adheres to a certain parallel computationcommunication-synchronization pattern.
– Consists of engine-side code and user-side code:
• Engine-side code:
– Is complete and handles all communication and synchronization.
• User-side code:
– Is incomplete and handles all computation.
– User completes the incomplete portions.
• MAP3S targets distributed-memory systems.
MAP3S
• MAP3S = MPI/C Advanced Pattern-based Parallel Programming System
Technical expertise
Engine
designer
Domain knowledge
Pattern
designer
Application
developer
Pattern-based parallel-programming
• The MAP3S usage scheme:
Select Pattern
Create
Specification File
(p.e. dimensions of mesh, data dependences, etc)
Generate
Pattern Instance
(automatic by pattern-instance generator)
Write User-side
Code
(domain-specific computation code)
The Simulation and Wavefront
computations
• The computations operate on a k-dimensional mesh of elements.
• Simulation:
– Multiple mesh instances M0, M1, … are computed.
– In iteration i = 0, elements of M0 are initialized.
– In iteration i > 0, certain elements of Mi are computed using elements of Mi - 1
that were initialized/computed in previous iteration.
– Execution proceeds until a terminating condition is met.
– Example: cellular-automata computations.
• Wavefront:
– Single mesh instance M is computed.
– In iteration i = 0,certain elements of M are initialized.
– In iteration i > 0, elements of M whose data dependences are satisfied are
computed.
– Execution proceeds until there are no elements to compute.
– Example: dynamic-programming computations.
Mesh-blocks
• A k-dimensional mesh is logically partitioned into kdimensional sub-meshes called mesh-blocks.
0
1
2
3
4
5
0
6
7
8
9 10 11
6
A
1
2
7
8
B
3
4
C
5
9 10 11
12 13 14 15 16 17
12 13 14 15 16 17
18 19 20 21 22 23
18 19 20 21 22 23
24 25 26 27 28 29
24 25 26 27 28 29
30 31 32 33 34 35
30 31 32 33 34 35
D
G
E
H
F
I
• Computation proceeds at granularity of mesh-blocks.
User-side code: Simulation
Prelude: process command-line
arguments.
Prelude
Prelude
Prologue: initialize first mesh, possibly at
granularity of mesh blocks.
Prologue
Prologue
BodyLocal
BodyLocal
BodyGlobal: decide whether to compute
another mesh or to terminate.
BodyGlobal
BodyGlobal
Epilogue: process last computed mesh,
possibly at granularity of mesh blocks.
Epilogue
Epilogue
BodyLocal: compute next mesh at
granularity of mesh blocks.
User-side code: Simulation
Prelude: process command-line arguments.
Prelude
Prelude
Prologue
Prologue
BodyLocal
BodyLocal
BodyGlobal: decide whether to compute
another mesh or to terminate.
BodyGlobal
BodyGlobal
Epilogue: process last computed mesh,
possibly at granularity of mesh blocks.
Epilogue
Epilogue
Prologue: initialize first mesh, possibly at
granularity of mesh blocks.
BodyLocal: compute next mesh at
granularity of mesh blocks.
User-side code: Simulation
Prelude: process command-line arguments.
Prelude
Prelude
Prologue
Prologue
BodyLocal
BodyLocal
BodyGlobal: decide whether to compute
another mesh or to terminate.
BodyGlobal
BodyGlobal
Epilogue: process last computed mesh,
possibly at granularity of mesh blocks.
Epilogue
Epilogue
Prologue: initialize first mesh, possibly at
granularity of mesh blocks.
BodyLocal: compute next mesh at
granularity of mesh blocks.
User-side code: Simulation
Prelude: process command-line arguments.
Prelude
Prelude
Prologue
Prologue
BodyLocal
BodyLocal
BodyGlobal: decide whether to compute
another mesh or to terminate.
BodyGlobal
BodyGlobal
Epilogue: process last computed mesh,
possibly at granularity of mesh blocks.
Epilogue
Epilogue
Prologue: initialize first mesh, possibly at
granularity of mesh blocks.
BodyLocal: compute next mesh at
granularity of mesh blocks.
User-side code: Simulation
Prelude: process command-line arguments.
Prelude
Prelude
Prologue
Prologue
BodyLocal
BodyLocal
BodyGlobal: decide whether to compute
another mesh or to terminate.
BodyGlobal
BodyGlobal
Epilogue: process last computed mesh,
possibly at granularity of mesh blocks.
Epilogue
Epilogue
Prologue: initialize first mesh, possibly at
granularity of mesh blocks.
BodyLocal: compute next mesh at
granularity of mesh blocks.
User-side code: Wavefront
Prelude: process command-line arguments
Prologue: initialize mesh, possibly at
granularity of mesh blocks.
Body: continue computing of mesh at
granularity of mesh blocks
Epilogue: process mesh, possibly at
granularity of mesh blocks.
Prelude
Prelude
Prologue
Prologue
Body
Body
Epilogue
Epilogue
Data-dependency specification
• The computation of an element depends on the values of certain other
elements.
• In MAP3S, the user specifies these data-dependencies using conditional
shape-lists at pattern-instance generation time.
– Syntax: given an element p(c0, c1,…, ck - 1), if a certain condition is met, then,
the computation of p requires the values of all the elements falling into the
specified k-dimensional volumes of the k-dimensional mesh, each of which is
specified relative to position (c0, c1,…, ck - 1).
– Here is a simple example (expressing dependences for the green element):
0 1 2 3
0
1
2
3
{“x > 0 && y > 0”,
{([“x - 1”, ”x - 1”], [“y - 1”, “y - 1”]),
([“x - 1”, “x - 1”], [“y”, “y”]),
([“x”, “x”], [“y - 1”, ”y - 1”])}};
Data-dependency specification
• The strengths of conditional shape-lists:
• user is not limited to pre-defined data-dependency specifications,
• user is able to express irregular data-dependency specifications.
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
{"y<=x", {(["0","y-1"],["y","y"]),
(["x","x"],["0","y-1"])}};
• In this example, conditional shape-lists specify the data-dependencies of
the Lower/Upper Matrix-Decomposition Wavefront computation.
Data-dependency specification
• The strengths of conditional shape-lists:
• user is not limited to pre-defined data-dependency specifications,
• user is able to express irregular data-dependency specifications.
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
{"y>x", {(["0","x-1"],["y","y"]),
(["x","x"],["0","x-1"]),
(["x","x"], ["x","x"])}};
{"y<=x", {(["0","y-1"],["y","y"]),
(["x","x"],["0","y-1"])}};
• In this example, conditional shape-lists specify the data-dependencies of
the Lower/Upper Matrix-Decomposition Wavefront computation.
Direct mesh-access
• In the user-code all the mesh elements can be accessed directly.
0
0
1
2
3
4
1
2
3
4
5
void computeMeshBlock(mesh, xMin, xMax, yMin, yMax) {
for(x = xMin; x<= xMax; x++) {
for(y = yMin; y<= yMax; y++) {
mesh[x][y] = f(mesh[x-1][y-1], mesh[x][y-1],
mesh[x-1][y]);
}
}
}
5
• With direct mesh-access, the user does not need to refactor their
sequential-code w.r.t. mesh access. In contrast, with indirect mesh-access
a refactoring is necessary, since input elements are accessed in auxiliary
data-structures.
Engine-side code
Engine-side code in the Wavefront pattern.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
0
3
4
9
10 11
12 13
14 15
16 17
18 19
20 21
22 23
24 25
26 27
28 29
30 31
32 33
34 35
6
A
1
2
7
8
D
G
B
E
H
C
5
F
I
• Element-level data-dependencies --- specified by the user --- are
automatically extended to mesh-block-level data-dependencies.
Engine-side code
• The mesh-block-level data-dependencies are utilized
to establish a parallel-computation schedule.
0
6
0
0
3
4
9
10 11
12 13
14 15
16 17
2
18 19
20 21
22 23
8
24 25
26 27
28 29
30 31
32 33
34 35
6
A
1
2
7
8
D
G
B
E
H
C
5
6
F
I
4
B
A
A
1
7
1
7
3
12 13
9
18 19
D
5
14 15
24 25
10 11
20 21
30 31
C
G
E
16 17
26 27
28 29
22 23
32 33
34 35
F
H
I
Engine-side code
CPU 0
• The parallel computation-schedule is
refined with mesh-blocks being assigned
among the processors in a round-robin
fashion (shown).
• The parallel computation-schedule is then
complemented with a parallel
communication schedule (not shown).
• The engine-side code executes user-side
code in accordance to the parallel
computation and communication
schedule.
CPU 1
0
6
0
6
2
8
4
B
A
A
1
7
1
7
3
12 13
9
18 19
D
5
14 15
24 25
10 11
20 21
30 31
C
G
E
16 17
26 27
28 29
22 23
32 33
34 35
F
H
I
Engine-side code
CPU 0
• Execution of user-side code by
the engine-side code when using
sequential prologue and
epilogue.
CPU 1
Prelude
Prologue(A)
0 1
A
6 7
Body(A)
2 3
B
8 9
4 5
C
10 11
16 17
F
22 23
Time
0 1
A
6 7
12 13
D
18 19
14 15
E
20 21
26 27
H
32 33
24 25
G
30 31
28 29
I
34 35
Body(B)
Body(D)
Body(C)
Body(E)
Body(G)
Epilogue(A,B,C,D,E,G)
Mesh representation
• The mesh can be represented using
either the dense mesh-representation or
the sparse mesh-representation.
• Sparse representation can have better
locality and can distribute the memory
footprint of the mesh among nodes.
0
8
16
24
32
40
48
56
1
9
17
25
33
41
49
57
2
10
18
26
34
42
50
58
3
11
19
27
35
43
51
59
4
12
20
28
36
44
52
60
5
13
21
29
37
45
53
61
6
14
22
30
38
46
54
62
7
15
23
31
39
47
55
63
2D-mesh in dense mesh-representation.
0
8
1 16 17 32 33 48 49 2 3 18 19 34 35 50 51 4 5 20 21 36 37 52 53 6 7 22 23 38 39 54 55
9 24 25 40 41 56 57 10 11 26 27 42 43 58 59 12 13 28 29 44 45 60 61 14 15 30 31 46 47 62 63
2D mesh in sparse mesh-representation.
Mesh representation
•A mesh memory-footprint can be as much a problem as performance. The
combination of parallel prologue and epilogue, and sparse-mesh representation, both
minimizes and distributes the mesh-storage memory-footprint.
CPU 0
Do not store dead
mesh-blocks
0 1
A
6 7
12 13
D
18 19
24 25
G
30 31
2 3
B
8 9
14 15
E
20 21
26 27
H
32 33
4 5
C
10 11
16 17
F
22 23
28 29
I
34 35
0 1
A
6 7
12 13
D
18 19
24 25
G
30 31
2 3 4 5
B
C
8 9 10 11
14 15
E
20 21
0 1 2 3 4 5
A
B
C
6 7 8 9 10 11
12 13
D
18 19
24 25
G
30 31
CPU 1
Original
Only store non-owned meshblocks that are used by owned
mesh-blocks.
0 1
A
6 7
12 13
D
18 19
24 25
G
30 31
2 3
B
8 9
14 15
E
20 21
26 27
H
32 33
4 5
C
10 11
16 17
F
22 23
28 29
I
34 35
0 1
A
6 7
12 13
D
18 19
24 25
G
30 31
2 3 4 5
B
C
8 9 10 11
14 15
E
20 21
0 1
A
6 7
12 13
D
18 19
24 25
G
30 31
2 3
B
8 9
14 15
E
20 21
• Memory-footprint reduction varies. It is most effective for large simulation
computations.
Experimental evaluation
• Problems:
– 2D problems:
• GoL: game-of-life (Simulation)
• LUMD: lower/upper matrix-decomposition (Wavefront)
– 3D problems:
• RTA: room-temperature annealing (Simulation)
• MSA: multiple-sequence alignment (Wavefront)
• Hardware:
– GigE:a 16-node cluster with Gigabit Ethernet
– IB: a 128-node cluster with InfiniBand (limited to 64)
Experimental evaluation
• Speedup on GigE:
– x-axis is # of nodes.
– y-axis is speedup.
• Performance gains on
LUMD and MSA are worse
than on GoL and RTA:
– LUMD has non-uniform
computation intensity,
which limits parallelism.
– MSA has limited
computation granularity,
which increases relative
overhead of
communication and
synchronization.
GoL (2D Simulation)
RTA (3D Simulation)
LUMD (2D Wavefront)
MSA (3D Wavefront)
Experimental evaluation
• Speedup on IB:
– x-axis is # of nodes.
– y-axis is speedup.
• Performance gains on
LUMD and MSA are worse
than on GoL and RTA:
– See GigE.
GoL (2D Simulation)
RTA (3D Simulation)
LUMD (2D Wavefront)
MSA (3D Wavefront)
Experimental evaluation
• Capability:
– Use of sparse-mesh representation distributes the mesh memoryfootprint across multiple nodes.
• Allows for handling of meshes whose memory-footprint exceeds memory capacity
of a single node.
– Using 16 nodes on GigE:
Problem Instance
Global Mesh
Memory-Footprint (GB)
Maximum Local
MeshMemory-Footprint
(GB)
GoL (131,072x 131,072)
32
3.0
RTA (1,024x 1,024 x 1,024)
32
4.4
LUMD (40,132x 40,132)
12
3.0
MSA (2,048 x 2,048 x 2,048)
32
3.0
• LUMD mesh memory-footprint effectiveness is limited due to computation
of elements being dependent on a larger number of elements.
Experimental evaluation
• Capability:
– Use of sparse-mesh representation distributes the mesh memoryfootprint across multiple nodes.
• Allows for handling of meshes whose memory-footprint exceeds memory capacity
of a single node.
– Using 16 nodes on GigE:
Problem Instance
Global Mesh
Memory-Footprint (GB)
Maximum Local
MeshMemory-Footprint
(GB)
GoL (131,072x 131,072)
32
3.0
RTA (1,024x 1,024 x 1,024)
32
4.4
LUMD (40,132x 40,132)
12
3.0
MSA (2,048 x 2,048 x 2,048)
32
3.0
• LUMD mesh memory-footprint effectiveness is limited due to computation
of elements being dependent on a larger number of elements.
Experimental evaluation
• What we learned:
– Dense meshes + large computation granularity:
• MAP3S delivers speedups in the range of 10 to 12 on 16 nodes;
• an in the range of 10 to 43 on 64 nodes;
– Sparse meshes:
• smaller speedups
• memory consumption is reduced by 20% to 50% (per node)
The End
Download