Distributed Stencil Computations: Conway's Game of Life

advertisement
Distributed Stencil Computations:
Conway’s Game of Life
Assignment 1, 5DV050, Spring 2012
Due on 17/4 at 16:00
Version 1.0
1
Introduction
Conway’s famous game of life is a two-dimensional cellular automaton in which digital cells
live, die, and reproduce. Despite the simple rules that govern each cell, very complicated
multi-cellular life forms tend to emerge. Even self-reproducing life forms can be found in
Conway’s game of life. The peculiar Game of Life has zero players and was presented in
a 1970 issue of the Scientific American magazine in a column by Martin Gardner called
Mathematical Games [1].
The rules of the game are simple enough. The cells live on an infinite two-dimensional
regular grid and the simulation advances in discrete time steps which we will refer to as
generations. Each grid point is either occupied with a living cell or it is empty. The state
of a grid point (x, y) at time t + 1 depends only on the state of the grid point and its eight
immediate neighbors at time t. The state of a grid point is determined by the following rules:
1. An empty grid point with exactly three occupied neighbors at time t becomes occupied
at time t + 1.
2. An occupied grid point with less than two neighbors at time t becomes empty at time
t + 1.
3. An occupied grid point with more than three neighbors at time t becomes empty at
time t + 1.
4. If none of the rules above apply, then the state of the grid point does not change from
time t to time t + 1.
Let us look at a small example to clarify the rules. Figure 1 illustrates four consecutive
generations of a five-cellular life form called a glider. When transitioning from the first to
the second generation, two cells die since they have only one neighboring cell each. However,
two new cells are spawned since the corresponding grid points have three neighboring cells
each. When transitioning to the third generation, two more cells die, but this time one of
the cells dies because it has four neighboring cells. After four time steps, the glider assumes
its original shape but has been moved one grid point down and to the right. The glider will
continue to move in the south–east direction indefinitely unless it comes into contact with
other cells.
1
1
3
3
1 3 2
3
1
2
3 4 3
2 3
1
3 1
3 3
3 2
1 3
4 2
2 3 3
Figure 1: A glider : A perpetually moving life form in Conway’s Game of Life. After four
generations, the shape of the life form repeats but its position has shifted one step down and
to the right. The number of neighboring cells are shown for each cell and for each empty grid
point with exactly three neighboring cells.
2
Boundary conditions
In theory, the grid is infinite and the number of cells is potentially unbounded. In a practical simulation, we might therefore want to limit ourselves to an m-by-n rectangular grid.
However, we must carefully define the boundary conditions. There are two common choices.
Fixed boundary conditions. The grid points at the boundary of the finite grid behave as
if all the neighboring grid points that lie outside the finite grid were empty.
Periodic boundary conditions. The grid wraps around in both dimensions, which means
that the left boundary is adjacent to the right boundary, and the top boundary is
adjacent to the bottom boundary.
Fixed boundary conditions effectively introduce a barrier with a significant effect on all cells
close to the boundary. Periodic boundary conditions do not suffer from artifacts near the
boundary, but, on the other hand, allow for life forms to self-interact once they grow large
enough to wrap around. In this assignment, we will use periodic boundary conditions.
3
Buffers
Time stepping is conceptually performed for all grid points simultaneously but in practice we
update the points sequentially on each process. In order not to overwrite any point before we
are done with it, we use two buffers for each local grid. We read from one buffer and write
to the other and then switch the roles of the two buffers before continuing advancing to the
next time step. It is possible to get away with much less buffer space, but it is optional to
actually implement this.
4
Motivation
We are studying how to parallelize the Game of Life not for its own sake but rather because the
computational structure has many similarities with more serious applications in science and
engineering. Computing the number of cells adjacent to any given grid point is equivalent to
computing a matrix–vector product y ← Ax, where A is a sparse matrix (i.e., most elements
of A are zero) and x is a representation of the grid as a long vector. In fact, any sparse
matrix–vector multiplication can be seen as a computation on a grid. Sometimes the grid is
regular, as in our case, but sometimes the grid is irregular or unstructured.
2
The benefits of using Game of Life as an example is that it is much easier to spot artifacts
caused by bugs in the program. We can focus our efforts on the issues most relevant to parallel
computing and scalability analysis.
Below is a list of topics that are, at least partially, covered by this assignment.
Data layouts We compare a one-dimensional distribution with a two-dimensional distribution and show that the former is inherently less scalable than the latter due to more
inter-process communication and a lower degree of concurrency.
Scalability analyses We analyze the scalability using a simple theoretical model and compare with measurements obtained through experiments.
Communication hiding It is possible to hide communication behind computation in order
to reduce the communication cost and/or make the algorithm more latency tolerant.
This part is optional.
Memory access patterns The two-dimensional grid is eventually stored in a one-dimensional
memory. The mapping from two to one dimension can have a large impact on the performance of the computation since it affects the interaction with the memory hierarchy.
We will analyze this impact through experiments.
Regularization techniques CPUs work best on regular computations with few and/or predictable branches and with regular memory access patterns. We use ghost regions
around each local grid as a regularization to eliminate special cases at the boundary of
the local grids.
5
Parallel simulation
Consider the simulation of the Game of Life on a square n-by-n finite grid with periodic
boundary conditions on a distributed memory machine with p processes. In order to update a
grid point we need to access all neighboring grid points. Hence, we will try to map neighboring
grid points to the same process in an effort to reduce communication costs.
(a)
(b)
Process 0
Process (0,0)
Process (0,1)
Process (1,0)
Process (1,1)
Process 1
Process 2
Process 3
Figure 2: Illustration of two different data layouts of an n × n grid on four processes. (a)
One-dimensional block row layout. (b) Two-dimensional block layout.
3
Perhaps the simplest data layout is the one-dimensional block row (or column) layout
illustrated in Figure 2(a). We split the rows of the grid into p blocks of approximately n/p
rows each and assign each block to a distinct process.
A slightly more complicated data layout is the two-dimensional block layout illustrated
√
√
in Figure 2(b). We arrange the processes in a logical two-dimensional mesh of size p-by- p
√
and split the rows and columns into p blocks of equal size each. Then we assign each one
√ √
of the p · p = p intersections of the row and column blocks to a distinct process.
One of the aims of this assignment is to understand why the one-dimensional distribution
is inherently less scalable than the two-dimensional distribution. We will see later that there
are two reasons for this. First, the one-dimensional distribution leads to more inter-process
communication. Second, the one-dimensional distribution cannot support as many processes
as the two-dimensional distribution (n versus n2 ). We say that the scalability of the onedimensional distribution is limited by the bandwidth and by the degree of concurrency.
6
6.1
Scalability analysis (2D)
One-dimensional distribution
Consider a square n × n grid distributed over p processes with a one-dimensional block row
distribution as illustrated in Figure 2(a). Each process holds a block row consisting of roughly
n/p rows and exactly n columns. In order for a process to compute the number of neighboring
cells for the topmost row in its local grid, it must have access to the bottommost row from
the block above. Similarly, it must also have access to the topmost row of the block below to
compute its bottommost row. Assuming that the communication links are bidirectional and
that each process has a dedicated link to each neighboring process, then the time to exchange
these boundary points is given by
ts + tw n,
(1)
i.e., identical to the time required for a single point-to-point message of length n. Here, ts is
the start-up or latency cost and tw is the word transfer time or inverse bandwidth.
After exchanging boundaries, the processes update their local grids in parallel without
further communication. The execution time for this phase is assumed proportional to the
number of local grid points, i.e., the computation time is given by
tc
n2
,
p
(2)
where tc is the time required to update one grid point. Combining the times for communication and computation, we find that the parallel execution time, as a function of n and p,
is
n2
TP (n, p) = tc
+ ts + tw n.
(3)
p
Assuming that the sequential execution time, as a function of n, is
TS (n) = tc n2 ,
(4)
we find that the parallel overhead, as a function of n and p, is
TO (n, p) = pTP (n, p) − TS (n) = ts p + tw np.
4
(5)
Let us now analyze the scalability of the one-dimensional data distribution. The parallel
efficiency can be written as
EP (n, p) =
TS (n)
TS (n)
1
.
=
=
T
pTP (n, p)
TS (n) + TO (n, p)
1 + TOS(n,p)
(n)
(6)
The final expression in (6) tell us that when we increase the number of processes the efficiency
drops since the overhead increases with p whereas the sequential execution time is constant.
On the other hand, when we increase the dimension of the grid we expect the efficiency to
increase since the sequential execution time grows faster than the parallel overhead (n2 versus
n).
According to (6), we can maintain a constant efficiency as we increase the number of
processes and the size of the grid simultaneously if we keep the ratio
TO (n, p)
TS (n)
(7)
constant. Requiring that the ratio be constant might lead to complicated formulas that are
difficult to interpret. Since we are doing the analysis to better understand the behavior of the
algorithm we will relax the constraint and require only that the ratio does not increase, which
translates into a constant or increasing efficiency. Thus, it suffices to find n as a function of
p such that the ratio (7) does not grow.
Let us now find a function n = n(p) such that
TO (n, p)
= O(1).
T1 (n)
(8)
For our particular example, we substitute (5) and T1 (n) and get
ts p + tw np
= O(1).
tc n2
(9)
Let us split the left-hand side of (9) into a sum of two fractions
ts p
tw np
+
= O(1).
2
tc n
tc n2
(10)
It suffices to show that both terms in (10) do not grow, i.e., that they are in O(1).
The first (latency) term in (10) leads to
n2 = Ω(p)
or, equivalently,
(11)
√
n = Ω( p).
(12)
√
We conclude that due to the latency of the network, n must grow at least as fast as p to
maintain or increase the efficiency.
The second (bandwidth) term in (10) similarly leads to
n2 = Ω(np)
5
(13)
or, equivalently,
n = Ω(p).
(14)
We conclude that due to the bandwidth of the network, n must grow at least as fast as p to
√
maintain or increase the efficiency. Note that since p grows faster than p the bandwidth is
the limiting factor for the scalability of the one-dimensional distribution.
Apart from latency and bandwidth, the scalability can also be limited by the degree of
concurrency, i.e., the number of processes that can be used simultaneously. For the onedimensional distribution, we can not use more than n processes since each process must have
at least one row in its local grid. Formally, we say that
p = O(n)
(15)
n = Ω(p).
(16)
or, equivalently,
Note that this is, in fact, the same limitation as was imposed by the bandwidth term. We conclude that the scalability of the one-dimensional distribution is limited both by the bandwidth
and by the degree of concurrency.
Having to grow n as fast as p might not seem like a big problem, but it actually means
that we need to use more and more memory per process. Since the physical memory of a real
parallel computer grows linearly with p, we will eventually run out of memory. The memory
required per process to store the local grid is n2 /p, but since n is at least as large as p, the
memory requirement per process grows like p (n2 /p and n = p gives p2 /p = p). There are
several implications of this result. Firstly, we may run out of main memory when scaling to
large numbers of processes. Secondly, the local grid might rapidly grow too large to fit in the
cache memory and thereby forcing the application to read and write to the much slower main
memory. This will degrade the performance of the local computations.
6.2
Two-dimensional distribution
After investigating the one-dimensional distribution in detail, let us now turn our attention to
√
√
the two-dimensional block distribution. For simplicity, we assume a square p × p process
mesh. When time stepping, each process must now exchange its boundary grid points with
up to eight neighboring processes. (Note that it is possible to exchange boundary points in
two phases such that communication is necessary only with four neighboring processes.) The
√
number of points on the boundary is (roughly) 4n/ p. The latency term remains constant
√
but the bandwidth term improves from Θ(n) to Θ(n/ p). If we redo the scalability analysis
with respect to the new bandwidth term we find that
√
n2 = Ω(n p)
(17)
√
or, equivalently, that n must grow as fast as p to maintain efficiency. Note the improvement
√
from p to p as a consequence of changing the data distribution.
What about the degree of concurrency? With the two-dimensional distribution, we can
use as many as n2 processes. Formally, this imposes the scalability limitation
or, equivalently,
p = O(n2 )
(18)
√
n = Ω( p).
(19)
6
√
Note, again, the improvement from p to p.
What about the memory requirements? The memory required on each process is still
√
n2 /p, but since we only have to increase n as p to maintain efficiency, the memory required
√
√
per process is constant (n2 /p and n = p gives ( p)2 /p = 1).
6.3
Conclusion
In order to maintain efficiency with the one-dimensional distribution we have to increase the
size of the grid linearly with p, which means that the memory required per process grows like
p as well. In contrast, the two-dimensional distribution only requires us to increase the size
√
of the grid linearly with p, which means that the memory required per process is constant.
We have found that the factors that limit the scalability of the one-dimensional distribution
are the bandwidth and the degree of concurrency. The latency has a subordinate significance.
7
The assignment
The assignment consists of three equal parts: Implementation, experiment, and analysis.
Your code should be written such that it adequately documents the implementation part and
it therefore suffices to give a high-level description of the implementation in the report. A
few paragraphs explaining the most important aspects should be enough. Your report should
focus on presenting and discussing the experiments and the analysis.
7.1
Implementation
Given a skeleton code written in C with MPI, your task is to implement a distributed memory
version of Conway’s Game of Life using periodic boundary conditions and a square grid.
The implementation should support both the one-dimensional and the two-dimensional data
layouts. (Note that the one-dimensional layout is a special case of the tow-dimensional layout.)
The user should be allowed to choose the size of the grid and the size of the process mesh,
both of which you may assume are square, using command line arguments.
You may not assume that each process receives the same number of grid points, i.e., do
√
not assume that p nor p evenly divides the grid size n.
The code you have been given uses ghost cells to regularize the computation at the boundary of the local grid. Your parallel code should do the same and there should be no special
cases for the boundary.
7.2
Experiments
The given code contains two versions of the local grid kernel. Both versions store the grid
in a row-major format, i.e., the rows of the local grid are stored consecutively one after the
other. The first version traverses the local grid row by row, which leads to a unit-stride
memory access pattern that is ideal for the memory hierarchy and promotes both cache
reuse and hardware prefetching. The second version traverses the grid column by column,
which leads to a large-stride memory access pattern that is detrimental for many reasons
but primarily because of increased effective memory access latency. Your task is to use the
supplied benchmarking function to evaluate the two kernel versions for local grids of different
sizes. Discuss your results and try to explain what you observe.
7
Your second task is to experimentally evaluate the memory-constrained scalability of the
one-dimensional and two-dimensional distributions. Start with a square grid of size n1 = 1000
on one process and gradually increase the number of processes. Scale the problem size such
that the local grid has as close to n21 grid points as possible to ensure that the memory required
per process remains approximately constant. The theoretical analysis from Section 6 predicts
that the two distributions should behave quite differently. But can you actually observe this
difference in your experimental data? Use the analysis as a guide to decide on how best to
present your data in order to support your conclusions.
7.3
Analysis
Conway’s original Game of Life takes place on a two-dimensional grid. However, one can
readily imagine extensions to higher-dimensional spaces. In particular, your task will be to
analyze a three-dimensional version of the game. Consider an n × n × n cubic grid distributed over p processes. We will consider three natural distributions (see Figure 3): (a) A
one-dimensional distribution in which each process receives a slice. (b) A two-dimensional
distribution in which each process receives a tube. (c) A three-dimensional distribution in
which each process receives a block. Your task is to analyze the three distributions along the
lines of the analysis in Section 6. Which distributions are inherently non-scalable in the sense
that the memory required per process to maintain efficiency is unbounded? In each case,
which factors limit the scalability—latency, bandwidth, and/or degree of concurrency?
Slice (1D)
Tube (2D)
Block (3D)
Figure 3: Three natural distributions of a three-dimensional cubic grid. Slice (1D), tube (2D),
and block (3D).
References
[1] Gardner, M., The fantastic combinations of John Conway’s new solitaire game “life”,
Scientific American, Volume 223, pp. 120–123, October 1970.
[2] Wikipedia article on Conway’s Game of Life, http://en.wikipedia.org/wiki/Conway’
s_Game_of_Life.
8
Download