Distributed Stencil Computations: Conway’s Game of Life Assignment 1, 5DV050, Spring 2012 Due on 17/4 at 16:00 Version 1.0 1 Introduction Conway’s famous game of life is a two-dimensional cellular automaton in which digital cells live, die, and reproduce. Despite the simple rules that govern each cell, very complicated multi-cellular life forms tend to emerge. Even self-reproducing life forms can be found in Conway’s game of life. The peculiar Game of Life has zero players and was presented in a 1970 issue of the Scientific American magazine in a column by Martin Gardner called Mathematical Games [1]. The rules of the game are simple enough. The cells live on an infinite two-dimensional regular grid and the simulation advances in discrete time steps which we will refer to as generations. Each grid point is either occupied with a living cell or it is empty. The state of a grid point (x, y) at time t + 1 depends only on the state of the grid point and its eight immediate neighbors at time t. The state of a grid point is determined by the following rules: 1. An empty grid point with exactly three occupied neighbors at time t becomes occupied at time t + 1. 2. An occupied grid point with less than two neighbors at time t becomes empty at time t + 1. 3. An occupied grid point with more than three neighbors at time t becomes empty at time t + 1. 4. If none of the rules above apply, then the state of the grid point does not change from time t to time t + 1. Let us look at a small example to clarify the rules. Figure 1 illustrates four consecutive generations of a five-cellular life form called a glider. When transitioning from the first to the second generation, two cells die since they have only one neighboring cell each. However, two new cells are spawned since the corresponding grid points have three neighboring cells each. When transitioning to the third generation, two more cells die, but this time one of the cells dies because it has four neighboring cells. After four time steps, the glider assumes its original shape but has been moved one grid point down and to the right. The glider will continue to move in the south–east direction indefinitely unless it comes into contact with other cells. 1 1 3 3 1 3 2 3 1 2 3 4 3 2 3 1 3 1 3 3 3 2 1 3 4 2 2 3 3 Figure 1: A glider : A perpetually moving life form in Conway’s Game of Life. After four generations, the shape of the life form repeats but its position has shifted one step down and to the right. The number of neighboring cells are shown for each cell and for each empty grid point with exactly three neighboring cells. 2 Boundary conditions In theory, the grid is infinite and the number of cells is potentially unbounded. In a practical simulation, we might therefore want to limit ourselves to an m-by-n rectangular grid. However, we must carefully define the boundary conditions. There are two common choices. Fixed boundary conditions. The grid points at the boundary of the finite grid behave as if all the neighboring grid points that lie outside the finite grid were empty. Periodic boundary conditions. The grid wraps around in both dimensions, which means that the left boundary is adjacent to the right boundary, and the top boundary is adjacent to the bottom boundary. Fixed boundary conditions effectively introduce a barrier with a significant effect on all cells close to the boundary. Periodic boundary conditions do not suffer from artifacts near the boundary, but, on the other hand, allow for life forms to self-interact once they grow large enough to wrap around. In this assignment, we will use periodic boundary conditions. 3 Buffers Time stepping is conceptually performed for all grid points simultaneously but in practice we update the points sequentially on each process. In order not to overwrite any point before we are done with it, we use two buffers for each local grid. We read from one buffer and write to the other and then switch the roles of the two buffers before continuing advancing to the next time step. It is possible to get away with much less buffer space, but it is optional to actually implement this. 4 Motivation We are studying how to parallelize the Game of Life not for its own sake but rather because the computational structure has many similarities with more serious applications in science and engineering. Computing the number of cells adjacent to any given grid point is equivalent to computing a matrix–vector product y ← Ax, where A is a sparse matrix (i.e., most elements of A are zero) and x is a representation of the grid as a long vector. In fact, any sparse matrix–vector multiplication can be seen as a computation on a grid. Sometimes the grid is regular, as in our case, but sometimes the grid is irregular or unstructured. 2 The benefits of using Game of Life as an example is that it is much easier to spot artifacts caused by bugs in the program. We can focus our efforts on the issues most relevant to parallel computing and scalability analysis. Below is a list of topics that are, at least partially, covered by this assignment. Data layouts We compare a one-dimensional distribution with a two-dimensional distribution and show that the former is inherently less scalable than the latter due to more inter-process communication and a lower degree of concurrency. Scalability analyses We analyze the scalability using a simple theoretical model and compare with measurements obtained through experiments. Communication hiding It is possible to hide communication behind computation in order to reduce the communication cost and/or make the algorithm more latency tolerant. This part is optional. Memory access patterns The two-dimensional grid is eventually stored in a one-dimensional memory. The mapping from two to one dimension can have a large impact on the performance of the computation since it affects the interaction with the memory hierarchy. We will analyze this impact through experiments. Regularization techniques CPUs work best on regular computations with few and/or predictable branches and with regular memory access patterns. We use ghost regions around each local grid as a regularization to eliminate special cases at the boundary of the local grids. 5 Parallel simulation Consider the simulation of the Game of Life on a square n-by-n finite grid with periodic boundary conditions on a distributed memory machine with p processes. In order to update a grid point we need to access all neighboring grid points. Hence, we will try to map neighboring grid points to the same process in an effort to reduce communication costs. (a) (b) Process 0 Process (0,0) Process (0,1) Process (1,0) Process (1,1) Process 1 Process 2 Process 3 Figure 2: Illustration of two different data layouts of an n × n grid on four processes. (a) One-dimensional block row layout. (b) Two-dimensional block layout. 3 Perhaps the simplest data layout is the one-dimensional block row (or column) layout illustrated in Figure 2(a). We split the rows of the grid into p blocks of approximately n/p rows each and assign each block to a distinct process. A slightly more complicated data layout is the two-dimensional block layout illustrated √ √ in Figure 2(b). We arrange the processes in a logical two-dimensional mesh of size p-by- p √ and split the rows and columns into p blocks of equal size each. Then we assign each one √ √ of the p · p = p intersections of the row and column blocks to a distinct process. One of the aims of this assignment is to understand why the one-dimensional distribution is inherently less scalable than the two-dimensional distribution. We will see later that there are two reasons for this. First, the one-dimensional distribution leads to more inter-process communication. Second, the one-dimensional distribution cannot support as many processes as the two-dimensional distribution (n versus n2 ). We say that the scalability of the onedimensional distribution is limited by the bandwidth and by the degree of concurrency. 6 6.1 Scalability analysis (2D) One-dimensional distribution Consider a square n × n grid distributed over p processes with a one-dimensional block row distribution as illustrated in Figure 2(a). Each process holds a block row consisting of roughly n/p rows and exactly n columns. In order for a process to compute the number of neighboring cells for the topmost row in its local grid, it must have access to the bottommost row from the block above. Similarly, it must also have access to the topmost row of the block below to compute its bottommost row. Assuming that the communication links are bidirectional and that each process has a dedicated link to each neighboring process, then the time to exchange these boundary points is given by ts + tw n, (1) i.e., identical to the time required for a single point-to-point message of length n. Here, ts is the start-up or latency cost and tw is the word transfer time or inverse bandwidth. After exchanging boundaries, the processes update their local grids in parallel without further communication. The execution time for this phase is assumed proportional to the number of local grid points, i.e., the computation time is given by tc n2 , p (2) where tc is the time required to update one grid point. Combining the times for communication and computation, we find that the parallel execution time, as a function of n and p, is n2 TP (n, p) = tc + ts + tw n. (3) p Assuming that the sequential execution time, as a function of n, is TS (n) = tc n2 , (4) we find that the parallel overhead, as a function of n and p, is TO (n, p) = pTP (n, p) − TS (n) = ts p + tw np. 4 (5) Let us now analyze the scalability of the one-dimensional data distribution. The parallel efficiency can be written as EP (n, p) = TS (n) TS (n) 1 . = = T pTP (n, p) TS (n) + TO (n, p) 1 + TOS(n,p) (n) (6) The final expression in (6) tell us that when we increase the number of processes the efficiency drops since the overhead increases with p whereas the sequential execution time is constant. On the other hand, when we increase the dimension of the grid we expect the efficiency to increase since the sequential execution time grows faster than the parallel overhead (n2 versus n). According to (6), we can maintain a constant efficiency as we increase the number of processes and the size of the grid simultaneously if we keep the ratio TO (n, p) TS (n) (7) constant. Requiring that the ratio be constant might lead to complicated formulas that are difficult to interpret. Since we are doing the analysis to better understand the behavior of the algorithm we will relax the constraint and require only that the ratio does not increase, which translates into a constant or increasing efficiency. Thus, it suffices to find n as a function of p such that the ratio (7) does not grow. Let us now find a function n = n(p) such that TO (n, p) = O(1). T1 (n) (8) For our particular example, we substitute (5) and T1 (n) and get ts p + tw np = O(1). tc n2 (9) Let us split the left-hand side of (9) into a sum of two fractions ts p tw np + = O(1). 2 tc n tc n2 (10) It suffices to show that both terms in (10) do not grow, i.e., that they are in O(1). The first (latency) term in (10) leads to n2 = Ω(p) or, equivalently, (11) √ n = Ω( p). (12) √ We conclude that due to the latency of the network, n must grow at least as fast as p to maintain or increase the efficiency. The second (bandwidth) term in (10) similarly leads to n2 = Ω(np) 5 (13) or, equivalently, n = Ω(p). (14) We conclude that due to the bandwidth of the network, n must grow at least as fast as p to √ maintain or increase the efficiency. Note that since p grows faster than p the bandwidth is the limiting factor for the scalability of the one-dimensional distribution. Apart from latency and bandwidth, the scalability can also be limited by the degree of concurrency, i.e., the number of processes that can be used simultaneously. For the onedimensional distribution, we can not use more than n processes since each process must have at least one row in its local grid. Formally, we say that p = O(n) (15) n = Ω(p). (16) or, equivalently, Note that this is, in fact, the same limitation as was imposed by the bandwidth term. We conclude that the scalability of the one-dimensional distribution is limited both by the bandwidth and by the degree of concurrency. Having to grow n as fast as p might not seem like a big problem, but it actually means that we need to use more and more memory per process. Since the physical memory of a real parallel computer grows linearly with p, we will eventually run out of memory. The memory required per process to store the local grid is n2 /p, but since n is at least as large as p, the memory requirement per process grows like p (n2 /p and n = p gives p2 /p = p). There are several implications of this result. Firstly, we may run out of main memory when scaling to large numbers of processes. Secondly, the local grid might rapidly grow too large to fit in the cache memory and thereby forcing the application to read and write to the much slower main memory. This will degrade the performance of the local computations. 6.2 Two-dimensional distribution After investigating the one-dimensional distribution in detail, let us now turn our attention to √ √ the two-dimensional block distribution. For simplicity, we assume a square p × p process mesh. When time stepping, each process must now exchange its boundary grid points with up to eight neighboring processes. (Note that it is possible to exchange boundary points in two phases such that communication is necessary only with four neighboring processes.) The √ number of points on the boundary is (roughly) 4n/ p. The latency term remains constant √ but the bandwidth term improves from Θ(n) to Θ(n/ p). If we redo the scalability analysis with respect to the new bandwidth term we find that √ n2 = Ω(n p) (17) √ or, equivalently, that n must grow as fast as p to maintain efficiency. Note the improvement √ from p to p as a consequence of changing the data distribution. What about the degree of concurrency? With the two-dimensional distribution, we can use as many as n2 processes. Formally, this imposes the scalability limitation or, equivalently, p = O(n2 ) (18) √ n = Ω( p). (19) 6 √ Note, again, the improvement from p to p. What about the memory requirements? The memory required on each process is still √ n2 /p, but since we only have to increase n as p to maintain efficiency, the memory required √ √ per process is constant (n2 /p and n = p gives ( p)2 /p = 1). 6.3 Conclusion In order to maintain efficiency with the one-dimensional distribution we have to increase the size of the grid linearly with p, which means that the memory required per process grows like p as well. In contrast, the two-dimensional distribution only requires us to increase the size √ of the grid linearly with p, which means that the memory required per process is constant. We have found that the factors that limit the scalability of the one-dimensional distribution are the bandwidth and the degree of concurrency. The latency has a subordinate significance. 7 The assignment The assignment consists of three equal parts: Implementation, experiment, and analysis. Your code should be written such that it adequately documents the implementation part and it therefore suffices to give a high-level description of the implementation in the report. A few paragraphs explaining the most important aspects should be enough. Your report should focus on presenting and discussing the experiments and the analysis. 7.1 Implementation Given a skeleton code written in C with MPI, your task is to implement a distributed memory version of Conway’s Game of Life using periodic boundary conditions and a square grid. The implementation should support both the one-dimensional and the two-dimensional data layouts. (Note that the one-dimensional layout is a special case of the tow-dimensional layout.) The user should be allowed to choose the size of the grid and the size of the process mesh, both of which you may assume are square, using command line arguments. You may not assume that each process receives the same number of grid points, i.e., do √ not assume that p nor p evenly divides the grid size n. The code you have been given uses ghost cells to regularize the computation at the boundary of the local grid. Your parallel code should do the same and there should be no special cases for the boundary. 7.2 Experiments The given code contains two versions of the local grid kernel. Both versions store the grid in a row-major format, i.e., the rows of the local grid are stored consecutively one after the other. The first version traverses the local grid row by row, which leads to a unit-stride memory access pattern that is ideal for the memory hierarchy and promotes both cache reuse and hardware prefetching. The second version traverses the grid column by column, which leads to a large-stride memory access pattern that is detrimental for many reasons but primarily because of increased effective memory access latency. Your task is to use the supplied benchmarking function to evaluate the two kernel versions for local grids of different sizes. Discuss your results and try to explain what you observe. 7 Your second task is to experimentally evaluate the memory-constrained scalability of the one-dimensional and two-dimensional distributions. Start with a square grid of size n1 = 1000 on one process and gradually increase the number of processes. Scale the problem size such that the local grid has as close to n21 grid points as possible to ensure that the memory required per process remains approximately constant. The theoretical analysis from Section 6 predicts that the two distributions should behave quite differently. But can you actually observe this difference in your experimental data? Use the analysis as a guide to decide on how best to present your data in order to support your conclusions. 7.3 Analysis Conway’s original Game of Life takes place on a two-dimensional grid. However, one can readily imagine extensions to higher-dimensional spaces. In particular, your task will be to analyze a three-dimensional version of the game. Consider an n × n × n cubic grid distributed over p processes. We will consider three natural distributions (see Figure 3): (a) A one-dimensional distribution in which each process receives a slice. (b) A two-dimensional distribution in which each process receives a tube. (c) A three-dimensional distribution in which each process receives a block. Your task is to analyze the three distributions along the lines of the analysis in Section 6. Which distributions are inherently non-scalable in the sense that the memory required per process to maintain efficiency is unbounded? In each case, which factors limit the scalability—latency, bandwidth, and/or degree of concurrency? Slice (1D) Tube (2D) Block (3D) Figure 3: Three natural distributions of a three-dimensional cubic grid. Slice (1D), tube (2D), and block (3D). References [1] Gardner, M., The fantastic combinations of John Conway’s new solitaire game “life”, Scientific American, Volume 223, pp. 120–123, October 1970. [2] Wikipedia article on Conway’s Game of Life, http://en.wikipedia.org/wiki/Conway’ s_Game_of_Life. 8