Flops, Memory and Parallel Processing and Gaussian elimination (Lecture Notes for 10-1-2013) We will discuss three important features in the efficiency of Gaussian elimination: floating point operations (flops), memory access and parallel processing. The topics span thirty+ in the development of efficient implementation of Gaussian elimination and associated pivoting schemes. FLOPS, LINPACK and the 1980’s LINPACK is the first publicly available, high quality linear algebra package. It was developed with an NSF grant in the 1980’s by, among other researchers, Cleve Moler the creator of Matlab. At that time and state of computer architecture, the number of floating point operations in an algorithm was the most important measure in the efficiency of an algorithm. As we have seen earlier, for larger n, the flop counts for Gaussian elimination with no pivoting (genp), partial pivoting (gepp), rook pivoting (gerp) and complete pivoting (gecp) applied to an n by n matrix is: Method Flops from elimination Flops for pivot search Total flops (leading order term) genp (2 / 3)n3 0 (2 / 3)n3 gepp (2 / 3)n3 (1 / 2)n 2 (2 / 3)n3 gerp (2 / 3)n3 (3 / 2)n 2 approximately (2 / 3)n3 gecp (2 / 3)n3 (1 / 3)n 3 n3 The conclusion from the last column is, based on flop count, genp, gepp and gerp should take about the same amount of time and gecp should take about 50% longer. In practice genp can lead to large error growth, so we won’t pursue genp further here. We can check the theory in the above table for the other routines by doing experiments. When comparing run times it is better to use code produced by a compiled computer language, rather than an interpreted language such as Matlab. When timing an interpreted language code, one is timing the work to understand the code as well as any computations. Since a compiled language (such a C, C++, Fortran, etc.) is translated to machine language before execution, the timing focuses on the computations, not the translation. Fortunately, in the “Try Your Own Pivot Scheme” assignment, C code implementing gecp is available through the Matlab Mex utility (see item 26 at the course web page). By making a small change in the gecp source code (in the pivot search section of the code we can change the “for (j = k; j <= nminus1; ++j)” loop to “for (j = k; j <= k; ++j)”) we can modify the gecp code to implement gepp. I also have Fortran code, again accessible in Matlab via the Mex utility, that implements rook pivoting to solve Ax = b. All this code is in “LINPACK” style in that the flop count was a key focus in the code development. We can try out the methods in Matlab (using a midrange, dual processor Acer laptop): >> n=2000; A=randn(n,n); b=randn(n,1); >> tic, x = gecp(A,b); toc Elapsed time is 20.461706 seconds. >> tic,x=gerp_unblocked(A,b);toc Elapsed time is 12.953847 seconds. >> tic, x = geppinC(A,b); toc Elapsed time is 12.884968 seconds. % complete pivoting % rook pivoting with similar style code % partial pivoting These runs support the conclusions of the above theory. The complete pivoting code takes about 50% (58% in the above runs) longer than the partial or rook pivoting code and the run times for the partial pivoting code and rook pivoting code are about the same! MANAGING MEMORY ACCESS, LAPACK and the 1990’s If we solve the same problem using the built in backslash in Matlab we get the following results: >> tic,x=A\b;toc Elapsed time is 0.900737 seconds. The Matlab solution implements Gaussian elimination with partial pivoting, but is more than 13 times faster than any of the earlier runs. What is going on? In the late 1980’s and early 1990’s the use of cache memory became increasingly popular. Cache memory is a relatively expensive but is fast compared to main RAM memory in a computer. It became apparent that it is important to keep calculations within the fast cache memory as much as possible. Otherwise, the time to access the slow main RAM memory outside the cache would dominate the calculation times. To address this issue a new publicly available linear algebra package, called LAPACK was developed with NSF funding. This package required recoding most of the LINPACK algorithms, typically using block factorizations which allowed dividing the algorithm into pieces such that the calculations could be kept in cache memory as much as possible. The current code for Matlab backslash does Gaussian elimination with partial pivoting using such a block algorithm. As we see above this leads to a huge decrease in run times, more than a factor of 13. The idea of block factorization (we will ignore pivoting initially for clarity) is based on the following block factorization of an n by n matrix A 0 U11 U12 A11 A12 L (0) : 11 L21 L22 0 U 22 A21 A22 Where, for a block size b, the matrices L11 , U 11 and A11 are b by b, L21 and A21 are (n-b) by b, A L U12 and A12 are b by (n-b) and L22 , U 22 and A22 are (n-b) by (n-b). Let A1 11 and L1 11 . We A21 L21 will call n by b matrices A1 and L1 “tall skinny matrices” since we will assume that b is much smaller than n. For example b might be 100 and n 2000 or larger. As can be seen by multiplying the first block column of U times L, equation (0) implies that L1U11 A1 or that: (1) : L1 and U11 are the LU factors of A1 . We can therefore calculate L1 and U11 by doing an LU factorization of the tall skinny matrix A1 . This calculation can be done in cache memory, assuming the b is chosen small enough. Next we multiply the first block row of L times U, which implies that L11U12 A12 . We therefore can calculate U12 using: (2) : U12 ( L11 ) 1 A12 . This calculation can also be done in cache since A12 and U12 are “short (they only have b rows), fat (n columns)” matrices. Finally, if in (0) we multiply the second block row of L by the second block row of U, it follows that L21U12 L22U 22 A22 or that L22U 22 A22 L21U12 Â22 . We break that up into two parts (3) : Aˆ A L U 22 22 21 12 and (4) : L22 and U 22 are the LU factors of Â22 . The matrices in (4) are typically too large to fit in cache memory. For example if n = 2000 and b = 100, then A22 and Â22 are 1900 by 1900 (more generally (n-b) by (n-b)). However, it is easy to separate the calculations in (3) into smaller pieces that do fit into cache memory. The structure of (3) is indicated by (5) : Aˆ 22 A22 L 21 U 12 The calculations for a submatrix of Â22 , for example the elements in rows 1001 to 1100 and columns 1001 to 1100, only require the same entries A22 and the elements of rows 1001 to 1100 of L21 and columns 1001 to 1100 of U12 , for this example. The submatrix sizes can be selected so that these calculations fit into cache memory. The calculations in (4) can be done by repeating the above procedure in a loop. The first pass through the loop we calculate the first block row and column of L and U applying the above procedure to the 2000 by 2000 matrix A (in our example). Next we apply the procedure to the 1900 by 1900 matrix Â22 , which produces the next block row and block column of L and U. Continuing we eventually factor the whole matrix. The GEPP_BLOCK code in item 25 at the course web site implements these ideas in Matlab code. GEPP_BLOCK does a block factorization with partial pivoting. This requires careful book keeping to keep track of the permutations. This is included in the GEPP_BLOCK but we will not discuss the details here. A block implementation of rook pivoting is also possible but due to the more complicated search in rook pivoting there will be more “cache misses”, where one needs to access memory outside of cache, than in partial pivoting. Finally, it appears that is not possible to do a block implementation of complete pivoting since complete pivoting requires a search of the entire unfinished portion of A, at every step of the factorization. Here are some timings for the same 2000 by 2000 matrix used earlier, using compiled code (C and Fortran): >> tic, x = gecp(A,b); toc Elapsed time is 20.461706 seconds. >> tic, x=gerp(A,b); toc Elapsed time is 1.367741 seconds. >> tic, x=A\b; toc Elapsed time is 0.900737 seconds. % unblocked complete pivoting % blocked rook pivoting % blocked partial pivoting In this experiment blocked rook pivoting requires about 50% more time than blocked partial pivoting and complete pivoting is more than 20 times slower than partial pivoting. Note that the most time consuming part of blocked Gaussian elimination algorithm is the matrix multiplication in (3) or (5). Most computer manufacturers supply very efficient, machine language implementations of simpler matrix operations such as matrix multiplication. Such routines are called the Basic Linear Algebra Subroutines or the BLAS. The blocked Gaussian elimination algorithm facilitates structuring the code so that the code can call BLAS routines. The above discussion represents the state of the art until very recently. For example, Matlab’s backslash for square linear systems calls an implementation of blocked Gaussian elimination with partial pivoting. However, computers are evolving again: PARALLEL PROCESSING, 2011 AND BEYOND Computers with multicore processors and parallel processing add-on computer cards are becoming increasingly common. For example, the NVIDIA Tessla cards contain hundreds or even thousands of processors. Currently, the more advanced parallel processing cards cost a few thousand dollars, but the prices will continue to fall. Soon massively parallel processing will not only be the domain of supercomputer; parallel processing will become ubiquitous. Our algorithms will need to adapt. Parts of the block Gaussian elimination algorithm are easy to parallelize. For example, it is easy to divide up the calculations in (5) into independent pieces as described earlier by focusing on calculation of submatrices of Â22 . Each piece can then be sent to a separate processor. These processors can independently and simultaneous do their portion of the calculations. Since this procedure is relatively straightforward the calculations in (3) or (5) are sometimes called “embarrassingly parallel.” Similarly the calculations in (2) are easy to do in parallel. If the calculations in (2) and (3) are done in parallel by hundreds or more processors, these calculations may be done quickly and the bottleneck in the computations may become the “tall skinny” LU factorization in step (1). If this tall skinny LU factorization in (1) is done with partial pivoting, the problem is not “embarrassingly parallel”; it is difficult to do the pivoting in parallel efficiently. A key factor in the computation time is the communication between processors, that is moving data. This often dominates the arithmetic computations. Let us look at one part of the communications – the number of messages that need to be passed between processors - for partial pivoting of a tall skinny matrix. A separate, important issue is the volume of data passed between processors, but for simplicity we will not discuss that here. In some environments, the time to establish the new data connection, to initiate a new message, between processors can be a bottleneck. A11 To the left we have picture a potential division of a tall skinny matrix, which we call A1 , to separate processors. In this case we have pictured m blocks in the matrix and assume that each block is a b by b matrix so that the complete tall skinny matrix is n by b with n mb . Assume that each block to be assigned to a separate processor. Locating a pivot element in A31 partial pivoting is a column by column operation. We can begin by finding the largest magnitude element in the first column. To do that we need to collect information from every A41 processor. For example, each parallel processor could calculate the largest magnitude element of column one of its submatrix and then the m local maximum elements can be compared to determine the largest magnitude element. This would require sending m messages, one for each processor. The selection of pivot elements for a factorization of the entire n by b matrix would require the same number of messages for each column or a total Am ,1 of n= mb messages. There will be other information passed between processors for the elimination, but we will focus only on the pivot search. We conclude that A21 With this approach the pivot search in partial pivoting of the n by b matrix A1 requires n mb messages. Recall that n can be large, thousands or more. There are other algorithms for implementing partial pivoting in parallel, but none of them minimizes the number of messages as well as an alternate pivoting scheme. Tournament pivoting is a new (first published in 2011) pivoting scheme. Rather than working column by column as does partial pivoting, tournament pivoting selects pivot elements by having “competitions” between pairs of blocks of A1 or pairs of sets of rows A1 in a “tournament” which chooses an overall winner consisting of the best b rows of A1 . These best rows are pivoted to the top of A1 and Gaussian elimination with no pivoting is applied to the pivoted A1 , so the that the winning rows of the tournament are the pivot rows. The principle motivating the competitions between blocks of A1 is that Gaussian elimination with partial pivoting tends to move rows that are linearly independent to the top or the matrix. We won’t try to prove this, but we can look at a specific example. Suppose the second row of a matrix is linearly dependent on the first row. When we do the elimination and add a multiple of the first row to the second row, the new row two will be entirely zero. When we select the pivot element for column two we search column two on or below the diagonal for the element furthest from zero. Therefore the current row two, which has zero on the diagonal, won’t remain at row two. So the linear dependent row two won’t remain at the top of the factorization. With this in mind we can recast the goal of pivoting is to select rows of the b by n matrix A1 so that the first b rows of the pivoted matrix are linear independent. We can now do this via a “tournament” involving blocks of A1 (pictured here again): A11 A21 A31 A41 1. The rows of A11 and A21 will participate in a competition by forming an LU A11 . The rows moved to A21 factorization with partial pivoting to the 2b by b matrix B the top are the “winners” that is they are rows of B that more linearly independent, as determined by Gaussian elimination. Let W be the b by b matrix with the winning rows. 2. For k = 3 to m have the rows of Ak1 (the “challengers”) compete with the rows of Am ,1 W (the current “champions”). The competition is done in a similar way to step 1: apply W . The b rows Ak1 Gaussian elimination with partial pivoting to the 2b by b matrix B that are moved to the top are the new winners or the current “champions”. The result will be the b most linearly independent rows (as determined by the tournament) of A1 , as desired. We should add that there are many other ways to carry out the “tournamemt” to choose b “winners.” The above tournament is a sequential style tournament where the current team of champions successively meet new challengers one after the other. A ‘real’ world example of this is on the Survivor TV shows where a player who is sent to an ‘exile’ island meets a new challenger every week. The winner goes on to the meet the next challenger. In the literature this is called a “flat tree” tournament. Another style tournament is “tennis style.” A tennis style tournament is divided into rounds. On the first round all m teams play in m/2 “games.” The m/2 winning teams (or sets of rows in our case) are sent to the next round. At the second round there are m/ 4 games (or, in our case, LU factorizations on matrices of size 2b by b) and m/4 sets of rows advance. This is continued to quarterfinals, semifinals and to finals where the b “best” rows are selected. This is called a “binary tree” tournament in the literature. A binary tree tournament takes less time than a flat tree tournament (in a real life tournament or on a computer that can have simultaneous games, i.e. a parallel processing computer). What is the communication cost of a tournament to determine the best, or most linear independent rows, of the n by b matrix A1 . Recall that for simplicity we are only focusing on the number of messages. In a tournament we need to pass information between processors at the beginning of each game played. There are only m-1 games played in a tournament of m teams (this is simple to see in the flat tournament case). Therefore The pivot search in tournament pivoting applied to the n by b matrix A1 requires initiating m 1 n / b messages. This is an order of magnitude less than required by the algorithm we described earlier for partial pivoting. The paper “CALU: A Communication Optimal LU Factorization Algorithm,” by Laura Grigori, James Demmel, and Hua Xiang, SIAM J. Matrix Analysis and Applications, Vol. 32, pp. 1317-1350, 2011 proves that tournament pivoting is communication optimal in that the number messages required by the algorithm and the number of words in these messages achieve, within a modest factor, theoretical lower bounds on the amount of communication. The authors conclude: “The reason to consider CALU is that it does an optimal amount of communication, and asymptotically less that Gaussian elimination with partial pivoting (GEPP), and so will be much faster on platforms where communication is expensive.”