Flops, Memory and Parallel Processing and Gaussian elimination

advertisement
Flops, Memory and Parallel Processing and Gaussian elimination (Lecture Notes for 10-1-2013)
We will discuss three important features in the efficiency of Gaussian elimination: floating point
operations (flops), memory access and parallel processing. The topics span thirty+ in the development
of efficient implementation of Gaussian elimination and associated pivoting schemes.
FLOPS, LINPACK and the 1980’s
LINPACK is the first publicly available, high quality linear algebra package. It was developed with an NSF
grant in the 1980’s by, among other researchers, Cleve Moler the creator of Matlab. At that time and
state of computer architecture, the number of floating point operations in an algorithm was the most
important measure in the efficiency of an algorithm. As we have seen earlier, for larger n, the flop
counts for Gaussian elimination with no pivoting (genp), partial pivoting (gepp), rook pivoting (gerp) and
complete pivoting (gecp) applied to an n by n matrix is:
Method
Flops from elimination
Flops for pivot search
Total flops
(leading order term)
genp
(2 / 3)n3
0
(2 / 3)n3
gepp
(2 / 3)n3
(1 / 2)n 2
(2 / 3)n3
gerp
(2 / 3)n3
(3 / 2)n 2 approximately
(2 / 3)n3
gecp
(2 / 3)n3
(1 / 3)n 3
n3
The conclusion from the last column is, based on flop count, genp, gepp and gerp should take about the
same amount of time and gecp should take about 50% longer. In practice genp can lead to large error
growth, so we won’t pursue genp further here.
We can check the theory in the above table for the other routines by doing experiments. When
comparing run times it is better to use code produced by a compiled computer language, rather than an
interpreted language such as Matlab. When timing an interpreted language code, one is timing the
work to understand the code as well as any computations. Since a compiled language (such a C, C++,
Fortran, etc.) is translated to machine language before execution, the timing focuses on the
computations, not the translation.
Fortunately, in the “Try Your Own Pivot Scheme” assignment, C code implementing gecp is available
through the Matlab Mex utility (see item 26 at the course web page). By making a small change in the
gecp source code (in the pivot search section of the code we can change the “for (j = k; j <= nminus1;
++j)” loop to “for (j = k; j <= k; ++j)”) we can modify the gecp code to implement gepp. I also have
Fortran code, again accessible in Matlab via the Mex utility, that implements rook pivoting to solve Ax =
b. All this code is in “LINPACK” style in that the flop count was a key focus in the code development.
We can try out the methods in Matlab (using a midrange, dual processor Acer laptop):
>> n=2000; A=randn(n,n); b=randn(n,1);
>> tic, x = gecp(A,b); toc
Elapsed time is 20.461706 seconds.
>> tic,x=gerp_unblocked(A,b);toc
Elapsed time is 12.953847 seconds.
>> tic, x = geppinC(A,b); toc
Elapsed time is 12.884968 seconds.
% complete pivoting
% rook pivoting with similar style code
% partial pivoting
These runs support the conclusions of the above theory. The complete pivoting code takes about 50%
(58% in the above runs) longer than the partial or rook pivoting code and the run times for the partial
pivoting code and rook pivoting code are about the same!
MANAGING MEMORY ACCESS, LAPACK and the 1990’s
If we solve the same problem using the built in backslash in Matlab we get the following results:
>> tic,x=A\b;toc
Elapsed time is 0.900737 seconds.
The Matlab solution implements Gaussian elimination with partial pivoting, but is more than 13 times
faster than any of the earlier runs. What is going on?
In the late 1980’s and early 1990’s the use of cache memory became increasingly popular. Cache
memory is a relatively expensive but is fast compared to main RAM memory in a computer. It became
apparent that it is important to keep calculations within the fast cache memory as much as possible.
Otherwise, the time to access the slow main RAM memory outside the cache would dominate the
calculation times.
To address this issue a new publicly available linear algebra package, called LAPACK was developed with
NSF funding. This package required recoding most of the LINPACK algorithms, typically using block
factorizations which allowed dividing the algorithm into pieces such that the calculations could be kept
in cache memory as much as possible. The current code for Matlab backslash does Gaussian elimination
with partial pivoting using such a block algorithm. As we see above this leads to a huge decrease in run
times, more than a factor of 13.
The idea of block factorization (we will ignore pivoting initially for clarity) is based on the following block
factorization of an n by n matrix A
0 U11 U12   A11 A12 
L

  

(0) :  11
 L21 L22  0 U 22   A21 A22 
Where, for a block size b, the matrices L11 , U 11 and A11 are b by b, L21 and A21 are (n-b) by b,
A 
L 
U12 and A12 are b by (n-b) and L22 , U 22 and A22 are (n-b) by (n-b). Let A1   11  and L1   11 . We
 A21 
 L21 
will call n by b matrices A1 and L1 “tall skinny matrices” since we will assume that b is much smaller
than n. For example b might be 100 and n 2000 or larger.
As can be seen by multiplying the first block column of U times L, equation (0) implies that L1U11  A1 or
that:
(1) : L1 and U11 are the LU factors of A1 .
We can therefore calculate L1 and U11 by doing an LU factorization of the tall skinny matrix A1 . This
calculation can be done in cache memory, assuming the b is chosen small enough.
Next we multiply the first block row of L times U, which implies that L11U12  A12 . We therefore can
calculate U12 using:
(2) : U12  ( L11 ) 1 A12 .
This calculation can also be done in cache since A12 and U12 are “short (they only have b rows), fat (n
columns)” matrices.
Finally, if in (0) we multiply the second block row of L by the second block row of U, it follows that
L21U12  L22U 22  A22 or that L22U 22  A22  L21U12  Â22 . We break that up into two parts
(3) : Aˆ  A  L U
22
22
21
12
and
(4) : L22 and U 22 are the LU factors of Â22 .
The matrices in (4) are typically too large to fit in cache memory. For example if n = 2000 and b = 100,
then A22 and Â22 are 1900 by 1900 (more generally (n-b) by (n-b)). However, it is easy to separate the
calculations in (3) into smaller pieces that do fit into cache memory. The structure of (3) is indicated by



(5) : 




Aˆ 22
 
 
 

 
 
 
 
A22
  
  
  
   L 
  21 
  
  
  
U 12

The calculations for a submatrix of Â22 , for example the elements in rows 1001 to 1100 and columns
1001 to 1100, only require the same entries A22 and the elements of rows 1001 to 1100 of L21 and
columns 1001 to 1100 of U12 , for this example. The submatrix sizes can be selected so that these
calculations fit into cache memory.
The calculations in (4) can be done by repeating the above procedure in a loop. The first pass through
the loop we calculate the first block row and column of L and U applying the above procedure to the
2000 by 2000 matrix A (in our example). Next we apply the procedure to the 1900 by 1900 matrix Â22 ,
which produces the next block row and block column of L and U. Continuing we eventually factor the
whole matrix.
The GEPP_BLOCK code in item 25 at the course web site implements these ideas in Matlab code.
GEPP_BLOCK does a block factorization with partial pivoting. This requires careful book keeping to keep
track of the permutations. This is included in the GEPP_BLOCK but we will not discuss the details here.
A block implementation of rook pivoting is also possible but due to the more complicated search in rook
pivoting there will be more “cache misses”, where one needs to access memory outside of cache, than
in partial pivoting. Finally, it appears that is not possible to do a block implementation of complete
pivoting since complete pivoting requires a search of the entire unfinished portion of A, at every step of
the factorization.
Here are some timings for the same 2000 by 2000 matrix used earlier, using compiled code (C and
Fortran):
>> tic, x = gecp(A,b); toc
Elapsed time is 20.461706 seconds.
>> tic, x=gerp(A,b); toc
Elapsed time is 1.367741 seconds.
>> tic, x=A\b; toc
Elapsed time is 0.900737 seconds.
% unblocked complete pivoting
% blocked rook pivoting
% blocked partial pivoting
In this experiment blocked rook pivoting requires about 50% more time than blocked partial pivoting
and complete pivoting is more than 20 times slower than partial pivoting.
Note that the most time consuming part of blocked Gaussian elimination algorithm is the matrix
multiplication in (3) or (5). Most computer manufacturers supply very efficient, machine language
implementations of simpler matrix operations such as matrix multiplication. Such routines are called
the Basic Linear Algebra Subroutines or the BLAS. The blocked Gaussian elimination algorithm facilitates
structuring the code so that the code can call BLAS routines.
The above discussion represents the state of the art until very recently. For example, Matlab’s backslash
for square linear systems calls an implementation of blocked Gaussian elimination with partial pivoting.
However, computers are evolving again:
PARALLEL PROCESSING, 2011 AND BEYOND
Computers with multicore processors and parallel processing add-on computer cards are becoming
increasingly common. For example, the NVIDIA Tessla cards contain hundreds or even thousands of
processors. Currently, the more advanced parallel processing cards cost a few thousand dollars, but the
prices will continue to fall. Soon massively parallel processing will not only be the domain of
supercomputer; parallel processing will become ubiquitous. Our algorithms will need to adapt.
Parts of the block Gaussian elimination algorithm are easy to parallelize. For example, it is easy to divide
up the calculations in (5) into independent pieces as described earlier by focusing on calculation of
submatrices of Â22 . Each piece can then be sent to a separate processor. These processors can
independently and simultaneous do their portion of the calculations. Since this procedure is relatively
straightforward the calculations in (3) or (5) are sometimes called “embarrassingly parallel.” Similarly
the calculations in (2) are easy to do in parallel.
If the calculations in (2) and (3) are done in parallel by hundreds or more processors, these calculations
may be done quickly and the bottleneck in the computations may become the “tall skinny” LU
factorization in step (1).
If this tall skinny LU factorization in (1) is done with partial pivoting, the problem is not “embarrassingly
parallel”; it is difficult to do the pivoting in parallel efficiently. A key factor in the computation time is
the communication between processors, that is moving data. This often dominates the arithmetic
computations. Let us look at one part of the communications – the number of messages that need to be
passed between processors - for partial pivoting of a tall skinny matrix. A separate, important issue is
the volume of data passed between processors, but for simplicity we will not discuss that here. In some
environments, the time to establish the new data connection, to initiate a new message, between
processors can be a bottleneck.
A11
To the left we have picture a potential division of a tall skinny matrix, which we call A1 , to
separate processors. In this case we have pictured m blocks in the matrix and assume that
each block is a b by b matrix so that the complete tall skinny matrix is n by b with n  mb .
Assume that each block to be assigned to a separate processor. Locating a pivot element in
A31
partial pivoting is a column by column operation. We can begin by finding the largest
magnitude element in the first column. To do that we need to collect information from every
A41
processor. For example, each parallel processor could calculate the largest magnitude
element of column one of its submatrix and then the m local maximum elements can be
compared to determine the largest magnitude element. This would require sending m

messages, one for each processor. The selection of pivot elements for a factorization of the
entire n by b matrix would require the same number of messages for each column or a total
Am ,1
of n= mb messages. There will be other information passed between processors for the
elimination, but we will focus only on the pivot search. We conclude that
A21

With this approach the pivot search in partial pivoting of the n by b matrix A1 requires
n  mb messages.
Recall that n can be large, thousands or more. There are other algorithms for implementing partial
pivoting in parallel, but none of them minimizes the number of messages as well as an alternate pivoting
scheme.
Tournament pivoting is a new (first published in 2011) pivoting scheme. Rather than working column by
column as does partial pivoting, tournament pivoting selects pivot elements by having “competitions”
between pairs of blocks of A1 or pairs of sets of rows A1 in a “tournament” which chooses an overall
winner consisting of the best b rows of A1 . These best rows are pivoted to the top of A1 and Gaussian
elimination with no pivoting is applied to the pivoted A1 , so the that the winning rows of the
tournament are the pivot rows.
The principle motivating the competitions between blocks of A1 is that

Gaussian elimination with partial pivoting tends to move rows that are linearly independent
to the top or the matrix.
We won’t try to prove this, but we can look at a specific example. Suppose the second row of a matrix is
linearly dependent on the first row. When we do the elimination and add a multiple of the first row to
the second row, the new row two will be entirely zero. When we select the pivot element for column
two we search column two on or below the diagonal for the element furthest from zero. Therefore the
current row two, which has zero on the diagonal, won’t remain at row two. So the linear dependent
row two won’t remain at the top of the factorization.
With this in mind we can recast

the goal of pivoting is to select rows of the b by n matrix A1 so that the first b rows of the
pivoted matrix are linear independent.
We can now do this via a “tournament” involving blocks of A1 (pictured here again):
A11
A21
A31
A41
1. The rows of A11 and A21 will participate in a competition by forming an LU
 A11 
 . The rows moved to
 A21 
factorization with partial pivoting to the 2b by b matrix B  
the top are the “winners” that is they are rows of B that more linearly independent, as
determined by Gaussian elimination. Let W be the b by b matrix with the winning rows.
2. For k = 3 to m have the rows of Ak1 (the “challengers”) compete with the rows of

Am ,1
W (the current “champions”). The competition is done in a similar way to step 1: apply
W 
 . The b rows
 Ak1 
Gaussian elimination with partial pivoting to the 2b by b matrix B  
that are moved to the top are the new winners or the current “champions”.
The result will be the b most linearly independent rows (as determined by the tournament) of A1 , as
desired.
We should add that there are many other ways to carry out the “tournamemt” to choose b “winners.”
The above tournament is a sequential style tournament where the current team of champions
successively meet new challengers one after the other. A ‘real’ world example of this is on the Survivor
TV shows where a player who is sent to an ‘exile’ island meets a new challenger every week. The winner
goes on to the meet the next challenger. In the literature this is called a “flat tree” tournament.
Another style tournament is “tennis style.” A tennis style tournament is divided into rounds. On the
first round all m teams play in m/2 “games.” The m/2 winning teams (or sets of rows in our case) are
sent to the next round. At the second round there are m/ 4 games (or, in our case, LU factorizations on
matrices of size 2b by b) and m/4 sets of rows advance. This is continued to quarterfinals, semifinals
and to finals where the b “best” rows are selected. This is called a “binary tree” tournament in the
literature. A binary tree tournament takes less time than a flat tree tournament (in a real life
tournament or on a computer that can have simultaneous games, i.e. a parallel processing computer).
What is the communication cost of a tournament to determine the best, or most linear independent
rows, of the n by b matrix A1 . Recall that for simplicity we are only focusing on the number of
messages. In a tournament we need to pass information between processors at the beginning of each
game played. There are only m-1 games played in a tournament of m teams (this is simple to see in the
flat tournament case). Therefore

The pivot search in tournament pivoting applied to the n by b matrix A1 requires initiating
m  1  n / b messages.
This is an order of magnitude less than required by the algorithm we described earlier for partial
pivoting.
The paper “CALU: A Communication Optimal LU Factorization Algorithm,” by Laura Grigori, James
Demmel, and Hua Xiang, SIAM J. Matrix Analysis and Applications, Vol. 32, pp. 1317-1350, 2011 proves
that tournament pivoting is communication optimal in that the number messages required by the
algorithm and the number of words in these messages achieve, within a modest factor, theoretical
lower bounds on the amount of communication. The authors conclude:

“The reason to consider CALU is that it does an optimal amount of communication, and
asymptotically less that Gaussian elimination with partial pivoting (GEPP), and so will be much
faster on platforms where communication is expensive.”
Download