Compiler Blockability of Numerical Steve Ken Department Carr Kennedy of Computer Rice Science University Houston TX 77251-1892 Abstract mizations don Over the past decade, have focused on on a single not kept increasing chip. speed ante is leading cated memory explicitly programs. comptler codes paper technology the while our to retaining well are on par- investigation the Our results memory to write a sublanguage will Introduction The trend precisely, able programmers need increasing Unfortunately, hand barked many in same rate. number The result of cycles for 10 to 20 machine Although lems, cycles cache it performs working tion sets has codes led hierarchy. We believe The user are specific task of specializing There compiler dence. fall in is a long this is a step not be creating a program Fortran of in the algorithms latency of machine. to a target in history wrong Instead, I compiler *Research supported by Darpa NOO014-91-J-1989 and by NSF Grant “What included to the the memory- obviate the this versions used compiler algo- the address a compiler the need of real-world the best block automatically we does with this preliminary generate study, information assist algorithms the to in ex- contributed some algorithms performing block To have extends point algorithms algorithms 1]. define we em- if a compiler block of several to the approach, point be needed to derive course loops, The mation Grant exhibit codes hand-blocked ver- known this transformation ical algorithms we have an and found common These LU to many as index-set show dependence used have decomposition a wide can be automatically algebra, been and success- without results splitting. that al- on triangular in linear methods of these pat- transformation be successfully to block key discovered can be used to analyze complex we have are quite loops. applied oting. can study, that that that which fully opti- of this approach In addition, gorithms indepen- through ONR CCR-9120008 efforts programs terns. architecture enough of this paper are competitive block 114 1063-9535I92 $3.00@ 1992 lEEE question In the programs machine achieve Sorensen versions from algorithmic use of sophisticated achieve and enough [D DSvdV9 and would trapezoidal of the to en- sions?” situatheir the our In that with 77 This [CK89]. in order prob- in the memory that block that This should and LA PACK. rithms to restructure performance will in a natu- to determine generate Fortran results compiler. optimizations The size. technology the corresponding Dongarra chip. these the viability from at the calculations cache to a particular to the on in common. ameliorate the design increase quite programmers direction. should to to improve that an for processors. optimization. automatically LAPACK increasing access—a on scientific than many been is now helps poorly larger by hand has a memory that success on scalar form pos- contend an algorithm good it programs same compiler on an experiment could a natural power is not We the to express To investigate reveal microprocessor speed enhanced recently, made need for performance. computational memory 77. achieve performance for point is toward Fortran to machine-independent hierarchy has vector management More research, in high-performance of memory-hierarchy pressed 1 technology possible to aban- More machine-independent it be for scientists programming. vectorization in ral, tnto opttmizations be expressed good compli- obviate use of compder can ~ to machme-spectjic describes algorithms tmbal- programmers programming. through numerical Thts to perform leading designed between to use more turn, sible speeds haue speed. to make it possible machine-language advanced power imbalance destgners systems, This an In strategies computational memory machine machine-specific form and design memory is haerarchzes. memory that result restructuring ticular the Unfortunately, The pace. computation microprocessor Algorithms* piv- is a transfor- Our results class with of numer- optimized for a particular machine’s expressed in a natural covered that erations memory specialized commute pilers to block We ground our we present essary to block the scribe a study tions to derive their corresponding that of the cannot a set of language sion of form. block of reuse: occurs that occurs same cache been in a accessed of a loop. accesses as some and a reference iteration a reference line a for temporal when has previously when Spa- data previous that access. In loop, we pro- the the expres- and DO 10 I = I,E A(I) = A(I-5) 10 algo- a machine-independent work types or a previous the following from For those to allow related reuse management, of as an opportunity transforma- by a compiler, in we review two reuse data current is in the we de- are accesses tial memory-hierarchy Temporal in the nec- in LAPACK extensions algorithms Finally, Then, algorithms. be blocked pose are to can be thought There spatial. optimization. of these algorithms applied reuse. to c)f back- that reuse dependence loop a review memory application Cache When com- thought automatically. point 2.2 op- can enable transformations the block which previously with to LAPACK are dis- methods. related Next, they we have about another were presentation material rithms one that by automatic begin when In addition, knowledge with codes be unlockable hierarchy form. present a reference value defined ence to B(I) ments to + B(I) A ( I-5 ) has by A ( I ) has spatial of B will temporal 5 iterations likely reuse earlier. of The reuse since consecutive be in the same cache the referele- line. summary. 2.3 2 Background 2.1 To fundamental same tool namely tool used available dependence. the first A if there statement reference to the compiler in vectorization two statements improve and to the second, the same exists a control memory and both location path a loop ing sets the first statement second dence, also If the first the writes reads from called a flow statement second it, from statements WL91]. source and for to location it, from there an output write to the and antidepen- location, Assume the of the cache. statements A dependence the source ent iterations carried and can that the loop be loop and location, and the used [CK87, to at describe the HK91]. of arrays Sections such for iteration every A [W0187, between it when the is more the reuse nest. is not interchange is blocking seciion of an mining, or set tially, of loop JS doesn’t rows, extra diagonals. of the 115 not as shown below. ,M) occurs. J-loop register loop to [LRW9 to effect N, a pre-loop loop the 1]. strip-mine-andUnroll-and- instead of as an application and the occurs if JS is less than [CC K90]. is completely instead addition, of cache is no interference blocking be seen In of B out analogous interchange divide iterations size strip-mine-and- J-loop HHJ(J+JS-1 A now mine-and-interchange com- the B, but + B(JJ) and there can the inner for JS of JS values for and than exists reuse, is unroll-and-jam used greater reuse to the transformation portion as elements, reuse reuse reference describe result so that loop of M is much = A(I) temporal jam by a particular the = 1,11 size of the cache [AK87]. information, this in cache A’s temporal A(I) Temporal are on differ- dependence work- capture + B(J) Temporal DO 10 I there if the references dependence dependence is accessed substructures columns the by a loop of the of the of references mon sink enhance analysis array is carried = A(I) DO 10 J = I,H, is dependence. by an outer To from whose to distance following DO 10 JJ = J, read the to reside the value 10 If both achieves of a dependence is applied dependence. is an input that A. To exploit for there blocks cache ac- space Strip-mine-and-interchange It shortens A(I) interchange statements into for that iteration DO 10 I = I,M dence. If both reuse. the DO 10 J = 1,11 10 is an enough the datum depen- the location grouped sink of loops in cache, be Consider and is a true dependence. reads writes to the behavior fit small temporal Por89, likely [Kuc78]. there than can are occurs. If the memory data is a transformation between flow of available parallelization— dependence exists is the blocking the cess more Dependence The Iteration-space unrolling. unrolled Essenafter unroll-and-jam. is used of a MIN function cache of strip stripWhen to handle . the 3 Index-set splitting J Iteration-space plied blocking as shown safety the constraints of blocking. splitting creates each new loop changed and pletely a partial be from iterating over iteration space. M application called Index-set one original split- loop nonintersecting order iteration space As an example of index-set following loop. is un- is still E J=aII+~ with portions Execution original the t ap- Sometimes a transformation applied. loops executed. consider permit can the be directly section. cases, multiple of the original ting, these always previous only In index-set ting in cannot com- 1 11 splitFigure 1+1 S-1 1: Upper Left N 11 Triangular Iteration Space DO 10 I = I,H 10 A(I) = A(I) + B(I) set I 11-loop The index of can be split at 100 to iteration obtain DO 10 I = I, MIH(M, IOO) A(I) = A(I) 20 A(I) = A(I) Although this loop forms. enable This the eration H does section loops nothing the blocking uses index-set of triangular and and with by itself, of splitting to trapezoidal complex formula cases iteration the iteration space iteration-space I it- appears bound dependence in the Interchanging gions loops requires preserve Therefore, loop for with then The gular general integer form constants the when blocking We of one type below, be symbolic) Figure 1 gives trian- space loops, iteration the of this loop. intersection a > 0. space Therefore, at the point interchanging line the the II J=@II+~ (1, aI+/3) the of loops must iteraand with inner loops unrolling the more above region following cre- triangular )+@ and one loop region that iter- the line. Since is known, loop is of the it can qest. IS = 1,1+1 S-2 J unrolled H) upon to determine therefore, the first H loop body Depending the values loop nest triangular to handle other be possible ~, to eliminate triangles iteration it may triangular to completely the unroll-and-jam common Trapezoidal of a and the size of the it may ditionally, 3.2 MIN(a(I+IS-2)+(3, body DO 10 J = a(I+IS-l)+~, also reun- overhead. Ad- can be extended [Car92]. spaces the be han- requires over region the two ex- Since of J at a(I+IS-1)+~ rectangular to give be of iterations making J, iterates be possible roll To interchange of the of the can 100P 10 gion; description with DO 20 J = aII+@, IS a graphical length DO 10 I = I,E, /3 are literal and of lower is begin 20 a and of the as follows. by the splitting rectangular DO 20 II of strip-mined where the function instead number J=cY(I+IS-I the be unrolled the vary that the line the spaces. bound defined region, loop requires for- to handle a linear unroll-and-jam Index-set ates over DO 10 II = 1,1+1 S-1 DO 10 J = cYII+~, Fl [OOP body 10 dled. also extended where upper space one loop below strip-mine-and-interchange (D may DO 10 I = I,E, tion bounds to I+ IS-i) strip-mine-and-interchange iteration ates W0187]. we derive iteration for bounds [W0186, regions re- 0 and to triangular difficult. it to unroll-and-jam. is given an This [Car92]. innermost applied. a triangular loop Below, loop rectangular, directly of the loop the triangular derivation extend loop over modification. on triangular the and iterate of determining performed is not be modification semantics blocking bound mula the the that a loop cannot < the Triangular spaces of blocking with 1+1S–1. H = I, MIH((J-~)/cI, a a trapezoidal If > nest, can be trivially where tended Triangular region ~ body IOOP This complex patterns. 3.1 loop DO 10 II 10 enable blocking spaces until DO 10 J = cd#3, MIH(M, 100)+1), + B(I) can a trapezoidal ~ DO 10 I = l, E,IS transformation application over of the following + B(I) DO 20 I = MAX(l, its bound gives 10 to iterate upper While the the the common 116 previous method non-rectangular-shaped applies to iteration many of spaces, there are still handle. some In linear ential equation iteration ple, codes, spaces where important algebra, loops occur. L is assumed with it will partial not double-precision the differ- trapezoidal-shaped the following to be a constant, with The MIN function one triangular = N. Because gions can be handled split into with blocking two one rectangular separated at rectangular already, separate the to point triangular the index lower at the each following bound, need point new loop L, of the not :1 = loop that that, after an iteration space that consider can In some space ~ can of two gular bound regions. ting As is a linear variablel extended loop that handle in a trapeIt may tion be this more loop complex which of A (K) dard dependence the rection The value split in similar to example, complete from the loop loops that The ration and blocking using can index-set regions consider be previous execution unroll- a MIN bounds time. two and After and-j am and placement there on both can fol- would result the in four by A(K) section does exist loops come performing a transformation loops, from 20% we ran an A read by the the In section from partial of K can be split space blocking so that where locations space where determine the split subscript expression A(K) and of the one loop and A (II) one loop they point that the access disjoint creates defines the over them scalar on arrays I Figure reof 117 2: Data N for A set it- common the iter- section 1+1 S-1 Space the larger A 1 index loops, explo- splitting, to these this index-set re- locations. separate oil the 1+1S over iterates access that loop, iterates functions program’s ref- and the section k)e index- of the called reveals I to N. Therefore, for sec- source Consider be blocked. constitute the dependence array for means the blocking. I to 1+1 S-1 goes from not This analyzing apply of the or di- exists at II- Stan- as distance accessed true goes from Undefini- by the [WO182]. are to the recurrence However, that II-loop distribution. such the backward by A(K) To allow of two [Car92]. to remove the the N. ation bound function Con- position. ) carried with that is potential 2. The currence ,HIIJ(I,El) (K)*F2(I-K) lower but loop. between abstractions, arrays of blocking, use of A ( II report to A(II) written split- the sink that erence [Car92]. convolution and Figure trian- the is prevented. memory splitting can each program the compiler the to the innermost defined of the eration MAX function of the iteration the below. interchange vectors, every series. DO 10 I = 0,1i3 DO 10 K = MAX(O,I-U2) 10 F3(I) = F3(I)+DT*F1 set loop, the example is a recurrence and ex- an computes there preventing of the outer-loop example, computes within iteration-space loop, produces As in rhomboidal regions the shape patterns be interchanged tions to the case for triangular another lowing time To only for mined complete fortunately, series. function resulting to rhomboidal similar times regions. patterns difficulties dependence blocking lower presents the strip must blocked. time the sider be To value. loop dependence cases, it is not that also DO 10 I = 0,1J3 DO 10 K = I, f41H(I+li2,111) 10 F3(I) = F3(I)+DT*FI (K)*F2(I-K) The 1000 triangular M splitting, be the following convolution induction kernel in the of mea- DO 10 I = l,Ii DO 10 II = I, 1+1 S-1 T(II) = A(II) DO IOK= II, Ii A(K) = A(K) + T(II) 10 inner index-set Complex 3.3 separately. nests be a constant any function adjoint each timing re- set of I can be DO 10 I = l,141JJ(M, (lJ-(3)/a) DO1OJ= L,aI+~ [OOP body 10 DO 20 I = HAX(i,lIIIH(lJ, (E-@) /a)+l), DO 20 J = L,M [OOP body 20 ample, over execution is a table For and where blocked. nest we iterated of the Below &io. a >0. region the and regions applied gives The [CCK90]. RS/6000 H) defines region aI+~ zoidal 75% REALS on an IBM surements, exam- and results body /OOp Splitting that and Consider DO 10 I= 1,11 DO 10 J = L, MIH(aI+~, 10 loops seismic To the is loop is guarded necessary 1%-ocedme IndexSetSp [it trix For each transformation-preventing repeat the following or a region steps is created 1. Calcufate the sections the preventing 2. that Intersect until failure may be blocked. of the source the sections 3. If the intersection 4. Set the subscript and union and sink using to the boundary common induction 5. Split 6. Repeat sections and variable. the index steps 10 20 stop. If the section the disjoint 5 if there loop be at this point. original equal cessed to the by the the equation In the boundary source and is solved above K. Splitting sink the of the for the inner example, at this between let point K = sections dependence induction 1+1 S-1 code possible may method the guard into executed. Instead, and solve for be used loop and The 11-loop the can 10 and loop The nest induction method just used mining representation of enough to relate variable values. solving the only that to record this of linear previous to be inspected KC is initialized KLB is the the be the to index array we have notation the each false code patterns, the effects values the The idea of an outer-loop is true and inner-loop the nest is code loop is inserted bounds On the true the within variable following for branch code of the loop bound the information of the is inserted, to false, to be inspected of an executed range. branch of the inspected to store guard, the upper the fol- bound of true on range. (FLAG) THEE KUB(KC) = K-1 FLAG = .FALSE. EKDI F allows performance of the shapes of control It innermost equations. to iteration-space be considered. of the am [AK87]. values. is inserted executed that last may flow be the and dependence on blocking case that the the IF-inspection must After an inner of the guard of the loop, upper loop inserting transformed 118 value iteration to store also IF-conversion IF-inspection, IF 1]. IF-inspection In addition out in cho- [HK9 representation On lowing Note 4 of to 1, FLAG is initialized lower re- ( . HOT. FLAG) THEE KC= KC+ I KLB(KC) = K FLAG = . TRUE. EMDIF ex- must the guard would in instructions the guard to be transformed. where blocking. that enhance those this called Then, is to replicate due to a decrease guard for which IF-inspection, IndexSetSplit upon 90 array for and IF precision representation to greatly the depends in in the correctness However, the is executed. To effect corresponding partial The the IF-statement and the executed dependence method locations 5, we show systems in when by not loop an increase at run-time K is the induction applicable in to handle to Fortran IndexSetSplat A(K) sections. The sen is equivalent maybe the on 10. of IndexSetSplzt effectiveness In Section and state- be completed variables 3 presents strip around can in the preventing A(II) Figure after The described induction (e.g., ample). distributed statement involved different positions be to B would executed to preserve unroll-and-j variable loop executed blocking surrounding the references have now 20 and unroll-and-jam were a combination allow is to inspect inner that techniques, to keep still the loop ments unma- routine checked be unsafely and sparse-matrix can yields be degradation and and and references never iteration. parallelism DO 10 II = 1,1+1 S-1 T(II) = A(II) DO 20 K = I ,1+1 S-1 A(K) = A(K) + T(II) DO 10 K = I+IS, M A(K) = A(K) + T(II) 10 prevent following the BLAS the innermost in a performance loop-level variable. ignored statements ac- DO 10 I = I,M 20 from K-1oop, would it for each unrolled sult set the code. One move IndexSetSplit that Therefore, in the are multiple on the introduced unrolled 3: Procedure is take were performed guard. boundaries. Figure that IF-statement were and solve for the inner-loop set of the inner 4 and then of the larger between code to Consider = I,E DO 20K= I,E IF (B(K, J) .Eq. 1.0) GOTO20 DO IO 1=1, M C(I, J) = C(I, J) + A(I, K) * B(K, J) COHTI!WE of symbolic are equal expression IF-statement [DDSvdV91]. dependence. and union an DO 20J information. equal multiply SGEMM dependence by computation. bound could requiring of the be a test last range code, the of FLAG after the body. the inspection is distributed around loop the inspection to be code the FLAG = . FALSE. DO 10J = l,H KC=O DD 20K= l,I! IF (B(J, K) .ME. 1.0) THEM IF ( . IJOT. FLAG) THEM KC= KC+l KLB(KC) = K FLAG = . TRUE. EllDIF ELSE IF (FLAG) THEN KUB(KC) = K-1 FLAG = FALSE. EEDIF EEDIF COHTIFNJE IF (FLAG) THEE KUB(KC) = M FLAG = . FALSE. EIVDIF DO 10 Kli = l,KC DO 10 K = KLB(K19) ,KUB(KIJ) DO 10 I = l,FJ C(I, J) = C(I, J) + A(I, K) * B(K, J) 20 10 LAPACK at To its adapt 4: Matrix Multiply After the independent form To it independent ability systems of linear veloped in Section block can Our shows dexSetSplit, loop where ated. The is shown new the cuted result 4. K-loop the than loop operations, transformations original nest for trix ble below, UJ is the is the are cost caused better f’reguencg result on ran arrays into is exe- slight in- show this, it on an IBM often we the B (K, J) loop unroll-and-jam ta- i and = 3 .33s 3.84s 2.25s 1.48 3.08s 3.71s 2.13s 1.45 The goal LINPACK have of of linear LAPACK and better is EISPACK cache replace with performance. the block algorithm, without QR de- pivoting of LU obtained triangular matrix. into by matrix This multiplying decomposition two matrices, of elementary lower L= Mk-l. as follows . . M;l, (1 is an can A by matrix triangular Mk. and decomposition the ries IF- Using this sition without 20 algorithms algorithms 10 Unfortunately, 119 equation, The computes shown equations to lower triangular U= Speedup UJ+IF 10% Systems that U, such that 1, UJ+lF after 2.5% 5 in UJ The perfor- rotations. is a form L and derived. Original with blockable. matrices, a sewhere [Ste73], .. MIA after inspection. Frequency is about the memory A is decomposed the matrix be RS/6000 Iri elimination where upper ma- unroll-and-jam the innermost In- pivoting QR decomposition is not decomposition L is a unit new IF-inspected how piv- method information non-LAPACK its without the partial and from algorithm. A = LU, can be the of REALS. shows of performing the To our Gaussian the ranges. loop large, LU 5,1 is executed transforming of performing the guard result over those inner locality. and on 300x300 is cre- by IF-inspecticm after data example 540 moving loop unroll-and-jam multiply model the point with if best known the in LAPACK, to improve Givens de- multiply executes loop how with techniques using and blocksolving is “blockable” LU decomposition Householder of a fourth the for iteration on matrix within which the executed the guarded counteracted performed over was KN-loop executes over crease in run-time more loop The where ranges in executes of IF-inspection of ranges the If that innermost in Figure number and nest the the derive algorithm commutative composition a new using algorithm IndezSetSplit also shows can a machine- examines LU decomposition using technology algorithms one found is a blockable mance and section An that the in machine-independent study study space the pro- handling LAPACK automatically algorithm, oting compiler equations 3. that a machine- details. express this hand to obtain in compiler of LAPACK’S corresponding IF-inspection to form, of three a compiler the whether possible another, we believe kernel optimization investigate make to subroutine each with ~er- independence. machine contrast, express machine-specific additional machine-specific LAPACK In performance. should one perform on each grammers achieved of machine from must optimization high have expense kernels a programmer blockable Figure desimers formance point Mk and below an algorithm pivoting after using algorithm, statement strip for Gaussian where 10 applies LU decompo- elimination statement Mk is 20 to A, is mining, DO 10 K = l, M-l, KS DO 10 KK = K, K+KS-l u DO 20 I = I(K+l, A(I, KK) = A(I, KK) / A(KK, KK) DO 10 J = KK+l, W DO 10 I = KK+I, E A(I, J) = A(I, J) - A(I, KK) * A(KK, J) DO 10 K = l, M-I, KS DO 20 KK = K, HIM(K+KS-l, M-1) DO 30 I = KK+l, H 30 A(I, KK) = A(I, KK)/A(KK, KK) DO 20 J = KK+I >K+KS-I DO 20 I = KK+l,li 20 A(I, J) = A(I, J) - A(I, KK) * A(KK, J) DO 10 J = K+KS, E DO 10 I = K+l,ll DO 10 KK = K, MIM(MIH(K+KS-l, M-1) ,1-1) 10 A(I, J) = A(I, J) - A(I, KK) * A(KK, J) N Eiizzi 10 K+KS-l 20 K K N Figure Figure 5: Sections of A in LU locality, Unfortunately, this performance rithm the block completed version, We show To improve a number applies array its cache a block algo- of updates them on only how To complete To a portion to attain to to the table below, is we performed “2” attain algorithm the the loop that loop there blocking KK-1ooP nest that that statement surrounds prevents 20 between the 5 shows entire unless accessed by the section the space to create tion loop of J can loop where the A ( 1, J ) are disjoint in statement 20. only from executes over of the Now, triangular this in the point, interchange innermost the best tained. Therefore, Not does this only LU block can position block on REALS. could version as well, 10, the the not itera- have with been similar decomposition algorithm exhibit partial pivoting has is discover IndexSetSplit, with algorithm fit the form partial a new sections Spltt to the algorithm the is added handled the without same cannot of code 7 for In recurrence the exists after LU partial that IndexSetSplit. in Figure for using be said (see Figure pivoting). by potential pivoting does Consider IndexSet- applying 7. the 6). been At The ob- reference reference blockable. better can DO 10 KK = K, K+KS-l D030 J= I,M TAU = A(KK, J) 25 A(KK, J) = A(IHAX, J) 30 A(I?4AX, J) = TAU DO 10 J = KK+KS,H DO 10 I = KK+l, Ii 10 A(I, J) = A(I, J) - A(I, KK) * A(KK, J) to put (see Figure compiler pivoting the following by by A ( 1, KK) be used the partial pivoting below. algorithm with in LU decomposition decomposition DO 10 KK = K, K+KS-l DO 10 J = K+KS,lJ DO 10 I = KK+I, H 10 A(I, J) = A(I, J) - A(I, KK) * A(KK, J) KK-loop transformations Sorensen ver- was run double-precision decomposition algorithm when itera- accessed accessed is shown blocking J D K+ KS- 1 locations those loop of the point LU Although section 10. Since statement the memory This experiment scalar the A accessed The a portion at that The and producing ~ is 20 is a subset surrounding be split a new space array KK-loop. statement for In addition, be- splitting by A ( 1, J ) in statement exists of the index-set in code, 54o using final 6. and by the KK-1OOP index-set of the of the A (1, KK) accessed recurrence tion the sections execution the to the 5.2 Figure RS/6000 that in Figure In version unroll-and-jam blocked point However, done. for to as “2+”. referred the by Sorensen. A ( 1, KK ) in statement 10 carried distribution algorithm decompo- with to the Sorensen trapezoidal sion to LU version J-loop parallel. improvements. around 10 before position. to the to our applied by hand performance refers replacement Note de- around and statement J) in statement LU be distributed to the innermost is a recurrence 20 and A(I, of strip-mined must surrounds ing interchanged refers its a hand-coded “1” as the 10 can be made IndexSetSplit compared and of the inner the block and an IBM the composition, Decomposition parallelism statement We applied sition IndexSeLSplit. using LU it also has increased surrounds algorithm all at once [DDSvdV91]. that cache strip-mine-and-interchange for the K-loop nest, the and of the poor developed groups A together portion best loop have essentially the matrix exhibits matrices. scientists that a block algorithm on large performance, 6: Block Decomposition sections. data loops 120 to A ( IMAX, to A (1, would J ) in statement J ) in statement Distributing convert the the true KK-loop 25 and 10 access around dependence from the the same both J- A (I, J) nal DO 10 K = I, U-l c c .,. c pick pivot --- partial I14AX are identical. pivoting 20 10 A(I, Figure 7: LU to A ( IMAX, direction. The This of a block is not rules would ing recurrence derived block level parallelism algorithm by sion, row point and block concept for row interchanges to understand values to the sequence of values location. the is different these block of values from the memory that that point 5.3 The key tion of the both triangular a the Gaussian are com- array. alone dependence It columns, is used be written re- 8: Block where a location Partial on REALS. is the multiplicalower zeros below each of matrices that have to solve class, a system having of lin- orthonormal decomposition independent [Ste73]. columns, then A can in the form the Q has orthonormal is upper fi- One triangular class of matrices that of the form 1 — 2VVT. algorithm column. QR algorithm fits the the I and = R elements, properties of Q is transformations Householder QR elementary below detailed consists V~ = reflector k= for values For a more and QQT diagonal the to obtainAk+~ Vk eliminates the kth for applying toAk positive or Householder reflectors point columns, with elementary The K+KS ,Ii =K+l ,E KK = K, IIIM(HIII(K+KS-l, E-l) ,1-1) J) = A(I, J) - A(I, KK) * A(KK, J) with was run A = QR, I–2VkV~ LU in QR uniquely the 1,. ... n– 1. diagonal in discussion of the of Vk, see Stewart the computation [Ste73]. Although Figure unroll-and-jam reveals decomposition, although class such a particular Each 10 where 8 and of elementary introduce be used One If A has linearly is not algorithm DO 10 J = DO 10 I DO 10 A(I, after below, Figure experiment elimination Any can of iteratively point in To compiler double-precision that element. updates version, however, the table A by a series matrix property through is, QR matrices diagonal may DO 10 K = l, E-l, KS DO 20 KK = K, HIE(K+KS-l,M-1) c C ... c to equations. of LU so it decompo- be profitable?” algorithm Householder computa- before locations. pass row ver- this pass through version each ear A data code are al- compilers, given 54o using split in LU question the This wholeinvolv- matching of making algorithm that Rs/6000 The dependence this. to is that of the The knowledge replacement. understand (rows) to refers scalar index-set available consider the and recurrence of pattern profitability sophisticated, the 6. bllock occur in in the matching is fol- computations and whole-column sufficient In but Data 1]. same performed compiler operations. sequence may be recognized. increase com- loop- in Figure The can are upgrade permutations the situation the to under- pattern 25 of the to believe to To that in commercially refers with columns to install Forms “Will an IBM whole row sition “1 +“ be the done and prevent- interchange In ready “l” can still in which the 10 and be ignored. more [D D$lvdV9 algorithm are locations mutative maps 8) row interchanges the the statements dicase. have to prove could de- can be equipped on permutations. both updates see the potential existence the increased is updated. in different key the ignores update versions, reverse of data the KK-loop each updates) the non-pivoting independently. column (column lation that a whole-column tions in preclude to the in the version, multiple occur to one would is reasonable Pivoting of a dependence also exhibits is updated particular * A(K, J) Partial (see Figure found point element seem and distributes This K) preservation algorithm mathematically lowed the compiler, ing - A(I, reversing row to recognize with similar a block the for the with decomposition blockable. operations mutable an understanding LU a compiler that column an antidependence analogue However, J) Decomposition prohibit rection. = A(I, J ) into pendence In J) stand Without operations, Fortunately, Do30J=l, rJ TAU = A(K, J) A(K, J) = A(IMAX, J) A(IHAX, J) = TAU DO 20 I = K+l>II A(I, K) = A(I, K) / A(K, K) DO 10 J = K+I, IJ DO 10 I = K+l, IJ 25 30 the values of commutative Pivoting position, tion 121 pivoting the of the best original is not block necessary algorithm algorithm. The for QR decom- is not an aggrega- block application of a number computation of elementary and storage original algorithm the step first then involves does [DDSvdV91]. not both exist in DO 10L= I,H DO 10 J = L+I, H IF (A(J, L) .Eq. 0.0) GOTO 10 DEli = DSQRT(A(L, L)* A(L, L) + A(J, L)* A(J,L)) C = A(L, L)/DEM S = A(J, L)/DEli DO 10 K = L,E Al = A(L, K) A2 = A(J, K) A(L, K) = C*AI + S*A2 A(J, K) = -S*A1 + C*A2 COMTIMUE 10 the Given is to factor (:::) and reflectors that =(::: :::)(R~’) solve Figure 9: QR Decomposition with Givens Rotations where Q= ~_–2-;v#(l -Qv2v;)(1-2v~ give v;) .— The difficulty putation for the compiler of I – 2VTVT computation that algorithm. block stride-one make comes because did not To illustrate, in the it involves exist in consider the the com- space original and to the innermost would necessitate the point case where the size is 2. access to the the references IF-block (1 – 2W,V;)(I = Here, – 2V27$) I – 2(VIV’) the computation (’ ‘VT”20($) of the matrix of a true of A(L, the element the K-loop part possible data of the dependence choice the computation of of block However, the to to express of Q from the block that of a language In to factor. that chose algorithms form. requires in a cur- in a manner expressibility allow QR algorithm to automatically machine-independent this im- blocking this language a compiler hanced it Householder a machine-dependent of no way programming allow making information. expression We know rent algorithm, to determine The the original can would factor. at Section (with 10) [KKP+81]. scalar of QR rithms cannot in form QR [Sew90]. rithm we We show know from the can be used rotation of no point IndexSetSplit that that Givens best block algorithm, and matrix algo- so instead IF-inspection have applicability. Consider ner matrix is the currently to derive wider in of orthogonal decomposition Figure the Fortran have cesses, resulting changing the code The a long in J-loop to the Givens references stride poor for between cache QR to A in the successive performance. innermost shown position In form algo- from their to main- order coding of block must needs House- block a compiler types compiler end, looping of blocking ing some styles, algorithms be made to be directed blocking factor in possible. to pick for an algo- automatically. IN corresponding in- the blocking These regions BLOCK that factor. Inter- are would are assumed optional. by the executes it If they to start proposal compiler’s are a Do-loop the DO and IN region guides analyze to of an IN are expressed, not value DO block- DO specifies by compiler determine bounds at the first BLOCK defined the for choice whose compiler. over should The the constructs DO specifies is chosen that a preliminary to guide factor. factor ac- we present constructs DO. BLOCK a DO-loop 9 [Sew90]. K-1oop the To this and with that by machine-dependent two 2.04 5.49 of machine-independent of these results Speedup algorithms. a machine-independent rithm Another goal expression the QR point (see of the decomposition shows be derived of distri- interchange 3.37s 15.3s examination transformations for J-1oop, Optimized 6.86s 84.0s holder ref- exists QR. The Specifically, Givens Point these splitting is a table extensions the and the prevents for index-set of Givens Size L) only of the Language the 6, we address allowing 6 tain sections Below performance between recurrence expansion) 300X300 500X500 in a the the around a recurrence use of A(L, IF-inspection Figure corresponding issue. 5.4 L, bution be en- be stated that J-loop However, the respect interchange antidependence and A (L, L), Array is not and K) with loop of the K-1oop, Examining reveals to A ( J, K) and case, distribution definition of the In this the consisting erences * loop. and distribution. Q = references to A (L, K ) invariant a to the DO statement the bounds in the specified DOIOL=I, B DO 20 J = L+l, H IF (A(J, L) .EQ. 0.0) DEN = DSQRT(A(L, C(J) BLOCK DO K = l> H-I Ill DO I = KK+l, Ii A(I, KK) = A(I, L)* A(L, L)+ A(J, L)* A(J, L)) = A(L, L)/DEM DO J = KK+l, = C(J)*A1 A(J, L) = -S(J)*A1 IF-Inspection Code + S(J)*A2 + C(J)*A2 including 20 DO 10 K = L+l, E DO 10 Jll = l,JC DO 10 J = JLB(JE) ,JUB(JH) Al = A(L, K) A2 = A(J, K) + S(J)*A2 A(L, K) = C(J)*AI 10 A(J, K) Figure = -S(J)*A1 10: + C(J)*A2 Optimized Givens Figure and allow end indexing last index were not The a natural block blocking PACK, the necessary, PACK compiler. to the code and In for still retain would accessible be removed, to new good it handle partially Previous performance their 8 compiler from have [W0186, discusses strip-mine-and-interchange shaped iteration spaces, general compiler algorithms unroll-and-jam. He use index-set that We arises take technique Irigoin blocking pendence splitting from this and Triolet iteration abstraction he nor shows to handle by more describe spaces does the example a trapezoidal for called he we were memory In how to exhibit region version results for How- applicato non- that many many cases, for uses a de- commute blocking compiler codes based fortunately, cone [IT88]. methods 123 by shown our like strictly success a plausi- in loops obtaining as good has splitting”. codes as the best developers. knowledge the ad- which to succeed be blocked been that block In about a compiler not not by comby the use yields on dependence QR decomposition, we have introduced LAPACK could we can and as “index-set can enable that ver- algorithm. splitting that we For each can be overcome at least produced the block are encouraging: known index-set end, in LAPACK whether of the problems we have operations To that succeed point can enough algorithm. trapezoidal patterns performance algorithms the study and well both point could from of this triangular dition, a general technique that the of programs we determined of the transformation to in a dependence nest 1]. a compiler blocking. to examine technology dependence work cases. a general hand able programs, block plex present for whether a collection compiler found blocking. developing include computations and the corresponding block triangularnot extend by on cache particular, iteration-space further handles In for but triangular a step that W0189]. also the need of these blocking [WL9 is it applicable to determine restructure examined sion the W0187, set out for which of LAreadily of work not nor applying Summary to avoid ma- perfclrmance. LAPACK amount does nor loops. automatically architectures. a significant for a loop parallelism splitting nested We have work has done a framework and nested codes, loops. and ordering framework of index-set algebra blockable present memory The Wolfe Lam transformations ble 7 and blocking on non-perfectly in linear when problem making Fortran of a rnachine- the library so, the machine-dependency in Extended work does perfectly case of LA- be used, does not are common tion the choice Then, to port LU which ever! algo- leaving the could library. be used as is that the algorithms source-level could while namely extensions to machine By doing form, language independent technology chine to the extensions technique Wolf be coded a non-blockable details, factor, the independence. of the 11: Block loops, decomposition it could express machine-dependent of 1. To LAST returns if LU machine advantage can a step region, algorithm, 11 to achieve programmer with example, a blockable in value a block For principal rithm last within value. in Figure the at the LAST(K) QR This block KK) DO I = KK+l, N A(I, J) = A(I, J) - A(I, KK) * A(KK, J) ERDDO EMDDO EMDDO DO J = LAST(K)+I, H DO I = K+l, E Ill K DO KK = K, MIH(LAST(K) ,1-1) A(I, J) = A(I, J) - A(I, KK) * A(KK, J) E!JDDO EUDDO EEDDO EIJDDD A2 = A(J, L) A(L, L) KK)/A(KK, EMDDO S(J) = A(J, L)/DE19 Al = A(L, L) c c c K DO KK GOTO20 by any analysis. Un- universal. For block algorithm has no corresponding sizes larger than to compensate ing for block algorithms, must be developed. Our goal it algorithm require the is to succeed, make point one because additional blocking. If automatic machine-independent such block as that on [CK87] block- expression proposed Programming Callahan Fzrst has been to find for the compiler user to techniques express that [CK89] S. Carr the algorithms naturally memory strated ods with hierarchy that that linear there can exist codes. IndezSetSpld and we will the resulting scientific breadth addition, we will that algorithms that future increasingly pilers will can remain this paper many complex need to free language extend to from more memory-hierarchy linear com- during logic. In ad- the go our Rebecca algebra. Conference us with on SIAM, An implementation pages 319–328, D. of Conference Symposium on ming Languages, 1981. 1988. B. Leasure, and and compiler Record the A GM PTogTamming January graphs ACM partition- Fifleenth Padua, Dependence In the Principles R. Kuhn, Wolfe. 1991. July of Languages, D. Kuck. The op- of the of Progmm- Principles StructuTe Volume Eight Lam, E.E. Rothberg, for and and Com- Sons, New Proceeding. conference Programming April t ems, A.K. M.E. Wolf. of on the Fourth ATchatecuTa{ Languages The of blocked and In- Support Operating Sys - 1991. Porte field. oj and and optimization In ternational/ will Computem Wiley John 1978. M.S. algorithms. automatic of 1. re- it fully these Uli block algorithms. Kremer helpful this and and point To Gache So&waTe Methods PeTfoTmance on PhD versions [Ste73] thesis, Rice of [WL91] G.W. M.E. Wolf and foT Improve- SupeTcomputeT University, ’91 guage May Wolfe. tober Systems, on 9(4) PTogramrning :491-542, S. Carr. Memory-Hierarchy thesis, Rice University, Science, October Languages [W0186] and 1987. [W0187] PhD In Proceedings S. Carr, allocation of the and for K. Kennedy. subscripted SIGPLAN ’90 Inlprovvariables. Conference [W0189] Wolfe. Processing, Wolfe. 1986 A data 6’ompu- 1973. locality of opti- the SIG- PvogTamming Lan- June Supercompilers 1991. for University Super- of Illinois, loop Iteration In Oc- space on December More of the tiling o.f PaTallel In Pro- Conference on 1986. Proceedings Conference Wolfe. interchange. International August Computing, 1989. MatTiz Implementation, Advanced PaTallel ceedings Al- 1982. of the M. Linear York, Proceedings thesis, ceedings M. to on Optimiz~ng erarchies. Management. F’M3 Department of Computer 1992. D. Callahan, ing register M. and of 1990. New Lam. In Con.feTence Design computers. J.R. Allen and K. Kennedy. Automatic translation of Fortran programs to vector form. ACM Methods England, M.S. algorithm. PLAN M. Horwood, Stewart. Introduction Academic Press, mizing people Computational Ellis tations. in un- of these G Sewell. gebTa, Danny us guidance all [Sew90] suggestions document. gave and thanks. Transactions [CCK90] and Systems regular section analon Parallel and Dis- Record the If these [W0182] [Car92] on cache performance References [AK87] Sorensen, LineaT Supernode D. Kuck, York, [LRW91] [Por89] many of algorithms heartfelt IL, Comptite.s. R. Triolet. on computations, Carr, made provided derstanding D.C. Solving 2(3):350–360, and Applications. preparation LAPACK chCagO, 1989. McKinley Sorensen Duff, K. Kennedy. job management. Briggs, the of Pro- 1991. and putations a few do a good toward [Kuc78] Acknowledgments Kathryn 1.S. Systems, ment Preston algebra PaTallel cO?7LpUt~?Lg, Vorst. timization. to programmers with can step [KKP+81] linear on ShaTed-Memo’ry F. Irigoin in - 1987. sophisticated that, general a significant certain and P. Havlak M. on program compiler van der Symposium exblock hierarchies, so that established the are increasingly to concentrate algorithms represent designs Dongarra, ing. In style. strategies methods, methods. to express memory adopt we have key sults it possible machine memory-management ditional these J.J. H.A. tributed [IT88] the In P~oceedings C’onfeTence of interprocedural bounded ysis. IEEE Transactions of the of SupeTcomput Greece, Blocking SCient{J$C pro- 1989. Philadelphia, [HK91] Then, understand to investigate make to add to a collection in a machine-independent Given have continue by fOT VectoT in an knowledge. to better [DDSvdV91] all, implemented we plan compiler supplied would not blocking future, cessing on hierarchies. SIAM of a parallel PTo.eedings Athens, and K. Kennedy. Fourth Analysis in In Imple- 1990. Kennedy. effects Verlag, and June Conference for memory December meth- but we have the in order of coverage tensions many, commutativity programs good demon- implementable block In of have and rhomboidal system. apply We Currently, trapezoidal experimental expectation readily automatically algebra triangular, the performance. NY, environment. Springer- codes numerical K. side Intemationa[ ing. possible and gramming 6, Design Plains, interprocedtwal of in Section D. Language White mentation, computation for memory the Third Processing for hi- SIAM Scientific 1987. iteration Supercomputing space tiling. ’89 In Pro- Conference,