Compiler Cache Optimizations for SR11000 Ichiro Kyushima Hitachi, Ltd., Systems Development Laboratory 2006/10/31 Outline of Talk Overview of SR11000 Sysmtem Overview of SR11000 Compiler Cache Optimizations on SR11000 Compiler Software Prefetch Loop Distribution Loop Blocking for Single Loop Nest (Loop Tiling) for Multiple Loop Nests (Strip-Mining) Summary Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 2 Hitachi’s HPC Systems & Fortran Compiler SR11000 Model H1, J1, K1 & K2 Peak Performance [GFlops] 100,000 Single node peak performance over 100 Gflops with multi GHz processor SR8000 First HPC machine with combined vector & scalar processing 10,000 SR2201 First commercially available distributed memory parallel processor 1,000 S-3800 Single CPU peak performance 8 GFlops (fastest in the world) 100 S-820 Single CPU peak performance 3 GFlops 10 S-810 Fortran Compiler’s Optimizing Facility and Programming Language Specification First Japanese vector supercomputer 1 Optimization for Cache Memory Automatic Parallelization Automatic Vectorization Automatic Pseudo Vectorization Fortran90 Fortran77 Fortran95 ●ISO/IEC 1539:1991 ●ISO/IEC 1539-1:1997 ●JIS X 3001-1994●JIS X 3001-1:1998 Copyright(c) Hitachi, Ltd. 2006. All rights reserved. ‘82 '85 '90 '95 '00 ●ISO/IEC 1539-1:2004 ‘05 3 Super-Technical Server SR11000 High-performance SMP node (134.4 GFlops*) POWER Architecture CPU (POWER5+ 2.1GHz*) 16way SMP High system-scalability max 512 nodes (68.8 TFlops*) *: SR11000 model K1 Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 4 Optimizing Compiler Lineup Optimizing C Optimizing FORTRAN FORTRAN77 (ISO1539:1980) Fortran90 (ISO/IEC 1539:1991) Fortran95 (ISO/IEC 1539:1997) C (ISO/IEC9899) Optimizing Standard C++ C++ (ISO/IEC14882-1998) Features Prallelization for SMP system - automatic parallelization - user-specified parallelization (Hitachi’s own directives, OpenMP) Cache Optimizations Today’s Topic Instruction-level Optimizations (for POWER CPU) Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 5 Compiler Structure Front End FORTRAN source FORTRAN Front End C Source C Front End C++ Source C++ Front End Common Back End Loop Transformations for Parallelization Parallelization for SMP Source Source Level Level IL IL Loop Transformations for Cache Opt. Traditional Optimizations Instruction Instruction Level Level IL IL Instruction Level Optimizations IL: Intermediate Language Code Generation Object Object Code Code Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 6 Compiler Cache Optimizations ● For large-scale scientific program to run efficiently on cache-based machine, effective use of cache memory is the key point. 1. memory latency hiding cache prefetch − hardware/software loop distribution (to reduce data streams) 2. reduction of cache misses loop transformations for improving data locality loop interchange outer loop unrolling loop fusion loop blocking loop tiling (for single loop nest) strip-mining (across loop nests) Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 7 Hardware Prefetch of POWER cache misses of contiguous lines triggers hardware prefetch each CPU core can detect max 8 data streams Problem when data streams > 8, not all streams are prefetched Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 8 Solution (1) - Software Prefetch Insert prefetch (dcbt) instruction for all data streams in a loop dcbt: (data cache block touch) prefetch cache-line specified by its operand to L1 cache do i=1,m … A1(i) … … A2(i) … ..… … An(i) … enddo do i=1,m dcbt(A1(i+d)) dcbt(A2(i+d)) ..… dcbt(An(i+d)) … A1(i) … … A2(i) … ..… … An(i) … enddo Copyright(c) Hitachi, Ltd. 2006. All rights reserved. also apply loop unrolling to remove redundant dcbt 9 memory throughput(relative) Effectiveness of Software Prefetch (1) 1.1 Software Prefetch 1.0 0.9 do i=1,m S=S+A1(i)+...+An(i) end do 0.8 0.7 0.6 Hardware Prefetch 0.5 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 num of data streams 10 Copyright(c) Hitachi, Ltd. 2006. All rights reserved. Solution (2) – Loop Distribution Splits a loop into multiple loops so that each loop has no more than 8 data stremas do i=1,m … A1(i) … … A2(i) … ..… … An(i) … enddo do i=1,m ... A1(i) ... ... ... A8(i) ... enddo do i=1,m ... A9(i) ... ... ... A16(i) ... enddo do i =,m ... A17(i) ... ... ... An(i) ... enddo Copyright(c) Hitachi, Ltd. 2006. All rights reserved. ≦ 8 streams ≦ 8 streams ≦ 8 streams 11 Loop Tiling Improve cache reusability in a loop nest do k=1,N matrix do j=1,N multiplication do i=1,N C(i,k) = C(i,k)+A(i,j)*B(j,k) enddo enddo enddo reference range of array A k=1 A(1:N,1:N) reuse k=2 A(1:N,1:N) N do jj=1,N,s tiling applied do ii=1,N,s do k=1,N do j=jj,min(jj+s-1,N) do i=ii,min(ii+s-1,N) C(i,k) = C(i,k)+A(i,j)*B(j,k) enddo enddo enddo reference range of array A enddo k=1 enddo A(ii:ii+s-1, jj:jj+s-1:N) reuse s s k=2 N A(ii:ii+s-1, jj:jj+s-1:N) Copyright(c) Hitachi, Ltd. 2006. All rights reserved. Effective reuse of in-cache data 12 Loop Tiling – Compiler Support Compiler selects target loop nest and tile size automatically Tiling directives are also available to specify loop and tile size example of tiling directive: *soption tiling Specify do k=1,N target loop *soption tilesize(100) タイリング Specify do j=1,N 幅の tile size 指定 *soption tilesize(200) do i=1,N C(i,k) = C(i,k)+A(i,j)*B(j,k) enddo enddo enddo do jj=1,N,100 do ii=1,N,200 do k=1,N do j=jj,min(jj+99,N) do i=ii,min(ii+199,N) C(i,k) = C(i,k)+A(i,j)*B(j,k) enddo enddo Generated enddo Control loop enddo enddo Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 13 Effectiveness of Loop Tiling Matrix Multiplication(double precision) executed on 16CPU N=3000 1.8 N=5000 N=10000 Best Size 1.6 speedup Compiler Selected 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0.0 8 16 32 64 128 150 200 256 300 350 400 450 512 1024 2048 tile size Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 14 Strip-Mining improve cache reusability between loops user-directive only *soption stripmine(100,1) do i=1,M1 do j=1,N ... A(j,i) ... enddo enddo Specify range of loops do i=1,M2 do j=1,N ... A(j,i) ... enddo enddo *soption end stripmine do jj=1,N,100 do i=1,M1 do j=jj,min(jj+99,N) ... A(j,i) ... enddo enddo do i=1,M2 do j=jj,min(jj+99,N) ... A(j,i) ... enddo enddo enddo Copyright(c) Hitachi, Ltd. 2006. All rights reserved. Effective reuse of incache data 15 Strip-Mining: Example NPB2.3/SP compute_rhs *soption stripmine(4,3) do m = 1, 5 do k=0,grid_points(3)-1 do j = 0, grid_points(2)-1 do i = 0, grid_points(1)-1 rhs(i,j,k,m) = forcing(i,j,k,m) end do end do end do end do : (about 300lines, 12 loops) : do k=1,grid_points(3)-2 do j = 1, grid_points(2)-2 do i = 1, grid_points(1)-2 wijk = ws(i,j,k) wp1 = ws(i,j,k+1) wm1 = ws(i,j,k-1) rhs(i,j,k,1) = rhs(i,j,k,1) + dz1tz1 * > (u(i,j,k+1,1) - 2.0d0*u(i,j,k,1) + > u(i,j,k-1,1)) - tz2 * (u(i,j,k+1,4) - u(i,j,k-1,4)) ・・・(snip)・・・ end do end do end do original relative performance 1.4 1.2 strip-mining 1.23 1.0 0.8 0.6 0.4 0.2 0.0 Copyright(c) Hitachi, Ltd. 2006. All rights reserved. *soption end stripmine compute_rhs 16 Summary Cache Optimizations on SR11000 Compiler optimizations for prefetch loop tiling hardware prefetch can detect max 8 data streams software prefetch complements hardware prefetch apply loop distribution to reduce data streams improves reusability of cached data for one loop nest target loop and tile size is selected by the compiler user-tuning is possible by directives strip-mining improves reusability of cached data across loop nests user specifies range and target loop by directives Copyright(c) Hitachi, Ltd. 2006. All rights reserved. 17