Compiler Cache Optimizations for SR11000

Compiler Cache Optimizations
for SR11000
Ichiro Kyushima
Hitachi, Ltd., Systems Development Laboratory
2006/10/31
Outline of Talk
Overview of SR11000 Sysmtem
‡ Overview of SR11000 Compiler
‡ Cache Optimizations on SR11000 Compiler
‡
„
„
„
Software Prefetch
Loop Distribution
Loop Blocking
‡
‡
‡
for Single Loop Nest (Loop Tiling)
for Multiple Loop Nests (Strip-Mining)
Summary
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
2
Hitachi’s HPC Systems &
Fortran Compiler
SR11000 Model H1, J1, K1 & K2
Peak Performance
[GFlops]
100,000
Single node peak performance
over 100 Gflops with multi GHz
processor
SR8000
First HPC machine with
combined vector & scalar
processing
10,000
SR2201
First commercially available
distributed memory parallel
processor
1,000
S-3800
Single CPU peak
performance 8 GFlops
(fastest in the world)
100
S-820
Single CPU peak
performance
3 GFlops
10
S-810
Fortran Compiler’s Optimizing Facility and
Programming Language Specification
First Japanese
vector
supercomputer
1
Optimization for Cache Memory
Automatic Parallelization
Automatic Vectorization
Automatic Pseudo Vectorization
Fortran90
Fortran77
Fortran95
●ISO/IEC 1539:1991
●ISO/IEC 1539-1:1997
●JIS X 3001-1994●JIS X 3001-1:1998
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
‘82
'85
'90
'95
'00
●ISO/IEC 1539-1:2004
‘05
3
Super-Technical Server SR11000
‡
‡
High-performance SMP node (134.4 GFlops*)
„ POWER Architecture CPU (POWER5+ 2.1GHz*)
„ 16way SMP
High system-scalability
„ max 512 nodes (68.8 TFlops*)
*: SR11000 model K1
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
4
Optimizing Compiler Lineup
Optimizing C
Optimizing FORTRAN
FORTRAN77 (ISO1539:1980)
Fortran90
(ISO/IEC 1539:1991)
Fortran95
(ISO/IEC 1539:1997)
C (ISO/IEC9899)
Optimizing Standard C++
C++ (ISO/IEC14882-1998)
Features
Prallelization for SMP system
- automatic parallelization
- user-specified parallelization (Hitachi’s own directives, OpenMP)
Cache Optimizations
Today’s Topic
Instruction-level Optimizations (for POWER CPU)
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
5
Compiler Structure
Front End
FORTRAN
source
FORTRAN
Front End
C
Source
C
Front End
C++
Source
C++
Front End
Common Back End
Loop Transformations for Parallelization
Parallelization for SMP
Source
Source
Level
Level
IL
IL
Loop Transformations for Cache Opt.
Traditional Optimizations
Instruction
Instruction Level
Level IL
IL
Instruction Level Optimizations
IL: Intermediate Language
Code Generation
Object
Object Code
Code
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
6
Compiler Cache Optimizations
● For large-scale scientific program to run efficiently on cache-based
machine, effective use of cache memory is the key point.
1. memory latency hiding
„
cache prefetch − hardware/software
‡ loop distribution (to reduce data streams)
2. reduction of cache misses
„
loop transformations for improving data locality
‡ loop interchange
‡ outer loop unrolling
‡ loop fusion
‡ loop blocking
ƒ loop tiling (for single loop nest)
ƒ strip-mining (across loop nests)
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
7
Hardware Prefetch of POWER
‡
‡
cache misses of contiguous lines triggers
hardware prefetch
each CPU core can detect max 8 data streams
Problem
when data streams > 8, not all streams are
prefetched
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
8
Solution (1) - Software Prefetch
‡
Insert prefetch (dcbt) instruction for all data streams
in a loop
„
dcbt: (data cache block touch)
prefetch cache-line specified by its operand to L1 cache
do i=1,m
… A1(i) …
… A2(i) …
..…
… An(i) …
enddo
do i=1,m
dcbt(A1(i+d))
dcbt(A2(i+d))
..…
dcbt(An(i+d))
… A1(i) …
… A2(i) …
..…
… An(i) …
enddo
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
also apply loop
unrolling to remove
redundant dcbt
9
memory throughput(relative)
Effectiveness of Software
Prefetch (1)
1.1
Software Prefetch
1.0
0.9
do i=1,m
S=S+A1(i)+...+An(i)
end do
0.8
0.7
0.6
Hardware Prefetch
0.5
0.4
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16
num of data streams
10
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
Solution (2) – Loop Distribution
‡
Splits a loop into multiple loops so that each
loop has no more than 8 data stremas
do i=1,m
… A1(i) …
… A2(i) …
..…
… An(i) …
enddo
do i=1,m
... A1(i) ...
...
... A8(i) ...
enddo
do i=1,m
... A9(i) ...
...
... A16(i) ...
enddo
do i =,m
... A17(i) ...
...
... An(i) ...
enddo
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
≦ 8 streams
≦ 8 streams
≦ 8 streams
11
Loop Tiling
‡
Improve cache reusability in a loop nest
do k=1,N
matrix
do j=1,N
multiplication
do i=1,N
C(i,k) = C(i,k)+A(i,j)*B(j,k)
enddo
enddo
enddo
reference range of array A
k=1
A(1:N,1:N)
reuse
k=2
A(1:N,1:N)
N
do jj=1,N,s
tiling applied
do ii=1,N,s
do k=1,N
do j=jj,min(jj+s-1,N)
do i=ii,min(ii+s-1,N)
C(i,k) = C(i,k)+A(i,j)*B(j,k)
enddo
enddo
enddo
reference
range of array A
enddo
k=1
enddo
A(ii:ii+s-1, jj:jj+s-1:N)
reuse
s
s
k=2
N
A(ii:ii+s-1,
jj:jj+s-1:N)
Copyright(c) Hitachi, Ltd. 2006. All rights
reserved.
Effective
reuse of
in-cache
data
12
Loop Tiling – Compiler Support ‡
‡
Compiler selects target loop nest and tile size automatically
Tiling directives are also available to specify loop and tile size
example of tiling directive:
*soption tiling
Specify
do k=1,N
target loop
*soption tilesize(100)
タイリング
Specify
do j=1,N
幅の
tile
size
指定
*soption tilesize(200)
do i=1,N
C(i,k) = C(i,k)+A(i,j)*B(j,k)
enddo
enddo
enddo
do jj=1,N,100
do ii=1,N,200
do k=1,N
do j=jj,min(jj+99,N)
do i=ii,min(ii+199,N)
C(i,k) = C(i,k)+A(i,j)*B(j,k)
enddo
enddo
Generated
enddo
Control loop
enddo
enddo
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
13
Effectiveness of Loop Tiling
Matrix Multiplication(double precision)
„
executed on 16CPU
N=3000
1.8
N=5000
N=10000
Best Size
1.6
speedup
‡
Compiler Selected
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
8
16
32
64
128
150
200 256 300 350 400 450 512 1024 2048
tile size
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
14
Strip-Mining
improve cache reusability between loops
‡ user-directive only
‡
*soption stripmine(100,1)
do i=1,M1
do j=1,N
... A(j,i) ...
enddo
enddo
Specify
range of loops
do i=1,M2
do j=1,N
... A(j,i) ...
enddo
enddo
*soption end stripmine
do jj=1,N,100
do i=1,M1
do j=jj,min(jj+99,N)
... A(j,i) ...
enddo
enddo
do i=1,M2
do j=jj,min(jj+99,N)
... A(j,i) ...
enddo
enddo
enddo
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
Effective
reuse of incache data
15
Strip-Mining: Example
NPB2.3/SP compute_rhs
*soption stripmine(4,3)
do m = 1, 5
do k=0,grid_points(3)-1
do j = 0, grid_points(2)-1
do i = 0, grid_points(1)-1
rhs(i,j,k,m) = forcing(i,j,k,m)
end do
end do
end do
end do
:
(about 300lines, 12 loops)
:
do
k=1,grid_points(3)-2
do
j = 1, grid_points(2)-2
do
i = 1, grid_points(1)-2
wijk = ws(i,j,k)
wp1 = ws(i,j,k+1)
wm1 = ws(i,j,k-1)
rhs(i,j,k,1) = rhs(i,j,k,1) + dz1tz1 *
>
(u(i,j,k+1,1) - 2.0d0*u(i,j,k,1) +
>
u(i,j,k-1,1)) - tz2 * (u(i,j,k+1,4) - u(i,j,k-1,4))
・・・(snip)・・・
end do
end do
end do
original
relative performance
‡
1.4
1.2
strip-mining
1.23
1.0
0.8
0.6
0.4
0.2
0.0
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
*soption end stripmine
compute_rhs
16
Summary
‡
Cache Optimizations on SR11000 Compiler
„
optimizations for prefetch
‡
‡
‡
„
loop tiling
‡
‡
‡
„
hardware prefetch can detect max 8 data streams
software prefetch complements hardware prefetch
apply loop distribution to reduce data streams
improves reusability of cached data for one loop nest
target loop and tile size is selected by the compiler
user-tuning is possible by directives
strip-mining
‡
‡
improves reusability of cached data across loop nests
user specifies range and target loop by directives
Copyright(c) Hitachi, Ltd. 2006. All rights reserved.
17
Related documents