Cache Performance Measurement and Optimization

advertisement
CS61: Systems Programming and Machine Organization
Harvard University, Fall 2009
Lecture 14:
Cache Performance Measurement
and Optimization
Prof. Matt Welsh
October 20, 2009
Topics for today
●
Cache performance metrics
●
Discovering your cache's size and performance
●
The “Memory Mountain”
●
Matrix multiply, six ways
●
Blocked matrix multiplication
© 2009 Matt Welsh – Harvard University
2
Cache Performance Metrics
Miss Rate
●
●
Fraction of memory references not found in cache (# misses / # references)
Typical numbers:
● 3-10% for L1
● Can be quite small (e.g., < 1%) for L2, depending on size and locality.
Hit Time
●
●
Time to deliver a line in the cache to the processor (includes time to determine
whether the line is in the cache)
Typical numbers:
● 1-2 clock cycles for L1
● 5-20 clock cycles for L2
Miss Penalty
●
Additional time required because of a miss
● Typically 50-200 cycles for main memory
© 2009 Matt Welsh – Harvard University
3
Writing Cache Friendly Code
• Repeated references to variables are good
(temporal locality)
• Stride-1 reference patterns are good (spatial locality)
• Examples:
cold
cache, 4-byte words, 4-word cache blocks
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
}
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
Miss rate = 1/4 = 25%
© 2009 Matt Welsh – Harvard University
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
}
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
Miss rate = 100%
4
Determining cache characteristics
●
●
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
© 2009 Matt Welsh – Harvard University
5
Determining cache characteristics
●
●
●
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Idea: Write a program to measure the cache's behavior and
performance.
●
●
Program needs to perform memory accesses with different locality patterns.
Simple approach:
●
●
●
Allocate array of size “W” words
Loop over the array with stride index “S” and measure speed of memory accesses
Vary W and S to estimate cache characteristics
S = 4 words
W = 32 words
© 2009 Matt Welsh – Harvard University
6
Determining cache characteristics
S = 4 words
W = 32 words
●
●
What happens as you vary W and S?
Changing W varies the total amount of memory accessed by the
program.
●
●
Changing S varies the spatial locality of each access.
●
●
●
As W gets larger than one level of the cache, performance of the program will drop.
If S is less than the size of a cache line, sequential accesses will be fast.
If S is greater than the size of a cache line, sequential accesses will be slower.
See end of lecture notes for example C program to do this.
© 2009 Matt Welsh – Harvard University
7
Varying Working Set
Keep stride constant at S = 1, and vary W from 1KB to 8MB
●
Shows size and read throughputs of different cache levels and memory
Read throughput
1200
(MB/sec)
main memory
region
L2 cache
region
L1 cache
region
Why the dropoff in
performance here?
read througput (MB/s)
1000
800
600
Why is this dropoff
not so drastic?
400
200
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1024k
2m
4m
8m
0
working set size (bytes)
© 2009 Matt Welsh – Harvard University
8
Varying stride
Keep working set constant at W = 256 KB, vary stride from 1-16
●
Shows the cache block size.
800
Why the gradual
drop in performance?
read throughput (MB/s)
700
600
500
one access per cache line
400
300
200
100
0
s1
s2
s3
s4
s5
s6
s7
s8
s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
© 2009 Matt Welsh – Harvard University
9
Pentium III – 550 Mhz
1000
L1 dropoff occurs at 16 KB
L1
800
600
400
xe
Ridges of
temporal
locality
L2
© 2009 Matt Welsh – Harvard University
8k
32k
128k
512k
2m
8m
s15
Stride (words)
s9
s7
mem
s5
s3
s1
0
2k
200
s13
Slopes of
spatial
locality
16 KB L1 cache
512 KB L2 cache
1200
s11
Throughput (MB/sec)
The Memory Mountain
L2 dropoff occurs
at 512 KB
Working set size
(bytes)
10
X86-64 Memory Mountain
5000
L1
4000
3000
L2
2000
4k
1000
128k
© 2009 Matt Welsh – Harvard University
s29
s25
Stride (w ords)
M em
s21
s17
s13
s9
s5
0
s1
Slopes of
Spatial
Locality
R e a d T h r o u g h p u t (M B /s )
6000
Pentium Noco na Xeon x86-64
3.2 GHz
12 Kuo p on-chip L1 trace cache
16 KB on-chip L1 d-cache
1 M B off-chip unified L2 cache
4m
Ridges o f
Tempo ral
Locality
W orking Set Size
(bytes)
128m
11
Opteron Memory Mountain
AMD Opteron
2 GHZ
3000
L1
2500
2000
Read
throughput
(MB/s)
1500
L2
1000
4k
500
128k
0
4m
© 2009 Matt Welsh – Harvard University
s29
s25
Stride (words)
s21
s17
s13
s9
s5
s1
Mem
Working set
(bytes)
128m
12
Matrix Multiplication Example
●
Matrix multiplication is heavily used in numeric and scientific
applications.
●
●
It's also a nice example of a program that is highly sensitive to cache effects.
Multiply two N x N matrices
●
●
●
O(N3) total operations
Read N values for each source element
Sum up N values for each destination
Variable sum
/* ijk */
held in register
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
© 2009 Matt Welsh – Harvard University
13
Matrix Multiplication Example
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
4*3 + 2*2 + 7*5 = 51
4 2 7
1 8 2
6 0 1
© 2009 Matt Welsh – Harvard University
x
3 0 1
2 4 5
5 9 1
51
=
14
Miss Rate Analysis for Matrix Multiply
Assume:
●
●
●
Line size = 32B (big enough for four 64-bit “double” values)
Matrix dimension (N) is very large
Cache is not even big enough to hold multiple rows
Analysis Method:
●
Look at access pattern of inner loop
k
i
j
k
A
© 2009 Matt Welsh – Harvard University
j
i
B
C
15
Layout of C Arrays in Memory (review)
C arrays allocated in row-major order
●
Each row in contiguous memory locations
Stepping through columns in one row:
●
for (i = 0; i < N; i++)
sum += a[0][i];
●
●
Accesses successive elements
Compulsory miss rate: (8 bytes per double) / (block size of cache)
Stepping through rows in one column:
●
for (i = 0; i < n; i++)
sum += a[i][0];
●
●
Accesses distant elements -- no spatial locality!
Compulsory miss rate = 100%
© 2009 Matt Welsh – Harvard University
16
Matrix Multiplication (ijk)
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Inner loop:
(*,j)
(i,j)
(i,*)
A
Row-wise
B
Columnwise
C
Fixed
Misses per Inner Loop Iteration:
A
0.25
B
1.0
C
0.0
Assume cache line size of 32 bytes.
Compulsory miss rate = 8 bytes per double / 32 bytes = ¼ = 0.25
© 2009 Matt Welsh – Harvard University
17
Matrix Multiplication (jik)
/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}
Inner loop:
(*,j)
(i,*)
A
B
Row-wise
Columnwise
Misses per Inner Loop Iteration:
A
0.25
B
1.0
(i,j)
C
Fixed
C
0.0
Same as ijk. Just swapped i and j.
© 2009 Matt Welsh – Harvard University
18
Matrix Multiplication (kij)
/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
A
Fixed
(k,*)
B
Row-wise
(i,*)
C
Row-wise
Misses per Inner Loop Iteration:
A
0.0
B
0.25
C
0.25
Now we suffer 0.25 compulsory misses per iteration for “B” and “C” accesses.
Also need to store back “temporary” result c[i][j] on each innermost loop iteration!
© 2009 Matt Welsh – Harvard University
19
Matrix Multiplication (ikj)
/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
(k,*)
A
B
Fixed
Row-wise
(i,*)
C
Row-wise
Misses per Inner Loop Iteration:
A
0.0
B
0.25
C
0.25
Same as kij.
© 2009 Matt Welsh – Harvard University
20
Matrix Multiplication (jki)
/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Inner loop:
(*,k)
(k,j)
A
Column wise
Misses per Inner Loop Iteration:
A
1.0
© 2009 Matt Welsh – Harvard University
B
0.0
(*,j)
B
Fixed
C
Columnwise
C
1.0
21
Matrix Multiplication (kji)
/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Inner loop:
(*,k)
(*,j)
(k,j)
A
Columnwise
B
Fixed
C
Columnwise
Misses per Inner Loop Iteration:
A
1.0
© 2009 Matt Welsh – Harvard University
B
0.0
C
1.0
22
Summary of Matrix Multiplication
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
© 2009 Matt Welsh – Harvard University
ijk or jik:
2 loads, 0 stores
• misses/iter = 1.25
•
kij or ikj:
2 loads, 1 store
• misses/iter = 0.5
•
jki or kji:
2 loads, 1 store
• misses/iter = 2.0
•
23
Pentium Matrix Multiply Performance
60
50
jki, kji
(3 mem accesses,
2 misses/iter)
Cycles/iteration
40
kij, ikj
(3 mem accesses,
0.5 misses/iter)
30
20
kji
jki
kij
ikj
jik
ijk
ijk, jik
10
(2 mem accesses,
1.25 misses/iter)
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
●
●
●
Versions with same number of mem accesses and miss rate perform about the same.
Lower misses/iter tends to do better
ijk and jik version fastes, although higher miss rate than kij and ikj versions
© 2009 Matt Welsh – Harvard University
24
Using blocking to improve locality
Blocked matrix multiplication
●
●
●
Break matrix into smaller blocks and perform independent multiplications
on each block.
Improves locality by operating on one block at a time.
Best if each block can fit in the cache!
Example: Break each matrix into four sub-blocks
A11 A12
A21 A22
B11 B12
X
B21 B22
=
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = A11B11 + A12B21
C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
© 2009 Matt Welsh – Harvard University
25
Blocked Matrix Multiply (bijk)
for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
© 2009 Matt Welsh – Harvard University
26
Blocked matrix multiply operation
A
B
kk
jj
C
jjjj+bsize jjjj+bsize jj+bsize
kk
kk
kk+bsizekk+bsizekk+bsize
x
=
Step 1: Pick location of block in matrix B
●
Block slides across matrix B left-to-right, top-to-bottom, by “bsize” units at a time
© 2009 Matt Welsh – Harvard University
27
Blocked matrix multiply operation
A
i
i
i
i
kk
kk
kk
kk
B
kk+bsize
kk+bsize
kk+bsize
kk+bsize
x
kk
C
jj
jj+bsize
=
kk+bsize
Step 2: Slide row sliver across matrix A
●
●
●
Hold block in matrix B fixed.
Row sliver slides from top to bottom across matrix A
Row sliver spans columns [ kk ... kk+bsize ]
© 2009 Matt Welsh – Harvard University
28
Blocked matrix multiply operation
A
B
kk
C
kk+bsize
i
j j
x
kk
jj
jj+bsize
=
kk+bsize
Step 3: Slide column sliver across block in matrix B
●
●
●
Row sliver in matrix A stays fixed.
Column sliver slides from left to right across the block
Column sliver spans rows [ kk ... kk+bsize ] in matrix B
© 2009 Matt Welsh – Harvard University
29
Blocked matrix multiply operation
A
B
C
j
kk
kk+bsize
i
i
j
x
kk
jj
jj+bsize
=
kk+bsize
Step 4: Iterate over row and column slivers together
●
●
Compute dot product of both vectors of length “bsize”
Dot product is added to contents of cell (i,j) in matrix C
© 2009 Matt Welsh – Harvard University
30
Locality properties
A
B
C
j
kk
kk+bsize
i
i
j
x
kk
jj
jj+bsize
=
kk+bsize
What is the locality of this algorithm?
●
●
Iterate over all elements of the block N times (once for each row sliver in A)
Row sliver in matrix A accessed “bsize” times (once for each column sliver in B)
If block and row slivers fit in the cache, performance should rock!
© 2009 Matt Welsh – Harvard University
31
Pentium Blocked Matrix
Multiply Performance
Blocking (bijk and bikj) improves performance by a factor of two
over unblocked versions (ijk and jik)
●
Relatively insensitive to array size.
60
Cycles/iteration
50
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
30
20
10
Blocked versions
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
5
12
10
0
75
50
25
0
Array size (n)
© 2009 Matt Welsh – Harvard University
32
Cache performance test program
/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
}
for (i = 0; i < elems; i += stride)
result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride)
{
uint64_t start_cycles, end_cycles, diff;
int elems = size / sizeof(int);
}
test(elems, stride);
/*
start_cycles = get_cpu_cycle_counter(); /*
test(elems, stride);
/*
end_cycles = get_cpu_cycle_counter();
/*
diff = end_cycles – start_cycles;
/*
return (size / stride) / (diff / CPU_MHZ);
© 2009 Matt Welsh – Harvard University
warm up the cache */
Read CPU cycle counter */
Run test */
Read CPU cycle counter again */
Compute time */
/* convert cycles to MB/s */
33
Cache performance main routine
#define
#define
#define
#define
#define
CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */
MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
MAXBYTES (1 << 23) /* ... up to 8 MB */
MAXSTRIDE 16
/* Strides range from 1 to 16 */
MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS];
int main()
{
int size;
int stride;
}
/* The array we'll be traversing */
/* Working set size (in bytes) */
/* Stride (in array elements) */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */
for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride));
printf("\n");
}
exit(0);
© 2009 Matt Welsh – Harvard University
34
Download