Memory Operation and Performance

advertisement
Memory Operation and
Performance
To understand the memory architecture so
that you could write programs that could take
the advantages and make the programs
faster. This lecture covers:
Memory Systems
Caches
Virtual Memory (VM)
Caches – fast memory between CPU and main Memory
Cache Design Parameters
A Diversity of Caches , different level of
caches
Looking at the Caches
Cache-aware Programming (Column-major
and row-major, row-major will make the
program faster.)
A Diversity of Caches
Multiple Levels of Caches (L1, L2, or L3
cache, L1 is faster than L2 and L2 is faster
than L3. L1 is more expensive than L2..)
On-Chip Caches (only one level cache, L1)
Instruction and Data Caches (separate
data and instruction in the cache. There is
likely that instruction will be reused later.)
Instruction and Data Cache – from
http://www.kids-online.net/learn/clickjr/details/cpu.html
Right is two photos of a
CPU (Central
Processing Unit). The
photo on the bottom is
the CPU chip from the
outside. The photo on
top is a large road map
of the inside of the
CPU, showing data
cache and instruction
cache.
Multiple Levels of Caches
Modern computer systems don't have just one
cache between the CPU and memory.
There is usually a hierarchy of caches. The caches
are usually called L1, L2, etc.—which is shorthand
for Level 1, Level 2, and so on.
An L1 cache is the cache that is within the CPU,
and is, therefore, the fastest and smallest, but more
expensive
The last cache (usually L2 or L3) is the cache that
loads data directly from the DRAM main memory,
less expensive
L1 and L2 Cache
Level 1 cache memory, is memory that is
included in the CPU itself. Level 2 cache
memory, is memory outside of the CPU.
Photo below shows level 2 cache memory on
the Processor
Example – shows the benefit of cache memory of
different levels
/* Assumes n is a power of two */
into two
void merge_sort (int * data, int n) {
int half = n >> 1;
if (n == 1) return;
binary_sort(data, half);
binary_sort(data + half, half);
merge(data, data + half, half); }
// no need to memorise
It is to divide the data
Graph of Merge Sort
the access times in nanoseconds (ns) for the
L1 cache (T1), L2 cache (T2), L3 cache (T3),
and main memory (Tm).
Merge sort- in a fast small cache (in L1 only)
Look at
the
total
time
Merge sort - in a slow cache (in L3 only)
Look
at the
total
time,
it is
longer
On-Chip Caches
multilevel caches can improve the
performance of a computer.
However, usually there is no major difference
between having a single L3-sized cache and
three caches
It is not as significant as the difference
between a single large cache and a single
small one
Instruction and Data Caches
Programs access and fetch instructions in much more
predictable ways than they do data.
For instance, instruction fetches exhibit much more spatial
locality than data, because it is very likely that an instruction
fetch will be soon followed by the fetch of the instruction
next to it. For example, the program is executing a++, there
is high chance to execute b++ and c= a + b*3;



a++;
b++;
c =a+ b*3;
Even when a branch or jump instruction makes this untrue,
it is very likely that the instruction fetched next will be one
that has already been fetched recently.
multiple levels of caches
Note that L1 is within CPU chip
Looking at the cache design
We can deduce many things about the cache
design of a particular computer by carefully
examining its memory performance.
We can design a benchmark program whose
locality we control such as.
int data[MAXSIZE];
for (i = 0; i < repeat; i++) {
for (i = 0; i < N; i++) {
dummy = data[i];
}
}
Explanation to the program
This loop accesses a chunk of memory repeatedly.
By varying N, we vary the temporal locality of the
accesses.
For example, for N == 4, each of the values data[i]
will be accessed every 4 iterations, but if N is 16,
each data[i] will be accessed only every 16
iterations.
A cache of size 16 would cause the benchmark to
perform much more poorly for N == 32 than for N
== 8, because for N == 32, each data[i] would have
been evicted (means removed) from the cache
before it was accessed again.
Control the spatial locality
Here, stride controls the amount of spatial
locality
int data[MAXSIZE];
for (i = 0; i < repeat; i++) {
for (i = 0; i < N; i += stride) {
dummy = data[i];
}
}
Result of benchmark
You can
see the
performance
Is not
Transfer
rate in
MB/s
proportional
to L1 cache.
Why? It is
effective
between 4K
and 512
bytes
Interpretation of result
We immediately notice that memory
performance comes in three discrete steps.
In the best performing step, the program is
accessing so little data that all of its
references fit in the L1 cache, and the rest of
the hierarchy is almost never required.
In the next step down, the references no
longer fit in the L1 but fit in the L2 cache, and
access to main memory is almost never
required.
Try to fit L1, L2 etc.
Graph showing size of L1 and performance
Performance:
Transfer rate
The effect of stride (steps)
Cache-Aware Programming
That is how to optimise the performance
Instruction Cache Overflow
Cache Collisions
Unused Cache Lines
Insufficient Temporal Locality
Example (1) – 4ms (assume 1M)
Example (2) – 3ms (assume 1M)
Example (3) – 3 ms (assume 512K)
Example (4) – 2.5 ms (assume
512K)
Example (5) – 2.3 ms (Assume
256K)
Example (5) – 2.3 ms
Instruction cache - Program of complicated
for/loop
Below is a program involving three complicated
operations:
for (i = 0; i < MAX; i++) {
<Complicated operation on A[i]>
<Complicated operation on B[i]>
<Complicated operation on C[i]>
}
It is better to separate into three
So that each complicated operation can maximise the
cache memory (instruction cache).
for (i = 0; i < MAX; i++) {
<Complicated operation on A[i]>
}
for (i = 0; i < MAX; i++) {
<Complicated operation on B[i]>
}
for (i = 0; i < MAX; i++) {
<Complicated operation on C[i]>
}
Cache Collisions
Cache collisions can also cause our
programs to execute slowly.
a cache collision occurs when a cache line is
evicted (switched out) even though the cache
is not full.
It happens when the line is full,
the system has to decide which
data line to remove (switch out).
Program of cache collision
Below is the program involving variables a and b.
int a[N];
<other stuff...>
int b[N];
<other stuff...>
int c[N];
for (i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
Reason of Cache Collision
It is possible that the compiler may allocate a,
b, and c to memory addresses that map to
the same cache set.
In this case, the assignment c[i] = a[i] + b[i]
will cause three cache misses in every
iteration of the loop, because the cache will
be constantly evicting the cache line that the
CPU requires next.
This operation will cause three operations, as c[],
a[] and b[] are in the same cache line.
Graph showing the Cache collision
The solution is to offset memory location
#define CACHELINESIZE <Cache line size of system>
#define COFFSET ((2 * CACHELINESIZE) / sizeof(int))
int a[N];
<other stuff...>
int b[N];
<other stuff...>
int
int c[N + COFFSET];
for (i = 0; i < N; i++) {
c[i + COFFSET] = a[i] + b[i];
Graph showing cache after change
Under-used Cache Lines
Suppose the cache line is 32 bytes wide, as it often
is. If a program is reading contiguous 4-byte
integers (continuous), the reference to the first will
cause the first eight integers (integers 0–7) to be
loaded into the cache.
The reference to the 9th will cause integers 8–15 to
be loaded, and so on. The hit ratio, even on a cold
cache, will be at least 7/8, or 0.875.
Now consider a program that reads integers with a
stride of eight or more. This means that the
program reads the first integer, then the 9th (or
higher), then the 17th etc.
Graph showing the effect
Cache miss
Example of a matrix
int data[M][N];
for (i = 0 ; i < N; i++) {
for (j = 0; j < M; j++) {
sum += data[j][i];
}
}
Row-major and Column-major
Accessing a column-major
Accessing row data
It will be faster, as it accesses [0,0], [0,1][0,2]
which will be loaded into cache line after
reading [00] up to [13]
Changing the order of the iterations is not always
better. Below is an example.
//It is because we have fixed transposed[i][j], but not
original [j][i]
int original[M][N];
int transposed[N][M];
for (i = 0; i < M; i++) {
for (j = 0; j < N; j++) {
transposed[i][j] = original[j][i];
}
}
Effect of rotating shape
This is the effect of previous program. It is to
rotate the image.
Insufficient Temporal Locality
int original[M][N];
int transposed[N][M];
for (k = 0; k < M / m; k++) {
for (l = 0; l < N / n; k++) {
for (i = k*m; i < (k+1)*m; i++) {
for (j = l*n; j < (l+1)*n; j++) {
transposed[i][j] = original[j][i];
}
}
}
}
Blocked transpose gets around cache misses
m and n must be a square and is determined by
the cache line size, say 32 bytes. So that it will fit
into the cache .
Virtual Memory (VM)
The term virtual memory refers to a combination of
hardware and operating system software that solve
several computing problems.
It receives a single name because it is a single
mechanism, but it meets several goals:
To simplify memory management and program
loading by providing virtual addresses.
To allow multiple large programs to be run
without the need for large amounts of RAM, by
providing virtual storage.
Virtual Addresses
Segmentation – group pages together with
different size
Memory Protection – due to the support of
more than ONE process, to protect the
memory being corrupted by others
Paging – use the same size in disk and
memory and load it into memory or from
memory to disk. But computers hold several
programs in memory at the same time.
Virtual Memory - Explanation
Sequence of virtual memory. Program size is larger than
main memory.
Memory
Disk
Page
contradictory about VM facts:
The compiler determines the address at
which a program will execute, by hard-wiring
a lot of addresses of variables and
instructions into the machine code it
generates.
The location of the program is not
determined until the program is executed and
may be anywhere in main memory.
Solution to contradictory facts
Code Relocation: Have the compiler generate addresses
relative to a base address, and change the base address
when the program is executed. This means that the
address of each reference is calculated explicitly by adding
the relative address to the base address. Drawback:
Address Translation: At run time, provide programs the
illusion that there are no other programs in memory.
Compilers can then generate any absolute address they
wish. Two programs may contain references to the same
address without interference.
Virtual and Physical Addresses
The addresses issued by the compiler are
called virtual addresses.
The addresses that result from the
translation are called physical addresses,
because they refer to an actual memory chip.
Multiple programs without relocation
Program
A shares
some
memory
locations
belonging
to
Program
B.
Relocatable code can share memory
Program
A uses
the
memory
locations
belonging
to itself.
Summary
Cache, L1 (within CPU), L2 and L3
Data cache and instruction cache
Program: column major and row major, rowmajor can enhance the performance.
Virtual memory: memory is too small to cater
for the whole program. It loads the page into
memory.
Download