Some Additions to Hardware-Based Data Prefetching

advertisement
DATA CACHE PREFETCHING SCHEMES FOR MULTIMEDIA
AND IMAGE PROCESSING APPLICATIONS
ALAA R. ALAMELDEEN
alaa@cs.wisc.edu
DEEPAK JINDAL
jindal@cs.wisc.edu
Dept. of Computer Sciences, University of Wisconsin-Madison
SANJEEV KULKARNI
sanjeevk@cs.wisc.edu
ABSTRACT
Data prefetching is an important approach to reduce data cache miss rate due to compulsory and capacity cache
misses. Prefetching can be performed in hardware or in software. The basic idea is to keep track of data access patterns of a
program and to use this information to anticipate the location of data that is going to be accessed next. Image processing and
multimedia applications are among the most growing application areas in computer systems. They are also memory-intensive
and time-critical applications, but they have predictable data access patterns. These properties render them ideal for data
prefetching schemes.
In this paper, we study some of the hardware-based data prefetching schemes, and their effect on the performance of
image processing and multimedia applications. We introduce a modification to the correlated-prefetching hardware scheme
presented by Chen and Baer that removes some of its restrictions requiring compiler support. We describe a set of multimedia
and image processing benchmark programs. Simulation experiments are held to compare different hardware prefetching
schemes, and they show satisfactory results for most of the image processing and multimedia applications without hurting
performance for other SPEC benchmarks.
KEY WORDS
Prefetching, hardware-based prefetching, data cache, reference prediction, memory access patterns, image processing,
multimedia.
1. INTRODUCTION
Due to the increasing speed gap between processors and memory, the increasing speed of processors cannot be
exploited efficiently unless some way of tolerating memory latency exists. The invention of cache memories as an
intermediate storage level between main memory and the processor has been a breakthrough, which helped to reduce the gap
between the processor speed and that of main memory. However, memory latency cannot be totally eliminated. Upon a cache
1
miss, the processor has to wait until the data is fetched both to the cache and to the processor. This wait time cannot be
eliminated fully even for out-of-order processors.
Several techniques have been proposed to reduce the data access penalty that results from cache misses. These
include [5]:

Increasing cache size, which can be limited by the available on-chip area for cache

Increasing cache associativity to reduce conflict misses, but this may increase the cache hit time [6]

Jouppi’s Victim cache [8] which reduces conflict misses by storing the blocks that are replaced from cache to an
associative victim cache. However, the effect of this technique is less with the increase in cache size and/or associativity.

Data prefetching, which reduces compulsory and capacity cache misses, by attempting to have data in cache before it is
actually required, whether it has been referenced before and replaced (capacity miss) or not (compulsory miss). Data
prefetching is defined as moving data from a higher level of the memory hierarchy to a lower level of the memory
hierarchy before the data is actually needed [14].
Data prefetching can be done in hardware or in software. Hardware-based data prefetching, explained in more detail
in the next section, is based on adding a hardware unit that fetches data to the data cache before it is needed. Prefetching data
to the cache depends on the previously accessed data patterns. Software-based prefetching [2, 9, 10, 11, 16], on the other
hand, is based on the analysis of the static program. An intelligent compiler may insert special instructions that prefetch data
many cycles ahead of their use by other instructions. These techniques can get more prefetches and can help in prefetching
complex access patterns (which cannot be done in hardware due to the high complexity). This is done at the cost of addition
of prefetch instructions (including a change in the instruction set architecture) as well as additional program calculations of
the prefetched addresses (which may reduce the program speed). Many machines have included some architectural support for
software-based prefetching schemes. An example of these machines is the HP PA-8000 [14].
Multimedia and image processing applications are growing application areas in modern computing environments,
and this trend tends to increase with the internet technology and its many services like video-on-demand. Inspired by this
trend, most modern architectures have included some architectural support for these applications, like the Intel MMX [7]. The
main problem with these applications is that they are memory intensive, processing large volumes of data, which renders them
rather slow. For example, a typical 24-bit true color image of dimensions 500x500 requires approximately 750 Kbytes of
storage, which cannot fit in L1 caches of practical sizes. However, the memory access patterns for most of these applications
are quite regular and predictable, which makes them suitable for data prefetching schemes.
2
The rest of this paper is organized as follows. Section 2 is a survey for some hardware-based data prefetching
schemes, with more focus on the Chen and Baer basic and lookahead prefetching schemes [3]. Section 3 explains the Chen
and Baer correlated reference prediction scheme. Section 4 introduces our modification to the correlated prefetching scheme.
Section 5 describes the image processing benchmarks used to compare the performance of different hardware-based
prefetching schemes. Section 6 describes our simulation model and performance metrics. Section 7 presents the results and
their explanations. Section 8 provides conclusions and suggestions for future work.
2. HARDWARE-BASED DATA PREFETCHING
Hardware-Based data prefetching schemes dynamically predict the memory address to prefetch based on addresses of
previous memory accesses. These techniques can be classified into two main classes:
i. On-chip schemes, which depend on the addresses required by the processor in all data references. Examples of these
schemes include the Chen and Baer’s basic, lookahead and correlated schemes [3].
ii. Off-chip schemes, which depend on the addresses that result in L1 cache misses, since these are usually the only addresses
available outside the processor chip. These include Jouppi’s stream buffers [8] enhanced by Palacharla and Kessler [12].
There are arguments for both of these approaches. On-chip schemes can be more accurate in estimating prefetched
data, thus removing a great portion of the compulsory and capacity cache misses. On the other hand, on-chip schemes
consume precious chip area that can be used to build a larger-sized cache. They also increase the required memory bandwidth.
Off-chip schemes do not frequently replace useful data in the data cache, and have much less memory bandwidth, but their
prefetching accuracy is not usually as good as on-chip schemes.
2.1 Memory Access Patterns
Prefetching schemes, both on-chip and off-chip, base their prediction for the next address to be needed by an
instruction on some regularity in the memory access pattern. For example, in a program segment consisting of nested loops,
the memory access patterns can be divided into the following four main categories [3]:
1. Scalar, which is a simple variable reference that does not change with respect to the loop index.
2. Zero stride, which is a reference inside an inner loop with a subscript expression that does not change with respect to the
inner loop index, but may change with respect to the outer loop (e.g. a reference to A[i] inside an inner loop indexed by j).
3. Constant stride, which is a reference inside a loop with a subscript expression that increases with a constant rate with
respect to the loop index (e.g. the reference A[i] inside the loop indexed by i).
3
4. Irregular, which is any pattern other than the previous three patterns (e.g., linked list traversals, tree traversals, ... etc.).
Hardware-based techniques implement special circuits to detect any regularity in memory access patterns for an
instruction or a group of instructions, and base their prediction on these patterns. Some techniques even extend their
prediction decisions to the patterns classified as (irregular), like the linked-list traversal accesses [13].
2.2 Jouppi’s Stream Buffer Scheme
This scheme is a variation of the One Block Lookahead (OBL) prefetching scheme [15]. The OBL scheme
prefetches the cache block i+1 whenever the cache block i is referenced. Jouppi [8] suggested an off-chip scheme in which
several data streams can be prefetched using FIFO stream buffers. A stream buffer is allocated whenever there is a data cache
miss on an address, and the buffer proceeds by prefetching the next address to be referenced. Each stream buffer stores the
next address to be referenced, the cache block and tag, and a valid bit. An incrementor is used to generate the next prefetch
address. It has to be noted that stream buffers are only allocated on a data cache miss, thus greatly decreasing the memory
bandwidth required by most on-chip schemes.
Palacharla and Kesseler [12] extended Jouppi’s scheme to detect and prefetch non-unit strides. They added fields
containing the last instruction address (PC), stride and state to each stream buffer. The last instruction address is used to keep
track of the last reference to this stream buffer. The stride field, which is the difference between the last two references to this
stream buffer, is used to determine the next address to be prefetched. The state field determines whether to proceed with
prefetching or not. The state is updated based on the success of previous predictions.
2.3 Chen and Baer’s Reference Prediction Schemes
Chen and Baer [3] introduced three variations for on-chip hardware-based data prefetching. These variations handle
the scalar, zero-stride and constant-stride access patterns. The basic scheme depends on the construction of a reference
prediction table (RPT) for each instruction in the program that references memory (i.e. load and store). An entry in the RPT
consists of the instruction address (PC), the previous data address referenced by this instruction, and the stride, which is the
difference between the last two referenced data addresses. In addition, an RPT entry contains a state field that provides
information about the success of previous prefetches for this entry. Data prefetching is triggered when the program counter
reaches an instruction that has a corresponding entry in the RPT. If the state of the corresponding RPT entry indicates that
data accesses can be predicted, the data at address (current address + stride) is prefetched to cache. The state field of an RPT
entry is updated on any memory reference by the instruction whose address is stored for that entry, according to the state
transition diagram shown in figure 1. All states are predictable except for the No-pred state. Whenever data from a certain
4
address is to be prefetched, it is added to an Outstanding Request List (ORL). Data is prefetched from the ORL in order as
long as no demand cache misses occur. Demand cache misses have precedence over ORL prefetches, and a reference is not
prefetched if it results in an exception (e.g., page fault).
correct prediction
Incorrect prediction
steady
init
correct prediction
Incorrect prediction
correct prediction
Incorrect prediction
correct prediction
transient
No-pred
Incorrect prediction
Figure 1: State Transition Diagram for Basic Reference Prediction Scheme
This scheme decreases data access penalty on nearly all benchmarks, but it has a drawback due to its strategy of
prefetching the predicted reference for the next execution of the current instruction. If the next occurrence of the same
instruction is too close (as in small loops), the prefetched data may not arrive in time for the instruction to use. On the other
hand, if the next occurrence of the instruction is too far (as in large loops), the predicted data may arrive too early and replace
a useful block in the data cache, or might even get replaced before being used. Due to both of these drawbacks, Chen and
Baer introduced the lookahead reference prediction scheme.
In the lookahead reference prediction scheme, an advanced copy of the program counter, the lookahead program
counter (LA_PC), advances ahead of the actual program counter and is used to trigger data prefetches. LA_PC is ahead of the
actual PC by a slightly higher number of cycles than the memory latency. If the LA_PC matches an address in an RPT entry,
the next estimated reference is added to the ORL. This increases the probability that the prefetched data will be in cache just
in time before it is actually needed. The lookahead prediction scheme adds another field to the RPT table entries, called times,
which indicates the number of iterations the LA_PC is ahead of the actual PC. For the update of LA_PC, branch prediction is
required to predict the outcomes of the branches. When a branch misprediction is discovered while executing the normal
instruction stream, LA_PC is reset back to PC and all the requests on the ORL are removed. When PC stalls because of a
cache miss LA_PC can still advance and issue prefetches to the ORL.
Chen and Baer also described another scheme, the correlated reference prediction scheme, which is going to be
discussed and modified in the next sections.
5
2.4 Other Schemes
Joseph and Grunwald [4] introduced an off-chip scheme that uses the miss address stream as a prediction source.
They assume that an observed Markov model can approximate the miss reference stream. A model is dynamically constructed
whenever a program is being executed. This means that the Markov model captures the past activity of all programs running
on a system and uses this information to predict future references. The Markov model is approximated by limiting the number
of states as well as the out-degree of each state to be applicable in hardware. The main disadvantage of this scheme is its high
hardware complexity [4].
Roth et al. [13] introduced a Dependence-Based prefetching scheme, which dynamically identifies the load
instructions that access a linked list structure, and constructs a description of the steps the program has followed to traverse
the structure. Based on the assumption that the program will continue to follow the same steps, a prefetch engine takes this
description and speculatively executes the load instructions in parallel with the original program’s execution. Details of this
scheme can be found in [13].
3. CORRELATED PREFETCHING SCHEME
From the analysis of results in the published papers for the above schemes, on-chip schemes seem to yield much
better performance (i.e., miss rates and data access penalties) than off-chip schemes. Chen and Baer [3] introduced another
scheme, the correlated reference prediction scheme, which handles two levels of stride changes. They introduced this scheme
to handle triangular loops. A typical triangular loop is:
for (i=0; i<n; i++)
for (j=0; j < i; j++)
A[i][j] = .....;
References in the inner loop are constant strides, but the frequent stride changes on each transition to the outer loop cause a
lot of useless prefetches in the lookahead scheme. In image processing applications, if processing is done for only part of the
image, a change of stride occurs whenever a line of an image is processed and the processing moves to the next line.
The key idea behind the Chen and Baer’s correlated reference prediction scheme is to keep track of adjacent accesses
in inner loops (as in the basic and lookahead schemes), as well as the accesses correlated by changes in the loop level. They
assume that loop branches are backward branches. Thus, a non-taken branch triggers the correlated scheme to use the higher
level of strides, since a non-taken branch implies the end of a loop.
6
Their implementation of the correlated scheme is restricted to only two-levels of nested loops. The additional
hardware in comparison with the lookahead scheme is:

A 2-bit shift register that stores the branch history of the last two loop branches

Additional (prev_addr, stride) pair in each entry of the RPT.
The use of branch history is dependent on some support from the compiler, which is supposed to flag loop branches
to distinguish them from other branches (like if-then-else branches inside the loop). Any RPT entry using the correlated
scheme should contain the following fields:

tag: load/store Instruction address

p_addr0, stride0: Previous address and stride used when the inner loop branch is not taken

p_addr1, stride1: Previous address and stride for the normal instruction reference inside the inner loop

times: The number of iterations (of the inner loop) the LA_PC is ahead of the actual PC

state: Same as in figure 1
The results presented by Chen and Baer [3] show a slight improvement for the correlated scheme over the lookahead
scheme, and it was our intuition that the improvement is going to be more significant for image processing and multimedia
applications.
4. MODIFIED CORRELATED PREFETCHING SCHEME
There are two main disadvantages for the original correlated scheme. The compiler support assumption that Chen
and Baer used to get their results could not be implemented solely in hardware. Also, the hardware implementation depends
on whether the loops are compiled with forward or backward branches. Both of these disadvantages greatly limit the
applicability of this scheme as a hardware-based prefetching scheme.
In this paper we propose a modified correlated reference prediction scheme that eliminates the need for compiler
support. The basic idea behind the scheme is the same as the original scheme. Whenever a new outer loop iteration starts we
issue a prefetch for the next outer loop iteration. The difference is in how we detect the start of the new outer loop iteration. In
the original correlated scheme LA_PC is used to prefetch the data accessed inside the inner loop. When the inner loop is
terminated, there is a branch misprediction and LA_PC is again set to the current PC. When the data referencing instruction is
encountered again, we note that the branch history (which is used for loop only branches) has changed and issue a prefetch for
the next iteration of the outer loop.
7
This is where our scheme differs from the original scheme. When a new iteration of the outer loop starts we will
always observe a sudden change in the previous stride unless it was an adjacent data access (like traversing a full array). This
change in the stride is used as an indication of a new outer loop iteration. The original scheme relies upon branch history to
determine when a new iteration starts. However, the change in stride makes it unnecessary to use branch history for this.
This scheme has a big advantage over the original scheme as it completely eliminates any compiler support. We do
not need any extra hardware to implement the new scheme as the change in stride is always calculated for state transitions.
When a change in stride1 is encountered in steady state, state is set to initial and stride1 is left unchanged as in the original
scheme. The condition p_addr0 + stride0 == current_addr is evaluated. If it is true, this implies that the last prediction was
correct and hence a new prefetch for the next outer loop iteration is issued for the address current_addr + stride0. If the
condition is false stride0 is set to current_addr – p_addr0 and no prefetch is issued. p_addr0 is always set to current_addr.
This scheme will perform slightly better than the original scheme. If LA_PC reaches that instruction again before PC
reaches the branch instruction of the inner loop, branch history will remain unchanged and hence original scheme will try to
prefetch for the next outer loop iteration again. But in our scheme we will issue a prefetch for the next inner loop iteration.
5. IMAGE PROCESSING AND MULTIMEDIA WORKLOAD
In this paper, we are especially concerned with the image processing and multimedia applications. We have selected
a set of applications that can be representative of a typical image processing and multimedia workload on a modern computer
system. Since there is currently no well-established benchmark for image processing and multimedia, we used the following
programs to estimate the effectiveness of prefetching schemes:
1.
ijpeg: The SPECint95 JPEG compression program. We used the same input for our runs as that specified in the
SPECint95 benchmark suite. The program is run on a 32 KB PPM image which is converted to the JPEG format. This is
important since the JPEG format is the most widely used image format on the world wide web.
2.
filter: A synthetic program that performs basic image low-pass (blurring) filter. We used an input image of size 32 KB on
which the basic image averaging low-pass filter was applied. The intensity of each image point is averaged with all its
neighbors to get the average
3.
mpeg2encode: The program encodes an image sequence to the MPEG-2 compressed format. We use this program as an
estimator of the workload of movie-recording programs. Though this program does not have much of a regular access
pattern, it has to be included since it is representative of the multimedia workload.
8
4.
mpeg2decode: The program decodes an MPEG-2 movie to separate images. We use this program as an estimator of the
workload of media player programs that display a set of images from an MPEG file. This type of programs is the most
frequently used multimedia programs type.
In addition to the image processing and multimedia workload, we also used two programs from the SPEC95
benchmark suite: compress and tomcatv. The data access predictability of these programs is summarized in Table 1. This
table shows the percentage of accesses for entries in the RPT that are in various states of the state transition diagram (figure
1). This table proves that most of these applications are predictable using any reasonable prefetching scheme.
%
ijpeg
filter
mpeg2encode
mpeg2decode
Compress
Tomcatv
init
Transient
31.0567
3.2055
0.0454
0.0001
12.9842
2.2568
18.3751
2.4559
13.4662
0.1789
62.6204
1.3875
Steady
55.7563
99.9544
78.6153
78.2543
85.9282
34.6807
no-pred
9.9815
0.0000
6.1437
0.9147
0.4267
1.3114
Table 1: Percentage of RPT accesses in which an accessed entry is in various states of figure 1
6. SIMULATION RUNS
6.1 Simulation Assumptions
To perform simulation runs on the above applications, we modified the Cache Simulation program, which is a part of
version 3.0 of the SimpleScalar Tool set [1]. This version of SimpleScalar performs fast, flexible and accurate simulation of
Alpha binaries. We added to this tool set an implementation of Chen and Baer’s basic and lookahead schemes, Palacharla and
Kesseler’s stream buffers, and our modified correlated reference prediction scheme.
In our implementation of the basic, lookahead and correlated prefetching schemes, we used the same assumptions as
used by Chen and Baer [3]. These assumptions are:
1.
The use of a perfect pipeline, which means that all data dependencies that result in pipeline stalls are ignored. This
assumption is not to the benefit of prefetching schemes, since a non-perfect pipeline will provide additional cycles for the
prefetched data to be ready in the data cache when it is needed.
2.
The use of a perfect instruction cache. This means that we ignore stalls that result from instruction cache and instruction
TLB misses. This assumption is also not to the benefit of prefetching schemes for the same reasons stated above.
3.
The use of the non-overlapped memory model. This is one of the memory models implemented by Chen and Baer, in
addition to the overlapped and pipelined models. This model is the most restrictive one since it implies that no memory
9
requests can be initiated until the previous (and the only) request is completed. This means that if we place a memory
request on the ORL while another memory request is still outstanding, we have to wait until data of the outstanding
request arrives from memory before the new request starts to be served. This model is typical of an on-chip cache backed
up by a second level cache [3].
In our implementation of the stream buffer technique, we exploited the maximum gain possible from an off-chip
technique by assuming an infinite number of perfect stream buffers. A perfect stream buffer has its prefetched data ready
when it is required, ignoring memory latency. We also simulated 512 stream buffers, which is a number large enough for
almost all applications that it can be considered as infinite. Stream buffers are allocated in FIFO order.
6.2 Simulation Settings
We performed two sets of experiments. In the first set, we measured the cache performance without prefetching as
well as the performance of different prefetching schemes under the same settings. The objective of this set of experiments is to
compare the benefits of prefetching for different cache sizes in various benchmark programs. In this set of experiments we
kept the cache line size, the size of ORL and the number of BTB entries constant. The settings we used are:

Cache line size = 32 bytes

Direct-mapped L1 cache

ORL size = 30

Number of BTB entries = 512

L2 cache access latency = 6 cycles

L2 cache is a 256 KB 4-way set associative cache with 64 byte lines, using LRU replacement policy

Memory access latency = 30 cycles.
We also performed a second set of experiments to find the impact of various block sizes and number of ORL entries on
the performance of basic, lookahead and correlated prefetching. The cache size was kept fixed at 16K. For the block sizes
experiment, we used block sizes of 16, 32 and 64 bytes. For the ORL experiment, we used 1, 2, 4, 8, and 16 entries in the
ORL. All these experiments were conducted for ijpeg.
6.3 Simulation Metrics
We used three metrics to compare different prefetching techniques.
10
6.3.1 L1 Cache Miss Rate
This metric is used to determine the success of any prefetching scheme on the reduction of cache misses. In the basic,
lookahead and correlated prediction methods, an access is considered a hit even if the data request for this access has been
issued by ORL but has not yet completed. However, a reference to an outstanding request in the ORL that has not been issued
is considered a miss. For the stream buffers scheme, a hit in any prefetched block in a stream buffer is considered as a cache
hit.
6.3.2 L1 Cache Replacement Rate
This is the rate at which blocks are being replaced from the data cache. This is an indication of the number of cache
prefetches and thus the increase in memory bandwidth required by the prefetching schemes. On-chip prefetching schemes
usually have higher cache replacement rates than a cache without prefetching.
6.3.3 Memory Cycles Per Instruction (MCPI)
This metric represents the average number of stall cycles per instruction contributed by memory access. Since we
assume that the processor can execute one instruction per cycle (perfect pipelining) and that we have an ideal instruction
cache, the only extra contribution of CPI results from data access penalty. MCPI caused by the data access penalty is
calculated as:
MCPI 
Total Number of Memory Wait Cycles
Number of Instructions Executed
The reason for using MCPI is that MCPI can reflect the actual stall time observed by the processor, taking both the
processor execution and cache behaviour into account. However, MCPI can be greatly reduced for prefetching schemes if we
use another more relaxed memory model (like the overlapped or the pipelined models [3]).
7. RESULTS AND DISCUSSION
7.1 L1 Cache Miss Rate
Figure 2 shows the cache miss rate of different benchmark programs for different prefetching schemes. For
multimedia and image processing applications, all prefetching schemes improve cache miss rate over the no-prediction
scheme. The only exception is the (filter) program for a cache size of 8KB. In this program, the higher replacement rate due to
prefetching and the fact that the image is a multiple of cache size makes many of these replacements replace useful data in
cache. For larger cache sizes, however, prefetching schemes reduce compulsory cache misses, which greatly reduces the miss
11
rate in comparison with the no-prefetch scheme. For a 32 KB cache for instance, the no-prefetch scheme has a 2.28% miss
rate, the stream buffer has a 1.71% miss rate, whereas all on-chip schemes have a miss rate of almost zero.
It has to be noted that the performance improvement as a result of the stream buffers scheme is almost constant in
comparison with the no-prefetch scheme in nearly all programs, whereas the performance of the on-chip techniques vary with
the program. Since the stream buffer scheme does not replace useful data in cache very often (since it is activated only on
cache misses), it performs better for small cache sizes in ijpeg, filter and mpeg2encode.
In the two other SPEC95 applications, prefetching schemes greatly reduce the cache miss rate in comparison with the
no-prefetch scheme. This reduction is more for compress than it is for tomcatv due to the higher predictability of compress
accesses as shown in table 1.
7.2 L1 Cache Replacement Rate
Figure 3 shows the cache replacement rates of different benchmark programs for different prefetching schemes. For
all cache sizes, the prefetching schemes produce some prefetch requests, and so they have a higher replacement rate in
Miss Rate - filter
Miss Rate - ijpeg
0.06
0.05
N
0.04
S
0.03
B
0.02
L
C
0.01
0
8K
16K
32K
64K
0.8
0.7
0.6
0.5
N
S
0.4
0.3
0.2
0.1
0
B
L
C
8K
128K
16K
32K
64K
128K
Cache Size
Cache Size
Miss Rate - mpeg2decode
Miss Rate - mpeg2encode
0.12
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
0.1
N
N
0.08
S
B
0.06
B
L
0.04
L
C
0.02
S
C
0
8K
16K
32K
64K
8K
128K
16K
Miss Rate - compress
0.2
N
S
0.1
B
L
0.05
C
0
16K
32K
64K
Cache Size
64K
128K
Miss Rate - tomcatv
0.15
8K
32K
Cache Size
Cache Size
128K
0.16
0.14
0.12
0.1
N
S
0.08
0.06
0.04
0.02
0
B
L
C
8K
16K
32K
Cache Size
Figure 2: Cache Miss Rate for Various Benchmark Programs
12
64K
128K
comparison with the no-prefetch scheme. The off-chip stream buffer scheme also has a lower replacement rate than on-chip
schemes due to the less number of prefetches it generates.
Replacement Rate - ijpeg
Replacement Rate - filter
0.07
1.2
0.06
1
N
0.05
0.04
0.03
0.02
0.8
S
B
0.6
B
L
0.4
L
C
0.01
N
S
C
0.2
0
0
8K
16K
32K
64K
128K
8K
32K
64K
Cache Size
Replacement Rate - mpeg2encode
Replacement Rate - mpeg2decode
0.025
0.02
N
0.015
S
L
0.005
C
32K
64K
N
S
0.08
0.06
0.04
0.02
0
0
16K
128K
0.16
0.14
0.12
0.1
B
0.01
8K
16K
Cache Size
128K
B
L
C
8K
16K
Cache Size
32K
64K
128K
Cache Size
Replacement Rate - compress
Replacement Rate - tomcatv
0.176
0.174
0.172
0.17
0.25
0.168
0.166
0.164
0.162
0.16
N
0.2
N
S
0.15
S
B
B
L
0.1
L
C
0.05
C
0
8K
16K
32K
64K
128K
8K
Cache Size
16K
32K
64K
128K
Cache Size
Figure 3: Cache Replacement Rate for Various Benchmark Programs
7.3 Memory Cycles Per Instruction (MCPI)
Figure 4 shows the data access penalty (in memory stall cycles per instruction) of various benchmark programs for
different prefetching schemes. This figure is the weakest point in our results due to the use of the traditionally-used nonoverlapped memory model. The non-overlapped memory model cannot serve more than one request at a time. This means that
the non-overlapped memory model limits the benefits of prefetching. However, in the most frequently-used two image
processing and multimedia programs (i.e., ijpeg and mpeg2decode), the MCPI is reduced greatly in all prefetching schemes in
comparison with the no-prefetch scheme. In these two applications, the correlated scheme performs best, followed by the
lookahead, then basic and stream buffer prefetching. The same improvement is shown in compress. These results are
13
consistent with our expectations for these applications. For filter and mpeg2encode, as well as for tomcatv, the limitations of
the non-overlapped memory model, together with the high replacement rate, makes prefetching schemes perform worse than
the no-prefetch scheme for small cache sizes. The miss rates show that eliminating the non-overlapped model limitations will
result in an improvement for the overlapped and pipelined memory models, since the reduction in miss rate implies that
prefetching schemes do correct predictions.
7.4 Effect of Different Cache Block Sizes
Figure 5 shows the effect of the increase in block size for the ijpeg program. This figure shows that the improvement
of prefetching schemes is consistent for all data cache block sizes. It has to be noted however that for a program like ijpeg
which accesses an image larger than the cache size, the improvement produced by the prefetching schemes is larger for
smaller block sizes. An increase in block size from 16 to 32 bytes produces a higher reduction in MCPI than the introduction
MCPI - ijpeg
MCPI - filter
0.08
0.07
0.06
0.05
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
N
B
0.04
0.03
0.02
0.01
0
L
C
8K
16K
32K
64K
N
B
L
C
128K
8K
16K
32K
Cache Size
64K
128K
Cache Size
MCPI - mpeg2encode
MCPI - mpeg2decode
0.025
0.14
0.12
0.02
N
0.1
N
0.015
B
0.08
B
0.01
L
0.06
L
C
0.04
C
0.005
0.02
0
0
8K
16K
32K
64K
128K
8K
16K
Cache Size
32K
64K
128K
Cache Size
MCPI - compress
MCPI - tomcatv
2.5
0.3
2.4
0.25
2.3
N
2.2
B
2.1
L
B
L
0.1
C
2
N
0.2
0.15
1.9
C
0.05
1.8
0
8K
16K
32K
64K
128K
8K
Cache Size
16K
32K
64K
Cache Size
Figure 4: MCPI (Memory Cycles Per Instruction) for Various Benchmark Programs
14
128K
Data Access Penalty for Different Block sizes16K Cache - ijpeg
0.045
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
16
32
64
N
B
L
C
Figure 5: Effect of Different Cache Block Sizes (16, 32 and 64 bytes) on Prefetching Schemes
of prefetching. For regular block sizes however, prefetching greatly improves MCPI.
7.5 Effect of Different ORL Sizes
Figure 6 shows the effect of different ORL sizes on MCPI. Due to the constraints of the non-overlapped memory model, the
ORL size variation has an insignificant effect on prefetching schemes. This is because we can only issue one prefetch to
memory at a time.
Data Access Penalty for Different ORL sizes- ijpeg
0.03
0.025
1
0.02
2
0.015
4
8
0.01
16
0.005
0
N
B
L
C
Figure 6: Effect of Different ORL Sizes on MCPI
7.6 Summary of results
Previous results illustrate that a small on-chip hardware prefetching unit can improve performance in image
processing and multimedia applications. The magnitude of this improvement however is limited by the non-overlapped
memory model restrictions. As shown in table 1, image processing and multimedia applications are highly predictable. The
great reduction in cache miss rates for nearly all prefetching schemes means that prefetching can be used effectively for these
types of applications.
15
In mpeg2decode and compress, there is a great reduction in MCPI for all prefetching schemes in comparison with the
no-prefetching scheme. This is due to the fact that they have a lot of non-sequential data accesses that, for no-prefetching
scheme, generate a lot of misses. Since the working set of compress is large for practical L1 cache sizes, an increase in cache
size does not seem to improve MCPI. This can only be improved by reducing compulsory and capacity misses, as done in
prefetching schemes.
On-chip prefetching schemes perform better in most cases than the off-chip stream buffer scheme, even with the use
of perfect infinite pool of stream buffers. This is due to the fact that stream buffers base their prediction on the miss address
stream, whereas on-chip schemes can use all the addresses issued by the processor. This improvement is done at the expense
of increasing the cache replacement rate and hence, the memory bandwidth.
8. CONCLUSIONS AND FUTURE WORK
We studied a few data prefetching schemes with a focus on their impact on image processing and multimedia
applications. We extended the correlated reference prediction scheme proposed by Chen and Baer [3] to a scheme that does
not require any compiler support or branch history buffers. The proposed scheme provides the same performance
improvement as that provided by the original correlated scheme. We assessed the performance of this scheme as well as the
performance of a cache system without prefetching, an off-chip stream buffer prefetching, and on-chip basic and lookahead
reference prediction schemes.
For the multimedia and image processing workload we used, as well as other SPEC95 benchmark programs,
prefetching schemes significantly reduced cache miss rate, indicating the potential of using prefetching schemes for most
applications. The reduction in cache miss rate came at the expense of a higher cache replacement rate, which implies a higher
memory bandwidth. Memory cycles per instruction is also reduced by prefetching schemes for most applications.
In all our experiments we have used the non-overlapped memory model in which only one memory access request
can be in progress at any time. A possible exploration could be to simulate with the overlapped model in which multiple
requests can be active at any time, as done by Chen and Baer [3]. This could increase the benefits of prefetching by removing
the misses resulting from an outstanding prefetch request which has not been issued.
These prefetching schemes also need to be evaluated on out-of-order processors. What needs to be determined is
whether the reduction in MCPI would be as significant in out-of-order processors as it is for in-order processors, to justify the
addition of an on-chip prefetching unit to out-of-order processors.
16
Another interesting study would be to study prefetching schemes with multiple levels of caches. Multiple level of
caches will have higher tolerance for prefetches that are generated much ahead of when they are required. We expect these
prefetches will be able to reduce the latency seen by L1 cache by prefetching data into L2 from main memory.
ACKNOWLEDGMENTS
We would like to thank Todd Austin for his hint about SimpleScalar, Collin McCurdy for his helpful comments, and Prof.
Mark Hill for his helpful pointers to research papers.
REFERENCES
[1] Doug Burger and Todd M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report #1342, University of
Wisconsin-Madison Computer Sciences Department, June 1997.
[2] William Y. Chen, Scott A. Mahlke, Pohua P. Chang and Wen-mei W. Hwu, “Data Access Microarchitectures for
Superscalar Processors with Compiler-Assisted Data Prefetching, Proc. 24th Annual International Symp. Microarchitecture,
pp. 69-73, November 1991.
[3] Tien-Fu Chen and Jean-Loup Baer, “Effective Hardware-Based Data Prefetching for High-Performance Processors,”
IEEE Transactions on Computers, Vol. 44, No. 5, pp. 609-623, May 1995.
[4] Dirk Grunwald and Douglas Joseph, “Prefetching Using Markov Predictors,” Proc. of the 24th Annual International
Symposium on Computer Architecture, pp. 252-263, June 1997.
[5] John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” Morgan Kaufmann
Publishers, Inc., 1996.
[6] Mark D. Hill, “A Case for Direct Mapped Caches,” IEEE Computer, Vol. 21, No. 12, pp. 25-40, December 1988.
[7] Intel Corporation, “MMX Technology Overview,” http://developer.intel.com/technology/itj/q31997/articles/art_2.htm.
[8] Norm P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and
Prefetch Buffers,” Proc. 17th International Symposium on Computer Architecture , pp. 364-373, June 1990.
[9] T. Mowry, M. S. Lam and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth
International Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 62-73, 1992.
17
[10] Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger, “SPAID: Software Prefetching in
Pointer- and Call-Intensive Environments,” Proc. 28th Annual International Symposium on Microarchitecture, pp. 231-236,
November 1995.
[11] Toshihiro Ozawa, Yasunori Kimura, and Shin’ichiro Nishizaki, “Cache Miss Heuristics and Preloading Techniques for
General-Purpose Programs,” Proc. 28th Annual International Symposium on Microarchitecture, pp. 243-248, November 1995.
[12] Subbarao Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st
Annual Symposium on Computer Architecture, pp. 24-33, April 1994.
[13] Amir Roth, Andreas Moshovos, and Gurindar S. Sohi, “Dependence Based Prefetching for Linked Data Structures,”
Proc. 8th International Conf. on Architectural Support for Programming Languages and Operating Systems, October 1998.
[14] Vatsa Santhanam, Edward H. Gornish and Wei-Chung Hsu, “Data Prefetching on the HP PA-8000,” Proc. of the 24th
Annual International Symposium on Computer Architecture, pp. 264-273, June 1997.
[15] A. J. Smith, “Cache Memories,” ACM Computing Surveys, Vol. 14, No. 3, pp. 473-530, September 1982.
[16] Steven P. Vander Wiel and David J. Lilja, “When Caches Aren’t Enough: Data Prefetching Techniques,” IEEE
Computer, Vol. 30, No. 7, pp. 23-30, July 1997.
18
Download