DATA CACHE PREFETCHING SCHEMES FOR MULTIMEDIA AND IMAGE PROCESSING APPLICATIONS ALAA R. ALAMELDEEN alaa@cs.wisc.edu DEEPAK JINDAL jindal@cs.wisc.edu Dept. of Computer Sciences, University of Wisconsin-Madison SANJEEV KULKARNI sanjeevk@cs.wisc.edu ABSTRACT Data prefetching is an important approach to reduce data cache miss rate due to compulsory and capacity cache misses. Prefetching can be performed in hardware or in software. The basic idea is to keep track of data access patterns of a program and to use this information to anticipate the location of data that is going to be accessed next. Image processing and multimedia applications are among the most growing application areas in computer systems. They are also memory-intensive and time-critical applications, but they have predictable data access patterns. These properties render them ideal for data prefetching schemes. In this paper, we study some of the hardware-based data prefetching schemes, and their effect on the performance of image processing and multimedia applications. We introduce a modification to the correlated-prefetching hardware scheme presented by Chen and Baer that removes some of its restrictions requiring compiler support. We describe a set of multimedia and image processing benchmark programs. Simulation experiments are held to compare different hardware prefetching schemes, and they show satisfactory results for most of the image processing and multimedia applications without hurting performance for other SPEC benchmarks. KEY WORDS Prefetching, hardware-based prefetching, data cache, reference prediction, memory access patterns, image processing, multimedia. 1. INTRODUCTION Due to the increasing speed gap between processors and memory, the increasing speed of processors cannot be exploited efficiently unless some way of tolerating memory latency exists. The invention of cache memories as an intermediate storage level between main memory and the processor has been a breakthrough, which helped to reduce the gap between the processor speed and that of main memory. However, memory latency cannot be totally eliminated. Upon a cache 1 miss, the processor has to wait until the data is fetched both to the cache and to the processor. This wait time cannot be eliminated fully even for out-of-order processors. Several techniques have been proposed to reduce the data access penalty that results from cache misses. These include [5]: Increasing cache size, which can be limited by the available on-chip area for cache Increasing cache associativity to reduce conflict misses, but this may increase the cache hit time [6] Jouppi’s Victim cache [8] which reduces conflict misses by storing the blocks that are replaced from cache to an associative victim cache. However, the effect of this technique is less with the increase in cache size and/or associativity. Data prefetching, which reduces compulsory and capacity cache misses, by attempting to have data in cache before it is actually required, whether it has been referenced before and replaced (capacity miss) or not (compulsory miss). Data prefetching is defined as moving data from a higher level of the memory hierarchy to a lower level of the memory hierarchy before the data is actually needed [14]. Data prefetching can be done in hardware or in software. Hardware-based data prefetching, explained in more detail in the next section, is based on adding a hardware unit that fetches data to the data cache before it is needed. Prefetching data to the cache depends on the previously accessed data patterns. Software-based prefetching [2, 9, 10, 11, 16], on the other hand, is based on the analysis of the static program. An intelligent compiler may insert special instructions that prefetch data many cycles ahead of their use by other instructions. These techniques can get more prefetches and can help in prefetching complex access patterns (which cannot be done in hardware due to the high complexity). This is done at the cost of addition of prefetch instructions (including a change in the instruction set architecture) as well as additional program calculations of the prefetched addresses (which may reduce the program speed). Many machines have included some architectural support for software-based prefetching schemes. An example of these machines is the HP PA-8000 [14]. Multimedia and image processing applications are growing application areas in modern computing environments, and this trend tends to increase with the internet technology and its many services like video-on-demand. Inspired by this trend, most modern architectures have included some architectural support for these applications, like the Intel MMX [7]. The main problem with these applications is that they are memory intensive, processing large volumes of data, which renders them rather slow. For example, a typical 24-bit true color image of dimensions 500x500 requires approximately 750 Kbytes of storage, which cannot fit in L1 caches of practical sizes. However, the memory access patterns for most of these applications are quite regular and predictable, which makes them suitable for data prefetching schemes. 2 The rest of this paper is organized as follows. Section 2 is a survey for some hardware-based data prefetching schemes, with more focus on the Chen and Baer basic and lookahead prefetching schemes [3]. Section 3 explains the Chen and Baer correlated reference prediction scheme. Section 4 introduces our modification to the correlated prefetching scheme. Section 5 describes the image processing benchmarks used to compare the performance of different hardware-based prefetching schemes. Section 6 describes our simulation model and performance metrics. Section 7 presents the results and their explanations. Section 8 provides conclusions and suggestions for future work. 2. HARDWARE-BASED DATA PREFETCHING Hardware-Based data prefetching schemes dynamically predict the memory address to prefetch based on addresses of previous memory accesses. These techniques can be classified into two main classes: i. On-chip schemes, which depend on the addresses required by the processor in all data references. Examples of these schemes include the Chen and Baer’s basic, lookahead and correlated schemes [3]. ii. Off-chip schemes, which depend on the addresses that result in L1 cache misses, since these are usually the only addresses available outside the processor chip. These include Jouppi’s stream buffers [8] enhanced by Palacharla and Kessler [12]. There are arguments for both of these approaches. On-chip schemes can be more accurate in estimating prefetched data, thus removing a great portion of the compulsory and capacity cache misses. On the other hand, on-chip schemes consume precious chip area that can be used to build a larger-sized cache. They also increase the required memory bandwidth. Off-chip schemes do not frequently replace useful data in the data cache, and have much less memory bandwidth, but their prefetching accuracy is not usually as good as on-chip schemes. 2.1 Memory Access Patterns Prefetching schemes, both on-chip and off-chip, base their prediction for the next address to be needed by an instruction on some regularity in the memory access pattern. For example, in a program segment consisting of nested loops, the memory access patterns can be divided into the following four main categories [3]: 1. Scalar, which is a simple variable reference that does not change with respect to the loop index. 2. Zero stride, which is a reference inside an inner loop with a subscript expression that does not change with respect to the inner loop index, but may change with respect to the outer loop (e.g. a reference to A[i] inside an inner loop indexed by j). 3. Constant stride, which is a reference inside a loop with a subscript expression that increases with a constant rate with respect to the loop index (e.g. the reference A[i] inside the loop indexed by i). 3 4. Irregular, which is any pattern other than the previous three patterns (e.g., linked list traversals, tree traversals, ... etc.). Hardware-based techniques implement special circuits to detect any regularity in memory access patterns for an instruction or a group of instructions, and base their prediction on these patterns. Some techniques even extend their prediction decisions to the patterns classified as (irregular), like the linked-list traversal accesses [13]. 2.2 Jouppi’s Stream Buffer Scheme This scheme is a variation of the One Block Lookahead (OBL) prefetching scheme [15]. The OBL scheme prefetches the cache block i+1 whenever the cache block i is referenced. Jouppi [8] suggested an off-chip scheme in which several data streams can be prefetched using FIFO stream buffers. A stream buffer is allocated whenever there is a data cache miss on an address, and the buffer proceeds by prefetching the next address to be referenced. Each stream buffer stores the next address to be referenced, the cache block and tag, and a valid bit. An incrementor is used to generate the next prefetch address. It has to be noted that stream buffers are only allocated on a data cache miss, thus greatly decreasing the memory bandwidth required by most on-chip schemes. Palacharla and Kesseler [12] extended Jouppi’s scheme to detect and prefetch non-unit strides. They added fields containing the last instruction address (PC), stride and state to each stream buffer. The last instruction address is used to keep track of the last reference to this stream buffer. The stride field, which is the difference between the last two references to this stream buffer, is used to determine the next address to be prefetched. The state field determines whether to proceed with prefetching or not. The state is updated based on the success of previous predictions. 2.3 Chen and Baer’s Reference Prediction Schemes Chen and Baer [3] introduced three variations for on-chip hardware-based data prefetching. These variations handle the scalar, zero-stride and constant-stride access patterns. The basic scheme depends on the construction of a reference prediction table (RPT) for each instruction in the program that references memory (i.e. load and store). An entry in the RPT consists of the instruction address (PC), the previous data address referenced by this instruction, and the stride, which is the difference between the last two referenced data addresses. In addition, an RPT entry contains a state field that provides information about the success of previous prefetches for this entry. Data prefetching is triggered when the program counter reaches an instruction that has a corresponding entry in the RPT. If the state of the corresponding RPT entry indicates that data accesses can be predicted, the data at address (current address + stride) is prefetched to cache. The state field of an RPT entry is updated on any memory reference by the instruction whose address is stored for that entry, according to the state transition diagram shown in figure 1. All states are predictable except for the No-pred state. Whenever data from a certain 4 address is to be prefetched, it is added to an Outstanding Request List (ORL). Data is prefetched from the ORL in order as long as no demand cache misses occur. Demand cache misses have precedence over ORL prefetches, and a reference is not prefetched if it results in an exception (e.g., page fault). correct prediction Incorrect prediction steady init correct prediction Incorrect prediction correct prediction Incorrect prediction correct prediction transient No-pred Incorrect prediction Figure 1: State Transition Diagram for Basic Reference Prediction Scheme This scheme decreases data access penalty on nearly all benchmarks, but it has a drawback due to its strategy of prefetching the predicted reference for the next execution of the current instruction. If the next occurrence of the same instruction is too close (as in small loops), the prefetched data may not arrive in time for the instruction to use. On the other hand, if the next occurrence of the instruction is too far (as in large loops), the predicted data may arrive too early and replace a useful block in the data cache, or might even get replaced before being used. Due to both of these drawbacks, Chen and Baer introduced the lookahead reference prediction scheme. In the lookahead reference prediction scheme, an advanced copy of the program counter, the lookahead program counter (LA_PC), advances ahead of the actual program counter and is used to trigger data prefetches. LA_PC is ahead of the actual PC by a slightly higher number of cycles than the memory latency. If the LA_PC matches an address in an RPT entry, the next estimated reference is added to the ORL. This increases the probability that the prefetched data will be in cache just in time before it is actually needed. The lookahead prediction scheme adds another field to the RPT table entries, called times, which indicates the number of iterations the LA_PC is ahead of the actual PC. For the update of LA_PC, branch prediction is required to predict the outcomes of the branches. When a branch misprediction is discovered while executing the normal instruction stream, LA_PC is reset back to PC and all the requests on the ORL are removed. When PC stalls because of a cache miss LA_PC can still advance and issue prefetches to the ORL. Chen and Baer also described another scheme, the correlated reference prediction scheme, which is going to be discussed and modified in the next sections. 5 2.4 Other Schemes Joseph and Grunwald [4] introduced an off-chip scheme that uses the miss address stream as a prediction source. They assume that an observed Markov model can approximate the miss reference stream. A model is dynamically constructed whenever a program is being executed. This means that the Markov model captures the past activity of all programs running on a system and uses this information to predict future references. The Markov model is approximated by limiting the number of states as well as the out-degree of each state to be applicable in hardware. The main disadvantage of this scheme is its high hardware complexity [4]. Roth et al. [13] introduced a Dependence-Based prefetching scheme, which dynamically identifies the load instructions that access a linked list structure, and constructs a description of the steps the program has followed to traverse the structure. Based on the assumption that the program will continue to follow the same steps, a prefetch engine takes this description and speculatively executes the load instructions in parallel with the original program’s execution. Details of this scheme can be found in [13]. 3. CORRELATED PREFETCHING SCHEME From the analysis of results in the published papers for the above schemes, on-chip schemes seem to yield much better performance (i.e., miss rates and data access penalties) than off-chip schemes. Chen and Baer [3] introduced another scheme, the correlated reference prediction scheme, which handles two levels of stride changes. They introduced this scheme to handle triangular loops. A typical triangular loop is: for (i=0; i<n; i++) for (j=0; j < i; j++) A[i][j] = .....; References in the inner loop are constant strides, but the frequent stride changes on each transition to the outer loop cause a lot of useless prefetches in the lookahead scheme. In image processing applications, if processing is done for only part of the image, a change of stride occurs whenever a line of an image is processed and the processing moves to the next line. The key idea behind the Chen and Baer’s correlated reference prediction scheme is to keep track of adjacent accesses in inner loops (as in the basic and lookahead schemes), as well as the accesses correlated by changes in the loop level. They assume that loop branches are backward branches. Thus, a non-taken branch triggers the correlated scheme to use the higher level of strides, since a non-taken branch implies the end of a loop. 6 Their implementation of the correlated scheme is restricted to only two-levels of nested loops. The additional hardware in comparison with the lookahead scheme is: A 2-bit shift register that stores the branch history of the last two loop branches Additional (prev_addr, stride) pair in each entry of the RPT. The use of branch history is dependent on some support from the compiler, which is supposed to flag loop branches to distinguish them from other branches (like if-then-else branches inside the loop). Any RPT entry using the correlated scheme should contain the following fields: tag: load/store Instruction address p_addr0, stride0: Previous address and stride used when the inner loop branch is not taken p_addr1, stride1: Previous address and stride for the normal instruction reference inside the inner loop times: The number of iterations (of the inner loop) the LA_PC is ahead of the actual PC state: Same as in figure 1 The results presented by Chen and Baer [3] show a slight improvement for the correlated scheme over the lookahead scheme, and it was our intuition that the improvement is going to be more significant for image processing and multimedia applications. 4. MODIFIED CORRELATED PREFETCHING SCHEME There are two main disadvantages for the original correlated scheme. The compiler support assumption that Chen and Baer used to get their results could not be implemented solely in hardware. Also, the hardware implementation depends on whether the loops are compiled with forward or backward branches. Both of these disadvantages greatly limit the applicability of this scheme as a hardware-based prefetching scheme. In this paper we propose a modified correlated reference prediction scheme that eliminates the need for compiler support. The basic idea behind the scheme is the same as the original scheme. Whenever a new outer loop iteration starts we issue a prefetch for the next outer loop iteration. The difference is in how we detect the start of the new outer loop iteration. In the original correlated scheme LA_PC is used to prefetch the data accessed inside the inner loop. When the inner loop is terminated, there is a branch misprediction and LA_PC is again set to the current PC. When the data referencing instruction is encountered again, we note that the branch history (which is used for loop only branches) has changed and issue a prefetch for the next iteration of the outer loop. 7 This is where our scheme differs from the original scheme. When a new iteration of the outer loop starts we will always observe a sudden change in the previous stride unless it was an adjacent data access (like traversing a full array). This change in the stride is used as an indication of a new outer loop iteration. The original scheme relies upon branch history to determine when a new iteration starts. However, the change in stride makes it unnecessary to use branch history for this. This scheme has a big advantage over the original scheme as it completely eliminates any compiler support. We do not need any extra hardware to implement the new scheme as the change in stride is always calculated for state transitions. When a change in stride1 is encountered in steady state, state is set to initial and stride1 is left unchanged as in the original scheme. The condition p_addr0 + stride0 == current_addr is evaluated. If it is true, this implies that the last prediction was correct and hence a new prefetch for the next outer loop iteration is issued for the address current_addr + stride0. If the condition is false stride0 is set to current_addr – p_addr0 and no prefetch is issued. p_addr0 is always set to current_addr. This scheme will perform slightly better than the original scheme. If LA_PC reaches that instruction again before PC reaches the branch instruction of the inner loop, branch history will remain unchanged and hence original scheme will try to prefetch for the next outer loop iteration again. But in our scheme we will issue a prefetch for the next inner loop iteration. 5. IMAGE PROCESSING AND MULTIMEDIA WORKLOAD In this paper, we are especially concerned with the image processing and multimedia applications. We have selected a set of applications that can be representative of a typical image processing and multimedia workload on a modern computer system. Since there is currently no well-established benchmark for image processing and multimedia, we used the following programs to estimate the effectiveness of prefetching schemes: 1. ijpeg: The SPECint95 JPEG compression program. We used the same input for our runs as that specified in the SPECint95 benchmark suite. The program is run on a 32 KB PPM image which is converted to the JPEG format. This is important since the JPEG format is the most widely used image format on the world wide web. 2. filter: A synthetic program that performs basic image low-pass (blurring) filter. We used an input image of size 32 KB on which the basic image averaging low-pass filter was applied. The intensity of each image point is averaged with all its neighbors to get the average 3. mpeg2encode: The program encodes an image sequence to the MPEG-2 compressed format. We use this program as an estimator of the workload of movie-recording programs. Though this program does not have much of a regular access pattern, it has to be included since it is representative of the multimedia workload. 8 4. mpeg2decode: The program decodes an MPEG-2 movie to separate images. We use this program as an estimator of the workload of media player programs that display a set of images from an MPEG file. This type of programs is the most frequently used multimedia programs type. In addition to the image processing and multimedia workload, we also used two programs from the SPEC95 benchmark suite: compress and tomcatv. The data access predictability of these programs is summarized in Table 1. This table shows the percentage of accesses for entries in the RPT that are in various states of the state transition diagram (figure 1). This table proves that most of these applications are predictable using any reasonable prefetching scheme. % ijpeg filter mpeg2encode mpeg2decode Compress Tomcatv init Transient 31.0567 3.2055 0.0454 0.0001 12.9842 2.2568 18.3751 2.4559 13.4662 0.1789 62.6204 1.3875 Steady 55.7563 99.9544 78.6153 78.2543 85.9282 34.6807 no-pred 9.9815 0.0000 6.1437 0.9147 0.4267 1.3114 Table 1: Percentage of RPT accesses in which an accessed entry is in various states of figure 1 6. SIMULATION RUNS 6.1 Simulation Assumptions To perform simulation runs on the above applications, we modified the Cache Simulation program, which is a part of version 3.0 of the SimpleScalar Tool set [1]. This version of SimpleScalar performs fast, flexible and accurate simulation of Alpha binaries. We added to this tool set an implementation of Chen and Baer’s basic and lookahead schemes, Palacharla and Kesseler’s stream buffers, and our modified correlated reference prediction scheme. In our implementation of the basic, lookahead and correlated prefetching schemes, we used the same assumptions as used by Chen and Baer [3]. These assumptions are: 1. The use of a perfect pipeline, which means that all data dependencies that result in pipeline stalls are ignored. This assumption is not to the benefit of prefetching schemes, since a non-perfect pipeline will provide additional cycles for the prefetched data to be ready in the data cache when it is needed. 2. The use of a perfect instruction cache. This means that we ignore stalls that result from instruction cache and instruction TLB misses. This assumption is also not to the benefit of prefetching schemes for the same reasons stated above. 3. The use of the non-overlapped memory model. This is one of the memory models implemented by Chen and Baer, in addition to the overlapped and pipelined models. This model is the most restrictive one since it implies that no memory 9 requests can be initiated until the previous (and the only) request is completed. This means that if we place a memory request on the ORL while another memory request is still outstanding, we have to wait until data of the outstanding request arrives from memory before the new request starts to be served. This model is typical of an on-chip cache backed up by a second level cache [3]. In our implementation of the stream buffer technique, we exploited the maximum gain possible from an off-chip technique by assuming an infinite number of perfect stream buffers. A perfect stream buffer has its prefetched data ready when it is required, ignoring memory latency. We also simulated 512 stream buffers, which is a number large enough for almost all applications that it can be considered as infinite. Stream buffers are allocated in FIFO order. 6.2 Simulation Settings We performed two sets of experiments. In the first set, we measured the cache performance without prefetching as well as the performance of different prefetching schemes under the same settings. The objective of this set of experiments is to compare the benefits of prefetching for different cache sizes in various benchmark programs. In this set of experiments we kept the cache line size, the size of ORL and the number of BTB entries constant. The settings we used are: Cache line size = 32 bytes Direct-mapped L1 cache ORL size = 30 Number of BTB entries = 512 L2 cache access latency = 6 cycles L2 cache is a 256 KB 4-way set associative cache with 64 byte lines, using LRU replacement policy Memory access latency = 30 cycles. We also performed a second set of experiments to find the impact of various block sizes and number of ORL entries on the performance of basic, lookahead and correlated prefetching. The cache size was kept fixed at 16K. For the block sizes experiment, we used block sizes of 16, 32 and 64 bytes. For the ORL experiment, we used 1, 2, 4, 8, and 16 entries in the ORL. All these experiments were conducted for ijpeg. 6.3 Simulation Metrics We used three metrics to compare different prefetching techniques. 10 6.3.1 L1 Cache Miss Rate This metric is used to determine the success of any prefetching scheme on the reduction of cache misses. In the basic, lookahead and correlated prediction methods, an access is considered a hit even if the data request for this access has been issued by ORL but has not yet completed. However, a reference to an outstanding request in the ORL that has not been issued is considered a miss. For the stream buffers scheme, a hit in any prefetched block in a stream buffer is considered as a cache hit. 6.3.2 L1 Cache Replacement Rate This is the rate at which blocks are being replaced from the data cache. This is an indication of the number of cache prefetches and thus the increase in memory bandwidth required by the prefetching schemes. On-chip prefetching schemes usually have higher cache replacement rates than a cache without prefetching. 6.3.3 Memory Cycles Per Instruction (MCPI) This metric represents the average number of stall cycles per instruction contributed by memory access. Since we assume that the processor can execute one instruction per cycle (perfect pipelining) and that we have an ideal instruction cache, the only extra contribution of CPI results from data access penalty. MCPI caused by the data access penalty is calculated as: MCPI Total Number of Memory Wait Cycles Number of Instructions Executed The reason for using MCPI is that MCPI can reflect the actual stall time observed by the processor, taking both the processor execution and cache behaviour into account. However, MCPI can be greatly reduced for prefetching schemes if we use another more relaxed memory model (like the overlapped or the pipelined models [3]). 7. RESULTS AND DISCUSSION 7.1 L1 Cache Miss Rate Figure 2 shows the cache miss rate of different benchmark programs for different prefetching schemes. For multimedia and image processing applications, all prefetching schemes improve cache miss rate over the no-prediction scheme. The only exception is the (filter) program for a cache size of 8KB. In this program, the higher replacement rate due to prefetching and the fact that the image is a multiple of cache size makes many of these replacements replace useful data in cache. For larger cache sizes, however, prefetching schemes reduce compulsory cache misses, which greatly reduces the miss 11 rate in comparison with the no-prefetch scheme. For a 32 KB cache for instance, the no-prefetch scheme has a 2.28% miss rate, the stream buffer has a 1.71% miss rate, whereas all on-chip schemes have a miss rate of almost zero. It has to be noted that the performance improvement as a result of the stream buffers scheme is almost constant in comparison with the no-prefetch scheme in nearly all programs, whereas the performance of the on-chip techniques vary with the program. Since the stream buffer scheme does not replace useful data in cache very often (since it is activated only on cache misses), it performs better for small cache sizes in ijpeg, filter and mpeg2encode. In the two other SPEC95 applications, prefetching schemes greatly reduce the cache miss rate in comparison with the no-prefetch scheme. This reduction is more for compress than it is for tomcatv due to the higher predictability of compress accesses as shown in table 1. 7.2 L1 Cache Replacement Rate Figure 3 shows the cache replacement rates of different benchmark programs for different prefetching schemes. For all cache sizes, the prefetching schemes produce some prefetch requests, and so they have a higher replacement rate in Miss Rate - filter Miss Rate - ijpeg 0.06 0.05 N 0.04 S 0.03 B 0.02 L C 0.01 0 8K 16K 32K 64K 0.8 0.7 0.6 0.5 N S 0.4 0.3 0.2 0.1 0 B L C 8K 128K 16K 32K 64K 128K Cache Size Cache Size Miss Rate - mpeg2decode Miss Rate - mpeg2encode 0.12 0.018 0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0 0.1 N N 0.08 S B 0.06 B L 0.04 L C 0.02 S C 0 8K 16K 32K 64K 8K 128K 16K Miss Rate - compress 0.2 N S 0.1 B L 0.05 C 0 16K 32K 64K Cache Size 64K 128K Miss Rate - tomcatv 0.15 8K 32K Cache Size Cache Size 128K 0.16 0.14 0.12 0.1 N S 0.08 0.06 0.04 0.02 0 B L C 8K 16K 32K Cache Size Figure 2: Cache Miss Rate for Various Benchmark Programs 12 64K 128K comparison with the no-prefetch scheme. The off-chip stream buffer scheme also has a lower replacement rate than on-chip schemes due to the less number of prefetches it generates. Replacement Rate - ijpeg Replacement Rate - filter 0.07 1.2 0.06 1 N 0.05 0.04 0.03 0.02 0.8 S B 0.6 B L 0.4 L C 0.01 N S C 0.2 0 0 8K 16K 32K 64K 128K 8K 32K 64K Cache Size Replacement Rate - mpeg2encode Replacement Rate - mpeg2decode 0.025 0.02 N 0.015 S L 0.005 C 32K 64K N S 0.08 0.06 0.04 0.02 0 0 16K 128K 0.16 0.14 0.12 0.1 B 0.01 8K 16K Cache Size 128K B L C 8K 16K Cache Size 32K 64K 128K Cache Size Replacement Rate - compress Replacement Rate - tomcatv 0.176 0.174 0.172 0.17 0.25 0.168 0.166 0.164 0.162 0.16 N 0.2 N S 0.15 S B B L 0.1 L C 0.05 C 0 8K 16K 32K 64K 128K 8K Cache Size 16K 32K 64K 128K Cache Size Figure 3: Cache Replacement Rate for Various Benchmark Programs 7.3 Memory Cycles Per Instruction (MCPI) Figure 4 shows the data access penalty (in memory stall cycles per instruction) of various benchmark programs for different prefetching schemes. This figure is the weakest point in our results due to the use of the traditionally-used nonoverlapped memory model. The non-overlapped memory model cannot serve more than one request at a time. This means that the non-overlapped memory model limits the benefits of prefetching. However, in the most frequently-used two image processing and multimedia programs (i.e., ijpeg and mpeg2decode), the MCPI is reduced greatly in all prefetching schemes in comparison with the no-prefetch scheme. In these two applications, the correlated scheme performs best, followed by the lookahead, then basic and stream buffer prefetching. The same improvement is shown in compress. These results are 13 consistent with our expectations for these applications. For filter and mpeg2encode, as well as for tomcatv, the limitations of the non-overlapped memory model, together with the high replacement rate, makes prefetching schemes perform worse than the no-prefetch scheme for small cache sizes. The miss rates show that eliminating the non-overlapped model limitations will result in an improvement for the overlapped and pipelined memory models, since the reduction in miss rate implies that prefetching schemes do correct predictions. 7.4 Effect of Different Cache Block Sizes Figure 5 shows the effect of the increase in block size for the ijpeg program. This figure shows that the improvement of prefetching schemes is consistent for all data cache block sizes. It has to be noted however that for a program like ijpeg which accesses an image larger than the cache size, the improvement produced by the prefetching schemes is larger for smaller block sizes. An increase in block size from 16 to 32 bytes produces a higher reduction in MCPI than the introduction MCPI - ijpeg MCPI - filter 0.08 0.07 0.06 0.05 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 N B 0.04 0.03 0.02 0.01 0 L C 8K 16K 32K 64K N B L C 128K 8K 16K 32K Cache Size 64K 128K Cache Size MCPI - mpeg2encode MCPI - mpeg2decode 0.025 0.14 0.12 0.02 N 0.1 N 0.015 B 0.08 B 0.01 L 0.06 L C 0.04 C 0.005 0.02 0 0 8K 16K 32K 64K 128K 8K 16K Cache Size 32K 64K 128K Cache Size MCPI - compress MCPI - tomcatv 2.5 0.3 2.4 0.25 2.3 N 2.2 B 2.1 L B L 0.1 C 2 N 0.2 0.15 1.9 C 0.05 1.8 0 8K 16K 32K 64K 128K 8K Cache Size 16K 32K 64K Cache Size Figure 4: MCPI (Memory Cycles Per Instruction) for Various Benchmark Programs 14 128K Data Access Penalty for Different Block sizes16K Cache - ijpeg 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 16 32 64 N B L C Figure 5: Effect of Different Cache Block Sizes (16, 32 and 64 bytes) on Prefetching Schemes of prefetching. For regular block sizes however, prefetching greatly improves MCPI. 7.5 Effect of Different ORL Sizes Figure 6 shows the effect of different ORL sizes on MCPI. Due to the constraints of the non-overlapped memory model, the ORL size variation has an insignificant effect on prefetching schemes. This is because we can only issue one prefetch to memory at a time. Data Access Penalty for Different ORL sizes- ijpeg 0.03 0.025 1 0.02 2 0.015 4 8 0.01 16 0.005 0 N B L C Figure 6: Effect of Different ORL Sizes on MCPI 7.6 Summary of results Previous results illustrate that a small on-chip hardware prefetching unit can improve performance in image processing and multimedia applications. The magnitude of this improvement however is limited by the non-overlapped memory model restrictions. As shown in table 1, image processing and multimedia applications are highly predictable. The great reduction in cache miss rates for nearly all prefetching schemes means that prefetching can be used effectively for these types of applications. 15 In mpeg2decode and compress, there is a great reduction in MCPI for all prefetching schemes in comparison with the no-prefetching scheme. This is due to the fact that they have a lot of non-sequential data accesses that, for no-prefetching scheme, generate a lot of misses. Since the working set of compress is large for practical L1 cache sizes, an increase in cache size does not seem to improve MCPI. This can only be improved by reducing compulsory and capacity misses, as done in prefetching schemes. On-chip prefetching schemes perform better in most cases than the off-chip stream buffer scheme, even with the use of perfect infinite pool of stream buffers. This is due to the fact that stream buffers base their prediction on the miss address stream, whereas on-chip schemes can use all the addresses issued by the processor. This improvement is done at the expense of increasing the cache replacement rate and hence, the memory bandwidth. 8. CONCLUSIONS AND FUTURE WORK We studied a few data prefetching schemes with a focus on their impact on image processing and multimedia applications. We extended the correlated reference prediction scheme proposed by Chen and Baer [3] to a scheme that does not require any compiler support or branch history buffers. The proposed scheme provides the same performance improvement as that provided by the original correlated scheme. We assessed the performance of this scheme as well as the performance of a cache system without prefetching, an off-chip stream buffer prefetching, and on-chip basic and lookahead reference prediction schemes. For the multimedia and image processing workload we used, as well as other SPEC95 benchmark programs, prefetching schemes significantly reduced cache miss rate, indicating the potential of using prefetching schemes for most applications. The reduction in cache miss rate came at the expense of a higher cache replacement rate, which implies a higher memory bandwidth. Memory cycles per instruction is also reduced by prefetching schemes for most applications. In all our experiments we have used the non-overlapped memory model in which only one memory access request can be in progress at any time. A possible exploration could be to simulate with the overlapped model in which multiple requests can be active at any time, as done by Chen and Baer [3]. This could increase the benefits of prefetching by removing the misses resulting from an outstanding prefetch request which has not been issued. These prefetching schemes also need to be evaluated on out-of-order processors. What needs to be determined is whether the reduction in MCPI would be as significant in out-of-order processors as it is for in-order processors, to justify the addition of an on-chip prefetching unit to out-of-order processors. 16 Another interesting study would be to study prefetching schemes with multiple levels of caches. Multiple level of caches will have higher tolerance for prefetches that are generated much ahead of when they are required. We expect these prefetches will be able to reduce the latency seen by L1 cache by prefetching data into L2 from main memory. ACKNOWLEDGMENTS We would like to thank Todd Austin for his hint about SimpleScalar, Collin McCurdy for his helpful comments, and Prof. Mark Hill for his helpful pointers to research papers. REFERENCES [1] Doug Burger and Todd M. Austin, “The SimpleScalar Tool Set, Version 2.0,” Technical Report #1342, University of Wisconsin-Madison Computer Sciences Department, June 1997. [2] William Y. Chen, Scott A. Mahlke, Pohua P. Chang and Wen-mei W. Hwu, “Data Access Microarchitectures for Superscalar Processors with Compiler-Assisted Data Prefetching, Proc. 24th Annual International Symp. Microarchitecture, pp. 69-73, November 1991. [3] Tien-Fu Chen and Jean-Loup Baer, “Effective Hardware-Based Data Prefetching for High-Performance Processors,” IEEE Transactions on Computers, Vol. 44, No. 5, pp. 609-623, May 1995. [4] Dirk Grunwald and Douglas Joseph, “Prefetching Using Markov Predictors,” Proc. of the 24th Annual International Symposium on Computer Architecture, pp. 252-263, June 1997. [5] John L. Hennessy and David A. Patterson, “Computer Architecture: A Quantitative Approach,” Morgan Kaufmann Publishers, Inc., 1996. [6] Mark D. Hill, “A Case for Direct Mapped Caches,” IEEE Computer, Vol. 21, No. 12, pp. 25-40, December 1988. [7] Intel Corporation, “MMX Technology Overview,” http://developer.intel.com/technology/itj/q31997/articles/art_2.htm. [8] Norm P. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” Proc. 17th International Symposium on Computer Architecture , pp. 364-373, June 1990. [9] T. Mowry, M. S. Lam and A. Gupta, “Design and Evaluation of a Compiler Algorithm for Prefetching,” Proc. Fifth International Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 62-73, 1992. 17 [10] Mikko H. Lipasti, William J. Schmidt, Steven R. Kunkel, and Robert R. Roediger, “SPAID: Software Prefetching in Pointer- and Call-Intensive Environments,” Proc. 28th Annual International Symposium on Microarchitecture, pp. 231-236, November 1995. [11] Toshihiro Ozawa, Yasunori Kimura, and Shin’ichiro Nishizaki, “Cache Miss Heuristics and Preloading Techniques for General-Purpose Programs,” Proc. 28th Annual International Symposium on Microarchitecture, pp. 243-248, November 1995. [12] Subbarao Palacharla and R.E. Kessler, “Evaluating Stream Buffers as a Secondary Cache Replacement,” Proc. 21st Annual Symposium on Computer Architecture, pp. 24-33, April 1994. [13] Amir Roth, Andreas Moshovos, and Gurindar S. Sohi, “Dependence Based Prefetching for Linked Data Structures,” Proc. 8th International Conf. on Architectural Support for Programming Languages and Operating Systems, October 1998. [14] Vatsa Santhanam, Edward H. Gornish and Wei-Chung Hsu, “Data Prefetching on the HP PA-8000,” Proc. of the 24th Annual International Symposium on Computer Architecture, pp. 264-273, June 1997. [15] A. J. Smith, “Cache Memories,” ACM Computing Surveys, Vol. 14, No. 3, pp. 473-530, September 1982. [16] Steven P. Vander Wiel and David J. Lilja, “When Caches Aren’t Enough: Data Prefetching Techniques,” IEEE Computer, Vol. 30, No. 7, pp. 23-30, July 1997. 18