– A Study Fetch Directed Prefetching

advertisement

Fetch Directed Prefetching – A Study

Gokul Nadathur

Nitin Bahadur

Sambavi Muthukrishnan

{ gokul, bnitin, sambavi}@cs.wisc.edu

Dec 8, 2000

University of Wisconsin-Madison

Abstract

This report presents our work on the fetch directed prefetching architecture proposed by Reinman et.al [Reinman 99]. Fetch directed architecture decouples the branch predictor from the main pipeline so that the branch predictor can run ahead of the pipeline on pipeline stalls and i-cache misses. Other components of the architecture help in having the next instruction blocks for execution ready for the fetch unit. We also explored prefetching using stream buffers over the base architecture. We implemented both on Simplescalar 2.0. Our simulations on a subset of the Spec95 Integer benchmarks show that fetch directed prefetching improves performance. Fetch directed prefetching gives greater improvement in

IPC as compared to base Simplescalar architecture with stream buffers. We observed a peak improvement of 8.8% for fetch directed prefetching and 6.29% for stream buffers among the benchmarks we ran. Our statistics show that the modest performance can be improved by multi-porting and/or pipelining the cache hierarchy. Thus the fetch directed architecture with intelligent prefetching hides the i-cache stall latencies, improving fetch bandwidth and performance.

1

1. Introduction

As computer architecture advances, most modern processors aim at achieving better performance through improved instruction level parallelism. This places a large demand on the instruction fetch mechanism and the memory/cache heirarchy. Also, with the advent of superscalar processors that can execute multiple instructions per cycle, there is a need for better branch prediction ( perhaps multiple branch predictions) per cycle. The rate at which the

BTB and branch predictor can be cycled thus have an effect on the number of actual instructions that may be executed since this limits the fetch bandwidth. Unless the instruction fetch and memory hierarchy respond to the needs of the execution engine, the true benefits of the advances in ILP cannot be realised.

This argument motivates our study on instruction prefetching techniques. When instructions are prefetched, there is a possibility of hiding memory latencies due to cache misses. We study two techniques: stream buffers and fetch-directed prefetching [Jouppi 90 and

Reinman 99].

While steam buffers just approaches the problem of improving fetch bandwidth by prefetching instructions/cache lines sequentially on a cache miss; the fetch directed architecture improves upon this by modifying the baseline architecture to provide a better idea of what should be prefetched. The fetch directed architecture works by letting the branch predictor work independently of the fetch unit. This enables a prediction each cycle (this is of course dependent on the cycle time of the branch predictor and the BTB). So the branch predictor is never stalled due to instruction cache misses (when the fetch unit is stalled).

Similarly the fetch unit is not stalled by any latency in branch prediction. The predictions produced in each cycle may be used to determine what to prefetch (with some suitable filtering mechanism).

The fetch directed architecture also proposes some modifications to the Branch Target

2

Buffer which enables better space utilization of the BTB by grouping together all addresses which fall through (even branches predicted as fall through) and not wasting BTB entries for such fall through branches..

The main goals of our project were to study the fetch directed architecture proposed by the authors of [1], to study prefetching in this context, modify the SimpleScalar tool set for this architecture and revalidate their conclusions from the simulation results. We also compare the performance of the fetch directed prefetching scheme with the stream buffers technique.

This paper is organized as follows. Section 2 describes the fetch directed architecture in detail. Section 3 describes the stream buffers and the fetch directed prefetching which we have implemented. The results of our simulation studies are presented in Section 4 and we finally conclude in Section 5.

2. Fetch Directed Prefetching Architecture

The fetch directed architecture uses a decoupled branch predictor and an instruction cache. The decoupled branch predictor runs ahead of the execution PC making predictions.

The other components of the architecture make use of the predictions to supply the fetch engine with instruction addresses to be next executed. Any prefetching mechanism can be added to the fetch directed architecture to make use of the instruction addresses which are available before they are used. Since stream buffers [Jouppi 90] show consistently better performance, we add stream prefetching to the base design to yield an architecture which combines the use of future predictions and prefetching to supply the fetch engine with correct instructions and hides the L1 i-cache miss latencies. The main components of the Fetch

Directed Prefetching Architecture are :

1) The decoupled branch predictor

2) Fetch Target Buffer (FTB)

3

3) Fetch Target Queue (FTQ)

4) Prefetch Filtration Mechanism

5) Prefetch Instruction Queue (PIQ)

6) Prefetch Buffers and Prefetch Mechanism

PIQ

L2 Cache

Prefetch

Branch

Predictor

Prefetch Filtration

Mechamism

Prefetch

Buffer

FTB

Instruction

Fetch

FTQ

Figure 1 : Fetch directed prefetching architecture

The fetch directed prefetching architecture is shown in Figure 1. The fetch target buffer

(FTB) is similar to the branch target buffer (BTB). The fetch target queue (FTQ) contains a list of instruction address blocks which should be next used by the fetch stage. The prefetch instruction queue (PIQ) contains a list of i-cache blocks which should be prefetched by the prefetch unit. Each component of the architecture is described in detail in the following sections.

2.1 Decoupled branch predictor

Motivation for a decoupled predictor

The decoupled branch predictor is the focal point of the fetch directed architecture. The predictor is decoupled from the main pipeline. This enables the predictor to run ahead of the

4

pipeline. The advantage of this is that the branch predictor being able to work even when the fetch unit is stalled. The branch predictor produces a useful prediction every cycle which may be used in later cycles by the fetch unit. This scheme however has the problem that the history seen by the branch predictor may not be up to date since it is making predictions well in advance of the fetch unit using any of those predictions.

Changes to SimpleScalar

The changes made to the Simplescalar tool set to incorporate a decoupled branch predictor were the following. There were changes needed to the main pipeline to extract the branch predictor out. Essentially, the original main pipeline called the branch predictor for every fetch address. Instead at this point, the fetch addresses were themselves specified by the Fetch

Target Queue (Section 2.3). An entry was popped off the queue to specify a range of addresses to access. When this range was completely fetched and put into the dispatch queue, the next range was read off the FTQ. Every range in a FTQ entry incorporates a fetch target block, starting at a taken branch target or the starting address of the program and ending at a taken branch or size of an FTQ entry.

The other changes were to the branch mispredict and branch misfetch handling. In either case, the entire fetch directed architecture state associated with branch misprediction needs to be flushed. The PC used by the branch predictor and the main pipeline also need to be reset to the correct target PC.

There was a need to maintain the branch predictor update information from the branch prediction stage to the branch predictor update stage (when the correct outcome of the branch is known). In the fetch directed architecture, the branch predictor runs ahead of the main pipeline and hence this information cannot be stored in the fetch data queue (in the way that the baseline architecture works). So we added an extra queue between the branch predictor and the fetch stage. This queue maintains the direction update information for the branches till

5

it can be put into the fetch data queue in the fetch stage.

2.2 Fetch Target Buffer (FTB)

The fetch target buffer is used for storing the fall though and target addresses of the branch ending a basic block. If the branch ending the basic block is predicted as taken, then the target address is used for next cycle's FTB lookup, else the fall through address is used. Each entry in the FTB contains a tag, taken target address and the fall through address of the block. It is similar to the basic block target buffer (BBTB) [Yeh-Pat 92]. The FTB differs from the BBTB in three aspects: a) Basic blocks with branches that are not taken are not stored in the FTB. b) The full fall through address of the basic block is not stored. Instead only the size of the basic block is stored. This reduces the storage requirements for the fall through address

( as it has been shown [Reinman 99] that the typical distance between the current fetch address and BBTB's fall through address is not large. ) c) The FTB is not coupled with the instruction cache, so that it can continue operation even when the i-cache is stalled due to a miss.

Each cycle the FTB is accessed by the branch predictor with an address and a prediction

[Reinman2 99]. If an entry is found in the FTB an entry is inserted into the FTQ starting at the current address and ending at the last instruction of the basic block. Based on the prediction, the target or fall through address is returned to the branch predictor lookup routine. The lookup routine will use this address for looking up the FTB in the next cycle. Thus the branch predictor and FTB together are used for predicting the next set of blocks that will be used by the fetch engine. By predicting the blocks before they are used by the fetch engine, the prefetch mechanism can prefetch the instructions in time for use, thus avoiding I-cache miss latencies.

6

An advantage of the FTB is that it can contain fetch blocks which span multiple cache blocks.

So on a FTB hit, a single/multi-block entry from the FTB is inserted into the FTQ (in one or more entries). If there is a FTB miss, then the a sequential block starting at the next cache aligned address and upto 2 cache blocks is inserted into the FTQ. Since not-taken branches do not have entries in the FTB, their lookup will fail and sequential blocks will be inserted into FTQ.

This not only gives correct operation but also reduces the space requirement of the FTB by not storing not-taken branches. On repetitive FTB lookup failures sequential blocks are inserted into the FTQ until a misprediction occurs when the FTB will be updated with the correct information and lookup failures will reduce. Since the FTB is of fixed size, entries in the FTB are replaced by LRU policy. A 2 level FTB would help reduce the FTB access time and we can have a larger second level FTB. We have currently implemented only a single level FTB.

Changes to Simplescalar

The FTB is used by the branch predictor and it fills in entries into the FTQ. So the decoupled branch predictor was made to use to use the FTB and update the FTB for all branch lookups and branch updates. The blocks produced by the FTB are then inserted into the FTQ.

2.3 Fetch Target Queue (FTQ)

The FTQ is a queue of instruction address blocks. It is used to bridge the gap between the branch predictor and the instruction cache[Reinman 99]. The fetch engine deqeueues the

FTQ for getting the instruction addresses to execute. The FTQ provides the buffering necessary to allow the branch predictor and i-cache to operate autonomously. It allows the branch predictor to work ahead of the i-cache when the latter is stalled due to an i-cache miss.

Each FTQ entry contains upto 3 fetch block entries and the target address for the branch ending the basic block. Each fetch block entry represents an L1 i-cache line. Each fetch block entry contains a valid bit, an enqueued prefetch bit (whether the entry has been

7

enqueued for prefetch already) and the cache block address.

Changes to Simplescalar

The FTQ gets filled by the FTB and it provides instruction address blocks to the fetch unit. So the fetch unit had to be modified to make use of instruction blocks provided by the

FTQ. The FTQ also provided some interfaces for use by the prefetch filtering mechanism.

2.4 Prefetch Instruction Queue (PIQ) and Prefetch Filter

The Prefetch Instruction Queue is a queue of addresses waiting to be prefetched by the prefetch engine. Each PIQ entry basically indicates a cache line which is to be prefetched.

Entries are inserted into the PIQ every cycle by the Prefetch Filtration Mechanism. The earlier a prefetch can be initiated before the fetch block reaches the I-cache, the greater the potential to hide the miss latency. At the same time, the farther the fetch block is on the FTQ, the more likelihood of it being on a mispredicted branch. So the heuristic [Reinman 99] starts choosing entries starting from second FTQ entry and goes upto 10th entry in the FTQ. The first entry in the FTQ is skipped as there would be little benefit starting to prefetch it when it is close to being fetched by the i-cache. By searching only 9 entries, we reduce the time required to find matching entries to insert into the PIQ. Based on exact timings, even multiple entries can be inserted into PIQ in one cycle. Currently we enqueue upto 3 entries into the PIQ at one time.

Associated with each PIQ entry is a prefetch buffer. We implemented the PIQ and prefetch stream buffers with same number of entries but it is possible to use lesser prefetch buffers based on the memory hierarchy and latencies. The prefetch mechanism deqeues entries from the PIQ and schedules them for prefetch. Thus the filtration mechanism fills the PIQ while the

Prefetch mechanism empties it.

Changes to Simplescalar

A new mechanism was added which checked the FTQ for any entries which could be

8

enqueued for prefetch and were not already enqueued. Here the filtration scheme looked at

FTQ entries, filtered them based on above heuristic and enqueued them into the PIQ for prefetch. Whenever the L2 bus was free and a stream buffer was available, the prefetch mechanism dequeued an entry from the PIQ and started prefetching it.

2.5 Recovery on mispredictions

On a misprediction, the FTQ, PIQ and Prefetch Buffer entries are invalidated and the

FTB is updated with the correct branch information. Based on their utilizations, the sizes of the

FTQ and PIQ can be accordingly set.

3 Stream Buffers

3.1 Introduction

L1 I-cache

L2 I-cache Stream buffer

Figure 2: stream buffers

Stream Buffers are used to reduce penalty due to compulsory and capacity miss problems. A stream buffer is a cache that is 1 cycle far from the L1 cache. This means that a miss in L1 serviced by the stream buffer will incur a penalty of 1 cycle. The buffers are probed in parallel with an i-cache access and if the i-cache access misses, then in the next cycle , the cache block is fetched into the i-cache from the stream buffer. The buffers can be probed in different ways. We have explored 2 ways of probing the buffer. In one policy the buffer is a

FIFO queue where only the head entry contains a comparator. So any probe is compared against the head of the queue. If the entry is present then its serviced or else the entire queue

9

is flushed and prefetching occurs from the next sequential block from the miss. This scheme has the advantage of being simple and a comparator is required only at the head. Another policy that we have explored is LRU. This policy uses a fully associative prefetch buffer with a

LRU replacement strategy, adding a replaceable bit to each prefetch buffer entry. A prefetch would be initiated only if there was an entry marked as replaceable in the prefetch buffer, choosing the least recently used replaceble entry. On a miss all entries would be marked as replaceable . They would also be marked as replaceable in the prefetch buffer when the matching cache blocks are fetched during the instruction cache fetch.

Changes to Simplescalar

The stream buffer logic and policies were implemented separately. Changes were made to the fetch unit of the simplescalar, the L1 i-cache miss handler and the main pipeline loop.

Additional Configuration parameters were added to configure the stream buffer for different runs. The fetch unit of simplescalar was modified to probe the stream buffer in parallel with the i-cache lookup. The prefetch was initiated after the fetch in the pipeline to give priority to instruction fetches over prefetch. The L1 miss handler was modified so that a miss in L1 and a corresponding hit in the stream buffer is serviced by the stream buffer. If a miss occurs both in

L1 and the stream buffer then an L2 cache access is made. During such an access , the stream buffers are flushed appropriately and the next prefetch address is also set. Cases when an instruction fetch is blocked by an ongoing prefetch were also accounted for.

3.2 Prefetch Buffers in Fetch directed architecture

Prefetch buffers are similar to the stream buffers. The main difference is that the addresses are computed by the PIQ and a prefetch is intiated using the address at the head of the queue. Priority is given to outstanding L1 cache fetches. Only FIFO policies were examined in this case. The prefetch buffers were flushed on a branch mispredict to keep them in sync

10

with the FTQ.

Implementation

The implementation of stream buffers were extended to accommodate prefetch buffers.

The changes that were involved were to the way the prefetch addresses were computed . The prefetch buffer queries the PIQ logic to obtain the address for the next prefetch. The flushing mechanisms were now initiated on a branch mispredict .

4 Simulation Results

4.1 Simulation Architecture

The base architecture that we explored was a 4-issue out of order processor. The L2 cache was single ported with a transfer bandwidth of 8 bytes/cycle. We used a 4KB , 64 cache byte lined, 4 way set associative L2 cache and 256 byte, 32 byte cache lined, 2 way set associative L1 cache. The L2 access latency was set to 12 cycles and the memory latency to

30 cycles. These along with the statstics related to fetch directed architecture and prefetching are presented in the following sections.

4.2 Stream Buffer Results

The implementation of stream buffers were tested on various Spec95 integer benchmarks like compress, li, perl and vortex. Various statistics like IPC, number of useful prefetches , harmful prefetches , cycles lost due to harmful prefetches and cycles gained due to useful prefetch were collected .

IPC of various policies

We see a consistent improvement in the IPC for the FIFO prefetching over baseline architecture. The LRU strategy performs better than FIFO . The percentage speed up vary across benchmarks. For FIFO it varies from less than 1% to less than 6%. For LRU the

11

percentage speed up varies from less than 1% to less than 7%. We attribute the low percentage increase to two factors. Firstly, the L2 cache is single ported and the prefetching strategy gives priority to the Icache and D-cache misses . So a prefetch is possible only when the L2 is not servicing either L1 caches. Secondly, some prefetches hoard the bus when the fetch unit need it. Since the fetch unit has initiated a fetch from L2 the ongoing prefetch is useless and has a detrimental effect by hoarding the bus. We studied the actual percentage of such occurrences and the number of cycles lost due to such harmful prefetches ( graph 2 ) .

These statistics explain why the IPC speedup is very low for certain benchmarks. For example in the case of li benchmark (graph 2 ), there was 5.17% gain in the number of cycles due to useful prefetches . But we also see that the harmful prefetches costed 2% cycle loss . So the actual percentage gained is reduced to around 3.17% of total number of cycles. Graph 1 shows the possible improvement in the IPC if these harmful prefetches were avoided ( Ideal IPC column ). For li , we see a 9% improvement in the Ideal IPC over the IPC using single ported non - piplined L2 cache. This means that adding ports to L2 cache or pipielining will definitely improve IPC in such cases.

Stream Buffers Prefetching Performance

1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1

0.9

0.8

compress perl li

Benchmarks vortex

No prefetching

Stream Buffers w/ FIFO

Stream Buffers w/ LRU

Ideal IPC w/ Stream

Buffers w/ FIFO

Graph 1 :

Processor Issue width = 4 . L1 cache size : 256 bytes, Cache line : 32 byte,

Mapping : 2 –way set associative

12

L2 latency = 12 cycles Prefetch buffer size = 8

Single Ported L2 cache with 8 bytes/cycle

Study of stalled i-fetches due to prefetch

20

18

16

14

12

10

8

6

4

2

0

Cycles lost due to harmful prefetch

Cycles gained due to prefetch hits compress li vortex

Benchmark perl

Graph 2 : Study of cycles lost due to prefetches blocking I –cache fetch. The parameters for prefetch buffers are the same as in graph1.

Variation of IPC with size of prefetch buffers

4

3

2

1

0

6

5

IPC Speedup vs Number of Stream Buffers (FIFO)

perl li vortex compress

2 4 8

Number of Stream Buffers

16

Graph 3 : Variation of IPC with number of entries in the stream buffers for FIFO policy.

13

IPC Speedup vs Number of Stream Buffers (LRU)

7

6

5

4

3

2

1

0 perl li vortex compress

2 4 8

Number of Stream Buffers

16

Graph 4 : Variation of IPC with number of stream buffers for LRU replacement policy.

We examined the amount of speedup that could be obtained as we vary the size of prefetch buffers. The main motivation for this study would be in the case when we use the LRU policy since it requires full associativity which adds to hardware cost. We found that adding additional buffers to the stream buffer does increase the IPC . But after a certain number of buffers the IPC reaches an asymptote ( Graphs 3 and 4 ). This behavior is consistent across both FIFO and LRU. The optimal number of buffers should be such that the total size of the stream buffer is equal to the average size of a basic block. The average size of the basic block also places an upper bound on the number of harmful blocks that can be actually prefetched assuming that the entire basic block is serviced by the prefetch buffer. So increasing the size of the prefetch buffer beyond this limit does not have any change since the increased number of blocks are never used in prefetch.

14

4.3 Fetch Directed Architecture Results

This section presents our results from simulations on the fetch directed architecture.

Graph 5 compares the performance of a Branch Target Buffer versus a Fetch Target Buffer in the decoupled architecture. We plugged both into the decoupled architecture to determine the speedup provided by the Fetch Target Buffer. Our simulations used a fully associative BTB and

FTB of 256 entries. We find that the FTB consistently outperforms the BTB. This is due to the fact that the FTB can keep the execution engine busy because its predictions span a larger number of instructions ( the fetch target block ) while the BTB can only predict the target address ( this is equivalent to a fetch target block that contains only one address). Also since the FTB stores entries only for taken branches, it has a larger reach than a same sized BTB.

BTB vs FTB

1.2

1

0.8

0.6

0.4

0.2

0

BTB

FTB li perl

Benchmark vortex

Graph 5 : Comparison of IPC obtained using BTB and FTB

Graph 6 presents results of our simulation studies that study the IPC improvement when we introduce prefetching into the decoupled architecture. The speedup is calculated over the base fetch directed architecture with no prefetching. We see a varying range of speedups for different benchmarks. The reason for this is that the prefetching is of use only if the decoupled

15

branch predictor is able to make the right predictions. Thus, the nature of the benchmark affects the gain derived due to prefetching. We find that we get a peak performance improvement of 8.8% (li) due to prefetching in the benchmarks we ran on the simulator.

Improvement in IPC using prefetching

3

2

1

0

6

5

4

10

9

8

7 compress li perl

Benchmark vortex

IPC improvement

Graph 6 : Improvement in IPC using stream buffer prefetching

Improvement in IPC using different techniques

10

9

8

7

6

3

2

1

5

4

0 compress li perl

Benchmark vortex

Stream Buffers over No

Stream Buffers

Fetch Directed

Prefetching over base

Fetch Directed

Architecture

Graph 7 : Comaprison of IPC improvement when using stream buffers with base simplescalar and fetch directed architecture

16

Graph 7 compares the stream buffer performance with the performance of the fetch directed prefetching architecture. We find that fetch directed prefetching gives substantial performance improvement in the case of li and vortex. In compress, we see a slight performance drop which again is due to the nature of the benchmark. If a lot of the branch predictions are wrong, the prefetching is mostly useless. For example, if a lot of branches we predict taken are actually not taken, stream buffers will perform better than fetch directed prefetching.

5 Conclusions

Fetch directed architecture decouples the branch predictor from the main pipeline so that it can run ahead and make predictions even when the I-cache is stalled. We simulated the fetch directed architecure on Simplescalar and found that the FTB scheme gives better performance than a BTB. We also coupled the architecture with stream buffers and showed that the architecture enables intelliegent prefetching resulting in a greater improvement in IPC as compared to prefetching over base simplescalar architecture. For better results, we need the change to memory hierarchy and model multi-ported and/or pipelined caches.

References

[Reinman 99] Glenn Reinman, Brad Calder, Todd Austin, Fetch Directed Instruction

Prefetching, Proceedings of 32 nd Annual Symposium on Microarchitecture (MICRO-32), Nov.

1999.

[Reinman2 99] Glenn Reinman, Todd Austin, Brad Calder, A Scalable Front-End Architecture for Fast Instruction Delivery , Proceedings of International Symposium on Computer

Architecture, May 1999.

[Reinman 00] Glenn Reinman, Brad Calder, and Todd Austin, Optimizations Enabled by a

Decoupled Front-End Architecture , To be published in the IEEE Transactions on Computing.

17

[Jouppi 90] Norman P. Jouppi, Improving Direct-Mapped Cache Performance by the

Addition of a Small Fully-Associative Cache and Prefetch Buffers , Proc. International

Symposium on Computer Architecture , June 1990

[Simplescalar 97] D.C. Burger, Todd Austin, The Simplescalar tool set, ver. 2.0

, Technical

Report CS-TR-97-1342, University of Wisconsin-Madison, July 1997.

[Yeh-Pat 92] T.Yeh abd Y.Patt, A Comprehensive instruction fetch mechanism for a processor supporting speculative execution, In Proceedings of 25 th Annual International

Symposium on Microarchitecture, Dec 1992.

18

Download