1 Access Map Pattern Matching Prefetch: Optimization Friendly Method Yasuo Ishii1, Mary Inaba2, and Kei Hiraki2 Abstract Recent studies of the microarchitecture and the compiler make performance improvements significantly. However, the aggressive optimizations often degrade the prefetch accuracy. Generally, the prefetch algorithm uses following information to generate the prefetch requests (a) data address, (b) fine-grained memory access ordering, and (c) instruction address, but these (b) and (c) can be scrambled by the aggressive optimizations. In this paper, we propose a new memory side prefetch algorithm – AMPM prefetcher – which uses coarse-grained memory access ordering which we call zone access ordering in place of (b) conventional memory access ordering. The AMPM prefetcher holds the positions where the previous memory requests are accessed for each Czone. The prefetcher generates prefetch requests based on the history. We evaluate the AMPM prefetcher in a DPC framework. The simulation result shows that our prefetcher improves performance by 53%. 1. INTRODUCTION Recent studies of the microarchitecture and the compiler makes performance improvement significantly. For example, the out-of-order execution, the speculative execution, the relaxed consistency, and the compiler optimizations like a loop unrolling improve the performance in recent decades. However, these aggressive optimizations often degrade the prefetch performance since conventional prefetching algorithms are implicitly required the processor and the compiler work without any optimizations. Generally, conventional prefetchers use (a) data address, (b) fine-grained memory access ordering, and (c) instruction address. The (b) memory access ordering can be scrambled by the out of order execution with the relaxed memory consistency. The (c) instruction addresses are duplicated by the loop unrolling. In these cases, the conventional prefetch like [2] cannot detect correct address correlation when the (b) is scrambled and (c) is duplicated since the prefetching generation is based 1 2 NEC Corporation University of Tokyo on whether the correlation of (a) data address and (b) memory access order is exactly matched to the previous history. In this paper, we propose a new memory side prefetch method – Access Map Pattern Matching (AMPM) prefetch. This prefetch method is tolerant to aggressive optimizations since it uses a coarse-grained memory access ordering information which we call memory zone ordering in place of the (b) fine-grained memory access ordering. AMPM prefetch method is composed of (1) detection of hot zones and holding the information of the zones, (2) listing prefetch candidates from pattern matching of memory access histories, and (3) selecting prefetch requests from the candidates. The prefetcher detects hot zones from recent access history and stores access histories in the memory access map. The memory access map holds only the position where the previous memory requests are accessed without any access ordering. The memory access map is replaced least recently used (LRU) policy. The pattern matching detects the stride address correlation between the address position of demand request and the address position stored in the memory access map. This pattern matching does not suffer from aggressive optimizations since it uses neither fine-grained memory access order nor instruction addresses. Memory access maps hold the profiling information. The amount of the prefetch requests selected from the candidates (prefetch degree) is decided from the profiled information. In section 2, we present the algorithm of AMPM prefetcher. The hardware implementation and complexity of the AMPM prefetcher is discussed in section 3. The other optimizations for DPC competition is introduced in section 4. Detailed budget count and simulation results are presented in section 5. Concluding remarks are presented in section 6. 2. DATA STRUCTURE AND ALGORITHM An AMPM prefetcher is composed of memory access maps which hold the memory access histories and a prefetch generator which uses a memory access map to make prefetch requests. A memory access map holds the accessed positions of the recent memory accesses in its corresponding Czone. The history in the memory access map has no information of the accesses. Memory access map table contains a number of the memory access maps and replaces them 2 least recently used (LRU) policy. The LRU replacement Initial State Demand Access 0x00 0x01 Access Stride = 2 0x02 0x03 0x04 Demand Access 0x05 Stride = 2 Init→Access 0x06 0x07 Init→Prefetch ACCESS PREFETCH SUCCESS Prefetch Request Prefetch Request 0x08 0x09 0x0A INIT Access Access 0x0B Figure 1 Access Map Pattern Matching Prefetch realizes a coarse-grained history management, since the memory access history of the map is discarded when the LRU map is replace. The LRU replacement also realizes the concentration of the limited hardware resources to hot zones of the memory accesses. In this paper, we apply the total size of the memory access maps near to the L2 capacity. The prefetch generator decides appropriate prefetch requests and issues to the main memory. It detects the prefetch candidates from the pattern matching of the access history. It also decides the number of requests to issue from the profiled information. 2.1. Memory Access Map A memory access map is the information of memory access history of its corresponding Czone. It holds the access information of all addresses in the Czone as a bit map data structure. The status of each address is stored as a 2 bits state machine. This 2 bits state machine has 4 states (Init, Prefetch, Access, Success). The diagram of the states is shown in the figure 2. The transitions of the states occur when the demand accesses reach the L2 controller or the prefetch requests are issued to the main memory. Since the transitions are monotonic, almost all the entries of the memory access maps will become Access or Success unless replacements of the memory access maps. In this case, the almost all frequently accessed data is stored in the L2 cache memory since the total size of the memory access maps is almost same with the L2 cache capacity. Since the hot zones are already fetched in the cache memory, there are no needs for additional prefetch requests. 2.2. Prefetch Generation 2.2.1 Generating Prefetch Requests AMPM prefetcher generates the prefetch requests when the demand request reaches the L2 cache memory. The prefetcher reads 3 consecutive memory access maps Demand Access Figure 2 State Diagram for Memory Access Map from the table and concatenates them. The generator detects the prefetch candidates with the address correlation pattern matching in the concatenated map. The basic concept of the pattern matching is based on the stride detection. This pattern matching detector generates many prefetch candidates within the memory access map in parallel. Finally, the prefetch generator selects nearest ones from the address of the demand request. The selected requests are issued to the main memory. For example, when the address 0x01, 0x03, and 0x04 are already accessed and the demand request for 0x05 reaches the L2 controller, the prefetch generator makes 2 candidates (1) 0x07 and (2) 0x06. It detects address correlation among (1) {0x01, 0x03, 0x05} and (2) {0x03, 0x04, 0x05}. 2.2.2 Adaptive Prefetch Degree The prefetch generator optimizes the degree of the prefetch for achieving good performance. The generator controls the degree from (1) the frequency of the prefetch requests, (2) the frequency of the L2 conflict miss, (3) the conflict miss of the memory access map, and (4) the ratio of prefetches success. The best prefetch degree based on the access frequency is decided by following evaluation, since the prefetched data can be reached enough fast in the L2 cache memory. The access frequency * Prefetch Success Ratio For this purpose, all memory access maps have a few access counters. The prefetch requests often degrade the processor performance, since the prefetched data pollute the L2 caches. In the case of (2) or (3), our prefetcher restricts the maximum number of the prefetch requests to 2. When the number of Success of in the map is larger than that of Prefetch, the prefetcher uses Prefetch states as Access or Success for more aggressive prefetches. 3. HARDWARE DESIGN & COMPLEXITY The overview of the implementation of the AMPM prefetcher is shown in the figure 3. The AMPM prefetcher is composed of a memory access map table and 3 a prefetch request generator. The table holds the memory access maps in a content addressable memory (CAM). The prefetch generator composed of access map shifters, candidate detectors, priority encoders, and address offset adders. The request generation is processed in following steps. First, the memory access map table is read by the address of the demand access. The shifter is used for the memory access map alignment. The position of the demand request is aligned to the edge of the access map. Second, the detector produces candidate addresses of the prefetch requests. Finally, the priority encoder selects the requests to the position of the demand access. 3.1. Memory Access Map Table A memory access map table is implemented as the multi-ported CAM which holds about 64 maps. The complexity of the CAM is almost same with the full-associative TLB. The TLB is feasible in spite of its placement in the fast clock domain (processor core domain). On the other hand, the memory access map table is placed in the slower clock domain (memory controller domain). It means that the memory access map table has enough implementable. When the memory access map has larger entry size, the memory access table can be implemented as a set-associative structure like cache memory. 3.2. Prefetch Generator Prefetch Generator is composed of following parts – access map shifters, candidate detectors, priority encoders, and address offset adders. Each component processes 256 bits data. The shifters of memory map and the priority encoders are not so small, but they can be implemented with feasible hardware since the fast 128+ bit shifter and priority encoder have already been put into practical use in commercial processors [5]. The candidate detectors can be implemented by many simple combination circuits. Each component needs a few gates. The address adders are simple 32 bits – 64 bits adder. They are feasible enough. 3.3. Pipelining for AMPM Prefetcher The previous subsection showed that the components of the prefetch generator are feasible. However, it is not feasible enough to work the prefetcher with much higher clock frequency. The prefetch generator has to be pipelined in such case. Pipeline registers are inserted between the detectors of the prefetch candidates and the priority encoders. In the pipelined AMPM prefetcher, the priority encoders are used repeatedly until next prefetch requests reach the pipeline registers. Furthermore the generator makes only 2 requests in one processor cycle. When the next prefetch request achieves to the pipeline stage, the previous prefetch candidates are discarded. Request Addr Stage 0 Request Addr +1 +0 -1 Access Map Table Entry +1 Entry Entry -1 Merged Access Map Shift to Left Stage 1 Request Addr Shift to Right Forward Prefetch Backward Prefetch FWD Prefetch Reg. BWD Prefetch Reg. Priority Encode Priority Encode + - Stage 2 FWD Prefetch Req. BWD Prefetch Req. Figure 3 Implementation of AMPM Prefetcher 4. OTHER OPTIMIZATIONS This section introduces the other optimizations for our prefetcher. 4.1. Processor Side L1 Prefetching Our prefetcher employs an adaptive stream prefetcher [4] as a processor side prefetcher since it has a good cost performance. This prefetch method was proposed as a memory side prefetcher, but we found that the method works well in the L1 prefetching. 4.2. Miss Status Handling Register The DPC framework provides a MSHR with 16 entries but this is not enough. More MSHR size is required for support enough in-flight prefetch requests. We employ another MSHR with 32 entries for handling of the prefetch requests and used the default MSHR with 16 entries for handling of the demand requests. 5. EVALUATION 5.1. Configuration Table 1 shows that storage counts for our prefetcher. The prefetcher employs a pipelined AMPM prefetcher and an adaptive stream prefetcher. The AMPM prefetcher is composed of several pipeline registers and a full associative memory access map with 52 entries. Each memory access map has 256 entries. The adaptive stream prefetcher employs 16 stream filters for stream detection and 16 histograms for prediction of the stream length. 5.2. Result We evaluate the proposed prefetcher in the DPC 4 framework. We use SPEC CPU2006 as a benchmark. The [4] I. Hur and C. Lin. Memory Prefetching Using Adaptive 4.50 CONFIG1 CONFIG2 CONFIG3 4.00 3.50 Speedups 3.00 2.50 2.00 1.50 1.00 0.50 0.00 Figure 4 Evaluation Results for SPEC CPU2006 compile option is “-O3 -fomit-frame-pointer -funroll-all-loops”. The simulation skips first 4000M instructions and evaluates following 100M instructions. We use ref inputs for the evaluation. The evaluation results are shown in the Figure 6. The AMPM prefetcher improves a processor performance by 53%. 6. CONCLUSIONS In this paper, we proposed an access map pattern matching prefetcher that realizes an optimization friendly prefetching. The prefetching algorithm requires coarse-grained history from LRU replacement policy of the memory access map table. Each access map keeps only the positions where the previous memory requests accessed in a bit map structure. The memory access map holds many previous requests since it does not need to hold the order of the memory accesses. Since the positions where the previous memory requests accessed are not affected by the order. The AMPM prefetcher is tolerant to the order of the memory accesses. We evaluated an optimized AMPM prefetcher with a 32K bit budget in the DPC framework. The evaluation result shows 53% performance improvement in SPEC CPU2006. REFERENCES [1] K. J. Nesbit, A. S. Dhodapkar and J. E. Smith, AC/DC: An adaptive data cache prefetcher. In Proc. of Int. Conf. on Parallel Architecture and Compilation Techniques, 2004. [2] K. J. Nesbit and J. E. Smith, Data Cache Prefetching Using a Global History Buffer. In Proc. of Int. Symp. On High Performance Computer Architecture, 2004. [3] S. P. Vanderwiel and D. J. Lilja. Data prefetch mechanisms. ACM Computing Surveys, 32(2):174-199, June 2000. MSHR Prefetch MSHR Memory Access Map Table Adaptive Stream Filter Components Valid bit (1bit) Address bit (26 bit) Valid bit (1bit) Address bit (26 bit) Issue bit (1 bit) Address Tag (18 bit) LRU status(6 bit) Access Counter (4 bit) Interval Timer (18 bit) Access Map (256 x 2 bit) Valid bit (1bit) Address bit (26 bit) Lifetime (10 bit) Stream Length (4 bit) Direction (1 bit) Stream Length Counter (16 bit) Histogram Pipeline Registers Total 16 entries Budget 0bit (Default) 32 entries 5bit pointer 52 entries + mode register (3 bit) + performance 901 bit 29147 bit 16 entries 672 bit 16 entries 2 series 2 direction 1024 bit 292 bit 32036 bit Table 1 Budget Counts of AMPM prefetcher Stream Detection. In Proc. of Int Symp, Microarchitecture, 2006 [5] S. D. Trong, M. Schmookeler, E. M. Schwarz, and M. Kroener. POWER6 Binary Floating-Point Unit, Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH18), 2007, [6] S. Palacharla and R. Kessler, Evaluating stream buffers as a secondary cache replacement, In Proc. of the 21st Ann. Intl. Symp. on Computer Architecture, pp.96-105, Feb. 2004.