Access Map Pattern Matching Prefetch: Optimization Friendly

advertisement
1
Access Map Pattern Matching Prefetch: Optimization Friendly Method
Yasuo Ishii1, Mary Inaba2, and Kei Hiraki2
Abstract
Recent studies of the microarchitecture and the compiler
make performance improvements significantly. However,
the aggressive optimizations often degrade the prefetch
accuracy. Generally, the prefetch algorithm uses
following information to generate the prefetch requests (a) data address, (b) fine-grained memory access
ordering, and (c) instruction address, but these (b) and
(c) can be scrambled by the aggressive optimizations.
In this paper, we propose a new memory side prefetch
algorithm – AMPM prefetcher – which uses
coarse-grained memory access ordering which we call
zone access ordering in place of (b) conventional
memory access ordering. The AMPM prefetcher holds
the positions where the previous memory requests are
accessed for each Czone. The prefetcher generates
prefetch requests based on the history.
We evaluate the AMPM prefetcher in a DPC framework.
The simulation result shows that our prefetcher improves
performance by 53%.
1. INTRODUCTION
Recent studies of the microarchitecture and the
compiler makes performance improvement significantly.
For example, the out-of-order execution, the speculative
execution, the relaxed consistency, and the compiler
optimizations like a loop unrolling improve the
performance in recent decades. However, these
aggressive optimizations often degrade the prefetch
performance since conventional prefetching algorithms
are implicitly required the processor and the compiler
work without any optimizations.
Generally, conventional prefetchers use (a) data
address, (b) fine-grained memory access ordering, and (c)
instruction address. The (b) memory access ordering can
be scrambled by the out of order execution with the
relaxed memory consistency. The (c) instruction
addresses are duplicated by the loop unrolling. In these
cases, the conventional prefetch like [2] cannot detect
correct address correlation when the (b) is scrambled and
(c) is duplicated since the prefetching generation is based
1
2
NEC Corporation
University of Tokyo
on whether the correlation of (a) data address and (b)
memory access order is exactly matched to the previous
history.
In this paper, we propose a new memory side prefetch
method – Access Map Pattern Matching (AMPM)
prefetch. This prefetch method is tolerant to aggressive
optimizations since it uses a coarse-grained memory
access ordering information which we call memory zone
ordering in place of the (b) fine-grained memory access
ordering. AMPM prefetch method is composed of (1)
detection of hot zones and holding the information of the
zones, (2) listing prefetch candidates from pattern
matching of memory access histories, and (3) selecting
prefetch requests from the candidates.
The prefetcher detects hot zones from recent access
history and stores access histories in the memory access
map. The memory access map holds only the position
where the previous memory requests are accessed without
any access ordering. The memory access map is replaced
least recently used (LRU) policy. The pattern matching
detects the stride address correlation between the address
position of demand request and the address position
stored in the memory access map. This pattern matching
does not suffer from aggressive optimizations since it
uses neither fine-grained memory access order nor
instruction addresses. Memory access maps hold the
profiling information. The amount of the prefetch
requests selected from the candidates (prefetch degree) is
decided from the profiled information.
In section 2, we present the algorithm of AMPM
prefetcher. The hardware implementation and complexity
of the AMPM prefetcher is discussed in section 3. The
other optimizations for DPC competition is introduced in
section 4. Detailed budget count and simulation results
are presented in section 5. Concluding remarks are
presented in section 6.
2. DATA STRUCTURE AND ALGORITHM
An AMPM prefetcher is composed of memory access
maps which hold the memory access histories and a
prefetch generator which uses a memory access map to
make prefetch requests.
A memory access map holds the accessed positions of
the recent memory accesses in its corresponding Czone.
The history in the memory access map has no information
of the accesses. Memory access map table contains a
number of the memory access maps and replaces them
2
least recently used (LRU) policy. The LRU replacement
Initial State
Demand
Access
0x00
0x01
Access
Stride = 2
0x02
0x03
0x04
Demand Access
0x05
Stride = 2
Init→Access
0x06
0x07
Init→Prefetch
ACCESS
PREFETCH
SUCCESS
Prefetch
Request
Prefetch Request
0x08
0x09
0x0A
INIT
Access
Access
0x0B
Figure 1 Access Map Pattern Matching Prefetch
realizes a coarse-grained history management, since the
memory access history of the map is discarded when the
LRU map is replace. The LRU replacement also realizes
the concentration of the limited hardware resources to hot
zones of the memory accesses. In this paper, we apply the
total size of the memory access maps near to the L2
capacity.
The prefetch generator decides appropriate prefetch
requests and issues to the main memory. It detects the
prefetch candidates from the pattern matching of the
access history. It also decides the number of requests to
issue from the profiled information.
2.1. Memory Access Map
A memory access map is the information of memory
access history of its corresponding Czone. It holds the
access information of all addresses in the Czone as a bit
map data structure. The status of each address is stored as
a 2 bits state machine. This 2 bits state machine has 4
states (Init, Prefetch, Access, Success). The diagram of
the states is shown in the figure 2. The transitions of the
states occur when the demand accesses reach the L2
controller or the prefetch requests are issued to the main
memory.
Since the transitions are monotonic, almost all the
entries of the memory access maps will become Access or
Success unless replacements of the memory access maps.
In this case, the almost all frequently accessed data is
stored in the L2 cache memory since the total size of the
memory access maps is almost same with the L2 cache
capacity. Since the hot zones are already fetched in the
cache memory, there are no needs for additional prefetch
requests.
2.2. Prefetch Generation
2.2.1 Generating Prefetch Requests
AMPM prefetcher generates the prefetch requests
when the demand request reaches the L2 cache memory.
The prefetcher reads 3 consecutive memory access maps
Demand
Access
Figure 2 State Diagram for Memory Access Map
from the table and concatenates them. The generator
detects the prefetch candidates with the address
correlation pattern matching in the concatenated map.
The basic concept of the pattern matching is based on the
stride detection. This pattern matching detector generates
many prefetch candidates within the memory access map
in parallel. Finally, the prefetch generator selects nearest
ones from the address of the demand request. The
selected requests are issued to the main memory.
For example, when the address 0x01, 0x03, and 0x04
are already accessed and the demand request for 0x05
reaches the L2 controller, the prefetch generator makes 2
candidates (1) 0x07 and (2) 0x06. It detects address
correlation among (1) {0x01, 0x03, 0x05} and (2) {0x03,
0x04, 0x05}.
2.2.2 Adaptive Prefetch Degree
The prefetch generator optimizes the degree of the
prefetch for achieving good performance. The generator
controls the degree from (1) the frequency of the prefetch
requests, (2) the frequency of the L2 conflict miss, (3) the
conflict miss of the memory access map, and (4) the ratio
of prefetches success.
The best prefetch degree based on the access frequency is
decided by following evaluation, since the prefetched
data can be reached enough fast in the L2 cache memory.
The access frequency * Prefetch Success Ratio
For this purpose, all memory access maps have a few
access counters.
The prefetch requests often degrade the processor
performance, since the prefetched data pollute the L2
caches. In the case of (2) or (3), our prefetcher restricts
the maximum number of the prefetch requests to 2.
When the number of Success of in the map is larger than
that of Prefetch, the prefetcher uses Prefetch states as
Access or Success for more aggressive prefetches.
3. HARDWARE DESIGN & COMPLEXITY
The overview of the implementation of the AMPM
prefetcher is shown in the figure 3. The AMPM
prefetcher is composed of a memory access map table and
3
a prefetch request generator. The table holds the memory
access maps in a content addressable memory (CAM).
The prefetch generator composed of access map shifters,
candidate detectors, priority encoders, and address offset
adders.
The request generation is processed in following steps.
First, the memory access map table is read by the address
of the demand access. The shifter is used for the memory
access map alignment. The position of the demand
request is aligned to the edge of the access map. Second,
the detector produces candidate addresses of the prefetch
requests. Finally, the priority encoder selects the requests
to the position of the demand access.
3.1. Memory Access Map Table
A memory access map table is implemented as the
multi-ported CAM which holds about 64 maps. The
complexity of the CAM is almost same with the
full-associative TLB. The TLB is feasible in spite of its
placement in the fast clock domain (processor core
domain). On the other hand, the memory access map table
is placed in the slower clock domain (memory controller
domain). It means that the memory access map table has
enough implementable.
When the memory access map has larger entry size, the
memory access table can be implemented as a
set-associative structure like cache memory.
3.2. Prefetch Generator
Prefetch Generator is composed of following parts –
access map shifters, candidate detectors, priority
encoders, and address offset adders. Each component
processes 256 bits data.
The shifters of memory map and the priority encoders
are not so small, but they can be implemented with
feasible hardware since the fast 128+ bit shifter and
priority encoder have already been put into practical use
in commercial processors [5].
The candidate detectors can be implemented by many
simple combination circuits. Each component needs a
few gates. The address adders are simple 32 bits – 64 bits
adder. They are feasible enough.
3.3. Pipelining for AMPM Prefetcher
The previous subsection showed that the components
of the prefetch generator are feasible. However, it is not
feasible enough to work the prefetcher with much higher
clock frequency. The prefetch generator has to be
pipelined in such case. Pipeline registers are inserted
between the detectors of the prefetch candidates and the
priority encoders. In the pipelined AMPM prefetcher, the
priority encoders are used repeatedly until next prefetch
requests reach the pipeline registers. Furthermore the
generator makes only 2 requests in one processor cycle.
When the next prefetch request achieves to the pipeline
stage, the previous prefetch candidates are discarded.
Request Addr
Stage 0
Request Addr
+1
+0
-1
Access Map Table
Entry +1
Entry
Entry -1
Merged Access Map
Shift to Left
Stage 1
Request Addr
Shift to Right
Forward Prefetch
Backward Prefetch
FWD Prefetch Reg.
BWD Prefetch Reg.
Priority
Encode
Priority
Encode
+
-
Stage 2
FWD Prefetch Req.
BWD Prefetch Req.
Figure 3 Implementation of AMPM Prefetcher
4. OTHER OPTIMIZATIONS
This section introduces the other optimizations for our
prefetcher.
4.1. Processor Side L1 Prefetching
Our prefetcher employs an adaptive stream prefetcher
[4] as a processor side prefetcher since it has a good cost
performance. This prefetch method was proposed as a
memory side prefetcher, but we found that the method
works well in the L1 prefetching.
4.2. Miss Status Handling Register
The DPC framework provides a MSHR with 16 entries
but this is not enough. More MSHR size is required for
support enough in-flight prefetch requests. We employ
another MSHR with 32 entries for handling of the
prefetch requests and used the default MSHR with 16
entries for handling of the demand requests.
5. EVALUATION
5.1. Configuration
Table 1 shows that storage counts for our prefetcher.
The prefetcher employs a pipelined AMPM prefetcher
and an adaptive stream prefetcher. The AMPM
prefetcher is composed of several pipeline registers and a
full associative memory access map with 52 entries. Each
memory access map has 256 entries. The adaptive stream
prefetcher employs 16 stream filters for stream detection
and 16 histograms for prediction of the stream length.
5.2. Result
We evaluate the proposed prefetcher in the DPC
4
framework. We use SPEC CPU2006 as a benchmark. The
[4] I. Hur and C. Lin. Memory Prefetching Using Adaptive
4.50
CONFIG1
CONFIG2
CONFIG3
4.00
3.50
Speedups
3.00
2.50
2.00
1.50
1.00
0.50
0.00
Figure 4 Evaluation Results for SPEC CPU2006
compile
option
is
“-O3
-fomit-frame-pointer
-funroll-all-loops”. The simulation skips first 4000M
instructions and evaluates following 100M instructions.
We use ref inputs for the evaluation.
The evaluation results are shown in the Figure 6. The
AMPM prefetcher improves a processor performance by
53%.
6. CONCLUSIONS
In this paper, we proposed an access map pattern
matching prefetcher that realizes an optimization friendly
prefetching. The prefetching algorithm requires
coarse-grained history from LRU replacement policy of
the memory access map table. Each access map keeps
only the positions where the previous memory requests
accessed in a bit map structure. The memory access map
holds many previous requests since it does not need to
hold the order of the memory accesses. Since the
positions where the previous memory requests accessed
are not affected by the order. The AMPM prefetcher is
tolerant to the order of the memory accesses.
We evaluated an optimized AMPM prefetcher with a
32K bit budget in the DPC framework. The evaluation
result shows 53% performance improvement in SPEC
CPU2006.
REFERENCES
[1] K. J. Nesbit, A. S. Dhodapkar and J. E. Smith, AC/DC: An
adaptive data cache prefetcher. In Proc. of Int. Conf. on
Parallel Architecture and Compilation Techniques, 2004.
[2] K. J. Nesbit and J. E. Smith, Data Cache Prefetching Using
a Global History Buffer. In Proc. of Int. Symp. On High
Performance Computer Architecture, 2004.
[3] S. P. Vanderwiel and D. J. Lilja. Data prefetch
mechanisms. ACM Computing Surveys, 32(2):174-199,
June 2000.
MSHR
Prefetch
MSHR
Memory
Access
Map
Table
Adaptive
Stream
Filter
Components
Valid bit (1bit)
Address bit (26 bit)
Valid bit (1bit)
Address bit (26 bit)
Issue bit (1 bit)
Address Tag (18 bit)
LRU status(6 bit)
Access Counter (4 bit)
Interval Timer (18 bit)
Access Map (256 x 2 bit)
Valid bit (1bit)
Address bit (26 bit)
Lifetime (10 bit)
Stream Length (4 bit)
Direction (1 bit)
Stream
Length
Counter (16 bit)
Histogram
Pipeline
Registers
Total
16 entries
Budget
0bit
(Default)
32 entries
5bit pointer
52 entries
+ mode
register (3
bit)
+
performance
901 bit
29147 bit
16 entries
672 bit
16 entries
2 series
2 direction
1024 bit
292 bit
32036 bit
Table 1 Budget Counts of AMPM prefetcher
Stream Detection. In Proc. of Int Symp,
Microarchitecture, 2006
[5] S. D. Trong, M. Schmookeler, E. M. Schwarz, and M.
Kroener. POWER6 Binary Floating-Point Unit,
Proceedings of the 18th IEEE Symposium on Computer
Arithmetic (ARITH18), 2007,
[6] S. Palacharla and R. Kessler, Evaluating stream buffers as
a secondary cache replacement, In Proc. of the 21st Ann.
Intl. Symp. on Computer Architecture, pp.96-105, Feb.
2004.
Download