The Migration Prefetcher

advertisement
HiPEAC 2012, Paris (France) – January 23, 2012
Javier Lira (Intel-UPC, Spain)
Timothy M. Jones (U. of Cambridge, UK)
javierx.lira@intel.com
timothy.jones@cl.cam.ac.uk
Carlos Molina (URV, Spain)
Antonio González (Intel-UPC, Spain)
carlos.molina@urv.net
antonio.gonzalez@intel.com
Intel®
24 MBytes



CMPs have become the
dominant paradigm.
Nehalem
IBM®
32 MBytes
Incorporate large shared lastlevel caches.
Access latency in large caches
is dominated by wire delays.
POWER7
Tilera®
32 MBytes
2
Tile-GX



NUCA divides a large cache
in smaller and faster banks.
Cache access latency
consists of the routing and
bank access latencies.
Banks close to cache
controller have smaller
latencies than further banks.
Processor
[1] Kim et al. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Architectures. ASPLOS’02
3


Data can be mapped in multiple banks.
Migration allows data to adapt to application’s behaviour.
S-NUCA
D-NUCA
Migration movements are effective, but about 50% of hits still
happen in non-optimal banks.
4

Introduction

Methodology

The Migration Prefetcher

Analysis of results

Conclusions
5
Placement
16 positions per data
Partitioned multicast
Access
Core 0
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Migration
Gradual promotion
LRU + Zero-copy
Replacement
[2] Beckmann and Wood. Managing Wire Delay in Large Chip-Multiprocessor Caches. MICRO’04
6
SPEC
CPU2006
PARSEC
Solaris 10
8 x UltraSPARC IIIi
Simics
GEMS
Ruby
Garnet
Orion
Number of cores
8 – UltraSPARC IIIi
Frequency
1.5 GHz
Main Memory Size
4 Gbytes
Memory Bandwidth
512 Bytes/cycle
Private L1 caches
8 x 32 Kbytes, 2-way
Shared L2 NUCA cache
8 MBytes, 128 Banks
NUCA Bank
64 KBytes, 8-way
L1 cache latency
3 cycles
NUCA bank latency
4 cycles
Router delay
1 cycle
On-chip wire delay
1 cycle
Main memory latency
250 cycles (from core)
7

Introduction

Methodology

The Migration Prefetcher

Analysis of results

Conclusions
8


Uses prefetching principles
on data migration.
This not a traditional
prefetcher.
◦ It does not bring data from
main memory.
◦ Potential benefits are much
restricted.

Require simple data
correlation.
9
Core 0
Core 1
Core 2
Core 3
NAT
A
PS
Core 4
Core 5
@
Next Address
B
Bank
5
DataBblock
Core 6
Core 7
10
• 1 confidence bit is
effective.
• > 1 bit is not worthy.

Fraction of prefetching requests that ended up being useful.
11
• 12-14 bits use about
25% of erroneous
information.
• NAT with 12
addressable bits is
232 KBytes in total.

Percentage of prefetching requests submitted with other
address’s information.
12
• Predicting data
location in based on
the last appearance
provides 50%
accuracy.
• Accuracy increases
accessing to local
bank.

Percentage of prefetching requests that are found in the
NUCA cache.
13

The realistic Migration Prefetcher uses:
◦ 1-bit confidence for data patterns.
◦ A NAT with 12 addressable bits (29KBytes/table).
◦ Last responder + Local as search scheme.


Total hardware overhead is 264 KBytes.
Latency: 2 cycles.
14

Introduction

Methodology

The Migration Prefetcher

Analysis of results

Conclusions
15
16



NUCA is up to 25% faster with
the Migration Prefetcher.
Reduces NUCA cache latency
by 15%, on average.
Achieves overall performance
improvements of 4%, and up to 17%.
17



The prefetcher introduces extra
traffic into the network.
In case of hit, reduces the number
of messages significantly.
This technique does not
increase energy consumption.
18

Introduction

Methodology

The Migration Prefetcher

Analysis of results

Conclusions
19

Existing migration techniques effectively concentrate most accessed
data to banks that are close to the cores.

About 50% of hits in NUCA are in non-optimal banks.

The Migration Prefetcher anticipates migrations based on the past.

It reduces the average NUCA latency by 15%.

Outperforms the baseline configuration by 4%, on average, and does
not increase energy consumption.
20
HiPEAC 2012, Paris (France) – January 23, 2012
Questions?
Download