Energy-Efficient Hardware Data
Prefetching
Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz
CONTENTS
Introduction
Hardware
prefetching
Hardware
data prefetching methods
Performance
speedup
Energy-aware
PARE
Conclusion
prefetching techniques
Introduction
Data prefetching, is the process of fetching data
that is needed in the program in advance, before
the instruction that requires it is executed.
It removes apparent memory latency.
Two types
Software prefetching
Using compiler
Hardware prefetching
Using additional circuit
Hardware prefetching
Use additional circuit
Prefetch tables are used to store recent load
instructions and relations between load
instructions.
Better
performance
Energy
overhead comes from
Energy
cost
Hardware Data Prefetching
Methods
Sequential
Stride
prefetching
prefetching
Pointer
prefetching
Combined
stride and pointer
prefetching
Sequential Prefetching
One block lookahead (OBL) approach
Initiate
a prefetch for block b+1 when block b is
accessed
Prefetch_on_miss
o
Whenever an access for block b results in a
cache miss
Tagged
prefetching
Associates a tag bit with every memory block
When a block is demand-fetched or a prefetched
block is referenced for the first time next block is
OBL Approaches
Prefetch-on-miss
Tagged
prefetch
Click to edit the
outline text format
demand-fetched
prefetched
demand-fetched
prefetched
Second Outline
Level0 demand-fetched
1 prefetched
Third Outline
Level
0Fourth
demand-fetched
0Outline
prefetched
1Level
prefetched
Fifth
Outline
Level
Stride Prefetching
Employ
special logic to monitor the
processor’s address referencing pattern
Detect
constant stride array references
originating from looping structures
Compare
successive addresses used by load
or store instructions
Reference Prediction Table
(RPT)
RPT
64 entries
64 bits
Hold
most recently used memory instructions
Address
of the memory instruction
Previous
address accessed by the instruction
Stride
State
value
field
Organization of RPT
PC
effective address
instruction tag
previous address
stride
state
+
prefetch address
Pointer Prefetching
Effective
No
for pointer_intensive programs
constant stride
Dependence_based
Use
prefetching
Detect dependence relationship
two hardware tables
Correlation table(CT)
•
Storing dependence information
Combined Stride And Pointer
Prefetching
Objective
to evaluate a technique that would
work for all types of memory access patterns
Use
both array and pointer
Better
All
performance
three tables (RPT, PPW, CT)
Performance Speedup
Combined
(stride+dep) technique has the best speedup
for most benchmarks.
no-prefetch
sequential
tagged
stride
dependence
stride+dep
2.4
2.2
Speedup
2
1.8
1.6
1.4
1.2
1
0.8
mcf
parser
art
bzip2
galgel
bh
em3d
health
mst
perim
avg
Energy-aware Prefetching
Architecture
Compiler-Based
LDQ
RA
RB OFFSET
Hints
Selective Filtering
Filtered
Regular
Cache Access
Compiler-Assisted
Adaptive Prefetching
Stride Counter
Filtered
Prefetch Filtering
using Stride Counter
Stride
Prefetcher
Pointer
Prefetcher
Prefetching Filtering
Buffer (PFB)
Filtered
Hardware Filtering
using PFB
Prefetches
Data-array
Tag-array
......
...
...
...
Prefetch from L2 Cache
...
L1 D-cache
Energy-aware Prefetching
Technique
Compiler-Based
Only
Selective Filtering (CBSF)
searching the prefetch hardware tables
Compiler-Assisted Adaptive
Prefetching
(CAAP)
Select
different prefetching schemes
Compiler-driven
Filtering using Stride Counter
(SC)
Reduce
prefetching energy
Hardware-based
Filtering using PFB (PFB)
Compiler-based selective
filtering
Only
searching the prefetch hardware tables
for selective memory instructions identified
by the compiler
Energy reduced by
Using loop or recursive type memory
access
Use only array and linked data structure
memory access
Compiler-assistive adaptive
prefetching
Select
different prefetching scheme based
on
Memory access to an array which
does not belongs to any larger
structure are only fed into the stride
prefetcher.
Memory access to an array which
belongs to a larger structure are fed
into both stride and pointer
Compiler-hinted Filtering
Using A Runtime SC
Reducing prefetching energy consumption
wasted on memory access patterns with very
small strides.
Small
strides are not used
Stride
can be larger than half the cache line size
Each
cache line contain
Program Counter(PC)
Stride counter
PARE: A Power-aware
Prefetch Engine
Used
Two
for reducing power dissipation
ways to reduce power
Reduces the size of each entry
•
Based on spatial locality of memory accesses
Partitions the large table into multiple smaller
tables
Hardware Prefetch Table
Pare Hardware Prefetch Table
Break
up the whole prefetch table into 16
smaller tables
Each
It
table containing 4 entries
also contain a group number
Only
bits
use lower 16 bit of the PC instead of 32
Pare Table Design
Advantages Of Pare Hardware
Table
Power
consumption reduced
CAM
cell power is reduced
Small
table
Reduce
total power consumption
Conclusion
Improve
the performance
Reduce
the energy overhead of hardware data
prefetching
Reduce
total energy consumption
compiler-assisted and hardware-based energyaware techniques and a new power-aware
prefetch engine techniques are used.
References
Yao
Guo ,”Energy-Efficient Hardware Data
Prefetching,” IEEE ,vol.19,no.2,Feb.2011
A.
J. Smith, “Sequential program prefetching in
memory hierarchies,”IEEE Computer, vol. 11,
no. 12, pp. 7–21, Dec. 1978.
A.
Roth, A. Moshovos, and G. S. Sohi,
“Dependence based prefetching for linked data
structures,” in Proc. ASPLOS-VIII, Oct. 1998,
pp.115–126.