Counting Stream Registers: An Efficient and Effective Snoop Filter

advertisement
Counting Stream Registers: An
Efficient and Effective Snoop
Filter Architecture
Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL),
Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo Charbon
(TU Delft), Paolo Ienne (EPFL)
Multicore Embedded Systems
• Increasing number of multiprocessor
based embedded systems.
• Low energy requirement with little
compromise on performance.
• Significant energy consumption in the
memory subsystem (caches, shared bus,
main memory).
2
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Symmetric Multiprocessor System
CPU 1
I$
D$
CPU n
CPU 2
I$
I$
D$
Shared
Memory
3
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
D$
Cache Coherency Problem
CPU 1
I$
D$
CPU n
CPU 2
I$
I$
D$
Shared
Memory
4
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
D$
Snoopy Hardware Coherence
Protocols
CPU 1
I$
D$
CPU n
CPU 2
I$
I$
D$
Snoop misses
consume
excessive energy
Shared
Memory
5
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
D$
Snoop Filters
CPU 1
D$
I$
SF
CPU n
CPU 2
I$
D$
I$
SF
Snoop filter lookup costs
lesser energy than a cache
SF
lookup
Shared
Memory
6
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
D$
Snoop Filters in Prior Art
• Include, Exclude and Hybrid JETTY
– Expensive for an embedded system in terms of
area.
– Energy consumed by the JETTYs itself is
significant.
• Stream Registers
– Present in IBM's BlueGene Supercomputer.
– Inclusive filter.
– Uses a base and mask register pair to track the
cache lines.
7
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Stream Registers
Base
---
0b1001
0b1010
Mask
---
0
No general mechanism to
remove address from SR
1111
1001
without
compromising
correctness
1
1100
1
1001
Addresses with 10XX result in snoop
filter hit
8
Valid
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Drawbacks of Stream Register
based Snoop Filters
• No efficient way to update the registers
when a line is removed from cache
– Degraded filtering performance over time
– Additional logic units introduced but not
efficient (e.g., cache wrap detection)
9
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Our Contribution
• Counting Stream Registers
– Eliminates cache wrap detection logic
– Counter to track cache lines
– More robust to workload variability
– Better or similar energy savings compared to
SRs
10
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Counting Stream Registers
Base
0b1001
0b1010
Mask
Removes
----- the need for extra
logic such as cache wrap
detection, active register
history etc.
1111
1001
0x01
1100
0x02
1001
Invalidated cache lines can be tracked
by decrementing the counter
11
Counter
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
0
Snoop Filter Architecture
Set of cache lines
grouped into a page
Index to direct mapped
snoop filter table
Used for comparison
with base register
Figure 2. CSR-filter hit detection logic.
12
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Figure 2 shows the snoop hit mechanism. The bus address
Experimental Analysis
• Virtex 2 FPGA running OpenRISC soft cores
– Configurable no. of processors, associativity and
size of data and instruction cache, cache type
and coherence protocol
• EEMBC Multibench Benchmarks
• CACTI 5.3 energy model
– Total memory subsystem energy accounted for
main memory r/w energy, data and instruction
cache r/w energy, leakage and snoop energy
13
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Cache Design Space Exploration
14
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Filtering
Filterin
Filteri
lementary color
level
mplementary
color
levelbased
basedononk k*/*/
MM
=m
- K;
YY= =y y- K;
=m
- K;
- K;
Results: Filtering Percentage
FilteringPercentage
Percentage
Filtering
Filtering
Percentage
Filtering
Percentage
Filtering
FilteringPercentage
Percentage
Filtering Percentage
filtering percentage
remains
consistent
for filters with 32 or (b)
Number of
Stream Registers
NumberofofStream
Stream
Registers
Number
Registers
(a)
(b)
more
CSRs.
results
demonstrate
that CSR filters are
is is
to to
read
image.
The
ep
readinThese
inthe
theRGB
RGB
image.
The CMYK
CMYK
more
robust Assume
toAssume
workload
variability
ed
sequentially.
4MBimage
imagesize,
size,than
and aaSR-filters, while
sequentially.
4MB
and
Image
ImageRotation
Rotation
Image
achieving
a better
overall
filtering
percentage.
uration
with
SRs/CSRs.
There
is little
little
ation
with
8 8RGB-to-CMYK
SRs/CSRs.
There
isConversion
h respect
datapresent
presentininthe
thecache,
cache, as
as the
the 88
respect
to todata
C.
Energy
Consumption
mapped
through3 3bits
bitsofofthe
thephysical
physical address
address
apped
through
Our
experimental
analysis
the following three
9-bit
offset
accountsfor
forall
all4 4MB
MBconsiders
theimage.
image.
bit
offset
accounts
ofofthe
system configurations:
CSR-filters
achievea ahigh
highfiltering
filteringpercentage.
percentage.
SR-filters
achieve
of
SRs/CSRs
increases,more
morebits
bitsofofthe
theaddress
address
SRs/CSRs
increases,
WT:
Write-through
caches without
snoop
filters.
do to
access
the
SRs/CSRs,
and
fewer
bits
are
access the SRs/CSRs, and fewer bits are
eWTSR:
image
itself.
TheCMYK
CMYK
valuewith
calculation
Write-through
caches
SR-filters.
CSR achieves higher filtering
mage
itself.
The
value
calculation
isis aa
t two step process involving several
Number of Stream Registers
(c)for smaller
%
of
two step
process Number
involving
several
of Stream
Registers
Numbernumber
of Stream Registers
(b)
(c)
WTCSR:
Write-through
caches
with
CSR-filters.
nd modifications. The SR-filters do not detect
modifications. The SR-filters do not detect
registers
ions. This results in an increased amount of
ns.
This results
indata
an increased
of
H.264 Video Encoding
Encoding
and
Decoding
th respect
to theHuffman
present
in amount
cache
(the
Image
Rotation
H.264 Video Encoding
respect
to results
the data
in cache
(the
tion),
which
in apresent
snoop filter
hit, thereby
n), which
in a in
snoop
filter
hit,The
thereby
ltering
rateresults
as shown
Figure
5(a).
2MB
ring
rate
as
shown
in
Figure
5(a).
The
2MB
rast, does not have this problem. With 64 and
t,ncertainty
does notdisappears.
have this With
problem.
more With
SRs, 64
the and
SRertainty disappears.
Withwhether
more SRs,
SRascertain
more accurately
or notthe
a given
certain
more
accurately whether or not a given
des
in the
cache.
in the cache.
lters act on the invalidations and modifications
Number of Stream Registers
y than
thethe
SR-filters,
asNumber
shown
in
Figure
5(b).
(d)
rs
act on
invalidations
and
modifications
Number
ofStream
Stream
Registers
(a)
of
Registers
(c)
Number of Stream Registers
ters, the
are consistent
2MB image
than
the results
SR-filters,
as shownfor
in aFigure
5(b).
(d)
he
number
CSRs
in the for
filter;
for aimage
4MB
s, the
resultsofare
consistent
a 2MB
Figure 4. The CSR-filter consistently achieves a higher filtering percenta
Counting
Stream
Registers:
An
Efficient
and Effective Snoop Filter
Architecture
15
16
CSRs
appear
to
be
insufficient,
while
the
H.264
Video
Encoding
RGB-to-CMYK
than
the SR-filter.
number of CSRs
in the filter;Image
for aConversion
4MB
Figure 4. The CSR-filter consistently achieves a higher filtering perce
Analysis: RGB2CMYK Benchmark
Huffman
Encod
Huffman
Enc
Normalized Energy
Normalized Energy
Filtering Percentage
Percentage
Filtering
SR-Filters
SR-Filters
(a)
(a)
(a)(a)
Number
NumberofofStream
StreamRegisters
Registers
Image
Ro
Image
FilteringPercentage
Percentage
Filtering
Normalized Energy
Normalized Energy
CSR-filters
CSR-filters
(c)
(c)
16
Number of Stream Registers
Number of Stream Registers
Figure 5. RGB to CMYK conversion with SR-filters (a) and CSR-filters (b).
Figure
5. RGB
CMYKrate
conversion
with SR-filters
(a) and
CSR-filters
(b).
For SR-filters
thetofiltering
is inconsistent
as the number
of SRs
increases
Fora SR-filters
thefor
filtering
rate An
is the
inconsistent
as
the
number
SRsArchitecture
increases
for
4MB
image;
CSR-filters,
filtering
is non-decreasing
when
the
Counting
Stream
Registers:
Efficient
andrate
Effective
Snoopof
Filter
for a 4MB image; for CSR-filters,
the
filtering
rate
is
non-decreasing
when
the
number of CSRs increases.
TCP/IP Pack
TCP/IP P
alized Energy
ized Energy
(b)
(b)
Discussion: Energy Consumption
• For most benchmarks, snoop energy was
around 8-10% of the total memory subsystem
energy without snoop filters
• CSR filters more effective for certain
benchmarks (H.264, Image rotation)
– Better filtering performance with smaller no. of
stream registers.
• Small reduction in overall energy
– Platform limited to 32 MB of off-chip SDRAM
– No complex data sharing and limited no. of
multiple producers of same data
17
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Summary
• Introduced counting stream registers based
snoop filter architecture
– Lesser hardware complexity and ability to track
cache line invalidations
• Experimental evaluation shows better filtering
percentage than stream registers with lesser
performance variation for different workloads.
18
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture
Download