Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin

advertisement
Improving Multiprocessor
Performance with Coarse-Grain
Coherence Tracking
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
International Symposium on Computer Architecture
June 7th, 2005
Overview of Idea
Coarse-Grain Coherence Tracking:
1. Monitors coherence status of memory at a
multi-line granularity
2. Uses the coarse-grain information to identify
requests that don’t need a coherence
broadcast
3. Sends these requests directly to memory
June 7, 2005
ISCA 2005
2
Problem
Broadcast Network
Data Network
NC
$
MC
P
DRAM
P
P
P
DRAM
DRAM
DRAM
Snoop-based systems support a limited
number of processors
– Limited broadcast bandwidth
– Increasing memory latency
June 7, 2005
ISCA 2005
3
Opportunity
• Some data requests don’t need a broadcast
– Requests for non-shared data
– Fetches of unmodified instructions
– Write-backs
• Some non-data requests don’t need to leave the
processor
– Requests to upgrade copy, but not shared
– Requests to flush copies, but not cached elsewhere
June 7, 2005
ISCA 2005
4
Unnecessary Broadcasts
100%
93%
80%
65%
Requests
62%
67%
Write-back
60%
DCB
Write
40%
I-Fetch
20%
Read
0%
Scientific
June 7, 2005
Multiprogrammed
Commercial
ISCA 2005
Arithmetic Mean
5
Our Approach
• Identify requests that don’t need a broadcast
• Send data requests directly to memory
– Reduce broadcast traffic
– Reduce latency in some systems
• Avoid sending non-data requests externally
– Further reduce broadcast traffic
– Reduce latency
June 7, 2005
ISCA 2005
6
Coarse-Grain Coherence Tracking
• Memory is divided into coarse-grain regions
– Aligned, power-of-two multiple of cache line size
– Can range from two lines to a physical page
• A cache-like structure is added to each
processor for monitoring coherence at the
granularity of regions
– Region Coherence Array (RCA)
June 7, 2005
ISCA 2005
7
Coarse-Grain Coherence Tracking
• Each entry has an address tag, state, and
count of lines cached by the processor
• The state indicates if the processor and / or
other processors are sharing / modifying lines
in the region
• On cache misses, the region state is read to
determine if a broadcast is necessary
June 7, 2005
ISCA 2005
8
Coarse-Grain Coherence Tracking
• On snoops, the region state provides a
response for the region
– Piggy-backed onto the conventional response
– Used to update other processors’ region state
• RCA maintains inclusion over caches
– When regions are evicted, their lines are evicted
– RCA must respond correctly if region’s lines cached
– Replacement algorithm uses line count
June 7, 2005
ISCA 2005
9
Example: Conventional Snooping
Network
Read: P0, 100002
Read: P0, 100002
Tag
• P0 loads 100002
 MISS
Invalid
Invalid
State
0000 Exclusive
0010
Pending
Invalid
0000
0000
0000
$Invalid
0
Invalid
$Invalid
1
Data
Load: 100002
• Snoop performed
Data
• Response sent
• Data transfer
June 7, 2005
P0
P1
M0
M1
ISCA 2005
10
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two
lines per region
 MISS
• Snoop performed
• Response sent
June 7, 2005
Read: P0, 100002
Tag
• P0 loads 100002
• Data transfer
Network
P0 has exclusive
Read:
,Region
100002 Not Shared
access
toP0region
Invalid,
Invalid, Region Not Shared
State
0000 Exclusive
0010
Pending
Invalid
000 Pending
001
Invalid
DI
0000
0000
000 Invalid
0000
$Invalid
0
Data
RCA
Invalid
$Invalid
1
000 Invalid
RCA
000 Invalid
Load: 100002
P0
P1
M0
M1
Data
ISCA 2005
11
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two
lines per region
• P0 loads 110002
Network
Tag
Exclusive region state,
broadcast unnecessary
State
 MISS, Region Hit
0010 Exclusive
001
0
0000
0011 Exclusive
Pending
Invalid
000 Invalid
• Direct request sent
Data
• Data transfer
$
DI
RCA
0000
0000
Invalid
$Invalid
1
000 Invalid
RCA
000 Invalid
Load: 110002
P0
P1
M0
M1
Read: P0, 110002
Data
June 7, 2005
ISCA 2005
12
Coarse-Grain Coherence Tracking
Region Coherence
Array added; two
lines per region
• P1 stores 100002
 MISS
Network
Region not exclusive
Owned, Region Owned
Owned, Region Owned
RFO: P1, 100002
anymore
RFO: P1, 100002
0010 Exclusive
Pending
Invalid
001
0000 Modified
0010
Pending
Invalid
000 Pending
001
Invalid
DD
0
0011 Exclusive
000 Invalid
0000
000 Invalid
$
• Response sent
• Data transfer
June 7, 2005
RCA
$Invalid
1
Data
• Snoop performed
 Hits in P0 cache
DD
DI
Data
P0
P1
M0
M1
ISCA 2005
RCA
Store: 100002
13
Overhead
• Storage space needed for RCA
– 3-6% storage overhead for cache
• Two bits needed in snoop response for region
response
• Path to memory needed to avoid broadcasts
– Simple with on-chip memory controllers
– May leverage data network
June 7, 2005
ISCA 2005
14
Simulator
PHARMsim:
• Execution-driven simulator built on top of SimOS-PPC
• Four 4-way superscalar out-of-order processors
• Two-level hierarchy with split L1, unified L2 caches
• Separate address / data networks –similar to Fireplane
• Region Coherence Array with same sets/assoc. as L2
June 7, 2005
ISCA 2005
15
Workloads
• Scientific
– Ocean, Raytrace, Barnes
• Multiprogrammed
– SPECint2000_rate
• Commercial
– TPC-W, TPC-B, TPC-H, SPECweb99,
SPECjbb2000
June 7, 2005
ISCA 2005
16
Broadcasts Avoided
100%
Requests
80%
67%
56%
60%
Write-back
DCB
Write
40%
I-Fetch
20%
Read
Scientific
June 7, 2005
Multiprogrammed
Commercial
ISCA 2005
1KB
512B
256B
Oracle
1KB
512B
256B
Oracle
1KB
512B
256B
Oracle
1KB
512B
256B
Oracle
0%
Arithmetic Mean
17
Snoop Traffic Reduction – Peak
Broadcasts / 100K Cycles
8000
6000
64%
4000
51%
38%
2000
0
Scientific
Multiprogrammed
Peak Traffic
June 7, 2005
Commercial
Peak Traffic with 512B Regions
ISCA 2005
18
Snoop Traffic Reduction – Average
Broadcasts / 100K Cycles
4000
2000
47%
74%
86%
0
Scientific
Multiprogrammed
Average Traffic
June 7, 2005
Commercial
Average Traffic with 512B Regions
ISCA 2005
19
Normalized Execution Time
Execution Time
91.2%
1.0
0.8
0.6
0.4
0.2
0.0
Scientific
Baseline
June 7, 2005
Multiprogrammed
256B Regions
Commercial
512B Regions
ISCA 2005
Arithmetic Mean
1KB Regions
20
Remaining Opportunity
• With 512B regions, ~10% of requests are
broadcast unnecessarily
• A third of the 10% are region false sharing
• Half of the 10% miss in RCA
– Potential for prefetching
June 7, 2005
ISCA 2005
21
Inclusion Overhead
L2 Miss Ratio
100%
80%
60%
40%
20%
+0.56%
+0.23%
+0.04%
0%
Scientific
Multiprogrammed
Baseline miss rate
Commercial
512B miss rate
--Regions with no lines cached replaced first
June 7, 2005
ISCA 2005
22
Conclusion
Coarse-Grain Coherence Tracking:
• Reduces broadcast traffic
– Most data requests sent directly to memory
• Reduces latency
– Many requests not sent to central arbitration point
– Many non-data requests not sent externally
• Improves scalability and performance
June 7, 2005
ISCA 2005
23
The End
June 7, 2005
ISCA 2005
24
Inclusion Evictions
2 lines evicted
Region Evictions
100%
1 line evicted
80%
0 lines evicted
60%
40%
20%
0%
Sc
ific
t
n
ie
M
June 7, 2005
og
r
ti p
l
u
m
m
ra
al
i
c
ed
C
ISCA 2005
om
er
m
25
Ordering
• Ordering point is now the Region Coherence Array
– A direct request is ordered once it accesses the RCA
• Direct requests are serialized w.r.t. to snoop requests
– A direct request occurs either before, or after a snoop
– All must appear to access and update RCA atomically
• No two processors can have exclusive access to a
region at the same time (no races)
June 7, 2005
ISCA 2005
26
Comparison to RegionScout
CGCT
RegionScout
Optimization
Latency
Power
Avoids broadcast for
non-shared data
Yes
Yes
Avoids broadcast for
clean data
Yes
No
Avoids tag lookups on
snoops
No
Yes –Like Jetty
Region state storage
Inclusive cache
Hash table, small cache
Region state transfer
2 bits in snoop response
1 bit in snoop response
Region protocol
7 states
Effectively 4 states
June 7, 2005
ISCA 2005
27
Execution Time
92.2%
91.2%
Normalized Execution Time
1.0
0.8
0.6
0.4
0.2
0.0
Scientific
Multiprogrammed
512B Regions
June 7, 2005
Commercial
Arithmetic Mean
512B Regions, half number of sets
ISCA 2005
28
Download