Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005 Overview of Idea Coarse-Grain Coherence Tracking: 1. Monitors coherence status of memory at a multi-line granularity 2. Uses the coarse-grain information to identify requests that don’t need a coherence broadcast 3. Sends these requests directly to memory June 7, 2005 ISCA 2005 2 Problem Broadcast Network Data Network NC $ MC P DRAM P P P DRAM DRAM DRAM Snoop-based systems support a limited number of processors – Limited broadcast bandwidth – Increasing memory latency June 7, 2005 ISCA 2005 3 Opportunity • Some data requests don’t need a broadcast – Requests for non-shared data – Fetches of unmodified instructions – Write-backs • Some non-data requests don’t need to leave the processor – Requests to upgrade copy, but not shared – Requests to flush copies, but not cached elsewhere June 7, 2005 ISCA 2005 4 Unnecessary Broadcasts 100% 93% 80% 65% Requests 62% 67% Write-back 60% DCB Write 40% I-Fetch 20% Read 0% Scientific June 7, 2005 Multiprogrammed Commercial ISCA 2005 Arithmetic Mean 5 Our Approach • Identify requests that don’t need a broadcast • Send data requests directly to memory – Reduce broadcast traffic – Reduce latency in some systems • Avoid sending non-data requests externally – Further reduce broadcast traffic – Reduce latency June 7, 2005 ISCA 2005 6 Coarse-Grain Coherence Tracking • Memory is divided into coarse-grain regions – Aligned, power-of-two multiple of cache line size – Can range from two lines to a physical page • A cache-like structure is added to each processor for monitoring coherence at the granularity of regions – Region Coherence Array (RCA) June 7, 2005 ISCA 2005 7 Coarse-Grain Coherence Tracking • Each entry has an address tag, state, and count of lines cached by the processor • The state indicates if the processor and / or other processors are sharing / modifying lines in the region • On cache misses, the region state is read to determine if a broadcast is necessary June 7, 2005 ISCA 2005 8 Coarse-Grain Coherence Tracking • On snoops, the region state provides a response for the region – Piggy-backed onto the conventional response – Used to update other processors’ region state • RCA maintains inclusion over caches – When regions are evicted, their lines are evicted – RCA must respond correctly if region’s lines cached – Replacement algorithm uses line count June 7, 2005 ISCA 2005 9 Example: Conventional Snooping Network Read: P0, 100002 Read: P0, 100002 Tag • P0 loads 100002 MISS Invalid Invalid State 0000 Exclusive 0010 Pending Invalid 0000 0000 0000 $Invalid 0 Invalid $Invalid 1 Data Load: 100002 • Snoop performed Data • Response sent • Data transfer June 7, 2005 P0 P1 M0 M1 ISCA 2005 10 Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region MISS • Snoop performed • Response sent June 7, 2005 Read: P0, 100002 Tag • P0 loads 100002 • Data transfer Network P0 has exclusive Read: ,Region 100002 Not Shared access toP0region Invalid, Invalid, Region Not Shared State 0000 Exclusive 0010 Pending Invalid 000 Pending 001 Invalid DI 0000 0000 000 Invalid 0000 $Invalid 0 Data RCA Invalid $Invalid 1 000 Invalid RCA 000 Invalid Load: 100002 P0 P1 M0 M1 Data ISCA 2005 11 Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P0 loads 110002 Network Tag Exclusive region state, broadcast unnecessary State MISS, Region Hit 0010 Exclusive 001 0 0000 0011 Exclusive Pending Invalid 000 Invalid • Direct request sent Data • Data transfer $ DI RCA 0000 0000 Invalid $Invalid 1 000 Invalid RCA 000 Invalid Load: 110002 P0 P1 M0 M1 Read: P0, 110002 Data June 7, 2005 ISCA 2005 12 Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region • P1 stores 100002 MISS Network Region not exclusive Owned, Region Owned Owned, Region Owned RFO: P1, 100002 anymore RFO: P1, 100002 0010 Exclusive Pending Invalid 001 0000 Modified 0010 Pending Invalid 000 Pending 001 Invalid DD 0 0011 Exclusive 000 Invalid 0000 000 Invalid $ • Response sent • Data transfer June 7, 2005 RCA $Invalid 1 Data • Snoop performed Hits in P0 cache DD DI Data P0 P1 M0 M1 ISCA 2005 RCA Store: 100002 13 Overhead • Storage space needed for RCA – 3-6% storage overhead for cache • Two bits needed in snoop response for region response • Path to memory needed to avoid broadcasts – Simple with on-chip memory controllers – May leverage data network June 7, 2005 ISCA 2005 14 Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors • Two-level hierarchy with split L1, unified L2 caches • Separate address / data networks –similar to Fireplane • Region Coherence Array with same sets/assoc. as L2 June 7, 2005 ISCA 2005 15 Workloads • Scientific – Ocean, Raytrace, Barnes • Multiprogrammed – SPECint2000_rate • Commercial – TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 June 7, 2005 ISCA 2005 16 Broadcasts Avoided 100% Requests 80% 67% 56% 60% Write-back DCB Write 40% I-Fetch 20% Read Scientific June 7, 2005 Multiprogrammed Commercial ISCA 2005 1KB 512B 256B Oracle 1KB 512B 256B Oracle 1KB 512B 256B Oracle 1KB 512B 256B Oracle 0% Arithmetic Mean 17 Snoop Traffic Reduction – Peak Broadcasts / 100K Cycles 8000 6000 64% 4000 51% 38% 2000 0 Scientific Multiprogrammed Peak Traffic June 7, 2005 Commercial Peak Traffic with 512B Regions ISCA 2005 18 Snoop Traffic Reduction – Average Broadcasts / 100K Cycles 4000 2000 47% 74% 86% 0 Scientific Multiprogrammed Average Traffic June 7, 2005 Commercial Average Traffic with 512B Regions ISCA 2005 19 Normalized Execution Time Execution Time 91.2% 1.0 0.8 0.6 0.4 0.2 0.0 Scientific Baseline June 7, 2005 Multiprogrammed 256B Regions Commercial 512B Regions ISCA 2005 Arithmetic Mean 1KB Regions 20 Remaining Opportunity • With 512B regions, ~10% of requests are broadcast unnecessarily • A third of the 10% are region false sharing • Half of the 10% miss in RCA – Potential for prefetching June 7, 2005 ISCA 2005 21 Inclusion Overhead L2 Miss Ratio 100% 80% 60% 40% 20% +0.56% +0.23% +0.04% 0% Scientific Multiprogrammed Baseline miss rate Commercial 512B miss rate --Regions with no lines cached replaced first June 7, 2005 ISCA 2005 22 Conclusion Coarse-Grain Coherence Tracking: • Reduces broadcast traffic – Most data requests sent directly to memory • Reduces latency – Many requests not sent to central arbitration point – Many non-data requests not sent externally • Improves scalability and performance June 7, 2005 ISCA 2005 23 The End June 7, 2005 ISCA 2005 24 Inclusion Evictions 2 lines evicted Region Evictions 100% 1 line evicted 80% 0 lines evicted 60% 40% 20% 0% Sc ific t n ie M June 7, 2005 og r ti p l u m m ra al i c ed C ISCA 2005 om er m 25 Ordering • Ordering point is now the Region Coherence Array – A direct request is ordered once it accesses the RCA • Direct requests are serialized w.r.t. to snoop requests – A direct request occurs either before, or after a snoop – All must appear to access and update RCA atomically • No two processors can have exclusive access to a region at the same time (no races) June 7, 2005 ISCA 2005 26 Comparison to RegionScout CGCT RegionScout Optimization Latency Power Avoids broadcast for non-shared data Yes Yes Avoids broadcast for clean data Yes No Avoids tag lookups on snoops No Yes –Like Jetty Region state storage Inclusive cache Hash table, small cache Region state transfer 2 bits in snoop response 1 bit in snoop response Region protocol 7 states Effectively 4 states June 7, 2005 ISCA 2005 27 Execution Time 92.2% 91.2% Normalized Execution Time 1.0 0.8 0.6 0.4 0.2 0.0 Scientific Multiprogrammed 512B Regions June 7, 2005 Commercial Arithmetic Mean 512B Regions, half number of sets ISCA 2005 28