Formal Verification and its Impact on the Snooping versus Directory Protocol Debate Milo M. K. Martin University of Pennsylvania milom@cis.upenn.edu Acknowledgements • Many thanks to my collaborators • • • • Mark Hill, David Wood, Mike Marty @ Wisconsin Dan Sorin @ Duke Alan Hu and Jesse Bingham @ UBC Rajeev Alur, Sebastian Burckhardt @ Penn • Supported by • IBM Graduate Fellowship, Sun, Intel • NSF Milo Martin - ICCD 2005 [2] Overview • Multiprocessor cache coherence protocols • Allows a multiprocessor look like a multi-programmed uniprocessor to software • Complex, concurrent, and performance critical • No consensus on general design approach • Multi-decade debate still raging • Formal verification • Used in finding bugs in cache coherence protocols • A great success in real-world use of formal verification • This presentation: • Revisiting debate in the context of formal verification • Some observations on protocol design & verification Milo Martin - ICCD 2005 [3] Caveats • I’m not a verification expert • Primary expertise is computer architecture • Especially multiprocessor memory systems • Some dabbling in formal verification • I’m only an academic • Limited industrial experience • But lots of conversations with designers • Some of what I will say is controversial • Not all of it is new, as well Milo Martin - ICCD 2005 [4] Outline • Multiprocessors and coherence background • Formal verification and coherence protocols • Revisit the snooping vs directory protocol debate • A new alternative: Token Coherence • Conclusion Milo Martin - ICCD 2005 [5] Multiprocessors • Multiprocessors are becoming ubiquitous • All servers, multi-core desktops, multi-core embedded • After decades of research and niche deployment • Why now? • Today’s workload (server and media workloads) • SQL and OpenGL most used “parallel languages” • Commodity multiprocessor software (e.g., Linux) • Power-efficient way to multiply performance • E.g., StrongARM 1Ghz 200Mhz, 30x less power • Use 5 cores, 6x power reduction, same net speed • Difficult software transition from one to two cores • Much easier after that… exciting times Milo Martin - ICCD 2005 [6] Multiprocessor Hardware • Provide a shared-memory abstraction • Familiar and efficient for programmers P1 P2 P3 P4 Memory System Milo Martin - ICCD 2005 [7] Multiprocessor Hardware • Provide a shared-memory abstraction • Familiar and efficient for programmers P1 Cache P2 M1 Interface Cache P3 M2 Interface Cache P4 M3 Interface Cache M4 Interface Interconnection Network • Cache coherence protocol provides transparency • Distributed, complicated, performance critical Milo Martin - ICCD 2005 [8] Invalidation-based Cache-Coherence • Goal: provide a “consistent” view of memory • Permissions in each cache per block • One read/write -or• Many readers “exclusive block” “shared block” • Cache coherence protocols • Distributed & complex • Correctness critical • Performance critical • Races: the main source of complexity • Requests for the same block at the same time Milo Martin - ICCD 2005 [9] Two classes of multiprocessors 1 • Snooping multiprocessors • Uses broadcast • “Virtual bus” interconnect + Directly locate data (2 hops) P • Directory-based multiprocessors • + + • Directory tracks writer or readers Avoids broadcast P Avoids “virtual bus” interconnect Indirection for cache-to-cache (3 hops) P P M P M 2 1 P 3 Method for ordering racing requests is key Milo Martin - ICCD 2005 [ 10 ] 2 Snooping Protocols • Original designs • Bus-based broadcast • High-speed point-to-point links • No (multi-drop) busses • Build “virtual bus” • Increasingly not globally synchronous • Other enhancements • • • • Split transaction Multiple request and response interconnects Snoop response combining Distribute memory on each processor node Milo Martin - ICCD 2005 [ 11 ] Snooping Example Requestor Requestor Read/Write P0 Virtual bus (totally-ordered) Interconnect P1 Home P2 M0 Root Milo Martin - ICCD 2005 [ 12 ] Snooping Example Requestor Read/Write P0 Virtual bus (totally-ordered) Interconnect No Copy Home P2 M0 P1 Root ordered interconnect orders requests Milo Martin - ICCD 2005 [ 13 ] Directory Protocols 1 • Send all requests to directory • Avoids broadcast P • “Scalable”, but who cares? • Most systems sold are modest in size • Does not require interconnect ordering P P 3 2 • (Bad) alternative names: • • • • “CC-NUMA” “Distributed shared memory” “Scalable cache coherence” Why bad names? don’t capture the fundamental differences Milo Martin - ICCD 2005 M [ 14 ] Directory Example Request Requestor Requestor Read/Write P0 P1 Home P2 M0 Fwd Milo Martin - ICCD 2005 [ 15 ] Directory Example Request Requestor Read/Write P0 P1 No Copy Home P2 M0 Data Fwd Done Milo Martin - ICCD 2005 [ 16 ] Directory Example Request Requestor Read/Write P0 P1 No Copy Home P2 M0 Fwd Milo Martin - ICCD 2005 [ 17 ] Directory Example Request Read/Write No Copy No Copy Home P0 P1 P2 M0 Data Done Fwd No ordered interconnect, directory orders requests Milo Martin - ICCD 2005 [ 18 ] The Debate: Snooping v. Directories Which approach is “better”? • Debated for 20+ years • Mostly debated in terms of • Scalable performance • Performance • Let’s revisit the debate in terms of • Design complexity • Verification’s impact on the above Milo Martin - ICCD 2005 [ 19 ] Outline • Multiprocessors and coherence background • Formal verification and coherence protocols • Revisit the snooping vs directory protocol debate • A new alternative: Token Coherence • Conclusion Milo Martin - ICCD 2005 [ 20 ] Formal Verification & Coherence Protocols • Model the protocol at a high level • • • • Abstract away some implementation details Capture concurrent races Find protocol bugs (earlier the better) Alternative: verify implementation vs high-level model • Multitude of formal techniques • Model checking, theorem proving, SAT solvers, etc. • Apply to scaled down system • Few processors, two data values, two addresses, limited traces, etc. Milo Martin - ICCD 2005 [ 21 ] Explicit Role of Formal Verification • Post-design verification • Used more like traditional design verification • Can help find bugs, but many “false bugs” • Out of date or incomplete specification • Or previously found and fixed • Many case studies, e.g., [Hu et al., ICCD 1997] • During-design verification • • • • Model creation part of design specification process “Formal verifiers” part of cross-functional design team Find bugs early easier, cleaner fixes Becoming more common, fewer anecdotes Milo Martin - ICCD 2005 [ 22 ] Implicit Role of Verification • Once formal verification is part of design… • Has implicit impact on the actual design • • • • A series of bugs might change high-level design Forces deep systematic think about the design Gives designers confidence Just making the model can find bugs (story) • “Verifiability” becomes a design constraint • Designers react to it (story) • Encourages modular, cleaner, documented designs Milo Martin - ICCD 2005 [ 23 ] Implicit Role of Verification (continued) • Is a “verifiable” design a better design? • “principles of good design”, keeps designers honest • Avoid problems before “bugs” develop • Easier alternative? just trick the designers • Design systems to be formally verified? • How might doing so affect low-level concurrent protocols? • What might such a coherence protocol look like? • I’ll talk about one possibility later in talk… Milo Martin - ICCD 2005 [ 24 ] Two Desirable Coherence Properties • What properties might a coherence protocol… • To make it “verifiable” • To make it simple • To make it flexible • Two desirable decoupling properties • Decouple interconnect properties from protocol • Decouple consistency from coherence Milo Martin - ICCD 2005 [ 25 ] Decouple Interconnect from Protocol (1 of 2) • Unordered interconnections • Simple, modular interface • Deadlock avoidance via virtual networks • Constrains design and model the least • Point-to-point ordered interconnects • Disallows adaptive routing • Reduces symmetry of model (state space) • Not so bad, but better to avoid • Most directory protocol fall into these categories Milo Martin - ICCD 2005 [ 26 ] Decouple Interconnect from Protocol (2 of 2) • Totally-ordered interconnects • Requires a bus or “virtual bus”, “snoop combining” • Sometimes timing sensitive • Complicate interface, implementation, modeling • What protocols require this property? • Snooping (all) • Is “snooping” defined by broadcast or ordering? • Few directory protocols (e.g., GS320) Milo Martin - ICCD 2005 [ 27 ] Decouple Coherence from Consistency • Memory consistency models • Defines “consistent” view of memory • Coherence: for a single location • Consistency: ordering among multiple locations • Example: Initial state: A = B = 0 Thread #0 Thread #1 while(A == 0) { /* nothing */ } Store B 1 Load B Store A 1 • “Load B” should return? • Under sequential consistency, always one • Can return zero under weaker models Milo Martin - ICCD 2005 [ 28 ] Enforcing A Memory Consistency Model • Option#1 • Coherence protocol provides “coherence invariant” • Single-reader/writer --or-- multiple readers • Processor internally allows or disallows reorderings • All “sync” instructions internal to processor core • Example: Alpha 21364 • Option #2 • • • • • Intertwine and disperse enforcement through system Totally order all requests Send “sync” instructions into memory system Maybe write-through L1 caches in multi-core systems Example: IBM Power4 Milo Martin - ICCD 2005 [ 29 ] Decoupling Implications • For verification • Easier to model each piece independently & together • Reuse models over time • For design • More compartmentalized • Easier incremental improvement over time • Reuse of design components Milo Martin - ICCD 2005 [ 30 ] Revisiting Snooping vs Directory Protocols • Snooping Protocols • Simple snooping is seductively simple • “Atomic” with simple bus • More aggressive implementations are quite complex • Violate the two decoupling properties • Directory Protocols • Have the decoupling properties • Complex, but in all the ways formal methods can help • Better “complexity scalability” over time Milo Martin - ICCD 2005 [ 31 ] Complexity Scaling Directory Complexity Complexity Snooping Time Interconnect Time Protocol Controller impl. • Initial designs • Simple bus-based snooping simple, directory less so • As design evolves • Snooping quickly becomes complex, directory less so • Caveat: few second-system directory systems Milo Martin - ICCD 2005 [ 32 ] Why Aren’t Directory Protocols More Common? • Complexity disconnect • No evolutionary path to directory protocols • Radical design departure • Designers are good at incrementally improving working approaches over time • Scalability trap • Previous idea: scalability at all costs! • Should only be a means to an ends, not an end goal • “Scalable cache coherence” is synonymous with directory protocols • Often used to bridge between snooping systems • Reputation for high latency Milo Martin - ICCD 2005 [ 33 ] My Opinion on the Coherence Debate? • I now advocate against snooping protocols • But for different reasons than others • i.e., not performance scalability • Main reason: decoupling properties • A reversal of my previous opinion! • Previously, I explored evolving snooping protocols • [ASPLOS 2000, HPCA 2002] • Now, tightly-coupled directory protocols attractive • AMD’s Operton protocol is interesting • “Directory-less” directory protocol • Glueless, point-to-point interconnect, non-scalable • Or, a new alternative… Milo Martin - ICCD 2005 [ 34 ] A New Alternative: Token Coherence [ISCA 2003] • A protocol design to be verified formally • Fast, simple, flexible, too. • Decoupling correctness and performance • Correctness substrate • Safety via token counting • Forward progress via persistent requests • Separate performance policies • Target the common case • Separate correctness and performance • Example of “Better Then Worst-Case Design” Milo Martin - ICCD 2005 [ 35 ] Key Observation: Token Counting • Explicitly encode permissions with tokens • At all times, all blocks have T tokens E.g., one token per processor • Components exchange tokens & data • Tokens: in caches, memory, or in transit • Controls reading & writing of data • One or more to read • All tokens to write Provides safety in all cases Milo Martin - ICCD 2005 [ 36 ] Token Counting Example Load B Store B P0 P1 L1 I&D L2 mem 0 L1 I&D L2 P2 L1 I&D P3 L1 I&D L2 L2 mem 3 interconnect • Each memory block initialized with T tokens • At least one token to read a block • All tokens to write a block Milo Martin - ICCD 2005 [ 37 ] Guaranteeing Starvation-Freedom • Handle pathological cases • Infrequently invoked • Can be slow, inefficient, and simple • When normal requests fail to succeed (4x) • • • • Longer timeout and issue a persistent request Request persists until satisfied Table at each processor “Deactivate” upon completion • Implementation • Arbiter at memory orders persistent requests Milo Martin - ICCD 2005 [ 38 ] Performance Policies • Opportunities • Aggressively target the common case • Requests are just “hints” to move data & tokens • Robust • Can’t cause “correctness” violations • A null or random policy is correct • Rely on correctness substrate • Examples • • • • TokenB - broadcast policy TokenD - performance characteristics of directory TokenM - predictive multicast protocols TokenCMP [HPCA 2005] - multi-level coherence • “Flat for correctness, hierarchical for performance” Milo Martin - ICCD 2005 [ 39 ] Ramifications of T.C. on Design Verification • Divide and conquer complexity • Formally verified Token Coherence [HPCA 2005] • Difficult to quantify, but promising • All races handled uniformly (reissuing) • E.g. simple replacements (no handshake) • Local invariants • Safety is response-centric; independent of requests • Locally enforced with tokens • Further innovation no correctness worries Milo Martin - ICCD 2005 [ 40 ] Token Coherence vs Directory Protocols • Similarities • Decouple interconnect from protocol • Decouple coherence from consistency • Token Coherence more explicitly gives you a “serial” coherence • Differences • Token Coherence can avoid directory indirection • Token Coherence is more flexible, decoupled • However, Token Coherence has separate persistent requests, which add complexity Result: an interesting alternative Milo Martin - ICCD 2005 [ 41 ] Outline • Multiprocessors and coherence background • Formal verification and coherence protocols • Revisit the snooping vs directory protocol debate • A new alternative: Token Coherence • Conclusion Milo Martin - ICCD 2005 [ 42 ] Conclusions • The age of multiprocessors and multi-core chips • Coherence protocol is key design to such designs • Formal verification has an important role to play • Leverage formal methods early in design process • Both explicit and implicit benefits • Two decoupling properties • Decouple interconnect from protocol • Decouple coherence and consistency • Snooping vs directory protocols? • Directory protocols have these decoupling properties • Token Coherence further embraces them Milo Martin - ICCD 2005 [ 43 ] Milo Martin - ICCD 2005 [ 44 ] Starvation Avoidance CMP 0 GETX Store B Store B P0 P1 L1 I&D L1 I&D CMP 1 Store B GETX GETX P2 L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect • Tokens move freely in the system • Transient Requests can miss in-flight tokens • Incorrect speculation, filters, prediction, etc Milo Martin - ICCD 2005 [ 45 ] Starvation Avoidance CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 L1 I&D L1 I&D L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect • Solution: issue Persistent Requests • Heavyweight request guaranteed to succeed Milo Martin - ICCD 2005 [ 46 ] Persistent Requests CMP 0 Store B timeout Store B P0 L1 I&D timeout P1 Store B timeout P2 L1 I&D L1 I&D interconnect B: P0 B: P2 B: P1 P3 L1 I&D interconnect Shared L2 arbiter 0 CMP 1 Shared L2 mem 0 mem 1 interconnect arbiter 0 • Processors issue persistent requests Milo Martin - ICCD 2005 [ 47 ] Persistent Requests CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 B: P0 L1 I&D L1 I&D B: P0 B: P0 L1 I&D interconnect P3 L1 I&D B: P0 interconnect B: P0 Shared L2 Shared L2 B: P0 arbiter 0 B: P0 B: P2 B: P1 mem 0 mem 1 interconnect arbiter 0 • Processors issue persistent requests • Arbiter orders and broadcasts activate Milo Martin - ICCD 2005 [ 48 ] Persistent Requests CMP 0 Store B Store B P1 P2 P0 B: P2 P0 L1 I&D L1 I&D interconnect B: P2 P0 L1 I&D B: P0 P2 3 B: P2 P0 Shared L2 1 P3 L1 I&D B: P2 P0 interconnect Shared L2 B: P2 P0 2 arbiter 0 B: P0 B: P2 B: P1 CMP 1 mem 0 mem 1 interconnect arbiter 0 • Processor sends deactivate to arbiter • Arbiter broadcasts deactivate (and next activate) Milo Martin - ICCD 2005 [ 49 ]