Formal Verification and its Impact on the Milo M. K. Martin

advertisement
Formal Verification and its Impact on the
Snooping versus Directory Protocol Debate
Milo M. K. Martin
University of Pennsylvania
milom@cis.upenn.edu
Acknowledgements
• Many thanks to my collaborators
•
•
•
•
Mark Hill, David Wood, Mike Marty @ Wisconsin
Dan Sorin @ Duke
Alan Hu and Jesse Bingham @ UBC
Rajeev Alur, Sebastian Burckhardt @ Penn
• Supported by
• IBM Graduate Fellowship, Sun, Intel
• NSF
Milo Martin - ICCD 2005
[2]
Overview
• Multiprocessor cache coherence protocols
• Allows a multiprocessor look like a multi-programmed
uniprocessor to software
• Complex, concurrent, and performance critical
• No consensus on general design approach
• Multi-decade debate still raging
• Formal verification
• Used in finding bugs in cache coherence protocols
• A great success in real-world use of formal verification
• This presentation:
• Revisiting debate in the context of formal verification
• Some observations on protocol design & verification
Milo Martin - ICCD 2005
[3]
Caveats
• I’m not a verification expert
• Primary expertise is computer architecture
• Especially multiprocessor memory systems
• Some dabbling in formal verification
• I’m only an academic
• Limited industrial experience
• But lots of conversations with designers
• Some of what I will say is controversial
• Not all of it is new, as well
Milo Martin - ICCD 2005
[4]
Outline
• Multiprocessors and coherence background
• Formal verification and coherence protocols
• Revisit the snooping vs directory protocol debate
• A new alternative: Token Coherence
• Conclusion
Milo Martin - ICCD 2005
[5]
Multiprocessors
• Multiprocessors are becoming ubiquitous
• All servers, multi-core desktops, multi-core embedded
• After decades of research and niche deployment
• Why now?
• Today’s workload (server and media workloads)
• SQL and OpenGL most used “parallel languages”
• Commodity multiprocessor software (e.g., Linux)
• Power-efficient way to multiply performance
• E.g., StrongARM 1Ghz  200Mhz, 30x less power
• Use 5 cores, 6x power reduction, same net speed
• Difficult software transition from one to two cores
• Much easier after that… exciting times
Milo Martin - ICCD 2005
[6]
Multiprocessor Hardware
• Provide a shared-memory abstraction
• Familiar and efficient for programmers
P1
P2
P3
P4
Memory System
Milo Martin - ICCD 2005
[7]
Multiprocessor Hardware
• Provide a shared-memory abstraction
• Familiar and efficient for programmers
P1
Cache
P2
M1
Interface
Cache
P3
M2
Interface
Cache
P4
M3
Interface
Cache
M4
Interface
Interconnection Network
• Cache coherence protocol provides transparency
• Distributed, complicated, performance critical
Milo Martin - ICCD 2005
[8]
Invalidation-based Cache-Coherence
• Goal: provide a “consistent” view of memory
• Permissions in each cache per block
• One read/write -or• Many readers
“exclusive block”
“shared block”
• Cache coherence protocols
• Distributed & complex
• Correctness critical
• Performance critical
• Races: the main source of complexity
• Requests for the same block at the same time
Milo Martin - ICCD 2005
[9]
Two classes of multiprocessors
1
• Snooping multiprocessors
• Uses broadcast
• “Virtual bus” interconnect
+ Directly locate data (2 hops)
P
• Directory-based multiprocessors
•
+
+
•
Directory tracks writer or readers
Avoids broadcast
P
Avoids “virtual bus” interconnect
Indirection for cache-to-cache (3 hops)
P
P
M
P
M
2
1
P
3
Method for ordering racing requests is key
Milo Martin - ICCD 2005
[ 10 ]
2
Snooping Protocols
• Original designs
• Bus-based broadcast
• High-speed point-to-point links
• No (multi-drop) busses
• Build “virtual bus”
• Increasingly not globally
synchronous
• Other enhancements
•
•
•
•
Split transaction
Multiple request and response interconnects
Snoop response combining
Distribute memory on each processor node
Milo Martin - ICCD 2005
[ 11 ]
Snooping Example
Requestor Requestor Read/Write
P0
Virtual bus
(totally-ordered)
Interconnect
P1
Home
P2
M0
Root
Milo Martin - ICCD 2005
[ 12 ]
Snooping Example
Requestor Read/Write
P0
Virtual bus
(totally-ordered)
Interconnect
No Copy
Home
P2
M0
P1
Root
ordered interconnect
orders requests
Milo Martin - ICCD 2005
[ 13 ]
Directory Protocols
1
• Send all requests to directory
• Avoids broadcast
P
• “Scalable”, but who cares?
• Most systems sold are modest in size
• Does not require interconnect ordering
P
P
3
2
• (Bad) alternative names:
•
•
•
•
“CC-NUMA”
“Distributed shared memory”
“Scalable cache coherence”
Why bad names? don’t capture the fundamental
differences
Milo Martin - ICCD 2005
M
[ 14 ]
Directory Example
Request
Requestor Requestor Read/Write
P0
P1
Home
P2
M0
Fwd
Milo Martin - ICCD 2005
[ 15 ]
Directory Example
Request
Requestor Read/Write
P0
P1
No Copy
Home
P2
M0
Data
Fwd
Done
Milo Martin - ICCD 2005
[ 16 ]
Directory Example
Request
Requestor Read/Write
P0
P1
No Copy
Home
P2
M0
Fwd
Milo Martin - ICCD 2005
[ 17 ]
Directory Example
Request
Read/Write
No Copy
No Copy
Home
P0
P1
P2
M0
Data
Done
Fwd
No ordered interconnect, directory orders requests
Milo Martin - ICCD 2005
[ 18 ]
The Debate: Snooping v. Directories
Which approach is “better”?
• Debated for 20+ years
• Mostly debated in terms of
• Scalable performance
• Performance
• Let’s revisit the debate in terms of
• Design complexity
• Verification’s impact on the above
Milo Martin - ICCD 2005
[ 19 ]
Outline
• Multiprocessors and coherence background
• Formal verification and coherence protocols
• Revisit the snooping vs directory protocol debate
• A new alternative: Token Coherence
• Conclusion
Milo Martin - ICCD 2005
[ 20 ]
Formal Verification & Coherence Protocols
• Model the protocol at a high level
•
•
•
•
Abstract away some implementation details
Capture concurrent races
Find protocol bugs (earlier the better)
Alternative: verify implementation vs high-level model
• Multitude of formal techniques
• Model checking, theorem proving, SAT solvers, etc.
• Apply to scaled down system
• Few processors, two data values, two addresses,
limited traces, etc.
Milo Martin - ICCD 2005
[ 21 ]
Explicit Role of Formal Verification
• Post-design verification
• Used more like traditional design verification
• Can help find bugs, but many “false bugs”
• Out of date or incomplete specification
• Or previously found and fixed
• Many case studies, e.g., [Hu et al., ICCD 1997]
• During-design verification
•
•
•
•
Model creation part of design specification process
“Formal verifiers” part of cross-functional design team
Find bugs early  easier, cleaner fixes
Becoming more common, fewer anecdotes
Milo Martin - ICCD 2005
[ 22 ]
Implicit Role of Verification
• Once formal verification is part of design…
• Has implicit impact on the actual design
•
•
•
•
A series of bugs might change high-level design
Forces deep systematic think about the design
Gives designers confidence
Just making the model can find bugs (story)
• “Verifiability” becomes a design constraint
• Designers react to it (story)
• Encourages modular, cleaner, documented designs
Milo Martin - ICCD 2005
[ 23 ]
Implicit Role of Verification (continued)
• Is a “verifiable” design a better design?
• “principles of good design”, keeps designers honest
• Avoid problems before “bugs” develop
• Easier alternative? just trick the designers
• Design systems to be formally verified?
• How might doing so affect low-level concurrent
protocols?
• What might such a coherence protocol look like?
• I’ll talk about one possibility later in talk…
Milo Martin - ICCD 2005
[ 24 ]
Two Desirable Coherence Properties
• What properties might a coherence protocol…
• To make it “verifiable”
• To make it simple
• To make it flexible
• Two desirable decoupling properties
• Decouple interconnect properties from protocol
• Decouple consistency from coherence
Milo Martin - ICCD 2005
[ 25 ]
Decouple Interconnect from Protocol (1 of 2)
• Unordered interconnections
• Simple, modular interface
• Deadlock avoidance via virtual networks
• Constrains design and model the least
• Point-to-point ordered interconnects
• Disallows adaptive routing
• Reduces symmetry of model (state space)
• Not so bad, but better to avoid
• Most directory protocol fall into these categories
Milo Martin - ICCD 2005
[ 26 ]
Decouple Interconnect from Protocol (2 of 2)
• Totally-ordered interconnects
• Requires a bus or “virtual bus”, “snoop combining”
• Sometimes timing sensitive
• Complicate interface, implementation, modeling
• What protocols require this property?
• Snooping (all)
• Is “snooping” defined by broadcast or ordering?
• Few directory protocols (e.g., GS320)
Milo Martin - ICCD 2005
[ 27 ]
Decouple Coherence from Consistency
• Memory consistency models
• Defines “consistent” view of memory
• Coherence: for a single location
• Consistency: ordering among multiple locations
• Example:
Initial state: A = B = 0
Thread #0
Thread #1
while(A == 0) { /* nothing */ } Store B  1
Load B
Store A  1
• “Load B” should return?
• Under sequential consistency, always one
• Can return zero under weaker models
Milo Martin - ICCD 2005
[ 28 ]
Enforcing A Memory Consistency Model
• Option#1
• Coherence protocol provides “coherence invariant”
• Single-reader/writer --or-- multiple readers
• Processor internally allows or disallows reorderings
• All “sync” instructions internal to processor core
• Example: Alpha 21364
• Option #2
•
•
•
•
•
Intertwine and disperse enforcement through system
Totally order all requests
Send “sync” instructions into memory system
Maybe write-through L1 caches in multi-core systems
Example: IBM Power4
Milo Martin - ICCD 2005
[ 29 ]
Decoupling Implications
• For verification
• Easier to model each piece independently & together
• Reuse models over time
• For design
• More compartmentalized
• Easier incremental improvement over time
• Reuse of design components
Milo Martin - ICCD 2005
[ 30 ]
Revisiting Snooping vs Directory Protocols
• Snooping Protocols
• Simple snooping is seductively simple
• “Atomic” with simple bus
• More aggressive implementations are quite complex
• Violate the two decoupling properties
• Directory Protocols
• Have the decoupling properties
• Complex, but in all the ways formal methods can help
• Better “complexity scalability” over time
Milo Martin - ICCD 2005
[ 31 ]
Complexity Scaling
Directory
Complexity
Complexity
Snooping
Time
Interconnect
Time
Protocol
Controller impl.
• Initial designs
• Simple bus-based snooping simple, directory less so
• As design evolves
• Snooping quickly becomes complex, directory less so
• Caveat: few second-system directory systems
Milo Martin - ICCD 2005
[ 32 ]
Why Aren’t Directory Protocols More Common?
• Complexity disconnect
• No evolutionary path to directory protocols
• Radical design departure
• Designers are good at incrementally improving
working approaches over time
• Scalability trap
• Previous idea: scalability at all costs!
• Should only be a means to an ends, not an end goal
• “Scalable cache coherence” is synonymous with
directory protocols
• Often used to bridge between snooping systems
• Reputation for high latency
Milo Martin - ICCD 2005
[ 33 ]
My Opinion on the Coherence Debate?
• I now advocate against snooping protocols
• But for different reasons than others
• i.e., not performance scalability
• Main reason: decoupling properties
• A reversal of my previous opinion!
• Previously, I explored evolving snooping protocols
• [ASPLOS 2000, HPCA 2002]
• Now, tightly-coupled directory protocols attractive
• AMD’s Operton protocol is interesting
• “Directory-less” directory protocol
• Glueless, point-to-point interconnect, non-scalable
• Or, a new alternative…
Milo Martin - ICCD 2005
[ 34 ]
A New Alternative:
Token Coherence [ISCA 2003]
• A protocol design to be verified formally
• Fast, simple, flexible, too.
• Decoupling correctness and performance
• Correctness substrate
• Safety via token counting
• Forward progress via persistent requests
• Separate performance policies
• Target the common case
• Separate correctness and performance
• Example of “Better Then Worst-Case Design”
Milo Martin - ICCD 2005
[ 35 ]
Key Observation: Token Counting
• Explicitly encode permissions with tokens
• At all times, all blocks have T tokens
E.g., one token per processor
• Components exchange tokens & data
• Tokens: in caches, memory, or in transit
• Controls reading & writing of data
• One or more to read
• All tokens to write
Provides safety in all cases
Milo Martin - ICCD 2005
[ 36 ]
Token Counting Example
Load B
Store B
P0
P1
L1 I&D
L2
mem 0
L1 I&D
L2
P2
L1 I&D
P3
L1 I&D
L2
L2
mem 3
interconnect
• Each memory block initialized with T tokens
• At least one token to read a block
• All tokens to write a block
Milo Martin - ICCD 2005
[ 37 ]
Guaranteeing Starvation-Freedom
• Handle pathological cases
• Infrequently invoked
• Can be slow, inefficient, and simple
• When normal requests fail to succeed (4x)
•
•
•
•
Longer timeout and issue a persistent request
Request persists until satisfied
Table at each processor
“Deactivate” upon completion
• Implementation
• Arbiter at memory orders persistent requests
Milo Martin - ICCD 2005
[ 38 ]
Performance Policies
• Opportunities
• Aggressively target the common case
• Requests are just “hints” to move data & tokens
• Robust
• Can’t cause “correctness” violations
• A null or random policy is correct
• Rely on correctness substrate
• Examples
•
•
•
•
TokenB - broadcast policy
TokenD - performance characteristics of directory
TokenM - predictive multicast protocols
TokenCMP [HPCA 2005] - multi-level coherence
• “Flat for correctness, hierarchical for performance”
Milo Martin - ICCD 2005
[ 39 ]
Ramifications of T.C. on Design Verification
• Divide and conquer complexity
• Formally verified Token Coherence [HPCA 2005]
• Difficult to quantify, but promising
• All races handled uniformly (reissuing)
• E.g. simple replacements (no handshake)
• Local invariants
• Safety is response-centric; independent of requests
• Locally enforced with tokens
• Further innovation  no correctness worries
Milo Martin - ICCD 2005
[ 40 ]
Token Coherence vs Directory Protocols
• Similarities
• Decouple interconnect from protocol
• Decouple coherence from consistency
• Token Coherence more explicitly gives you a
“serial” coherence
• Differences
• Token Coherence can avoid directory indirection
• Token Coherence is more flexible, decoupled
• However, Token Coherence has separate persistent
requests, which add complexity
Result: an interesting alternative
Milo Martin - ICCD 2005
[ 41 ]
Outline
• Multiprocessors and coherence background
• Formal verification and coherence protocols
• Revisit the snooping vs directory protocol debate
• A new alternative: Token Coherence
• Conclusion
Milo Martin - ICCD 2005
[ 42 ]
Conclusions
• The age of multiprocessors and multi-core chips
• Coherence protocol is key design to such designs
• Formal verification has an important role to play
• Leverage formal methods early in design process
• Both explicit and implicit benefits
• Two decoupling properties
• Decouple interconnect from protocol
• Decouple coherence and consistency
• Snooping vs directory protocols?
• Directory protocols have these decoupling properties
• Token Coherence further embraces them
Milo Martin - ICCD 2005
[ 43 ]
Milo Martin - ICCD 2005
[ 44 ]
Starvation Avoidance
CMP 0
GETX
Store B
Store B
P0
P1
L1 I&D
L1 I&D
CMP 1
Store B
GETX
GETX
P2
L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
• Tokens move freely in the system
• Transient Requests can miss in-flight tokens
• Incorrect speculation, filters, prediction, etc
Milo Martin - ICCD 2005
[ 45 ]
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
L1 I&D
L1 I&D
L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
• Solution: issue Persistent Requests
• Heavyweight request guaranteed to succeed
Milo Martin - ICCD 2005
[ 46 ]
Persistent Requests
CMP 0
Store B
timeout
Store B
P0
L1 I&D
timeout
P1
Store B timeout
P2
L1 I&D
L1 I&D
interconnect
B: P0
B: P2
B: P1
P3
L1 I&D
interconnect
Shared L2
arbiter 0
CMP 1
Shared L2
mem 0
mem 1
interconnect
arbiter 0
• Processors issue persistent requests
Milo Martin - ICCD 2005
[ 47 ]
Persistent Requests
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
B: P0 L1 I&D
L1 I&D
B: P0
B: P0 L1 I&D
interconnect
P3
L1 I&D
B: P0
interconnect
B: P0 Shared L2
Shared L2
B: P0
arbiter 0
B: P0
B: P2
B: P1
mem 0
mem 1
interconnect
arbiter 0
• Processors issue persistent requests
• Arbiter orders and broadcasts activate
Milo Martin - ICCD 2005
[ 48 ]
Persistent Requests
CMP 0
Store B
Store B
P1
P2
P0
B: P2
P0 L1 I&D
L1 I&D
interconnect
B: P2
P0 L1 I&D
B: P0
P2
3
B: P2
P0 Shared L2
1
P3
L1 I&D
B: P2
P0
interconnect
Shared L2
B: P2
P0
2
arbiter 0
B: P0
B: P2
B: P1
CMP 1
mem 0
mem 1
interconnect
arbiter 0
• Processor sends deactivate to arbiter
• Arbiter broadcasts deactivate (and next activate)
Milo Martin - ICCD 2005
[ 49 ]
Download