Dynamic Verification of Sequential Consistency Albert Meixner and Daniel J. Sorin Presented by Peter Gilbert ECE259 Spring 2008 Introduction • Problem: How to detect errors in multithreaded memory systems? – Transient physical faults increasingly problematic as memory systems become more complex – Errors possible in many components: caches, memories, cache/memory controllers, interconnect Potential approach • Tailored detection mechanisms for each component and each type of error – Example: for “single-bit-stuck-at-x” model for system bus, add a parity bit – Problems: • • • • Lots of components… Requires understanding error models and how they interact Cannot detect design bugs Some errors difficult to detect with localized mechanisms Dynamic verification • Monitor system execution • Verify high-level invariants rather than considering individual components • Can detect transient faults, design bugs, fabrication defects – Any error that affects the high-level invariants • End-to-end correctness DVSC • Verifying memory consistency == verifying memory system correctness • Goal of DVSC: verify that SC is enforced in a shared memory multiprocessor – SC defines end-to-end correctness DVSC error model • Caches and memories – Bit corruption • Cache/memory controllers – Corruption of state or outputs • Interconnect – Messages corrupted, dropped, replicated, misrouted, reordered First idea: DVSC-Direct • Dynamically construct total order of loads and stores • Verify that total order satisfies SC • Trigger system recovery when error is detected DVSC-Direct Design • For every load and store: – Processor informs block’s home memory node • Inform message: – <address, load/store, data value, logical time> • Logical time: if A causes B, A has smaller logical time • Replay accesses in logical time order at home node – Uses priority queue for Informs and shadow copies of memory blocks – Verify that load gets value from most recent store Cost of DVSC-Direct • Inform bandwidth proportional to the number of loads and stores – 8-53 times more bandwidth than unprotected system – Uses bandwidth like a system without caches! Alternative: DVSC-Indirect • Verify sub-invariants proven to be equivalent to SC – Proof due to Plakal et al. • Terminology – coherence epoch - interval of logical time during which a processor has Shared or Exclusive access to a block – A memory access is bound to a coherence transaction T if permission is obtained via T Constructing SC from sub-invariants • Fact 1: load of block B bound to T receives either: – – • Most recent store of B bound to T Value of B received in response to T Lemmas 1. Exclusive epochs for block B do not overlap with other Exclusive or Shared epochs for B 2. Every load or store occurs in some epoch and is bound to the transaction that epoch 3. Each word w of B received at the start of an epoch equals the most recent store to w DVSC-Indirect approach • DIVA to verify that memory operations occur in program order (Fact 1) – Recall from Architecture I: DIVA dynamically verifies a speculative core – DIVA presents in-order abstraction to memory system • ECC added to each cache and memory line to detect silent corruptions • Hardware for verifying epoch invariants Cache controller • Cache controller maintains Cache Epoch Table – CET entry (per-block): S/E DRB 1 bit 1 bit Logical time at start 16 bits Hash of data block at start of epoch 16 bits S/E - type of epoch: shared or exclusive DRB - data ready bit – Check that every load and store is performed in appropriate epoch (Lemma 2) – When epoch ends, send info to home memory controller in Inform-Epoch message (CET + end time and end data hash) Memory controller • Memory controller maintains directory-like Memory Epoch Table – MET entry (per-block): Latest end time of any S epoch Latest end time of any E epoch Hash of data block from latest E epoch 16 bits 16 bits 16 bits – When Inform-Epoch is received from cache controller: • Sort in priority queue (VWB) • Process in logical time order, checking: – This epoch does not overlap with other epochs – Correct block data is transferred from epoch to epoch DVSC-Indirect implementation CPU DIVA CPU DIVA CPU DIVA Cache CET Cache CET Cache CET Interconnect VWB Memory Verifier MET VWB Memory Verifier MET VWB Memory Verifier MET DVSC-Indirect snooping example DVSC-Indirect summary • Verifies SC through sub-invariants • Costs: – DIVA – Storage structures • CET at each cache controller, MET and VWB at each memory controller, ECC on cache and memory lines • Not large or complicated • Bandwidth usage – Proportional to coherence traffic (as opposed to number of loads and stores for DVSC-Direct) Evaluation • Can DVSC detect the errors from the error model? – Corrupted, dropped, misrouted, reordered, duplicated messages – Corrupted cache and memory blocks – Don’t consider errors in processor core • DIVA handles this • How much does DVSC increase bandwidth usage? • How does it affect error-free performance? Methodology • Full-system simulation with Simics – 8-node multiprocessor • Each processor implements SC, speculates for higher performance • Two levels of cache • Support for backward error recovery with SafetyNet at each node – DVSC-Indirect costs • Each CET 68 KB • Each MET 102 KB • VWB: 1024 entries – SPARC V9 running Solaris 8 – Interconnect: 2.5 GB/s links • Benchmarks – Four commercial workloads + barnes-hut from SPLASH-2 Error coverage • DVSC-Direct and DVSC-Indirect detected all injected errors in simulation • Small probability of false negatives in DVSC-Indirect – ECC fails to detect a bit error – hash collisions • False positives also possible in DVSCIndirect – When VWB not large enough to prevent outof-order processing of Inform-Epochs Bottleneck link bandwidth Bandwidth on most-utilized link - directory • DVSC-Direct: uses 8-53 times more bandwidth • DVSC-Indirect directory: 8-25% increase • DVSC-Indirect snooping: 0-15% increase DVSC-Indirect error-free performance Runtime with DVSC-Indirect compared to unprotected system - directory • • Performance impact minimal: usually equivalent to that of SafetyNet by itself Similar results for snooping Conclusions • DVSC-Indirect is effective at detecting all memory system errors injected in the simulations – False negatives and false positives will occur with small probability • DVSC-Indirect imposes small error-free performance overhead • Bottleneck bandwidth usage with DVSC-Indirect is only 8-25% greater than unprotected case • Is the hardware cost justified? – Probably depends on the application reliability requirements