Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun Goyal Kevin Waugh Arvind Gopalakrishnan Debugging Multi – Threading Programs Debuggers – always helpful Aim of discussion • Deterministic replay of multi processor execution • Record non deterministic events, specially memory races Flight Data Recorder (FDR) Bug – Net Strata FDR -- Approach Deterministic re-players and data race detectors exist FDR – Records operating system and I/O issues FDR -- Assumptions Sequential Consistency Directory based scheme Cache size is same as memory FDR -- Kinds of logs 3 kinds to meet performance, space and complexity requirements • To restore consistent state logs old memory on updates – checkpoints and logging • Record outcome of races assumes SC and records subset (implied races omitted) • Record system I/O logs interrupt timing and treats device interfaces as pseudo processors. Has low time space overhead – continuously enabled Recording Races Necessary to log non deterministic thread interleaving – outcomes of races Question? – how much… solution in memory model – here SC Record arcs – order pairs of dynamic instructions – not all Time stamps of cached blocks stored – missing timestamps approx FDR Issues and Optimizations Log Size – Regulated Transitive Reduction – judiciously log strict vector dependencies Hardware Cost – false races – approx on LRU in associative set – 24KB per core Simpler Design – take timestamps out of the cache TSO Model – avoids replay deadlocks of SC – additional info of load values BugNet:Net the Bug Architecture support for Deterministic Replay Debugging . Focus on replay of user code and shared libraries. Built, improving on the ideas of FDR Claim to be viable for use with software development (application). Archtecture Overview Checkpoint based recording • Check Point Interval snapshots • CP buffer (PC+Reg Map) Observe the Loads done by threads to trace the complete execution • Intial Register Values in a CP • The Trace of the loads Tracking loads works in spite of interrupts ,DMA transfers and other threads writing to shared memory. Load Bits in cache • Reduce multiple loads/log size. • Updates stores from external events FLL and MRB Dictionary based compression • For log data FDR vs BugNet FDR • Features include tracking I/O, Interrupts, DMA accesses. • Extra Hardware and log size overhead BugNet • Focus on application level S/W debugging, simpler scheme. • Smaller in terms of Hardware and Log Size Assumptions/Limitations Assumes a sequential consistency memory model Wont help in finding bugs which are caused by interactions with the OS and other system code. Question usability in mainstream systems. For debugging user level applications, software based recording more viable? Strata – Logging Shared Memory Dependencies Record memory counts on a dependency Hardware/cache-based scheme • Assumes sequential consistency • Dictionary and Snoopy cache consistency Drop-in replacement for Netzer’s scheme • Smaller log size • Less computation to create log • More complicated replay Narayanasamyet. al. ASPLOS06 Strata cont. Lowresource overhead • 12% bandwidth on Dictionary Scheme • ~0% bandwidth on Snoopy Scheme Scales linearly with number of threads • Each stratum holds one word per threads • Potentially worse than Netzer’s scheme Concerns and Criticisms All systems are require hardware • Significant resource overhead • Software would be slower, but still useful Consistency models restrictive • Exclude commodity hardware (x86) Encourages sloppy programming • Users != Testers