Getting Rid of Store-Buffers in TSO Analysis Mohamed Faouzi Atig Uppsala University, Sweden Ahmed Bouajjani LIAFA, University of Paris 7, France Gennaro Parlato University of Southampton, UK ✓ Sequential consistency memory model (SC) T1 … Shared Memory Tn Write(var,val): sh_mem[var] val; (immidialy visible to all threads Read(var): returns sh_mem[val]; SC= • actions of different threads interleaved in any order • action of the same thread maintain the execution order WMM= For performance reason modern multi-processors reorder memory operations of the same thread Total Store Ordering (TSO) T1 (x4) (z7) (y3) M1 … … … Tn (z4) (y4) Mn Shared Memory • Each thread has its store-buffer (FIFO) • Write(var,val): the pair (varval) is sent to the buffer • Memory update = execution of a Write taken from some buffer • Read(var) returns val - If (var val) the last value written into var still in the store-buffer - the buffer does not contain any Write to var, and sh_mem(var) = val • fence requires that the store-buffer is empty Correct under SC -- Wrong under TSO Dekker’s mutual exclusion protocol Thread 1 a: y:=1 b: r1:=x c: if (r1==0) then d: critical section Thread 2 1: x:=1 2: r2:=y 4: if (r2==0) then 4: critical section Bad Schedule for TSO: 4 y1 x1 abcd 123 both threads in the critical section!!! x y 0 0 Verification for TSO? • For finite state programs reachability is non-primitive recursive [Atig, Bouajjani, Burckhardt, Masuvathi – POPL’10] • What shall we do? • Symbolic representation of the store buffers? [Linden, Wolper—SPIN’10]: Regular model-checking • Our approach reduce the analysis from TSO to SC • can be done only with approximations … What is this talk about If we restrict to only executions where each thread is executed at most k times with no interruption (for a fixed k) we can translate any concurrent program PTSO (recursion, thread creation, heap, …) into another program PSC s.t. • PSC (under SC) simulates all possible executions of PTSO (under TSO) where each thread is executed at most k times • PSC has no buffer at all! Simulation of the store-buffers using 2k copies of the shared variables as locals • PSC has linear size in the size of PTSO • Advantage: use off-the-shelf SC tools for the analysis of TSO programs Code-to-code translation from TSO to SC k-round (for each thread) reachability P1 T1 M1 … … Pi Mi Ti = … … Run (Ti1++Mi1)+ round Pi1 (Ti2++Mi2)+ ... round Pi2 A k-round run : Ɐi # round Pi ≤ k Shared Memory Compositional reasoning [(Ti +Mi)*]k round0 (Mask0 Buff0) (Mask1 Buff1) round1 round2 (Mask2 Buff2) Getting rid of store-buffers Maski x y z (Mask0 Buff0) is a copy of the shared vars as Boolean (as locals) (Mask1 Buff1) (Mask2 Buff2) Buffi x y z - 6 - is a copy of the shared vars (as locals) Invariant: x y Mask0 Mask1 Mask2 z x y z Buff0 3 5 - Buff1 0 - - Buff2 0 1 4 store-buffer round 2 (x0) (y1) (z4) (y7) round 1 (x0) (x4) (x7) round 0 (x3) (x7) (y5) at each time in the simulation Maski [var]=1 iff • there is a store in the store-buffer for var that update the Shared memory at round i • Buffi[var] containts the last value sent for var Simulation (Mask0 Buff0) (Mask1 Buff1) (Mask2 Buff2) Before simulation: • Masks set to False • r_SC0; r_TSO0; Simulation: • All statements not involving shared vars are executed Write(var,val) • Maskr_TSO[var] T; • Queuer_TSO[var] val; Read(var) Let i be the greatest index s.t. i>=r_SC & Maski(var) =1 End of round : (Update shared vars): For all var if Maskr_SC (var) ==1 varBuffr_SC [var]; if i>=0 else return Queuei[var] return var ; round round round Buff i 0 1 2 Skeleton of the translation Shared sh_vars; before(){ // start round Thread_i() if (!sim){ lock; Begin sim=1; r_SC++; if (r_TSO< r_SC) locals l_vars; r_TSO, r_SC, sim, Mask0 , Buff0, …,Maskk , Buff k; r_TSO=r_SC; } Init(); // initialize Masks to False, r_SC=0, r_TSO, while(*) sim=0; r_TSO++; } stmt_1; stmt_2; … stmt_n; end stmt_j before(); stmt_j; after(); after() { if(*) //end round Update_shared(r_SC, Mask, Queue) sim=0; unlock; } Characteristics of the translation • For fixed k, PSC is linear in the size of PTSO • 2k copies of the shared variable as locals (no store-buffer) • PSC and PTSO are in the same class • no restriction on the programs is imposed • The reachable shared states are the same in PSC and PTSO A state S is reachable in PTSO with at most k rounds per thread iff S is reachable in PSC Bounding Store Ages Observation: When r_SC =1 (Mask0, Buff0) are not used any longer (Mask0 Buff0) Reuse the Mask and Queue variables: Translation: (Maskj , Buffj) are used circularly (modulo k+1). (Mask1 Buff1) k store-ages: (Mask2 Buff2) • • (Mask0 Buff0) Unbounded rounds! Constraint: each write pair remains in the store-buffer for at most k rounds … … How can we use this code-to-code translation? Corollaries Decidability results for TSO reachability Our code-to-code translation is a linear reduction TSO -> SC. schedules (k fixed) Concurrent Boolean Prog. Complexity Inherit decidability from SC References k-store-ages no recursion Pspace k contextswitches Recursion Exptime [Qadeer, Rehof – TACAS’05] k round-robin Recursion Finite # threads |parameterized Exptime [Lal, Reps–CAV’08] [La Torre, P., Madhusudan—CAV’09] [La Torre, P., Madhusudan—CAV’10] k-rounds per thread recursion thread-creation 2-Expspace [Atig, Bouajjani, Qadeer – TACAS’09] k-delay bound recursion thread- creation Exptime [Emmi, Qadeer, Rakamaric—POPL’11] k-compositional recursion thread-creation Exptime [Bouajjani, Emmi, P.—SAS’11] Tools for SC Tools for TSO (our code-to-code translation as a plug-in) A convenient way to get new tools for TSO … Concurrent Program Experiments Mutual exclusion Protocols POIROT Loop unrolling: 2 No fences (buggy for TSO) D=1 Dekker (by MSR) D stands for Delay bound With fences (correct for TSO) D=1 D=2 7s 6s 72 s Lamport 26 s 110 s 1608 s Peterson 5s 6s 47 s Szymanski 8s 6s 978 s POIROT: SMT-based bounded model-checkers for SC programs Errors due to TSO discovered in few seconds! POIROT can also be a model-checker for TSO! Conclusions Conclusions We have proposed a code-to-code translation from TSO to SC • allows to use existing and future tools designed for SC to analyze programs running under TSO • under-approximation (error finding) • restrictions imposed on the analyzed runs is useful to find errors in programs Beyond TSO ? Generic approach ? Thanks!