Theory of Memory W. Paul Saarland University and DFKI bmb+f Projekt Verisoft-XT joint work with Ulan Degebaev and Norbert Schirmer Saarland University why might his be important? • Unites theories of – – – – – – – – – store buffers interlocking caches cache coherence out of order execution X64 instruction set address translation optimized compilation structured parallel C semantics • Explains why hypervisor might run structured parallel C • VCC is supposed to mirror structured parallel C semantics • thus VCC might be(come) sound Specifying Memory x M(x) Store Buffer memory M sbuf(y) r(j) w(i) Store Buffer memory M sbuf(y) r(j) w(i) Caches M ca Many Caches: Snooping M ca(1) ca(p) Many Caches x.la M ca(1) ca(p) x.off Many Caches x.la M ca(1) ca(p) x.off Many Caches x.off M ca(1) ca(p) Overlapping Transactions c public (a) b a c c Sequentially Consistent Memory lemma 5 c public (a) b a c c Tomasulo Schedulers for OOO IF issue reservation stations funct. units CDB ROB WB Two Memory Units m RS MMU RS sbuf funct. units LS CDB ROB Single Processor OOO correctness lemma 6 m RS MMU RS sbuf funct. units LS CDB ROB Multi Processor OOO implementation m RS MMU RS sbuf funct. units LS CDB data(i,j) ROB Multi Processor OOO correctness lemma 7 m RS MMU RS sbuf funct. units LS CDB data(i,j) ROB Multi Processor OOO correctness lemma 7 m RS MMU RS sbuf funct. units LS CDB data(i,j) ROB X64 architecture • CPU core mm – R: user registers – SR: system registers ca • CR3 – acc: access – segmentation sbuf acc mmu • mmu: memory management unit – tlb: translation look aside buffer tlb • memory system acc CR3 segmentation core R – mm: main memory – ca: cache – sbuf: store buffer segmentation off lemma 8 mm • 1 segment • large as entire address space • segmentation invisible ca sbuf acc mmu acc tlb CR3 segmentation core R Bad news: cache state is visible • CPU core mm or devices – acc: access ca sbuf acc mmu acc core tlb CR3 R • acc.adr: address • acc.r: rights (user,write, exe) • acc.data • acc.mmode: memory mode – WB: write back – WT: write through ... – NC: no cache Good News: no device, no NC mode • acc.mmode: memory mode mm ca – WB: write back – WT: write through ... – NC: no cache not used sbuf acc mmu acc core tlb CR3 R Sequentially Consistent Physical Memory lemma 9 • acc.mmode: memory mode PM – WB: write back – WT: write through ... mix on same address sbuf acc mmu acc core tlb CR3 R • PM: sequentially consistent physical memory abstraction – Proof: MOESI invariants are maintained Initialize page tables • 1 processor page tables PM sbuf – sbuf invisible • operating mode: paging disabled – mmu invisible acc mmu acc core tlb CR3 R • set up page table tree in PM Translated Linear Memory page tables PM sbuf acc mmu acc core tlb CR3 R • many processors • operating mode: paging enabled • keep tlb consistent Translated Consistent Linear Memory + sbufs lemma 10 LM page tables sbuf acc core CR3 R • many processors • operating mode: paging enabled • keep tlb consistent C0: Pascal with C syntax configurations • c = ( pr, rd, lms, hm,gm) – – – – – memory m pr program rest rd recursion depth lms: [0: recursion depth]!{local memories} hm: heap memory gm: global memory • subvariables – (m,i)[17].gpr[3] • value of pointers: subvariables ! va(c,(m,i)) ba(m,i) size(m,i) Parallel C • c = ( pr, rd, lms, hm,gm) – – – – – memory m pr program rest rd recursion depth lms: [0: recursion depth]!{local memories} hm: heap memory gm: global memory • Share – gm – hm • Interleave at small steps semantics steps va(c,(m,i)) ba(m,i) size(m,i) Parallel C • c = ( pr, rd, lms, hm,gm) – – – – – memory m pr program rest rd recursion depth lms: [0: recursion depth]!{local memories} hm: heap memory gm: global memory • Share – gm – hm • Interleave at small steps semantics steps • Problem: – Processor interleaves instructions of compiled programs code(p) va(c,(m,i)) ba(m,i) size(m,i) simulation relation consis(c, alloc, d) LM alloc (c,y) y alloc (c,p) p Non optimizing compiler: step by step simulation Optimizing compiler: simulation between IO-steps IO-steps (1): volatile accesses Volatiles Sequentially Consistent lemma 11 Structured Parallel C • Implement Locks using Volatiles • IO-steps (2): lock release • Run Processors alone on locked portions of linear memory • Lemma 1: sbufs invisible • Lemma 10: Ordinary C code in linear memory Summary • Implement Locks using Volatiles • IO-steps (2): lock release • Run Processors alone on locked portions of linear memory • Lemma 1: sbufs invisible • Lemma 10: Ordinary C code in linear memory • Outlined correctness proof for implementation of structured parallel C – Initialisation – compilation