Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012 Executive Summary L1 P P CI CI CO L1 ? CO Simple L2 abstraction Simple abstraction GPU P Coherent View P Acoherent View - Complex implementation - Hides caches (bad?!) - High overhead - Simple implementation - Abstracts caches - Low overhead 2 Outline Motivation and Goals ASM Model ASM-CMP Prototype Evaluation and Results Conclusions and Future Work 3 Trends Energy Matters Dark Silicon/Mobile/Datacenter < 50% of processor powered by 20241 Complexity Matters Lower barrier to entry for accelerators Area Matters New tech nodes are not cheaper2 Memory: may be difficult to turn off e.g., S-NUCA Compatibility doesn’t matter Vertical integration is the new black 1 Esmaeilzadeh, et al. ISCA 2011 2 ExtremeTech 2012 4 We must change We can change The Problem With Coherence Wrong abstraction Optimized for fine-grained, share-everything • Programs aren’t! Makes SW isolation hard Hypothesis: SW will want control over data placement Impedes HW specialization Does your multicore ASIC need a coherence controller? Coherent GPUs? Efficiency problems Directories take space/broadcasts take energy • e.g. 14% of cache are dedicated to directory on 4-core die1 1 Stackhouse et al., ISSCC 2008 5 Rethinking Coherence: Goals Maintain programmer sanity Keep shared memory Minimal compatibility change Expose hardware capabilities Let SW guide memory management -> semantics Simple hardware Lower cost of entry for accelerators Solution: Acoherent Shared Memory 6 Outline Motivation and Goals ASM Model ASM-CMP Prototype Evaluation and Results Conclusions and Future Work 7 ASM Model Basics Replace black box with simple hierarchy Still flat, linear address space SW gets private storage CI CO CO P P 8 CI Manage with CVS-like checkout/checkin Checkout/Checkin Granularity? Checkout/Checkin are not synchronization primitives - Closer to a FENCE 9 P P CI CO CO Checkin: Publish local updates globally CI Checkout: Pull data into private storage Segments Compromise: Memory Segments – Linear partition of address space – CO/CI segments at a time Observation: Programs are already segmented Can re-use layout Stack Heap Typical CO/CI granularity in existing C code Data BSS Code 10 Segment Types Not all memory wants/needs acoherence Segment types give different “views” Communicate semantic information to HW Stack Private Private Available Types Private Coherent-RW Heap Acoherent Shared Acoherent Device Data BSS Code Coherent-RO Coherent RO Shared, Read-Only 11 Managing Finite Resources Model so far is strong acoherence Likely requires prohibitive HW resources Also weak acoherence and best-effort acoherence Still useful to software/hardware Synchronized => Weak acoherence: not a problem Data visible early (before checkin) Hybrid Runtimes => not a problem Best-effort acoherence: Spontaneous checkouts at any time • + SW notification All-or-nothing 12 Case Study: pthreads pthread_barrier_t barrier; shared_data char* shared_data; • Global, Heap in acoherent segment Automatic: • Stack in private segment Runtime2: Step 1: • Text in coherentWorks as is RO segment Task: Convert to ASM Checkout/Checkin Assign Segments int main(int argc argc, char* argv argv[]) { j k int i i,j,k; sib pthread_t sib; shared_data = malloc(PROBLEM_SIZE); barrier NULL, 2); pthread_barrier_init(&barrier, sib NULL, pthread_create(&sib, worker, (void*) 1); int pthread_barrier_init(…) { worker((void*) 0); … pthread_join(sib, sib NULL); _barrier = coherent_malloc(sizeof(int)); int0;pthread_barrier_wait(…) { return … … } } Automatic: Library checkin(heap, data); <barrier> void* worker(void* arg) arg checkout(heap, data); { … while } (work remains) { <split work> <do work> barrier pthread_barrier_wait(&barrier); } } 13 • Synch. in coherentRW segment • CI/CO Global, Heap at synchronization Communication Point Memory Consistency Model Option 1: The Details (6 slides + really ugly equations) 14 Option 2: The Highlights (2 slides) Memory Consistency Model Defined in style of SPARC TSO/RMO Memory Order: Total order of memory ops • Restricted by consistency model Processor Order: Local dependencies Value of load: defined via memory + processor order 15 Weak Acoherence 1. Define Memory Order LSi a p LSi LSi a m LSi # Load -> Load to same address (a) Same as TSO, etc. LSi p SiS a LSi m SiS a SiS a p SiS a SiS a m SiS a # Load -> Store to same address (b) # Store -> Store to same address (c) CI-CO pair => fence Total order of CO/CI SiS a p CIiS m COSj p LSj SiS m LSj CX p CX CX m CX # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) 2. Define legal value of loads value LSi a value max m S S a | S S a m LSi a or 16 S S a p LSi a Strong Acoherence 1. Define Memory Order LSi a p LSi LSi a m LSi # Load -> Load to same address (a) LSi p SiS a LSi m SiS a # Load -> Store to same address (b) Normally: S p CI CI m S Can “lose” data Stores not visible until CI S i SiS a p SiS a SiS a m SiS a S i SiS a p CIiS m COSj p LSj SiS m LSj CX p CX CX m CX SiS p next p (CI S , SiS ) p next p (CO S , SiS ) next p (CI S , SiS ) m SiS SiS p CO p next p (CI , SiS ) SiS m max p (COS , SiS ) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered 2. Define legal value of loads value LSi a value max p SiS a | max p (CO S , LSi a ) p SiS a p LSi a or, if SiS a does not exist, value max m S S a | max p (CO S , S S a) m S S a m LSi a 17 Other Segment Types Coherent Like weak, but: • Loads implicitly paired with (atomic) CO • Stores implicitly paired with (atomic) CI SC w.r.t. each other Private Like weak 18 Analysis CO/CI not atomic Subtlties: Initially, Thread 0 A = 0 Thread 1 00: CHECKOUT Initially, Thread 0 A = 0 Thread 1 Thread 0 A = 0 Thread 1 05: CHECKOUT 02: CHECKOUT 10: A = 1 Initially, 14: A = 1 03: R0 = A 11: CHECKIN 12: A = 1 01: R0 = A 06: R0 = A 13: CHECKIN 04: R1 = A Strong: R0 = 0 or 1 Weak: R0 = 0 or 1 (a) Lazy checkout Strong: R0 = 0, R1 = 0 Weak: R0 = 0, R1 = 0 or 1 (b) Isolation 19 Strong: R0 = 0 Weak: R0 = 0 or 1 (c) Leaky stores ASM = SC for DRF ASM = SC for lossless and properly paired Lossless: No clobbering checkouts i.e., SiS a, if COiS : SiS a p COiS Next CI iS : SiS a p CI iS p COiS Properly Paired: All conflicting stores->load separated by CI/CO i.e., LSi a, S Sj a : value( LSi a) value( S Sj a), i j COiS , CI Sj : S Sj a p CI Sj m COiS p LSi a Proof sketch: LL+PP executions defined by CO/CI order, program order only CO/CI, program order same in ASM, SC 20 CO/CI Semantics CO/CI like fence Lazy checkouts Non-atomic, non-blocking checkins • Updates can interleave Initially, Thread 0 Initially, A = 0 Thread 0 Thread 1 00: A = 1 01: B = 2 02: CHECKIN 00: CHECKOUT 10: A = 1 A = 0 Thread 1 10: A = 10 11: B = 20 12: CHECKIN 11: CHECKIN 01: R0 = A Finally, any combo of: A = 1 or 10 B = 2 or 20 Finally: R0 = 0 or 1 21 Consistency Highlights Coherent accesses have implicit CO/CI CO/CI are totally ordered Transitivity hides non-atomicity Thread 0 Thread 1 Sequentially consistent for data-race-free Lossless & Properly Paired CO lock_segment ST critical LL lock LD lock CI critical_segment ST lock ST lock CI lock_segment STsync lock CI lock_segment CO critical_segment 22 LD critical SC lock Outline Motivation and Goals ASM Model ASM-CMP Prototype Evaluation and Results Conclusions and Future Work 23 ASM-CMP Overview Based on MIPS + special insns, e.g., checkout, checkin Uses segments, no paging Skipping the Details • Maintains flat address space Coherence protocol -> Acoherence Engine DMA for caches • Selectively move data 24 Baseline Memory Controller Memory Controller Memory Controller Memory Controller 25 Core L1I L2 L1D Switch Segment Types L2 Exclusive L2 CI CO CO CI Noninclusive L2 L1 AE L1 AE P P Acoherent L1 P P Coherent-RW 26 P L1 P Private Acoherence Engine Three main responsibilities: Checkout: Lazy Flash Invalidate • Invalidate all segment data Checkin: Track write set Decoupled Metastate Cache • Write back all dirty segment data Order: Timestamp based • Detect CI-CO pairs FSM like coherence, but few races, no directory 27 Decoupled Metastate Cache All L1 Caches Decouple metastate from data Quick access to aggregate state Track V/D per-segment Checkout: XOR global/segment valid Checkin: Walk segment dirty state 28 Order Need to: 1. Determine if a CI precedes a CO 2. Delay load after CO if previous CI hasn’t completed Timestamp algorithm (per segment): Two phase CO/CI 1. Acquire timestamp 1. Invalidate/Flush 2. Wait for previous CO/CI to complete Implemented in firmware 29 Multiple Writer Support Keep per-byte dirty bitmask in L1s Allows multiple writers with false sharing 12.5% larger L1 cache Bitmask accompanies data to L2 30 Simple? Directory L2 / L2 REQ RESP RESP REQ FWD L1 L1 Source of Races / Complexity 31 Outline Motivation and Goals ASM Model ASM-1 Prototype Evaluation and Results Conclusions and Future Work 32 Methodology Simulation-based Enhanced-User Mode Workloads: Class-1: SPLASH Class-2: Task-Q Three memory modules ASM-CMP CC from gem5-Ruby • MESI (Inclusive) • MOESI (Non-inclusive) 33 Performance 1.4 Runtime Normalized to MOESI 1.2 1 0.8 0.6 0.4 0.2 0 False Sharing/ Comparable performance Checkout too much Migratory Sharing 34 moesi mesi asm Perfect Checkout Runtime Normalized to ASM Baseline 1.2 1 0.8 0.6 0.4 0.2 0 asm_base 35 asm_ideal Energy 1.2 Energy Normalized to MOESI 1 0.8 0.6 0.4 0.2 0 Less Energy (Same Performance) 36 e_l1d e_l1i e_l2 e_link e_switch e_tlb Checkout Characteristics % of checkouts Class-1 Workloads 80% 70% 60% 50% 40% 30% 20% 10% 0% fft fmm lu mp3d ocean radix water Most checkout invalidations Checkouts usually small; dead Canaffect be large (> blocks 25% of L1) # blocks invalidated % of Checkout Invalidations Elided barnes 1.2 1 0.8 0.6 0.4 0.2 0 37 Checkin Characteristics Class-1 Workloads 40% % of checkins 35% 30% 25% 20% 15% 10% 5% 0% Checkins usually small; Checkin latency is hidden Can be large (> 25% of L1) # blocks invalidated barnes fft fmm lu mp3d 38 ocean radix water Outline Motivation and Goals ASM Model ASM-CMP Prototype Evaluation and Results Conclusions/Other Work 39 Conclusions Going forward: HW designs must find efficiency SW will want to see caches/control placement ASM: viable alternative to coherent shared memory Semantic cooperation between HW/SW ASM-CMP: build components w/o coherence engine Make custom integration easier Practically: Will the next x86 core use ASM? No Will a heterogeneous accelerator? Maybe 40 Related Work ASM Model ASM-CMP Alternatives/Detractors Skip 41 Related Work – ASM Model Relaxed consistency models Release Consistency (ISCA 1990) • Acquire/Release ≈ CO/CI DRF-0 (ISCA 1990), DRF-1 (PDS 1993) • SC for DRF Weak ordering (ISCA 1998) Semantic Segmentation Cohesion (ISCA 2011) Entry consistency (CMU-TR 1991) 42 Related Work – ASM-CMP Rigel: IEEE Micro 2011 Differentiates coherent/incoherent Treadmarks: ISCA 1992 Twinning and diffing 43 Related Work - Alternatives Reduce directory overhead Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…) Simplify coherence design Denovo (PACT 2011) Coherence is here to stay CACM 2012 44 Future Work ASM Model ASM Implementations ASM Software Skip 45 Future Work – ASM Model 1. Use CO/CI for synchronization Return timestamp with CO/CI Blocking CO 2. Only guarantee transitivity across coherent accesses Would eliminate need for timestamps 3. Hierarchical ASM Expose multiple levels of abstracted caches 4. Interaction with coherent shared memory Acoherent/coherent components in same system 46 Future Work: ASM Implementation ASM-CMP 1. Optimize empty checkout/checkin 2. Non-speculative support for strong acoherence • e.g., HW copy-on-write support on eviction • Use ASM as foundation for TM/Determinism/etc 3. Low overhead byte-diffing • False sharing is rare/pattern reuse is common 4. More segment control • Non-contiguous • Remap-able Other 1. Multi-socket support 2. Use ASM to simplify traditional coherence • Private/shared 47 Future Work – ASM Software 1. Message passing on ASM More efficient than coherence (think: migratory) 2. Software speculation Use working memory for isolation 3. Programming language integration CO/CI first-class operations Work already exists: • Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS 48 Previous Work Rerun: ISCA 2008 and CACM 2009 Race recorder for deterministic replay vs. state of the art: • SAME logging performance, > 10x state reduction Calvin: HPCA 2011 Coherence for deterministic execution • i.e., zero-log-size deterministic replay Selective determinism to match program requirements Hobbes: WoDet 2011 Strong acoherence in SW runtime 49 Phew! Backup Slides What I would do differently Focus on more specific target system Stop building new infrastructure! Why did I? • gem5 wasn’t ready • Started more radical/not clear it would have helped Step back more often Easy to get sucked in to details – usually don’t matter Functional specification of consistency -> yuck! 52 Case Study: Cilk Work-stealing task queue Distributed design 53 Runtime Normalized to MOESI Using Segments ASM Segments Benefit 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Energy Normalized to MOESI Using Segments moesi_tlb_0 moesi_tlb_32 moesi_tlb_64 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 e_l1d e_l1i e_l2 e_link 54 e_switch e_tlb History 1980 CPU Era Everything is general purpose 2000 Multicore Era Moore of the same 55 2010 Dark Era ?? Navigating the Darkness Solution #1: Wait for CMOS replacement Don’t hold your breath Solution #2: Rethink everything Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal 56 Coherence? Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues: Need efficient components/reduced waste Heterogeneity/Specialization Different memory access patterns Multicore ASICs Important workloads don’t use it Compatibility not a show stopper Mobile -> fast design cycles, controlled SW stacks Datacenter -> economy of scale in single location Missing opportunities 57 Case Study 2: Software Speculation begin_speculation() { <copy state> checkout(…) <setup> } Use private storage Multiple checkouts: “forget” updates Task: SW can Convert use memory to ASM in new ways end_speculation() { if(success) <free copies> checkin(…) Checkin: commit updates else abort_speculation(); } abort_speculation() { <revert to copy> checkout(…) <cleanup> } 58 New Software Potential Evaluate ability to write speculation software Microbenchmark: Fill array with speculative data, then commit Vary size of array Normalized Runtime 1.5 1 ASM MESI 0.5 0 16 64 256 1K 16K 32K 64K 128K # of Blocks in Isolation Region 59 Using Weak Acoherence global array; weak acoherent func producer(…) func consumer(…) checkout(array); waitfor(producer); array[0] = x; checkout(array); globally visible! … array[1] = y; checkin(array); end func checkin(array); signal(consumer); end func Synch hides early checkin Synchronized -> Early visibility OK 60 Using Best-Effort Acoherence begin_tx checkout(array) array[0] = x checkout(array) array[1] = y checkin(array) end_tx Exception! SW handles resource limitations 61 Simulator Design Two Goals Functionally evaluate ASM system • programming model, kernel management Performance comparison to CMP Enhanced User Mode simulator Emulate non-timing critical components (e.g., disks) Simulate the rest (e.g., virtual memory) 62 Qualitative Data Is ASM a reasonable model? YES Almost no changes to application software • Unsynchronized flags • Stack sharing Functioning Kernel, same tricks • Heavier use of coherent segments 63 Three Questions 1. How can software select view? 2. Which view to use? 3. How to manage CO/CI? PC PC P P P CI CO CO LLC CI DRAM P P Hardware Acoherent Layout View 64 P Private View P P Coherent View ASM-CMP Segments Uses true memory segments e.g., all pointers are long (segment + offset) BUT, address space still appears flat! Long Pointer Propagation Segment pointers propagate through datapath Add lp/sp + register sidecars Languages/SW remain segment-oblivious 65 ASM-CMP Segments Segment pointers propagates with datapath memcpy(dst, src, len); lp $t0, lp $t0,0(dst) 0(dst) lp $t0, 0(dst) lp $t1, lp $t1,0(src) 0(src) mov $a2$a20(src) ;;cnt len len mov $t2, cnt<- <lp$t2, $t1, Pointers are long Memory loop: beqz $t2, exit beqz $t2, exit lb $t3, lb $t3,0($t1) 0($t1); sb $t3, sb $t3,0($t0) 0($t0); addi $t0, $t0, 1 ; addi $t0, $t0, addi $t1, $t1, 1 ; subi $t2, $t2, 1 ; b loop exit: mov $t2, $a2 ; cnt <- len ld ld src src ; Register File loop: dst ptr Offset st dst ; st dst dst Seg. Ptr. 1inc. ; inc.dst Seg beqz $t2, exit $t0 Offset inc. src dec. cnt Offset Seg lb $t3, 0($t1) ; $t1 ld src src ptr Offset Seg. Ptr. len sb $t3, 0($t0) ; $t2 st dst src addi $t0, $t0, 1 ; $t3 inc.data dst addi $t1, $t1, 1 ; inc. src 1 dst subi $t2, $t2, 1 ; dec. cnt Segment propagates b loop Seg ALU exit: src -> dst 66 Offset+1 The Problem DRAM LLC PC PC P P Coherent Shared Memory P Hardware Layout P Software View 67 The Problem DRAM LLC PC PC P P Coherent Shared Memory P P Hardware Policy – Hardware Software Layout View Software Can’t Change! 68 All Data Are Created Equal? Assume: CMP MESI protocol, inclusive LLC DRAM ? ? Location := 1; LLC ? 69 PC PC P P Missed Opportunities Assume: CMP MESI protocol, inclusive LLC DRAM begin_tx cpLocation := Location; Location := 1; end_tx SW Makes Redundant Copy 70 LLC PC PC P P All Data Are NOT Created Equal Assume: CMP MESI protocol, inclusive LLC func foo() var Location; Wasting Space DRAM LLC Location := 1; Private 71 PC PC P P ASM-1 Hardware 8MB L3 256KB L2 Bitmask AE Per-line 32KB L1 Bitmask P 72 Baseline P0 P1 P2 P3 P4 P5 P6 P7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3_0 L3_4 L3_1 L3_5 L3_2 L3_6 L3_3 L3_7 In-order, single thread L2 L2 L2 L2 L2 L2 L2 L2 Ring interconnect 73 L1 L1 L1 L1 L1 L1 L1 L1 P8 P9 P10 P11 P12 P13 P14 P15 Storage Overhead 100% 90% More indirection -> longer latency Storage Overhead 80% 70% 60% 50% ASM-1 MESI-1 Level MESI-2 Level MESI-3 Level 40% 30% No Indirection 20% 10% 0% # Cores 74 Rethinking Coherence: Why Now? Dennard Scaling is over; Moore’s Law continues Need scalable, energy efficient components Accelerators are here How should they see memory? Shared-little workloads in important markets 75 All Data Are NOT Created Equal Assume: CMP MESI protocol, inclusive LLC DRAM func CUDAKernel(…) … Not clear accelerators want/need coherence LLC Location := 1; 76 PC PC P GPU