Acoherent Shared Memory Derek R. Hower Ph.D. Defense

Acoherent Shared Memory Derek R. Hower Ph.D. Defense July 16, 2012 Executive Summary L1 P P CI CI CO L1 ? CO Simple L2 abstraction Simple abstraction GPU P Coherent View P Acoherent View - Complex implementation - Hides caches (bad?!) - High overhead - Simple implementation - Abstracts caches - Low overhead 2 Outline  Motivation and Goals  ASM Model  ASM-CMP Prototype  Evaluation and Results  Conclusions and Future Work 3 Trends  Energy Matters  Dark Silicon/Mobile/Datacenter  < 50% of processor powered by 20241  Complexity Matters  Lower barrier to entry for accelerators  Area Matters  New tech nodes are not cheaper2  Memory: may be difficult to turn off  e.g., S-NUCA  Compatibility doesn’t matter  Vertical integration is the new black 1 Esmaeilzadeh, et al. ISCA 2011 2 ExtremeTech 2012 4 We must change We can change The Problem With Coherence  Wrong abstraction  Optimized for fine-grained, share-everything • Programs aren’t!  Makes SW isolation hard  Hypothesis: SW will want control over data placement  Impedes HW specialization  Does your multicore ASIC need a coherence controller?  Coherent GPUs?  Efficiency problems  Directories take space/broadcasts take energy • e.g. 14% of cache are dedicated to directory on 4-core die1 1 Stackhouse et al., ISSCC 2008 5 Rethinking Coherence: Goals  Maintain programmer sanity  Keep shared memory  Minimal compatibility change  Expose hardware capabilities  Let SW guide memory management -> semantics  Simple hardware  Lower cost of entry for accelerators  Solution: Acoherent Shared Memory 6 Outline  Motivation and Goals  ASM Model  ASM-CMP Prototype  Evaluation and Results  Conclusions and Future Work 7 ASM Model Basics  Replace black box with simple hierarchy  Still flat, linear address space  SW gets private storage CI CO CO P P 8 CI  Manage with CVS-like checkout/checkin Checkout/Checkin Granularity? Checkout/Checkin are not synchronization primitives - Closer to a FENCE 9 P P CI CO CO Checkin: Publish local updates globally CI Checkout: Pull data into private storage Segments  Compromise: Memory Segments – Linear partition of address space – CO/CI segments at a time  Observation: Programs are already segmented  Can re-use layout Stack Heap Typical CO/CI granularity in existing C code Data BSS Code 10 Segment Types  Not all memory wants/needs acoherence  Segment types give different “views”  Communicate semantic information to HW Stack Private Private Available Types Private Coherent-RW Heap Acoherent Shared Acoherent Device Data BSS Code Coherent-RO Coherent RO Shared, Read-Only 11 Managing Finite Resources  Model so far is strong acoherence  Likely requires prohibitive HW resources  Also weak acoherence and best-effort acoherence  Still useful to software/hardware Synchronized =>  Weak acoherence: not a problem  Data visible early (before checkin) Hybrid Runtimes => not a problem  Best-effort acoherence:  Spontaneous checkouts at any time • + SW notification  All-or-nothing 12 Case Study: pthreads pthread_barrier_t barrier; shared_data char* shared_data; • Global, Heap in acoherent segment Automatic: • Stack in private segment Runtime2: Step 1: • Text in coherentWorks as is RO segment Task: Convert to ASM Checkout/Checkin Assign Segments int main(int argc argc, char* argv argv[]) { j k int i i,j,k; sib pthread_t sib; shared_data = malloc(PROBLEM_SIZE); barrier NULL, 2); pthread_barrier_init(&barrier, sib NULL, pthread_create(&sib, worker, (void*) 1); int pthread_barrier_init(…) { worker((void*) 0); … pthread_join(sib, sib NULL); _barrier = coherent_malloc(sizeof(int)); int0;pthread_barrier_wait(…) { return … … } } Automatic: Library checkin(heap, data); <barrier> void* worker(void* arg) arg checkout(heap, data); { … while } (work remains) { <split work> <do work> barrier pthread_barrier_wait(&barrier); } } 13 • Synch. in coherentRW segment • CI/CO Global, Heap at synchronization Communication Point Memory Consistency Model Option 1: The Details (6 slides + really ugly equations) 14 Option 2: The Highlights (2 slides) Memory Consistency Model  Defined in style of SPARC TSO/RMO  Memory Order: Total order of memory ops • Restricted by consistency model  Processor Order: Local dependencies  Value of load: defined via memory + processor order 15 Weak Acoherence 1. Define Memory Order LSi a  p LSi  LSi a  m LSi # Load -> Load to same address (a) Same as TSO, etc. LSi  p SiS a  LSi m SiS a SiS a  p SiS a  SiS a m SiS a # Load -> Store to same address (b) # Store -> Store to same address (c) CI-CO pair => fence Total order of CO/CI SiS a  p CIiS m COSj  p LSj  SiS m LSj CX  p CX  CX m CX # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) 2. Define legal value of loads  value  LSi a   value max m  S S a | S S a m LSi a or 16 S S a  p LSi a   Strong Acoherence 1. Define Memory Order LSi a  p LSi  LSi a  m LSi # Load -> Load to same address (a) LSi  p SiS a  LSi m SiS a # Load -> Store to same address (b) Normally: S  p CI  CI m S Can “lose” data Stores not visible until CI S i SiS a  p SiS a  SiS a m SiS a S i SiS a  p CIiS m COSj  p LSj  SiS m LSj CX  p CX  CX m CX SiS  p next p (CI S , SiS )  p next p (CO S , SiS )  next p (CI S , SiS ) m SiS SiS  p CO  p next p (CI , SiS )  SiS m max p (COS , SiS ) # Store -> Store to same address (c) # Paired CI-CO act as distributed fence (d) # CI/CO -> CI/CO (e) # Store not visible until CI (f) # Stores can be clobbered 2. Define legal value of loads   value  LSi a   value max p SiS a | max p (CO S , LSi a )  p SiS a  p LSi a    or, if SiS a does not exist,  value max m S S a | max p (CO S , S S a)  m S S a  m LSi a 17  Other Segment Types  Coherent  Like weak, but: • Loads implicitly paired with (atomic) CO • Stores implicitly paired with (atomic) CI  SC w.r.t. each other  Private  Like weak 18 Analysis  CO/CI not atomic  Subtlties: Initially, Thread 0 A = 0 Thread 1 00: CHECKOUT Initially, Thread 0 A = 0 Thread 1 Thread 0 A = 0 Thread 1 05: CHECKOUT 02: CHECKOUT 10: A = 1 Initially, 14: A = 1 03: R0 = A 11: CHECKIN 12: A = 1 01: R0 = A 06: R0 = A 13: CHECKIN 04: R1 = A Strong: R0 = 0 or 1 Weak: R0 = 0 or 1 (a) Lazy checkout Strong: R0 = 0, R1 = 0 Weak: R0 = 0, R1 = 0 or 1 (b) Isolation 19 Strong: R0 = 0 Weak: R0 = 0 or 1 (c) Leaky stores ASM = SC for DRF  ASM = SC for lossless and properly paired  Lossless:  No clobbering checkouts  i.e., SiS a, if COiS : SiS a  p COiS  Next CI iS : SiS a  p CI iS  p COiS  Properly Paired:  All conflicting stores->load separated by CI/CO  i.e., LSi a, S Sj a : value( LSi a)  value( S Sj a), i  j  COiS , CI Sj : S Sj a  p CI Sj m COiS  p LSi a  Proof sketch:  LL+PP executions defined by CO/CI order, program order only  CO/CI, program order same in ASM, SC 20 CO/CI Semantics  CO/CI like fence  Lazy checkouts  Non-atomic, non-blocking checkins • Updates can interleave Initially, Thread 0 Initially, A = 0 Thread 0 Thread 1 00: A = 1 01: B = 2 02: CHECKIN 00: CHECKOUT 10: A = 1 A = 0 Thread 1 10: A = 10 11: B = 20 12: CHECKIN 11: CHECKIN 01: R0 = A Finally, any combo of: A = 1 or 10 B = 2 or 20 Finally: R0 = 0 or 1 21 Consistency Highlights  Coherent accesses have implicit CO/CI  CO/CI are totally ordered  Transitivity hides non-atomicity Thread 0 Thread 1  Sequentially consistent for data-race-free  Lossless & Properly Paired CO lock_segment ST critical LL lock LD lock CI critical_segment ST lock ST lock CI lock_segment STsync lock CI lock_segment CO critical_segment 22 LD critical SC lock Outline  Motivation and Goals  ASM Model  ASM-CMP Prototype  Evaluation and Results  Conclusions and Future Work 23 ASM-CMP Overview  Based on MIPS  + special insns, e.g., checkout, checkin  Uses segments, no paging Skipping the Details • Maintains flat address space  Coherence protocol -> Acoherence Engine  DMA for caches • Selectively move data 24 Baseline Memory Controller Memory Controller Memory Controller Memory Controller 25 Core L1I L2 L1D Switch Segment Types L2 Exclusive L2 CI CO CO CI Noninclusive L2 L1 AE L1 AE P P Acoherent L1 P P Coherent-RW 26 P L1 P Private Acoherence Engine  Three main responsibilities:  Checkout: Lazy Flash Invalidate • Invalidate all segment data  Checkin: Track write set Decoupled Metastate Cache • Write back all dirty segment data  Order: Timestamp based • Detect CI-CO pairs  FSM like coherence, but few races, no directory 27 Decoupled Metastate Cache  All L1 Caches  Decouple metastate from data  Quick access to aggregate state  Track V/D per-segment  Checkout:  XOR global/segment valid  Checkin:  Walk segment dirty state 28 Order  Need to: 1. Determine if a CI precedes a CO 2. Delay load after CO if previous CI hasn’t completed  Timestamp algorithm (per segment):  Two phase CO/CI 1. Acquire timestamp 1. Invalidate/Flush 2. Wait for previous CO/CI to complete  Implemented in firmware 29 Multiple Writer Support  Keep per-byte dirty bitmask in L1s  Allows multiple writers with false sharing  12.5% larger L1 cache  Bitmask accompanies data to L2 30 Simple? Directory L2 / L2 REQ RESP RESP REQ FWD L1 L1 Source of Races / Complexity 31 Outline  Motivation and Goals  ASM Model  ASM-1 Prototype  Evaluation and Results  Conclusions and Future Work 32 Methodology  Simulation-based  Enhanced-User Mode  Workloads:  Class-1: SPLASH  Class-2: Task-Q  Three memory modules  ASM-CMP  CC from gem5-Ruby • MESI (Inclusive) • MOESI (Non-inclusive) 33 Performance 1.4 Runtime Normalized to MOESI 1.2 1 0.8 0.6 0.4 0.2 0 False Sharing/ Comparable performance Checkout too much Migratory Sharing 34 moesi mesi asm Perfect Checkout Runtime Normalized to ASM Baseline 1.2 1 0.8 0.6 0.4 0.2 0 asm_base 35 asm_ideal Energy 1.2 Energy Normalized to MOESI 1 0.8 0.6 0.4 0.2 0 Less Energy (Same Performance) 36 e_l1d e_l1i e_l2 e_link e_switch e_tlb Checkout Characteristics % of checkouts Class-1 Workloads 80% 70% 60% 50% 40% 30% 20% 10% 0% fft fmm lu mp3d ocean radix water Most checkout invalidations Checkouts usually small; dead Canaffect be large (> blocks 25% of L1) # blocks invalidated % of Checkout Invalidations Elided barnes 1.2 1 0.8 0.6 0.4 0.2 0 37 Checkin Characteristics Class-1 Workloads 40% % of checkins 35% 30% 25% 20% 15% 10% 5% 0% Checkins usually small; Checkin latency is hidden Can be large (> 25% of L1) # blocks invalidated barnes fft fmm lu mp3d 38 ocean radix water Outline  Motivation and Goals  ASM Model  ASM-CMP Prototype  Evaluation and Results  Conclusions/Other Work 39 Conclusions  Going forward:  HW designs must find efficiency  SW will want to see caches/control placement  ASM: viable alternative to coherent shared memory  Semantic cooperation between HW/SW  ASM-CMP: build components w/o coherence engine  Make custom integration easier  Practically:  Will the next x86 core use ASM? No  Will a heterogeneous accelerator? Maybe 40 Related Work  ASM Model  ASM-CMP  Alternatives/Detractors Skip 41 Related Work – ASM Model  Relaxed consistency models  Release Consistency (ISCA 1990) • Acquire/Release ≈ CO/CI  DRF-0 (ISCA 1990), DRF-1 (PDS 1993) • SC for DRF  Weak ordering (ISCA 1998)  Semantic Segmentation  Cohesion (ISCA 2011)  Entry consistency (CMU-TR 1991) 42 Related Work – ASM-CMP  Rigel: IEEE Micro 2011  Differentiates coherent/incoherent  Treadmarks: ISCA 1992  Twinning and diffing 43 Related Work - Alternatives  Reduce directory overhead      Cuckoo directory (HPCA 2011) Tagless directory (MICRO 2009, PACT 2011) Waypoint (PACT 2010) Region coherence (IEEE Micro 2006) SW controlled coherence (…)  Simplify coherence design  Denovo (PACT 2011)  Coherence is here to stay  CACM 2012 44 Future Work  ASM Model  ASM Implementations  ASM Software Skip 45 Future Work – ASM Model 1. Use CO/CI for synchronization  Return timestamp with CO/CI  Blocking CO 2. Only guarantee transitivity across coherent accesses  Would eliminate need for timestamps 3. Hierarchical ASM  Expose multiple levels of abstracted caches 4. Interaction with coherent shared memory  Acoherent/coherent components in same system 46 Future Work: ASM Implementation  ASM-CMP 1. Optimize empty checkout/checkin 2. Non-speculative support for strong acoherence • e.g., HW copy-on-write support on eviction • Use ASM as foundation for TM/Determinism/etc 3. Low overhead byte-diffing • False sharing is rare/pattern reuse is common 4. More segment control • Non-contiguous • Remap-able  Other 1. Multi-socket support 2. Use ASM to simplify traditional coherence • Private/shared 47 Future Work – ASM Software 1. Message passing on ASM  More efficient than coherence (think: migratory) 2. Software speculation  Use working memory for isolation 3. Programming language integration  CO/CI first-class operations  Work already exists: • Worlds (ECOOP 2011), Revisions (OOPSLA 2010), PGAS 48 Previous Work  Rerun: ISCA 2008 and CACM 2009  Race recorder for deterministic replay  vs. state of the art: • SAME logging performance, > 10x state reduction  Calvin: HPCA 2011  Coherence for deterministic execution • i.e., zero-log-size deterministic replay  Selective determinism to match program requirements  Hobbes: WoDet 2011  Strong acoherence in SW runtime 49 Phew! Backup Slides What I would do differently  Focus on more specific target system  Stop building new infrastructure!  Why did I? • gem5 wasn’t ready • Started more radical/not clear it would have helped  Step back more often  Easy to get sucked in to details – usually don’t matter  Functional specification of consistency -> yuck! 52 Case Study: Cilk  Work-stealing task queue  Distributed design 53 Runtime Normalized to MOESI Using Segments ASM Segments Benefit 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Energy Normalized to MOESI Using Segments moesi_tlb_0 moesi_tlb_32 moesi_tlb_64 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 e_l1d e_l1i e_l2 e_link 54 e_switch e_tlb History 1980 CPU Era Everything is general purpose 2000 Multicore Era Moore of the same 55 2010 Dark Era ?? Navigating the Darkness  Solution #1: Wait for CMOS replacement  Don’t hold your breath  Solution #2: Rethink everything     Deep integration HW Specialization/Heterogeneity Efficiency Take compatibility off its pedestal 56 Coherence? Rethinking Coherence: Why Now?  Dennard Scaling is over; Moore’s Law continues:  Need efficient components/reduced waste  Heterogeneity/Specialization  Different memory access patterns  Multicore ASICs  Important workloads don’t use it  Compatibility not a show stopper  Mobile -> fast design cycles, controlled SW stacks  Datacenter -> economy of scale in single location  Missing opportunities 57 Case Study 2: Software Speculation begin_speculation() { <copy state> checkout(…) <setup> } Use private storage Multiple checkouts: “forget” updates Task: SW can Convert use memory to ASM in new ways end_speculation() { if(success) <free copies> checkin(…) Checkin: commit updates else abort_speculation(); } abort_speculation() { <revert to copy> checkout(…) <cleanup> } 58 New Software Potential  Evaluate ability to write speculation software  Microbenchmark:  Fill array with speculative data, then commit  Vary size of array Normalized Runtime 1.5 1 ASM MESI 0.5 0 16 64 256 1K 16K 32K 64K 128K # of Blocks in Isolation Region 59 Using Weak Acoherence global array; weak acoherent func producer(…) func consumer(…) checkout(array); waitfor(producer); array[0] = x; checkout(array); globally visible! … array[1] = y; checkin(array); end func checkin(array); signal(consumer); end func Synch hides early checkin Synchronized -> Early visibility OK 60 Using Best-Effort Acoherence begin_tx checkout(array) array[0] = x checkout(array) array[1] = y checkin(array) end_tx Exception! SW handles resource limitations 61 Simulator Design  Two Goals  Functionally evaluate ASM system • programming model, kernel management  Performance comparison to CMP  Enhanced User Mode simulator  Emulate non-timing critical components (e.g., disks)  Simulate the rest (e.g., virtual memory) 62 Qualitative Data  Is ASM a reasonable model?  YES  Almost no changes to application software • Unsynchronized flags • Stack sharing  Functioning Kernel, same tricks • Heavier use of coherent segments 63 Three Questions 1. How can software select view? 2. Which view to use? 3. How to manage CO/CI? PC PC P P P CI CO CO LLC CI DRAM P P Hardware Acoherent Layout View 64 P Private View P P Coherent View ASM-CMP Segments  Uses true memory segments  e.g., all pointers are long (segment + offset)  BUT, address space still appears flat!  Long Pointer Propagation  Segment pointers propagate through datapath  Add lp/sp + register sidecars  Languages/SW remain segment-oblivious 65 ASM-CMP Segments Segment pointers propagates with datapath memcpy(dst, src, len); lp $t0, lp $t0,0(dst) 0(dst) lp $t0, 0(dst) lp $t1, lp $t1,0(src) 0(src) mov $a2$a20(src) ;;cnt len len mov $t2, cnt<- <lp$t2, $t1, Pointers are long Memory loop: beqz $t2, exit beqz $t2, exit lb $t3, lb $t3,0($t1) 0($t1); sb $t3, sb $t3,0($t0) 0($t0); addi $t0, $t0, 1 ; addi $t0, $t0, addi $t1, $t1, 1 ; subi $t2, $t2, 1 ; b loop exit: mov $t2, $a2 ; cnt <- len ld ld src src ; Register File loop: dst ptr Offset st dst ; st dst dst Seg. Ptr. 1inc. ; inc.dst Seg beqz $t2, exit $t0 Offset inc. src dec. cnt Offset Seg lb $t3, 0($t1) ; $t1 ld src src ptr Offset Seg. Ptr. len sb $t3, 0($t0) ; $t2 st dst src addi $t0, $t0, 1 ; $t3 inc.data dst addi $t1, $t1, 1 ; inc. src 1 dst subi $t2, $t2, 1 ; dec. cnt Segment propagates b loop Seg ALU exit: src -> dst 66 Offset+1 The Problem DRAM LLC PC PC P P Coherent Shared Memory P Hardware Layout P Software View 67 The Problem DRAM LLC PC PC P P Coherent Shared Memory P P Hardware Policy – Hardware Software Layout View Software Can’t Change! 68 All Data Are Created Equal? Assume: CMP MESI protocol, inclusive LLC DRAM ? ? Location := 1; LLC ? 69 PC PC P P Missed Opportunities Assume: CMP MESI protocol, inclusive LLC DRAM begin_tx cpLocation := Location; Location := 1; end_tx SW Makes Redundant Copy 70 LLC PC PC P P All Data Are NOT Created Equal Assume: CMP MESI protocol, inclusive LLC func foo() var Location; Wasting Space DRAM LLC Location := 1; Private 71 PC PC P P ASM-1 Hardware 8MB L3 256KB L2 Bitmask AE Per-line 32KB L1 Bitmask P 72 Baseline P0 P1 P2 P3 P4 P5 P6 P7 L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 L2 L2 L2 L3_0 L3_4 L3_1 L3_5 L3_2 L3_6 L3_3 L3_7 In-order, single thread L2 L2 L2 L2 L2 L2 L2 L2 Ring interconnect 73 L1 L1 L1 L1 L1 L1 L1 L1 P8 P9 P10 P11 P12 P13 P14 P15 Storage Overhead 100% 90% More indirection -> longer latency Storage Overhead 80% 70% 60% 50% ASM-1 MESI-1 Level MESI-2 Level MESI-3 Level 40% 30% No Indirection 20% 10% 0% # Cores 74 Rethinking Coherence: Why Now?  Dennard Scaling is over; Moore’s Law continues  Need scalable, energy efficient components  Accelerators are here  How should they see memory?  Shared-little workloads in important markets 75 All Data Are NOT Created Equal Assume: CMP MESI protocol, inclusive LLC DRAM func CUDAKernel(…) … Not clear accelerators want/need coherence LLC Location := 1; 76 PC PC P GPU

Acoherent Shared Memory Derek R. Hower Ph.D. Defense

Related documents

Products

Support

Acoherent Shared Memory Derek R. Hower Ph.D. Defense

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib