† DeNovo : Rethinking Hardware for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Rob Bocchino, Sarita Adve, Vikram Adve Other collaborators: Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel) Applications: John Hart, Victor Lu †De Novo = From the beginning, anew Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory – Directory-based coherence • Complex, unscalable – Address, communication, coherence granularity is cache line • Software-oblivious, inefficient power, bandwidth, latency, area – Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value can read return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory – Directory-based coherence • Complex, unscalable Banish shared memory? – Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory – Directory-based coherence • Complex, unscalable Banish wild shared memory! – Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model • Data races, non-determinism, no safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software Motivation • Goal: Power-, complexity-, performance-scalable hardware • Today: shared-memory – Directory-based coherence • Complex, unscalable Banish wild shared memory! – Address, communication, coherence granularity is cache line • Inefficient in power, bandwidth, latency, area, especially for O-O codes – Difficult programming model Need memory! • Datadisciplined races, non-determinism, noshared safety/composability/modularity, … – Can’t specify “what value a read can return” a.k.a. memory model • Data races defy acceptable semantics; mismatched hardware/software • Fundamentally broken for hardware & software What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory? Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory? Wild Shared-Memory = Global address space + Implicit, anywhere communication, synchronization What is Shared-Memory? Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects Disciplined Shared-Memory • Top-down view: prog model, language, … (earlier today) – Use explicit effects for semantic guarantees • Data-race-freedom, determinism-by-default, controlled non-determinism – Reward: simple semantics, safety, composability, … • Bottom-up view: hardware, runtime, … – Use explicit effects + above guarantees for • Simple coherence and consistency • Software-aware address/comm/coherence granularity, data layout – Reward: power-, complexity-, performance-scalable hardware • Top-down and bottom-up views are synergistic! DeNovo Research Strategy 1. Start with deterministic ( data-race-free) codes ― Common & best case, basis for extension to other codes 2. Disciplined non-deterministic codes 3. Wild non-deterministic, legacy codes • Work with languages for best h/w-s/w interface – Current driver is DPJ – End-goal is language-oblivious interface • Work with realistic applications – Current work with kd-trees Progress Summary 1. Deterministic codes ― Language level model with DPJ, translation to h/w interface ― Design for simple coherence, s/w-aware comm. and layout ― Baseline coherence implemented, rest underway 2. Disciplined non-deterministic codes ― Language level model with DPJ (and Intel) 3. Wild non-deterministic, legacy codes – Just begun, will be influenced by above Collaborations with languages & applications groups – DPJ; first scalable SAH quality kd-tree construction Progress Summary 1. Deterministic codes ― Language level model with DPJ, translation to h/w interface ― Design for simple coherence, s/w-aware comm. and layout ― Baseline coherence implemented, rest underway 2. Disciplined non-deterministic codes ― Language level model with DPJ (and Intel) 3. Wild non-deterministic, legacy codes – Just begun, will be influenced by above Collaborations with languages & applications groups – DPJ; first scalable SAH quality kd-tree construction Coherence and Consistency Key Insights • Guaranteed determinism • Explicit effects Coherence and Consistency Key Insights • Guaranteed determinism – Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects Coherence and Consistency Key Insights • Guaranteed determinism – Read should return value of last write in sequential order • From same task in this parallel phase, if it exists • Or from previous parallel phase • No concurrent conflicting writes in this phase • Explicit effects – Compiler knows all regions written in this parallel phase – Cache can self-invalidate before next parallel phase • Invalidates data in writeable regions not written by itself Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection – Complexity: Races in protocol – Overhead: Sharer list – Performance: All cache misses go through directory Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection – Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol – Overhead: Sharer list – Performance: All cache misses go through directory Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection – Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol – Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists – Performance: All cache misses go through directory Today's Coherence Protocols • Snooping: broadcast, ordered networks • Directory: avoid broadcast through indirection – Complexity: Races in protocol • Race-free software (almost) race-free coherence protocol • No transient states, much simpler protocol – Overhead: Sharer list • Explicit effects enable self-invalidations • No need for sharer lists – Performance: All cache misses go through directory • Directory only tracks one up-to-date copy, not sharers or serialization • Data copies can move from cache to cache without telling directory Baseline DeNovo Coherence • Assume (for now): Private L1, shared L2; single word line • Directory tracks one current copy of line, not sharers • L2 data arrays double as directory – Keep valid data or registered core id, no space overhead • S/W inserts self-invalidates for regions w/ write effects • L1 states = invalid, valid, registered • No transient states: Protocol ≈ 3-state textbook pictures – Formal specification and verification with Intel DeNovo Region Granularity • DPJ regions too fine-grained – Ensure accesses to individual objects/fields don’t interfere – Too many regions for hardware • DeNovo only needs aggregate data written in phase – E.g., can summarize a field of entire data structure as one region Can we aggregate to few enough regions, without excessive invalidations? Evaluation Methodology • Modified Wisconsin GEMS + (Intel!) Simics simulator • 4 apps: LU, FFT, Barnes, kd-tree – Converted DPJ regions into DeNovo regions by hand App # Regions Barnes 8 FFT LU 3 Kd-tree 2 4 • Compared DeNovo vs. MESI for single word lines – Goal: Does simplicity impact performance? E.g., are self-invalidations too conservative? (Efficiency enhancements are next step) Results for Baseline Coherence Execution time Normalized to MESI (%) 100 80 60 MESI DeNovo 40 20 0 LU FFT Barnes-Hut kd-tree (nested) • DeNovo comparable to MESI • Simple protocol is foundation for efficiency enhancements Improving Performance & Power Insight: Can always copy valid data to another cache – w/o demand access, w/o going through directory – If later demand read sees it, it must be correct – No false sharing effects (no loss of “ownership”) Simple line based protocol (with word based valid bits) Can get line from anywhere, not just from directory L2 ― Point-to-point transfer, sender-initiated transfer Can transfer more than line at a time ― Point-to-point bulk transfer Transfer granularity can be region-driven ― AoS vs. SoA optimization is natural Towards Ideal Efficiency Current systems: ― Address, transfer, coherence granularity = fixed cache line Denovo so far: ― Transfer is flexible, but address is still line, coherence is still word Next step: Region-centered caches – Use regions for memory layout • Region based “pool allocation,” fields of same region at fixed strides – Cache banks devoted to regions – Regions accessed together give address, transfer granularity – Regions w/ same sharing behavior give coherence granularity Applicable to main memory and pin bandwidth Interactions with runtime scheduler Summary • Current shared-memory models fundamentally broken – Semantics, programmability, hardware • Disciplined programming models solve these problems • DeNovo = hardware for disciplined programming – Software-driven memory hierarchy – Coherence, consistency, communication, data layout, … – Simpler, faster, cooler, cheaper, … • Sponsor interactions: – Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou – Non-deterministic model: Adam Welc/Shpeisman/Ni – SCC prototype exploration: Jim Held Next steps • Phase 1: – – – – – Full implementation, verification, results for deterministic codes Design and results for disciplined non-determinism Design for wild non-deterministic, legacy codes Continue work with language and application groups Explore prototyping on SCC • Phase 2: – Design and simulation results for complete DeNovo system running large applications – Language-oblivious hardware-software interface – Prototype and tech transfer