DeNovo*: Rethinking Hardware for Disciplined Parallelism

advertisement
†
DeNovo : Rethinking Hardware for
Disciplined Parallelism
Byn Choi, Rakesh Komuravelli, Hyojin Sung,
Rob Bocchino, Sarita Adve, Vikram Adve
Other collaborators:
Languages: Adam Welc, Tatiana Shpeisman, Yang Ni (Intel)
Applications: John Hart, Victor Lu
†De
Novo = From the beginning, anew
Motivation
• Goal: Power-, complexity-, performance-scalable hardware
• Today: shared-memory
– Directory-based coherence
• Complex, unscalable
– Address, communication, coherence granularity is cache line
• Software-oblivious, inefficient power, bandwidth, latency, area
– Difficult programming model
• Data races, non-determinism, no safety/composability/modularity, …
– Can’t specify “what value can read return” a.k.a. memory model
• Data races defy acceptable semantics; mismatched hardware/software
• Fundamentally broken for hardware & software
Motivation
• Goal: Power-, complexity-, performance-scalable hardware
• Today: shared-memory
– Directory-based coherence
• Complex, unscalable
Banish shared memory?
– Address, communication, coherence granularity is cache line
• Inefficient in power, bandwidth, latency, area, especially for O-O codes
– Difficult programming model
• Data races, non-determinism, no safety/composability/modularity, …
– Can’t specify “what value a read can return” a.k.a. memory model
• Data races defy acceptable semantics; mismatched hardware/software
• Fundamentally broken for hardware & software
Motivation
• Goal: Power-, complexity-, performance-scalable hardware
• Today: shared-memory
– Directory-based coherence
• Complex, unscalable
Banish wild shared memory!
– Address, communication, coherence granularity is cache line
• Inefficient in power, bandwidth, latency, area, especially for O-O codes
– Difficult programming model
• Data races, non-determinism, no safety/composability/modularity, …
– Can’t specify “what value a read can return” a.k.a. memory model
• Data races defy acceptable semantics; mismatched hardware/software
• Fundamentally broken for hardware & software
Motivation
• Goal: Power-, complexity-, performance-scalable hardware
• Today: shared-memory
– Directory-based coherence
• Complex, unscalable
Banish wild shared memory!
– Address, communication, coherence granularity is cache line
• Inefficient in power, bandwidth, latency, area, especially for O-O codes
– Difficult programming model
Need
memory!
• Datadisciplined
races, non-determinism, noshared
safety/composability/modularity,
…
– Can’t specify “what value a read can return” a.k.a. memory model
• Data races defy acceptable semantics; mismatched hardware/software
• Fundamentally broken for hardware & software
What is Shared-Memory?
Shared-Memory =
Global address space
+
Implicit, anywhere communication, synchronization
What is Shared-Memory?
Shared-Memory =
Global address space
+
Implicit, anywhere communication, synchronization
What is Shared-Memory?
Wild Shared-Memory =
Global address space
+
Implicit, anywhere communication, synchronization
What is Shared-Memory?
Disciplined Shared-Memory =
Global address space
+
Implicit, anywhere communication, synchronization
Explicit, structured side-effects
Disciplined Shared-Memory
• Top-down view: prog model, language, … (earlier today)
– Use explicit effects for semantic guarantees
• Data-race-freedom, determinism-by-default, controlled non-determinism
– Reward: simple semantics, safety, composability, …
• Bottom-up view: hardware, runtime, …
– Use explicit effects + above guarantees for
• Simple coherence and consistency
• Software-aware address/comm/coherence granularity, data layout
– Reward: power-, complexity-, performance-scalable hardware
• Top-down and bottom-up views are synergistic!
DeNovo Research Strategy
1. Start with deterministic ( data-race-free) codes
― Common & best case, basis for extension to other codes
2. Disciplined non-deterministic codes
3. Wild non-deterministic, legacy codes
• Work with languages for best h/w-s/w interface
– Current driver is DPJ
– End-goal is language-oblivious interface
• Work with realistic applications
– Current work with kd-trees
Progress Summary
1. Deterministic codes
― Language level model with DPJ, translation to h/w interface
― Design for simple coherence, s/w-aware comm. and layout
― Baseline coherence implemented, rest underway
2. Disciplined non-deterministic codes
― Language level model with DPJ (and Intel)
3. Wild non-deterministic, legacy codes
– Just begun, will be influenced by above
Collaborations with languages & applications groups
– DPJ; first scalable SAH quality kd-tree construction
Progress Summary
1. Deterministic codes
― Language level model with DPJ, translation to h/w interface
― Design for simple coherence, s/w-aware comm. and layout
― Baseline coherence implemented, rest underway
2. Disciplined non-deterministic codes
― Language level model with DPJ (and Intel)
3. Wild non-deterministic, legacy codes
– Just begun, will be influenced by above
Collaborations with languages & applications groups
– DPJ; first scalable SAH quality kd-tree construction
Coherence and Consistency
Key Insights
• Guaranteed determinism
• Explicit effects
Coherence and Consistency
Key Insights
• Guaranteed determinism 
– Read should return value of last write in sequential order
• From same task in this parallel phase, if it exists
• Or from previous parallel phase
• No concurrent conflicting writes in this phase
• Explicit effects
Coherence and Consistency
Key Insights
• Guaranteed determinism 
– Read should return value of last write in sequential order
• From same task in this parallel phase, if it exists
• Or from previous parallel phase
• No concurrent conflicting writes in this phase
• Explicit effects 
– Compiler knows all regions written in this parallel phase
– Cache can self-invalidate before next parallel phase
• Invalidates data in writeable regions not written by itself
Today's Coherence Protocols
• Snooping: broadcast, ordered networks
• Directory: avoid broadcast through indirection
– Complexity: Races in protocol
– Overhead: Sharer list
– Performance: All cache misses go through directory
Today's Coherence Protocols
• Snooping: broadcast, ordered networks
• Directory: avoid broadcast through indirection
– Complexity: Races in protocol
• Race-free software (almost) race-free coherence protocol
• No transient states, much simpler protocol
– Overhead: Sharer list
– Performance: All cache misses go through directory
Today's Coherence Protocols
• Snooping: broadcast, ordered networks
• Directory: avoid broadcast through indirection
– Complexity: Races in protocol
• Race-free software (almost) race-free coherence protocol
• No transient states, much simpler protocol
– Overhead: Sharer list
• Explicit effects enable self-invalidations
• No need for sharer lists
– Performance: All cache misses go through directory
Today's Coherence Protocols
• Snooping: broadcast, ordered networks
• Directory: avoid broadcast through indirection
– Complexity: Races in protocol
• Race-free software (almost) race-free coherence protocol
• No transient states, much simpler protocol
– Overhead: Sharer list
• Explicit effects enable self-invalidations
• No need for sharer lists
– Performance: All cache misses go through directory
• Directory only tracks one up-to-date copy, not sharers or serialization
• Data copies can move from cache to cache without telling directory
Baseline DeNovo Coherence
• Assume (for now): Private L1, shared L2; single word line
• Directory tracks one current copy of line, not sharers
• L2 data arrays double as directory
– Keep valid data or registered core id, no space overhead
• S/W inserts self-invalidates for regions w/ write effects
• L1 states = invalid, valid, registered
• No transient states: Protocol ≈ 3-state textbook pictures
– Formal specification and verification with Intel
DeNovo Region Granularity
• DPJ regions too fine-grained
– Ensure accesses to individual objects/fields don’t interfere
– Too many regions for hardware
• DeNovo only needs aggregate data written in phase
– E.g., can summarize a field of entire data structure as one region
Can we aggregate to few enough regions,
without excessive invalidations?
Evaluation Methodology
• Modified Wisconsin GEMS + (Intel!) Simics simulator
• 4 apps: LU, FFT, Barnes, kd-tree
– Converted DPJ regions into DeNovo regions by hand
App
# Regions
Barnes
8
FFT
LU
3
Kd-tree
2
4
• Compared DeNovo vs. MESI for single word lines
– Goal: Does simplicity impact performance?
E.g., are self-invalidations too conservative?
(Efficiency enhancements are next step)
Results for Baseline Coherence
Execution time
Normalized to MESI (%)
100
80
60
MESI
DeNovo
40
20
0
LU
FFT
Barnes-Hut
kd-tree
(nested)
• DeNovo comparable to MESI
• Simple protocol is foundation for efficiency enhancements
Improving Performance & Power
Insight: Can always copy valid data to another cache
– w/o demand access, w/o going through directory
– If later demand read sees it, it must be correct
– No false sharing effects (no loss of “ownership”)
 Simple line based protocol (with word based valid bits)
 Can get line from anywhere, not just from directory L2
― Point-to-point transfer, sender-initiated transfer
 Can transfer more than line at a time
― Point-to-point bulk transfer
Transfer granularity can be region-driven
― AoS vs. SoA optimization is natural
Towards Ideal Efficiency
Current systems:
― Address, transfer, coherence granularity = fixed cache line
Denovo so far:
― Transfer is flexible, but address is still line, coherence is still word
Next step: Region-centered caches
– Use regions for memory layout
• Region based “pool allocation,” fields of same region at fixed strides
– Cache banks devoted to regions
– Regions accessed together give address, transfer granularity
– Regions w/ same sharing behavior give coherence granularity
Applicable to main memory and pin bandwidth
Interactions with runtime scheduler
Summary
• Current shared-memory models fundamentally broken
– Semantics, programmability, hardware
• Disciplined programming models solve these problems
• DeNovo = hardware for disciplined programming
– Software-driven memory hierarchy
– Coherence, consistency, communication, data layout, …
– Simpler, faster, cooler, cheaper, …
• Sponsor interactions:
– Nick Carter, Mani Azimi/Akhilesh Kumar/Ching-Tsun Chou
– Non-deterministic model: Adam Welc/Shpeisman/Ni
– SCC prototype exploration: Jim Held
Next steps
• Phase 1:
–
–
–
–
–
Full implementation, verification, results for deterministic codes
Design and results for disciplined non-determinism
Design for wild non-deterministic, legacy codes
Continue work with language and application groups
Explore prototyping on SCC
• Phase 2:
– Design and simulation results for complete DeNovo system
running large applications
– Language-oblivious hardware-software interface
– Prototype and tech transfer
Download