Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip B. Gibbons Two Decades after the peak of Theory’s interest in parallel computing… The Age of Many-Core is finally underway • Fueled by Moore’s Law: 2X cores per chip every 18 months All aboard the parallelism train! • (Almost) The only way to faster apps 3 Theory: Asleep at the Switch to Many-Core All Aboard the Parallelism Train? Switch to Many-Core…Many Challenges • Interest waned long ago • Yet problems were NOT solved Research needed in all aspects of Many-Core • Computer Architecture YES! • Programming Languages & Compilers YES! • Operating & Runtime Systems YES! • Theory Who has answered the call? 4 Theory: Asleep at the Switch to Many-Core Theory: Asleep at the Switch “Engineer driving derailed Staten Island train may have fallen asleep at the switch.” (12/26/08) Theory needs to wake-up & regain a leadership role in parallel computing 5 Theory: Asleep at the Switch to Many-Core Theory’s Strengths • Conceptual Models – Abstract models of computation • New Algorithmic Paradigms – New algorithms, new protocols • Provable Correctness – Safety, liveness, security, privacy,… • Provable Performance Guarantees – Approximation, probabilistic, new metrics • Inherent Power/Limitations – Of primitives, features,… …among others 6 Theory: Asleep at the Switch to Many-Core Five Areas in Which Theory Can (Should) Have an Important Impact • Parallel Thinking • Memory Hierarchy • Asymmetry/Heterogeniety • Concurrency Primitives • Power 7 Theory: Asleep at the Switch to Many-Core Montparnasse 1895 Impact Area: Parallel Thinking Key: Good Model of Parallel Computation • Express Parallelism • Good parallel programmer’s model • Good for teaching, teaching “how to think” • Can be engineered to good performance 8 Theory: Asleep at the Switch to Many-Core Impact Area: Memory Hierarchy • Deep cache/storage hierarchy • Need conceptual model • Need smart thread schedulers 9 Theory: Asleep at the Switch to Many-Core Impact Area: Asymmetry/Heterogeniety • Fat/Thin cores • SIMD extensions • Multiple coherence domains • Mixed-mode parallelism • Virtual Machines • ... 10 Theory: Asleep at the Switch to Many-Core Impact Area: Concurrency Primitives • Parallel prefix • Hash map [Herlihy08] • Map reduce [Karloff09] • Transactional memory • Memory block transactions [Blelloch08] • Graphics primitives [Ha08] • Make the case Many-Core should (not) support • Improve the algorithm • Recommend new primitives (prescriptive) 11 Theory: Asleep at the Switch to Many-Core Impact Area: Power Many-cores provide features for reducing power • Voltage scaling [Albers07] • Dynamically run on fewer cores, fewer banks Fertile area for Theory help 12 Theory: Asleep at the Switch to Many-Core Deep Dive: Memory Hierarchy • Deep cache/storage hierarchy • Need conceptual model • Need smart thread schedulers 13 Theory: Asleep at the Switch to Many-Core Good Performance Requires Effective Use of the Memory Hierarchy CPU Performance: • Running/response time L1 • Throughput • Power L2 Cache Main Memory Magnetic Disks Two new trends: Pervasive Multicore & Pervasive Flash bring new challenges and opportunities 14 Theory: Asleep at the Switch to Many-Core New Trend 1: Pervasive Multicore Challenges Opportunity • Rethink apps & systems to take advantage of more CPUs on chip CPU CPU CPU • Cores compete for hierarchy L1 L1 L1 • Hard to reason about parallel performance Shared L2 Cache L2 Cache Main Memory Magnetic Disks • Hundred cores coming soon • Cache hierarchy design in flux • Hierarchies differ across platforms Makes Effective Use of Hierarchy Much Harder 15 Theory: Asleep at the Switch to Many-Core New Trend 2: Pervasive Flash Opportunity CPU CPU CPU L1 L1 L1 Shared L2 Cache Flash Devices Main Memory Magnetic Disks • Rethink apps & systems to take advantage Challenges • Performance quirks of Flash • Technology in flux, e.g., Flash Translation Layer (FTL) New Type of Storage in the Hierarchy 16 Theory: Asleep at the Switch to Many-Core How Hierarchy is Treated Today Algorithm Designers & Application/System Developers often tend towards one of two extremes Ignorant API view: Memory + I/O; Parallelism often ignored [Performance iffy] (Pain)-Fully Aware Hand-tuned to platform [Effort high, Not portable, Limited sharing scenarios] Or they focus on one or a few aspects, but without a comprehensive view of the whole 17 Theory: Asleep at the Switch to Many-Core Hierarchy-Savvy Parallel Algorithm Design (Hi-Spade) project …seeks to enable: A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies • Hide what can be hid • Expose what must be exposed for good performance • Robust: many platforms, “Hierarchy-Savvy” many resource sharing scenarios • Sweet-spot between ignorant and (pain)fully aware 18 Theory: Asleep at the Switch to Many-Core http://www.pittsburgh.intel-research.net/projects/hi-spade/ Hi-Spade Research Scope A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Research agenda includes • Theory: conceptual models, algorithms, analytical guarantees • Systems: runtime support, performance tools, architectural features • Applications: databases, operating systems, application kernels 19 Theory: Asleep at the Switch to Many-Core Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Sequential External Memory (EM) Algorithms Main Memory (size M) [See Vitter’s ACM Surveys article] Block size B External Memory External Memory Model 20 Theory: Asleep at the Switch to Many-Core + Simple model + Minimize I/Os – Only 2 levels Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Sequential Alternative: Cache-Oblivious Algorithms [Frigo99] Main Memory (size M) Block size B External Memory Cache-Oblivious Model + Key Goal + simple model Key Goal: Good performance for any M & B Guaranteed good cache performance at all levels of hierarchy – Single CPU only 21 Twist on EM Model: M & B unknown to Algorithm Theory: Asleep at the Switch to Many-Core Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Cache Hierarchies: Parallel Explicit Multi-level Hierarchy: Multi-BSP Model [Valiant08] Goal: Approach simplicity of cache-oblivious model Hierarchy-Savvy sweet spot 22 Theory: Asleep at the Switch to Many-Core Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Challenge: – Theory of cache-oblivious algorithms falls apart once introduce parallelism: Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Key reason: Caches not fully shared CPU1 CPU2 CPU3 L1 L1 L1 B Shared L2 Cache L2 Cache 23 What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time – Parallel cache-obliviousness too strict a goal Theory: Asleep at the Switch to Many-Core Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Key new dimension: Scheduling of parallel threads Has LARGE impact on cache performance Recall our problem scenario: CPU1 CPU2 CPU3 L1 L1 L1 B Shared L2 Cache L2 Cache 24 Theory: Asleep at the Switch to Many-Core all CPUs want to write B at ≈ the same time Can mitigate (but not solve) if can schedule the writes to be far apart in time Existing Parallel Cache Models 1 2 p CPU CPU CPU Parallel Shared-Cache Model: shared cache (size = C) block transfer (size = B) main memory Parallel Private-Cache Model: 1 2 p CPU CPU CPU private cache (size = C) private cache (size = C) private cache (size = C) block transfer (size = B) block transfer (size = B) main memory 25 Theory: Asleep at the Switch to Many-Core Slide from Rezaul Chowdhury block transfer (size = B) Competing Demands of Private and Shared Caches p 1 CPU 2 CPU CPU private cache private cache private cache shared cache main memory 26 Shared cache: cores work on the same set of cache blocks Private cache: cores work on disjoint sets of cache blocks Experimental results have shown that on CMP architectures work-stealing, i.e., the state-of-art scheduler for private-cache model, can suffer from excessive shared-cache misses parallel depth first, i.e., the best scheduler for shared-cache model, can incur excessive private-cache misses Theory: Asleep at the Switch to Many-Core Slide from Rezaul Chowdhury Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Private vs. Shared Caches • Parallel all-shared hierarchy: + Provably good cache performance for cache-oblivious algs • 3-level multi-core model: insights on private vs. shared + Designed new scheduler with provably good cache performance for class of divide-and-conquer algorithms [Blelloch08] CPU1 CPU2 CPU3 L1 L1 L1 Shared L2 Cache L2 Cache 27 Theory: Asleep at the Switch to Many-Core – Results require exposing working set size for each recursive subproblem Hi-Spade: Hierarchy-savvy Parallel Algorithm Design Parallel Tree of Caches … … … … … … … … … … … … … … … … … … … … … … … Approach: [Blelloch09] Design low-depth cache-oblivious algorithm Thrm: for each level i, only O(Mi P D/ Bi ) misses more than the sequential schedule Low depth D 28 Theory: Asleep at the Switch to Many-Core Good miss bound Five Areas in Which Theory Can (Should) Have an Important Impact • Parallel Thinking • Memory Hierarchy • Asymmetry/Heterogeniety • Concurrency Primitives • Power 29 Theory: Asleep at the Switch to Many-Core