Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10th SIAM International Conference on Data Mining Hi-Spade: Outline / Take-Aways • Hierarchies are important but challenging • Hi-Spade vision: Hierarchy-savvy algorithms & systems • Smart thread schedulers enable simple, hierarchy-savvy abstractions • Flash-savvy (database) systems maximize benefits of Flash devices • Ongoing work w/ many open problems Hi-Spade 3 © Phillip B. Gibbons SDM’10 keynote For Good Performance, Must Use the Hierarchy Effectively CPU Performance: • Running/response L1 time • Throughput Hierarchy: Cache L2 Cache • Power Main Memory Memory Magnetic Disks Storage Data-intensive applications stress the hierarchy Hi-Spade 4 © Phillip B. Gibbons SDM’10 keynote Clear Trend: Hierarchy Getting Richer • More levels of cache • Pervasive Multicore • New memory / storage technologies – E.g., Pervasive use of Flash These emerging hierarchies bring both new challenges & new opportunities Hi-Spade 5 © Phillip B. Gibbons SDM’10 keynote New Trend: Pervasive Multicore Challenges Opportunity • Rethink apps & systems to take advantage of more CPUs on chip CPU CPU CPU • Cores compete for hierarchy L1 L1 L1 • Hard to reason about parallel performance Shared L2 Cache L2 Cache • Hundred cores coming soon Main Memory • Cache hierarchy design in flux Magnetic Disks • Hierarchies differ across platforms Much Harder to Use Hierarchy Effectively Hi-Spade 6 © Phillip B. Gibbons SDM’10 keynote New Trend: Pervasive Flash Opportunity CPU CPU CPU L1 L1 L1 Challenges Shared L2 Cache Flash Devices • Rethink apps & systems to take advantage • Performance quirks of Flash Main Memory Magnetic Disks • Technology in flux, e.g., Flash Translation Layer (FTL) New Type of Storage in the Hierarchy Hi-Spade 7 © Phillip B. Gibbons SDM’10 keynote E.g., Xeon 7500 Series MP Platform socket socket 2 HW threads 32KB 256KB 8 … 2 HW threads 2 HW threads 32KB 256KB 24MB Shared L3 Cache 4 … 32KB 8 … 256KB Attach: Magnetic Disks & Flash Devices 8 © Phillip B. Gibbons 32KB 256KB 24MB Shared L3 Cache up to 1 TB Main Memory Hi-Spade 2 HW threads SDM’10 keynote How Hierarchy is Treated Today Algorithm Designers & Application/System Developers tend towards one of two extremes (Pain)-Fully Aware Ignorant API view: Memory + I/O; Parallelism often ignored Performance iffy Hand-tuned to platform Effort high, Not portable, Limited sharing scenarios Or they focus on one or a few aspects, but without a comprehensive view of the whole Hi-Spade 9 © Phillip B. Gibbons SDM’10 keynote From SDM’10 Call for Papers “Extracting knowledge requires the use of sophisticated, high-performance and principled analysis techniques and algorithms, based on sound theoretical and statistical foundations. These techniques in turn require powerful visualization technologies; implementations that must be carefully tuned for performance; software systems that are usable by scientists, engineers, and physicians as well as researchers; and infrastructures that support them.” Hi-Spade 10 © Phillip B. Gibbons SDM’10 keynote Hierarchy-Savvy parallel algorithm design (Hi-Spade) project …seeks to enable: A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies • Ignore what can be ignored • Focus on what must be exposed for good performance • Robust across many platforms & resource sharing scenarios “HierarchySavvy” • Sweet-spot between ignorant and (pain)fully aware Hi-Spade 11 http://www.pittsburgh.intel-research.net/projects/hi-spade/ © Phillip B. Gibbons SDM’10 keynote Hierarchy-Savvy Sweet Spot HierarchySavvy Platform 1 (Pain)-Fully Aware performance Platform 2 Ignorant programming effort Modest effort, good performance, robust Hi-Spade 12 © Phillip B. Gibbons SDM’10 keynote Hi-Spade Research Scope A hierarchy-savvy approach to algorithm design & systems for emerging parallel hierarchies Agenda: Create abstractions, tools & techniques that • Assist programmers & algorithm designers in achieving effective use of emerging hierarchies • Lead to systems that better leverage the new capabilities these hierarchies provide Theory / Systems / Applications Hi-Spade 13 © Phillip B. Gibbons SDM’10 keynote Hi-Spade Collaborators • Intel Labs Pittsburgh: Shimin Chen (co-PI) • Carnegie Mellon: Guy Blelloch, Jeremy Fineman, Robert Harper, Ryan Johnson, Ippokratis Pandis, Harsha Vardhan Simhadri, Daniel Spoonhower • Microsoft Research: Suman Nath • EPFL: Anastasia Ailamaki, Manos Athanassoulis, Radu Stoica • University of Pittsburgh: Panos Chrysanthis, Alexandros Labrinidis, Mohamed Sharaf Hi-Spade 14 © Phillip B. Gibbons SDM’10 keynote Hi-Spade: Outline • Hierarchies are important but challenging • Hi-Spade vision: Hierarchy-savvy algorithms & systems • Smart thread schedulers enable simple, hierarchy-savvy abstractions • Flash-savvy (database) systems maximize benefits of Flash devices • Ongoing work w/ many open problems Hi-Spade 15 © Phillip B. Gibbons SDM’10 keynote Abstract Hierarchy: Target Platform Specific Example: … … … Xeon 7500 General Abstraction: Tree of Caches … … … … … … … … … … … … … … … … … … … … … … … Hi-Spade 16 © Phillip B. Gibbons SDM’10 keynote Abstract Hierarchy: Simplified View What yields good hierarchy performance? • Spatial locality: use what’s brought in – Popular sizes: Cache lines 64B; Pages 4KB • Temporal locality: reuse it • Constructive sharing: don’t step on others’ toes How might one simplify the view? • Approach 1: Design to a 2 or 3 level hierarchy (?) • Approach 2: Design to a sequential hierarchy (?) • Approach 3: Do both (??) Hi-Spade 17 © Phillip B. Gibbons SDM’10 keynote Sequential Hierarchies: Simplified View • External Memory Model – See [J.S. Vitter, ACM Computing Surveys, 2001] Main Memory (size M) Block size B External Memory Simple model Minimize I/Os Only 2 levels Only 1 “cache” External Memory Model Can be good choice if bottleneck is last level Hi-Spade 18 © Phillip B. Gibbons SDM’10 keynote Sequential Hierarchies: Simplified View • Ideal Cache Model [Frigo et al., FOCS’99] Main Memory (size M) Block size B External Memory Twist on EM Model: M & B unknown to Algorithm simple model Ideal Cache Model Key Algorithm Goal: Good performance for any M & B Key Goal Guaranteed good cache performance at all levels of hierarchy Single CPU only (All caches shared) Encourages Hierarchical Locality Hi-Spade 19 © Phillip B. Gibbons SDM’10 keynote Example Paradigms Achieving Key Goal • Scan: e.g., computing the sum of N items N/B misses, for any B (optimal) • Divide-and-Conquer: e.g., matrix multiply C=A*B A11*B11 A11*B12 + + A12*B21 A12*B22 A21*B11 A21*B12 + + A22*B21 A22*B22 A11 A12 = B11 B12 B21 B22 * A21 A22 Uses Divide: Recursively compute A11*B11,…, A22*B22 Recursive Conquer: Compute 4 quadrant sums Z-order O(N2/B + N3/(B*√M)) misses (optimal) Layout Hi-Spade 20 © Phillip B. Gibbons SDM’10 keynote Multicore Hierarchies: Possible Views Design to Tree-of-Caches abstraction: • Multi-BSP Model [L.G. Valiant, ESA’08] – 4 parameters/level: cache size, fanout, latency/sync cost, transfer bandwidth – Bulk-Synchronous … … … … … … … … … … … Our Goal: • Approach simplicity of Ideal Cache Model – Hierarchy-Savvy sweet spot – Do not require bulk-synchrony Hi-Spade 21 © Phillip B. Gibbons SDM’10 keynote Multicore Hierarchies: Key Challenge • Theory underlying Ideal Cache Model falls apart once introduce parallelism: Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Keyreason: reason:Caches Cachesnot notfully fullyshared shared Key CPU1 CPU2 CPU3 L1 L1 L1 B Shared L2 Cache L2 Cache Hi-Spade 22 © Phillip B. Gibbons What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time SDM’10 keynote Multicore Hierarchies Key New Dimension: Scheduling Key new dimension: Scheduling of parallel threads Has LARGE impact on cache performance Recall our problem scenario: Key reason: Caches not fully shared CPU3 CPU2 CPU1 all CPUs want to write B at ≈ the same time L1 B L1 L1 Shared L2 Cache L2 Cache Hi-Spade 23 Can mitigate (but not solve) if can schedule the writes to be far apart in time © Phillip B. Gibbons SDM’10 keynote Key Enabler: Fine-Grained Threading • Coarse Threading popular for decades – Spawn one thread per core at program initialization – Heavy-weight O.S. threads – E.g., Splash Benchmark • Better Alternative: – – – – Hi-Spade System supports user-level light-weight threads Programs expose lots of parallelism Dynamic parallelism: forking can be data-dependent Smart runtime scheduler maps threads to cores, dynamically as computation proceeds 24 © Phillip B. Gibbons SDM’10 keynote Cache Uses Among Multiple Threads Destructive Constructive • compete for the limited • share a largely on-chip cache overlapping working set P P P P P P L1 L1 L1 L1 L1 L1 Interconnect Shared L2 Cache Hi-Spade 25 Interconnect Shared L2 Cache “Flood” off-chip PINs © thanks Phillip B.to Gibbons Slide Shimin Chen SDM’10 keynote Smart Thread Schedulers • Work Stealing – Give priority to tasks in local work queue – Good for private caches • Parallel Depth-first (PDF) [JACM’99, SPAA’04] – Give priority to earliest ready tasks in the sequential schedule – Good for shared caches P Sequential locality to parallel locality L1 L2 Cache Main Memory Hi-Spade 26 © Phillip B. Gibbons P P P P L1 L1 L1 L1 Shared L2 Cache Main Memory SDM’10 keynote Parallel Merge Sort: WS vs. PDF 8 cores Work Stealing (WS): Parallel Depth First (PDF): Cache miss Cache hit Mixed Shared cache = 0.5 *(src array size + dest array size). Hi-Spade 27 © Phillip B. Gibbons SDM’10 keynote Private vs. Shared Caches • 3-level multi-core model • Designed new scheduler (Controlled PDF) with provably good cache performance for class of divide-and-conquer algorithms [SODA08] CPU1 CPU2 CPU3 L1 L1 L1 Shared L2 Cache L2 Cache Results require exposing working set size for each recursive subproblem Main Memory Hi-Spade 28 © Phillip B. Gibbons SDM’10 keynote Low-Span + Ideal Cache Model • Observation: Guarantees on cache performance depend on the computation’s span S (length of critical path) – E.g., Work-stealing on single level of private caches: Thrm: For any computation w/ fork-join parallelism, O(M P S / B) more misses on P cores than on 1 core • Approach: Design parallel algorithms with – Low span, and [SPAA’10] – Good performance on Ideal Cache Model Thrm: For any computation w/ fork-join parallelism for each level i, only O(M i P S / Bi ) more misses than on 1 core, for hierarchy of private caches Hi-Spade 29 Low span © Phillip B.S Gibbons GoodSDM’10 miss bound keynote Challenge of General Case Tree-of-Caches • Each subtree has a given amount of compute & cache resources • To avoid cache misses from migrating tasks, would like to assign/pin task to a subtree • But any given program task may not match both – E.g., May need large cache but few processors Hi-Spade 30 © Phillip B. Gibbons SDM’10 keynote Hi-Spade: Outline • Hierarchies are important but challenging • Hi-Spade vision: Hierarchy-savvy algorithms & systems • Smart thread schedulers enable simple, hierarchy-savvy abstractions • Flash-savvy (database) systems maximize benefits of Flash devices • Ongoing work w/ many open problems Hi-Spade 31 © Phillip B. Gibbons SDM’10 keynote Flash Superior to Magnetic Disk on Many Metrics • Energy-efficient • Smaller • Lighter • More durable • Higher throughput • Less cooling cost Hi-Spade 32 © Phillip B. Gibbons SDM’10 keynote Flash-Savvy Systems • Simply replacing some magnetic disks with Flash devices WILL improve performance However: • Much of the performance left on the table – Systems not tuned to Flash characteristics Flash-savvy systems: • Maximize benefits of platform’s flash devices – What is best offloaded to flash? Many papers in this area--Discuss only our results Hi-Spade 33 © Phillip B. Gibbons SDM’10 keynote NAND Flash Chip Properties Block (64-128 pages) Page (512-2048 B) … Read/write pages, erase blocks … • Write page once after a block is erased In-place update 1. Copy 2. Erase 3. Write 4. Copy 5. Erase Random © Phillip B. Gibbons 0.4ms 0.6ms Read SDM’10 keynote Sequential 34 Random Hi-Spade Sequential • Expensive operations: • In-place updates • Random writes 0.4ms 127ms Write Using “Semi-Random” Writes in place of Random Writes Energy to Maintain Random Sample Our algorithm Our Algorithm [VLDB’08] Hi-Spade 35 On Lexar CF card © Phillip B. Gibbons SDM’10 keynote Quirks of Flash (Mostly) Hidden by SSD Firmware Sequential Reads Intel X25-M SSD 0.25 time (ms) 0.2 0.15 0.1 0.05 16K 8K 4K 2K 1K 512 0 Request Size seq-read seq-write ran-read ran-write Random writes & in-place updates no longer slow Hi-Spade 36 © Phillip B. Gibbons SDM’10 keynote Flash Logging (1/3) [SIGMOD’09] Transactional logging: major bottleneck • Today, OLTP Databases can fit into main memory (e.g., in TPCC, 30M customers < 100GB) • In contrast, must flush redo log to stable media at commit time Log access pattern: small sequential writes • Ill-suited for magnetic disks: incur full rotational delays • Alternative solutions are expensive or complicated Exploiting flash devices for logging Hi-Spade 37 © thanks Phillip B.to Gibbons Slide Shimin Chen SDM’10 keynote Flash Logging (2/3) USB flash drives are a good match • Widely available USB ports • Inexpensive: use multiple devices for better performance • Hot-plug: cope with limited erase cycles • Multiple USB flash drives achieve better performance with lower price than a single SSD Our solution: FlashLogging • Unconventional array design • Outlier detection Worke & hiding r Worke • Efficient recovery r Worke r Hi-Spade 38 © thanks Phillip B.to Gibbons Slide Shimin Chen Databas e Request queue Interface In-memory log buffer SDM’10 keynote new order transactions per minute Flash Logging (3/3) 35000 30000 25000 20000 15000 10000 5000 0 disk Ideal ssd usb-A usb-B usb-C ssd original FlashLogging • Up to 5.7X improvements over disk based logging • Up to 98% of ideal performance • Multiple USB flash drives achieve better performance than a single SSD, at fraction of the price Hi-Spade 39 © thanks Phillip B.to Gibbons Slide Shimin Chen SDM’10 keynote PR-Join for Online Aggregation • Data warehouse and business intelligence – Fast growing multi-billion dollar market • Interactive ad-hoc queries – Important for detecting new trends – Fast response times hard to achieve • One promising approach: Online aggregation – Provides early representative results for aggregate queries (sum, avg, etc), i.e., estimates & statistical confidence intervals – Problem: Queries with joins are too slow • Our goal: A faster join for online aggregation Hi-Spade 40 © Phillip B. Gibbons SDM’10 keynote Early Representative Result Rate Low High Design Space Hi-Spade 41 PR-Join targets Rippl e SMS Hash Ripple GRAC LowE Total I/O Cost © thanks Phillip B.to Gibbons Slide Shimin Chen SDM’10 keynote High Background: Ripple Join A join B: find matching records of A and B records from B spilled new spilled new records from A For each ripple: • Read new records from A and B; check for matches • Read spilled records; check for matches with new records • Spill new to disk Join: Checks all pairs of records from A and B Hi-Spade Problem: Ripple width limited bykeynote memory size © Phillip B. Gibbons SDM’10 42 Partitioned expanding Ripple Join PR-Join Idea: Multiplicatively expanding ripples • Higher result rate • Representative results empty To overcome Ripple width > memory: & hash partitioning • Each partition < memory • Report results per partitioned ripple empty Partitioned on Join key [Sigmod’10] Hi-Spade 43 © Phillip B. Gibbons SDM’10 keynote PR-Join leveraging SSD Near-optimal total I/O cost Higher early result rate Setting: 10GB joins 10GB, 500MB memory Inputs on HD; SSD for temp storage Hi-Spade 44 © Phillip B. Gibbons SDM’10 keynote Concurrent Queries & Updates in Data Warehouse • Data Warehouse queries dominated by table scans – Sequential scan on HD • Updates are delayed to avoid interfering – E.g., Mixing random updates with TPCH queries would incur 2.9X query slowdown – Thus, queries are on stale data Hi-Spade 45 © Phillip B. Gibbons SDM’10 keynote Concurrent Queries & Updates in Data Warehouse • Our Approach: Cache updates on SSD – Queries take updates into account on-the-fly – Updates periodically migrated to HD in batch – Improves query latency by 2X, improves update throughput by 70X Data Warehouse 2. Query processing Table (range) scan Merge 1. Incoming updates Related updates 3. Migrate updates Disks (main data) Hi-Spade 46 © Phillip B. Gibbons SSD (updates) SDM’10 keynote Hi-Spade: Outline • Hierarchies are important but challenging • Hi-Spade vision: Hierarchy-savvy algorithms & systems • Smart thread schedulers enable simple, hierarchy-savvy abstractions • Flash-savvy (database) systems maximize benefits of Flash devices • Ongoing work w/ many open problems Hi-Spade 47 © Phillip B. Gibbons SDM’10 keynote Publications (1) Cache Hierarchy & Schedulers: 1. PDF scheduler for shared caches [SPAA’04] 2. Scheduling for constructive sharing [SPAA’07] 3. Controlled-PDF scheduler [SODA’08] 4. Combinable MBTs [SPAA’08] 5. Semantic space profiling & visualization [ICFP’08] 6. Scheduling beyond nested parallelism [SPAA’09] 7. Low depth paradigm & algorithms [SPAA’10] 8. Parallel Ideal Cache model [under submission] Hi-Spade 48 © Phillip B. Gibbons SDM’10 keynote Semantic Space Profiling Heap Use DAGs showing peak use for 2 schedulers work stealing breadth-first Matrix Multiply Allocation point breakdown (2 cores, breadth-first) [ICFP’08] Hi-Spade 49 © Phillip B. Gibbons SDM’10 keynote Publications (2) Flash-savvy database systems: 1. Semi-random writes [VLDB’08] 2. Flash-based transactional logging [Sigmod’09] 3. PR-Join for online aggregation [Sigmod’10] 4. I/O scheduling for transactional I/Os [under submission] 5. Concurrent warehousing queries & updates [under submission] Hi-Spade 50 © Phillip B. Gibbons SDM’10 keynote Many Open Problems • Hierarchy-savvy ideal: Simplified view + thread scheduler that will rule the world • New tools & architectural features that will help • Extend beyond MP platform to cluster/cloud • Richer class of algorithms: Data mining, etc. • Hierarchy-savvy scheduling for power savings • PCM-savvy systems: How will Phase Change Memory change the world? Hi-Spade 51 © Phillip B. Gibbons SDM’10 keynote Hi-Spade: Conclusions • Hierarchies are important but challenging • Hi-Spade vision: Hierarchy-savvy algorithms & systems • Smart thread schedulers enable simple, hierarchy-savvy abstractions • Flash-savvy (database) systems maximize benefits of Flash devices • Ongoing work w/ many open problems Hi-Spade 52 © Phillip B. Gibbons SDM’10 keynote