Taming Parallelism John Mellor-Crummey Department of Computer Science Rice University johnmc@cs.rice.edu http://hpctoolkit.org Computer Science Corporate Affiliates Meeting October 16, 2008 Advance of Semiconductors: “Moore’s Law” Gordon Moore, Founder of Intel • 1965: since the integrated circuit was invented, the number of transistors/inch2 in these circuits roughly doubled every year; this trend would continue for the foreseeable future • 1975: revised - circuit complexity doubles every 18 months Image credit: http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf 2 Leveraging Moore’s Law Trends From increasing circuit density to performance • • More transistors = ↑ opportunities for exploiting parallelism Microprocessors with instruction-level parallelism — implicit parallelism – pipelined execution of instructions – multiple functional units for multiple independent pipelines – multiple instruction issue and out-of-order execution — explicit parallelism – short vector operations for streaming and multimedia – long instruction words 3 Three Complications • Power and heat — “thermal wall” • Limited instruction-level parallelism — difficult-to-predict branches • Wire delays — impediment to performance with wider issue 4 The Emergence of Multicore Processors • Today — AMD Barcelona: 4 cores; integrated cache — Intel Harpertown: 2 dual cores — IBM – Power 5: dual core; 2 SMT threads per core – Cell: 1 PPC core; 8 SPEs w/ SIMD parallelism — Sun T2: 8 cores; 8-way fine-grain multithreading per core — Tilera: 64 cores in an 8x8 mesh; 5MB cache on chip • Around the corner — — — — AMD Shanghai, 4 cores Intel Core i7, 4 cores, 2 SMT threads per core Intel Itanium “Tukwila,” 4 cores, 2 billion transistors IBM Power7 in 2010 Image credit: http://img.hexus.net/v2/cpu/amd/barcelonajuly/dieshot-big.jpg 5 David Patterson (UC Berkeley) on Multicore August 26, 2008 “The leap to multicore is not based on a breakthrough in programming or architecture; it’s actually a retreat from the even harder task of building power-efficient, high-clock-rate, single-core chips” 6 Isn’t Multicore Just More of the Same? No! • Clock frequency increases no longer compensate for increasing software bloat • Application performance won’t track processor enhancements unless software is highly concurrent • • • Heterogeneous cores will become more commonplace Programming models must support parallelism or wither Need development tools for threaded parallel codes 7 From Multicore to Scalable Parallel Systems 8 The Need for Speed • Computationally challenging problems —simulations that are intrinsically multi-scale, or —simulations involving interaction of multiple processes • DOE application space —turbulent combustion —magnetic confinement fusion —climate —cosmology —materials science —computational chemistry —... 9 Hierarchical Parallelism in Supercomputers • • • • Cores with pipelining and short vectors Multicore processors Shared-memory multiprocessor nodes Scalable parallel systems Image credit: http://www.nersc.gov/news/reports/bluegene.gif 10 Historical Concurrency in Top 500 Systems Image credit: http://www.top500.org/overtime/list/31/procclass 11 Scale of Today’s Largest HPC Systems 1 petaflop! > 100K cores Image credit: http://www.top500.org/list/2008/06/100 12 Achieving High Performance on Parallel Systems Computation is only part of the picture • Memory latency and bandwidth — CPU rates have improved 4x as fast as memory over last decade — bridge speed gap using memory hierarchy — multicore exacerbates demand • • system CPU Interprocessor communication Input/output — I/O bandwidth to disk typically grows linearly with # processors Image Credit: Bob Colwell, ISCA 1995 13 The Parallel Performance Problem • • HPC platforms have become enormously complex Opportunities for performance losses at all levels — algorithmic scaling — serialization and load imbalance — communication or I/O bottlenecks — insufficient or inefficient parallelization — memory hierarchy utilization • Modern scientific applications are growing increasingly complex — often multiscale and adaptive • Performance and scalability losses seem mysterious 14 The Role of Performance Tools Pinpoint and diagnose bottlenecks in parallel codes • • • Are there parallel scaling bottlenecks at any level? Are applications making the most of node performance? What are the rate limiting factors for an application? — mismatch between application needs and system capabilities? – memory bandwidth – floating point performance – communication bandwidth, latency • • What is the expected benefit of fixing bottlenecks? What type and level of effort is necessary? — tune existing implementation — overhaul implementation to better match architecture capabilities — new algorithms 15 Outline • • Overview of HPCToolkit Pinpointing scalability bottlenecks — scalability bottlenecks on large-scale parallel systems — scaling on multicore processors • Status and ongoing work 16 Performance Analysis Goals • Accurate measurement of complex parallel codes — — — — large, multi-lingual programs fully optimized code: loop optimization, templates, inlining binary-only libraries, sometimes partially stripped complex execution environments – dynamic loading or static binaries – SPMD parallel codes with threaded node programs – batch jobs — production executions • Effective performance analysis — pinpoint and explain problems – intuitive enough for scientists and engineers – detailed enough for compiler writers — yield actionable results • Scalable to petascale systems 17 HPCToolkit Approach • Binary-level measurement and analysis — observe fully optimized, dynamically linked executions • Sampling-based measurement — minimize systematic error and avoid blind spots — support data collection for large-scale parallelism • Collect and correlate multiple derived performance metrics — diagnosis requires more than one species of metric — derived metrics: “unused bandwidth” rather than “cycles” • Associate metrics with both static and dynamic context — loop nests, procedures, inlined code, calling context • Support top-down performance analysis — avoid getting overwhelmed with the details 18 HPCToolkit Workflow compile & link app. source profile execution [hpcrun] optimized binary binary analysis [hpcstruct] presentation [hpcviewer] database call stack profile program structure interpret profile correlate w/ source [hpcprof] 19 HPCToolkit Workflow compile & link app. source profile execution call stack profile [hpcrun] optimized binary binary analysis program structure [hpcstruct] Compile and link for production execution — full optimization presentation [hpcviewer] database interpret profile correlate w/ source [hpcprof] 20 HPCToolkit Workflow compile & link app. source profile execution call stack profile [hpcrun] optimized binary binary analysis program structure [hpcstruct] Measure execution unobtrusively — launch optimized application binaries — collect statistical profiles of events of interest – call path profiles presentation [hpcviewer] database interpret profile correlate w/ source [hpcprof] 21 Call Path Profiling Measure and attribute costs in context • • Sample timer or hardware counter overflows Gather calling context using stack unwinding Call path sample Calling Context Tree (CCT) return address return address return address instruction pointer Overhead proportional to sampling frequency... ...not call frequency 22 Call Path Profiling Challenges Unwinding optimized code • Difficulties —compiler information is inadequate for unwinding —code may be partially stripped • Questions —where is the return address of the current frame? —where is the contents of the frame pointer for the caller’s frame? • Approach: use binary analysis to support unwinding —recover function bounds in stripped load modules —compute unwind recipes for code intervals within procedures • Real-world challenges —dynamic loading —multithreading 23 HPCToolkit Workflow compile & link app. source profile execution call stack profile [hpcrun] optimized binary binary analysis program structure [hpcstruct] Analyze binary to recover program structure — extract loop nesting & identify procedure inlining — map transformed loops and procedures to source presentation [hpcviewer] database interpret profile correlate w/ source [hpcprof] 24 HPCToolkit Workflow profile execution compile & link app. source call stack profile [hpcrun] optimized binary binary analysis program structure [hpcstruct] • • • Correlate dynamic metrics with static source structure Synthesize new metrics by combining metrics Combine multiple profiles presentation [hpcviewer] database interpret profile correlate w/ source [hpcprof] 25 HPCToolkit Workflow compile & link profile execution [hpcrun] app. source optimized binary binary analysis [hpcstruct] call stack profile program structure Presentation — support top-down analysis with interactive viewer — analyze results anytime, anywhere presentation [hpcviewer] database interpret profile correlate w/ source [hpcprof] 26 Effective Presentation source pane • top-down, bottom-up, & flat views • inclusive and exclusive costs view control navigation pane metric pane 27 Principal Views • Calling context tree view — “top-down” (down the call chain) — associate metrics with each dynamic calling context — high-level, hierarchical view of distribution of costs • Caller’s view — “bottom-up” (up the call chain) — apportion a procedure’s metrics to its dynamic calling contexts — understand costs of a procedure called in many places • Flat view — “flatten” the calling context of each sample point — aggregate all metrics for a procedure, from any context — attribute costs to loop nests and lines within a procedure 28 Chroma Lattice QCD Library calling context view • costs for loops in CCT • costs for inlined procedures (not shown) 29 LANL’S Parallel Ocean Program (POP) caller’s view show attribution of procedure costs to calling contexts 30 S3D Solver for Turbulent, Reacting Flows Overall performance (15% of peak) 2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle highlighted loop accounts for 11.4% of total program waste Wasted Opportunity (Maximum FLOP rate * cycles - (actual FLOPs)) 31 Outline • • Overview of Rice’s HPCToolkit Pinpointing scalability bottlenecks — scalability bottlenecks on large-scale parallel systems — scaling on multicore processors • Status and ongoing work 32 The Problem of Scaling 1.000 ? 0.750 Ideal efficiency Actual efficiency 0.625 6 65 53 4 38 16 96 40 10 24 6 25 64 16 4 0.500 1 Efficiency 0.875 CPUs Note: higher is better 33 Goal: Automatic Scaling Analysis • • • • Pinpoint scalability bottlenecks Quantify the magnitude of each problem Guide user to problems Diagnose the nature of the problem 34 Challenges for Pinpointing Scalability Bottlenecks • Parallel applications — modern software uses layers of libraries — performance is often context dependent • Monitoring — bottleneck nature: computation, data movement, synchronization? — size of petascale platforms demands acceptable data volume — low perturbation for use in production runs Example climate code skeleton main land sea ice ocean wait wait wait atmosphere wait 35 Performance Analysis with Expectations • Users have performance expectations for parallel codes — strong scaling: linear speedup — weak scaling: constant execution time • Putting expectations to work — define your expectations — measure performance under different conditions – e.g. different levels of parallelism or different inputs — compute the deviation from expectations for each calling context – for both inclusive and exclusive costs — correlate the metrics with the source code — explore the annotated call tree interactively 36 Weak Scaling Analysis for SPMD Codes Performance expectation for weak scaling – total work increases linearly with # processors – execution time is same as that on a single processor • • Execute code on p and q processors; without loss of generality, p < q Let Ti = total execution time on i processors • For corresponding nodes nq and np – let C(nq) and C(np) be the costs of nodes nq and np • Expectation: • Fraction of excess work: X w (n q ) = C(n q ) = C(n p ) € C(n q ) − C(n p ) Tq parallel overhead total time 37 € Weak Scaling: 1K to 10K processors - 10K = 1K 38 Scalability Analysis of LANL’s POP calling context view strong scaling 4 vs. 64 processors 39 Scalability Analysis of MILC calling context view weak scaling 1 vs. 16 processors 40 Scaling on Multicore Processors • Compare performance — single vs. multiple processes on a multicore system • Strategy — differential performance analysis – subtract the calling context trees as before, unit coefficient for each 41 Multicore Losses at the Procedure Level 42 Multicore Losses at the Loop Level 43 Analyzing Performance of Threaded Code • • Goal: identify where a program needs improvement Approach: attribute work and idleness — use statistical sampling to measure activity and inactivity — when a sample event interrupts an active thread t – charge t one sample to its current calling context as “work” – suppose w threads are working and i threads are idle – charge t i/w samples of idleness each working thread is partially responsible for idleness — when a sample event interrupts and idle thread, drop it • Applicability — Pthreads — Cilk and OpenMP – use logical call stack unwinding to bridge gap between logical and physical call stacks e.g. with Cilk work stealing, OpenMP task scheduling 44 Analysis of Multithreaded Cilk Cholesky decomposition Top down work Bottom up idleness 45 What about Time? • Profiling compresses out the temporal dimension —that’s why serialization is invisible in profiles • What can we do? Trace call path samples —sketch: – – – – N times per second, take a call path sample of each thread organize the samples for each thread along a time line view how the execution evolves left to right what do we view? assign each procedure a color; view execution with a depth slice Time 46 17 Call Path Sample Trace for GTC Gyrokinetic Toroidal Code (GTC) • • 32 process MPI program Each process has a pair of threads managed with OpenMP 47 Status and Future Work • • Deployment on 500TF TACC Ranger system this month Measurement — sampling-based measurement on other platforms is emerging – deployment on Cray XT; working with IBM on Blue Gene P — polishing call stack unwinding for Blue Gene — measurement of dynamic multithreading • Analysis — statistical analysis CCTs for parallel codes — cluster analysis and anomaly detection — parallel analysis of performance data • Presentation — strategies for presenting performance for large-scale parallelism — handling out-of-core performance data 48