Taming Parallelism John Mellor-Crummey Department of Computer Science

advertisement
Taming Parallelism
John Mellor-Crummey
Department of Computer Science
Rice University
johnmc@cs.rice.edu
http://hpctoolkit.org
Computer Science Corporate Affiliates Meeting
October 16, 2008
Advance of Semiconductors: “Moore’s Law”
Gordon Moore, Founder of Intel
•
1965: since the integrated circuit was invented, the number of
transistors/inch2 in these circuits roughly doubled every year;
this trend would continue for the foreseeable future
•
1975: revised - circuit complexity doubles every 18 months
Image credit: http://download.intel.com/research/silicon/Gordon_Moore_ISSCC_021003.pdf
2
Leveraging Moore’s Law Trends
From increasing circuit density to performance
•
•
More transistors = ↑ opportunities for exploiting parallelism
Microprocessors with instruction-level parallelism
— implicit parallelism
– pipelined execution of instructions
– multiple functional units for multiple independent pipelines
– multiple instruction issue and out-of-order execution
— explicit parallelism
– short vector operations for streaming and multimedia
– long instruction words
3
Three Complications
•
Power and heat
— “thermal wall”
•
Limited instruction-level parallelism
— difficult-to-predict branches
•
Wire delays
— impediment to performance with wider issue
4
The Emergence of Multicore Processors
•
Today
— AMD Barcelona: 4 cores; integrated cache
— Intel Harpertown: 2 dual cores
— IBM
– Power 5: dual core; 2 SMT threads per core
– Cell: 1 PPC core; 8 SPEs w/ SIMD parallelism
— Sun T2: 8 cores; 8-way fine-grain multithreading per core
— Tilera: 64 cores in an 8x8 mesh; 5MB cache on chip
•
Around the corner
—
—
—
—
AMD Shanghai, 4 cores
Intel Core i7, 4 cores, 2 SMT threads per core
Intel Itanium “Tukwila,” 4 cores, 2 billion transistors
IBM Power7 in 2010
Image credit: http://img.hexus.net/v2/cpu/amd/barcelonajuly/dieshot-big.jpg
5
David Patterson (UC Berkeley) on Multicore
August 26, 2008
“The leap to multicore is not based on a breakthrough in
programming or architecture; it’s actually a retreat from the
even harder task of building power-efficient, high-clock-rate,
single-core chips”
6
Isn’t Multicore Just More of the Same?
No!
•
Clock frequency increases no longer compensate for
increasing software bloat
•
Application performance won’t track processor
enhancements unless software is highly concurrent
•
•
•
Heterogeneous cores will become more commonplace
Programming models must support parallelism or wither
Need development tools for threaded parallel codes
7
From Multicore to
Scalable Parallel Systems
8
The Need for Speed
•
Computationally challenging problems
—simulations that are intrinsically multi-scale, or
—simulations involving interaction of multiple processes
•
DOE application space
—turbulent combustion
—magnetic confinement fusion
—climate
—cosmology
—materials science
—computational chemistry
—...
9
Hierarchical Parallelism in Supercomputers
•
•
•
•
Cores with pipelining and short vectors
Multicore processors
Shared-memory multiprocessor nodes
Scalable parallel systems
Image credit: http://www.nersc.gov/news/reports/bluegene.gif
10
Historical Concurrency in Top 500 Systems
Image credit: http://www.top500.org/overtime/list/31/procclass
11
Scale of Today’s Largest HPC Systems
1 petaflop!
> 100K
cores
Image credit: http://www.top500.org/list/2008/06/100
12
Achieving High Performance on Parallel Systems
Computation is only part of the picture
•
Memory latency and bandwidth
— CPU rates have improved 4x as
fast as memory over last decade
— bridge speed gap using memory
hierarchy
— multicore exacerbates demand
•
•
system
CPU
Interprocessor communication
Input/output
— I/O bandwidth to disk typically grows linearly with # processors
Image Credit: Bob Colwell, ISCA 1995
13
The Parallel Performance Problem
•
•
HPC platforms have become enormously complex
Opportunities for performance losses at all levels
— algorithmic scaling
— serialization and load imbalance
— communication or I/O bottlenecks
— insufficient or inefficient parallelization
— memory hierarchy utilization
•
Modern scientific applications are growing increasingly complex
— often multiscale and adaptive
•
Performance and scalability losses seem mysterious
14
The Role of Performance Tools
Pinpoint and diagnose bottlenecks in parallel codes
•
•
•
Are there parallel scaling bottlenecks at any level?
Are applications making the most of node performance?
What are the rate limiting factors for an application?
— mismatch between application needs and system capabilities?
– memory bandwidth
– floating point performance
– communication bandwidth, latency
•
•
What is the expected benefit of fixing bottlenecks?
What type and level of effort is necessary?
— tune existing implementation
— overhaul implementation to better match architecture capabilities
— new algorithms
15
Outline
•
•
Overview of HPCToolkit
Pinpointing scalability bottlenecks
— scalability bottlenecks on large-scale parallel systems
— scaling on multicore processors
•
Status and ongoing work
16
Performance Analysis Goals
•
Accurate measurement of complex parallel codes
—
—
—
—
large, multi-lingual programs
fully optimized code: loop optimization, templates, inlining
binary-only libraries, sometimes partially stripped
complex execution environments
– dynamic loading or static binaries
– SPMD parallel codes with threaded node programs
– batch jobs
— production executions
•
Effective performance analysis
— pinpoint and explain problems
– intuitive enough for scientists and engineers
– detailed enough for compiler writers
— yield actionable results
•
Scalable to petascale systems
17
HPCToolkit Approach
•
Binary-level measurement and analysis
— observe fully optimized, dynamically linked executions
•
Sampling-based measurement
— minimize systematic error and avoid blind spots
— support data collection for large-scale parallelism
•
Collect and correlate multiple derived performance metrics
— diagnosis requires more than one species of metric
— derived metrics: “unused bandwidth” rather than “cycles”
•
Associate metrics with both static and dynamic context
— loop nests, procedures, inlined code, calling context
•
Support top-down performance analysis
— avoid getting overwhelmed with the details
18
HPCToolkit Workflow
compile & link
app.
source
profile
execution
[hpcrun]
optimized
binary
binary
analysis
[hpcstruct]
presentation
[hpcviewer]
database
call stack
profile
program
structure
interpret profile
correlate w/ source
[hpcprof]
19
HPCToolkit Workflow
compile & link
app.
source
profile
execution
call stack
profile
[hpcrun]
optimized
binary
binary
analysis
program
structure
[hpcstruct]
Compile and link for production execution
— full optimization
presentation
[hpcviewer]
database
interpret profile
correlate w/ source
[hpcprof]
20
HPCToolkit Workflow
compile & link
app.
source
profile
execution
call stack
profile
[hpcrun]
optimized
binary
binary
analysis
program
structure
[hpcstruct]
Measure execution unobtrusively
— launch optimized application binaries
— collect statistical profiles of events of interest
– call path profiles
presentation
[hpcviewer]
database
interpret profile
correlate w/ source
[hpcprof]
21
Call Path Profiling
Measure and attribute costs in context
•
•
Sample timer or hardware counter overflows
Gather calling context using stack unwinding
Call path sample
Calling Context Tree (CCT)
return address
return address
return address
instruction pointer
Overhead proportional to sampling frequency...
...not call frequency
22
Call Path Profiling Challenges
Unwinding optimized code
•
Difficulties
—compiler information is inadequate for unwinding
—code may be partially stripped
•
Questions
—where is the return address of the current frame?
—where is the contents of the frame pointer for the caller’s frame?
•
Approach: use binary analysis to support unwinding
—recover function bounds in stripped load modules
—compute unwind recipes for code intervals within procedures
•
Real-world challenges
—dynamic loading
—multithreading
23
HPCToolkit Workflow
compile & link
app.
source
profile
execution
call stack
profile
[hpcrun]
optimized
binary
binary
analysis
program
structure
[hpcstruct]
Analyze binary to recover program structure
— extract loop nesting & identify procedure inlining
— map transformed loops and procedures to source
presentation
[hpcviewer]
database
interpret profile
correlate w/ source
[hpcprof]
24
HPCToolkit Workflow
profile
execution
compile & link
app.
source
call stack
profile
[hpcrun]
optimized
binary
binary
analysis
program
structure
[hpcstruct]
•
•
•
Correlate dynamic metrics with static source structure
Synthesize new metrics by combining metrics
Combine multiple profiles
presentation
[hpcviewer]
database
interpret profile
correlate w/ source
[hpcprof]
25
HPCToolkit Workflow
compile & link
profile
execution
[hpcrun]
app.
source
optimized
binary
binary
analysis
[hpcstruct]
call stack
profile
program
structure
Presentation
— support top-down analysis with interactive viewer
— analyze results anytime, anywhere
presentation
[hpcviewer]
database
interpret profile
correlate w/ source
[hpcprof]
26
Effective Presentation
source pane
• top-down, bottom-up, & flat views
• inclusive and exclusive costs
view control
navigation pane
metric pane
27
Principal Views
•
Calling context tree view
— “top-down” (down the call chain)
— associate metrics with each dynamic calling context
— high-level, hierarchical view of distribution of costs
•
Caller’s view
— “bottom-up” (up the call chain)
— apportion a procedure’s metrics to its dynamic calling contexts
— understand costs of a procedure called in many places
•
Flat view
— “flatten” the calling context of each sample point
— aggregate all metrics for a procedure, from any context
— attribute costs to loop nests and lines within a procedure
28
Chroma Lattice QCD Library
calling context view
• costs for loops in CCT
• costs for inlined procedures
(not shown)
29
LANL’S Parallel Ocean Program (POP)
caller’s view
show attribution of procedure
costs to calling contexts
30
S3D Solver for Turbulent, Reacting Flows
Overall performance (15% of peak)
2.05 x 1011 FLOPs / 6.73 x 1011 cycles= .305 FLOPs/cycle
highlighted loop accounts for
11.4% of total program waste
Wasted Opportunity
(Maximum FLOP rate
* cycles - (actual
FLOPs))
31
Outline
•
•
Overview of Rice’s HPCToolkit
Pinpointing scalability bottlenecks
— scalability bottlenecks on large-scale parallel systems
— scaling on multicore processors
•
Status and ongoing work
32
The Problem of Scaling
1.000
?
0.750
Ideal efficiency
Actual efficiency
0.625
6
65
53
4
38
16
96
40
10
24
6
25
64
16
4
0.500
1
Efficiency
0.875
CPUs
Note: higher is better
33
Goal: Automatic Scaling Analysis
•
•
•
•
Pinpoint scalability bottlenecks
Quantify the magnitude of each problem
Guide user to problems
Diagnose the nature of the problem
34
Challenges for Pinpointing Scalability Bottlenecks
•
Parallel applications
— modern software uses layers of libraries
— performance is often context dependent
•
Monitoring
— bottleneck nature: computation, data movement, synchronization?
— size of petascale platforms demands acceptable data volume
— low perturbation for use in production runs
Example climate code skeleton
main
land
sea ice
ocean
wait
wait
wait
atmosphere
wait
35
Performance Analysis with Expectations
•
Users have performance expectations for parallel codes
— strong scaling: linear speedup
— weak scaling: constant execution time
•
Putting expectations to work
— define your expectations
— measure performance under different conditions
– e.g. different levels of parallelism or different inputs
— compute the deviation from expectations for each calling context
– for both inclusive and exclusive costs
— correlate the metrics with the source code
— explore the annotated call tree interactively
36
Weak Scaling Analysis for SPMD Codes
Performance expectation for weak scaling
– total work increases linearly with # processors
– execution time is same as that on a single processor
•
•
Execute code on p and q processors; without loss of generality, p < q
Let Ti = total execution time on i processors
•
For corresponding nodes nq and np
– let C(nq) and C(np) be the costs of nodes nq and np
•
Expectation:
•
Fraction of excess work: X w (n q ) =
C(n q ) = C(n p )
€
C(n q ) − C(n p )
Tq
parallel overhead
total time
37
€
Weak Scaling: 1K to 10K processors
-
10K
=
1K
38
Scalability Analysis of LANL’s POP
calling context view
strong scaling
4 vs. 64 processors
39
Scalability Analysis of MILC
calling context view
weak scaling
1 vs. 16 processors
40
Scaling on Multicore Processors
•
Compare performance
— single vs. multiple processes on a multicore system
•
Strategy
— differential performance analysis
– subtract the calling context trees as before, unit coefficient for each
41
Multicore Losses at the Procedure Level
42
Multicore Losses at the Loop Level
43
Analyzing Performance of Threaded Code
•
•
Goal: identify where a program needs improvement
Approach: attribute work and idleness
— use statistical sampling to measure activity and inactivity
— when a sample event interrupts an active thread t
– charge t one sample to its current calling context as “work”
– suppose w threads are working and i threads are idle
– charge t i/w samples of idleness
each working thread is partially responsible for idleness
— when a sample event interrupts and idle thread, drop it
•
Applicability
— Pthreads
— Cilk and OpenMP
– use logical call stack unwinding to bridge gap between logical and
physical call stacks
e.g. with Cilk work stealing, OpenMP task scheduling
44
Analysis of Multithreaded Cilk
Cholesky decomposition
Top down work
Bottom up idleness
45
What about Time?
•
Profiling compresses out the temporal dimension
—that’s why serialization is invisible in profiles
•
What can we do? Trace call path samples
—sketch:
–
–
–
–
N times per second, take a call path sample of each thread
organize the samples for each thread along a time line
view how the execution evolves left to right
what do we view?
assign each procedure a color; view execution with a depth slice
Time
46
17
Call Path Sample Trace for GTC
Gyrokinetic Toroidal Code (GTC)
•
•
32 process MPI program
Each process has a pair of threads managed with OpenMP
47
Status and Future Work
•
•
Deployment on 500TF TACC Ranger system this month
Measurement
— sampling-based measurement on other platforms is emerging
– deployment on Cray XT; working with IBM on Blue Gene P
— polishing call stack unwinding for Blue Gene
— measurement of dynamic multithreading
•
Analysis
— statistical analysis CCTs for parallel codes
— cluster analysis and anomaly detection
— parallel analysis of performance data
•
Presentation
— strategies for presenting performance for large-scale parallelism
— handling out-of-core performance data
48
Download