On-line Automated Performance Diagnosis on

advertisement
On-line Automated Performance
Diagnosis on Thousands of Processors
Philip C. Roth
Future Technologies Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
Paradyn Research Group
Computer Sciences Department
University of Wisconsin-Madison
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
1
High Performance Computing Today
 Large parallel computing resources
 Tightly coupled systems (Earth Simulator, BlueGene/L, XT3)
 Clusters (LANL Lightning, LLNL Thunder)
 Grid
 Large, complex applications
 ASCI Blue Mountain job sizes (2001)
 512 cpus: 17.8%
 1024 cpus: 34.9%
 2048 cpus: 19.9%
 Small fraction of peak performance is the rule
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
2
Achieving Good Performance
 Need to know what and where to tune
 Diagnosis and tuning tools are critical for realizing potential of
large-scale systems
 On-line automated tools are especially desirable

 Manual tuning is difficult
 Finding interesting data in large data volume
 Understanding application, OS, hardware interactions
 Automated tools require minimal user involvement; expertise is
built into the tool
 On-line automated tools can adapt dynamically
 Dynamic control over data volume
 Useful results from a single run
But: tools that work well in small-scale environments often don’t scale
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
3
Barriers to Large-Scale Performance Diagnosis
• Managing performance data volume
• Communicating efficiently between distributed tool
components
• Making scalable presentation of data and analysis results
Tool Front End
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
4
Our Approach for Addressing These
Scalability Barriers
 MRNet: multicast/reduction infrastructure
for scalable tools
 Distributed Performance Consultant: strategy
for efficiently finding performance
bottlenecks in large-scale applications
 Sub-Graph Folding Algorithm: algorithm for
effectively presenting bottleneck diagnosis
results for large-scale applications
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
5
Outline






Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
6
Performance Consultant
 Automated performance diagnosis
 Search for application performance problems
 Start with global, general experiments (e.g., test
CPUbound across all processes)
 Collect performance data using dynamic instrumentation
 Collect only the data desired
 Remove the instrumentation when no longer needed
 Make decisions about truth of each experiment
 Refine search: create more specific experiments based on
“true” experiments (those whose data is above userconfigurable threshold)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
7
Performance Consultant
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
8
Performance Consultant
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
9
Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
10
Outline






Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
11
MRNet: Multicast/Reduction Overlay Network
 Parallel tool infrastructure providing:
 Scalable multicast
 Scalable data synchronization and transformation
 Network of processes between tool front-end and
back-ends
 Useful for parallelizing and distributing tool activities
 Reduce latency
 Reduce computation and communication load at tool front-end
 Joint work with Dorian Arnold (University of
Wisconsin-Madison)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
12
Typical Parallel Tool Organization
Tool Front End
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
13
MRNet-based Parallel Tool Organization
Tool Front End
Internal Process
Filter
Multicast/
Reduction
Network
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
14
Outline






Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
15
Performance Consultant: Scalability Barriers
 MRNet can alleviate scalability problem for
global performance data (e.g., CPU utilization
across all processes)
 But front-end still processes local
performance data (e.g., utilization of process
5247 on host mcr398.llnl.gov)
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
16
Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
17
Distributed Performance Consultant
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
18
Distributed Performance Consultant:
Variants
 Natural steps from traditional centralized approach
(CA)
 Partially Distributed Approach (PDA)
 Distributed local searches, centralized global search
 Requires complex instrumentation management
 Truly Distributed Approach (TDA)
 Distributed local searches only
 Insight into global behavior from combining local search
results (e.g., using Sub-Graph Folding Algorithm)
 Simpler tool design than PDA
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
19
Distributed Performance Consultant: PDA
cham.cs.wisc.edu
CPUbound
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_mult
…
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
20
Distributed Performance Consultant: TDA
cham.cs.wisc.edu
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
…
c128.cs.wisc.edu
U. S. DEPARTMENT OF ENERGY
21
Distributed Performance Consultant: TDA
cham.cs.wisc.edu
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
myapp367
myapp4287
myapp27549
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
…
c128.cs.wisc.edu
Sub-Graph Folding Algorithm
Do_row
Do_col
Do_mult
Do_col
Do_mult
…
Do_row
…
Do_col
Do_mult
…
…
…
OAK RIDGE NATIONAL LABORATORY
Do_row
U. S. DEPARTMENT OF ENERGY
22
Outline






Paradyn and the Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
23
Search History Graph Example
CPUbound
c33.cs.wisc.edu
c34.cs.wisc.edu
main
myapp{1272}
myapp{1273}
myapp{7624}
myapp{7625}
main
main
main
main
A
B
A
C
D
A
B
C
D
E
B
A
A
B
C
B
C
C
D
D
D
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
24
Search History Graphs
 Search History Graph is effective for
presenting search-based performance
diagnosis results…
 …but it does not scale to a large number of
processes because it shows one sub-graph
per process
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
25
Sub-Graph Folding Algorithm
 Combines host-specific sub-graphs into
composite sub-graphs
 Each composite sub-graph represents a
behavioral category among application
processes
 Dynamic clustering of processes by qualitative
behavior
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
26
SGFA: Example
CPUbound
c33.cs.wisc.edu
c34.cs.wisc.edu
c*.cs.wisc.edu
main
myapp{1272}
myapp{1273}
myapp{7624}
myapp{*}
myapp{7625}
main
main
main
main
A
B
A
C
D
A
B
B
C
C
D
A
E
D
D
A
B
C
B
D
C
E
D
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
27
SGFA: Implementation
 Custom MRNet filter
 Filter in each MRNet process keeps folded
graph of search results from all reachable
daemons
 Updates periodically sent upstream
 By induction, filter in front-end holds entire
folded graph
 Optimization for unchanged graphs
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
28
Outline






Performance Consultant
MRNet
Distributed Performance Consultant
Sub-Graph Folding Algorithm
Evaluation
Summary
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
29
DPC + SGFA: Evaluation
 Modified Paradyn to perform bottleneck searches
using CA, PDA, or TDA approach
 Modified instrumentation cost tracking to support
PDA
 Track global, per-process instrumentation cost separately
 Simple fixed-partition policy for scheduling global and local
instrumentation
 Implemented Sub-Graph Folding Algorithm as custom
MRNet filter to support TDA (used by all)
 Instrumented front-end, daemons, and MRNet
internal processes to collect CPU, I/O load
information
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
30
DPC + SGFA: Evaluation
 su3_rmd
 QCD pure lattice gauge theory code
 C, MPI
 Weak scaling scalability study
 LLNL MCR cluster





1152 nodes (1048 compute nodes)
Two 2.4 GHz Intel Xeons per node
4 GB memory per node
Quadrics Elan3 interconnect (fat tree)
Lustre parallel file system
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
31
DPC + SGFA: Evaluation
 PDA and TDA: bottleneck searches with up
to 1024 processes so far, limited by
partition size
 CA: scalability limit at less than 64
processes
 Similar qualitative results from all approaches
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
32
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
33
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
34
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
35
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
36
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
37
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
38
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
39
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
40
DPC: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
41
SGFA: Evaluation
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
42
Summary
 Tool scalability is critical for effective use
of large-scale computing resources
 On-line automated performance tools are
especially important at large scale
 Our approach:
 MRNet
 Distributed Performance Consultant (TDA) plus
Sub-Graph Folding Algorithm
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
43
References
 P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a
Software-Based Multicast/Reduction Network for
Scalable Tools,” SC 2003, Phoenix, Arizona,
November 2003
 P.C. Roth and B.P. Miller, “The Distributed
Performance Consultant and the Sub-Graph Folding
Algorithm: On-line Automated Performance Diagnosis
on Thousands of Processes,” in submission
 Publications available from http://www.paradyn.org
 MRNet software available from
http://www.paradyn.org/mrnet
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
44
Download