Scalable On-line Automated Performance Diagnosis

advertisement
The Distributed Performance Consultant:
Automated Performance Diagnosis on
1000s of Processors
Philip C. Roth
pcroth@cs.wisc.edu
Computer Sciences Department
University of Wisconsin-Madison
Madison, WI 53706
USA
April 14, 2004
High Performance Computing Today
• Large parallel computing resources
– Tightly coupled systems
• Earth Simulator (Japan, 5120 CPUs)
• ASCI Purple (LLNL, 12K CPUs planned)
• BlueGene/L (LLNL, 128K CPUs planned)
– Clusters
• Lightning (LANL, 2816 CPUs)
• Aspen Systems (Forecast Systems Lab, 1536 CPUs)
– Grid
• Large applications
– ASCI Q job sizes (1/1 to 1/24 2004)
• 257–512 cpus: 19.7%
• 513–1024 cpus: 24.6%
• 1024–2048 cpus: 16.4%
© Philip C. Roth 2004
-2-
Distributed Performance Consultant
Barriers to Large-Scale Performance Diagnosis
• Managing performance data volume
• Communicating efficiently between
distributed tool components
• Making scalable presentation of
data and analysis results
Tool Front End
Tool
Daemons
d0
d1
d2
d3
dP-4
dP-3
dP-2
dP-1
App
Processes
a0
a1
a2
a3
aP-4
aP-3
aP-2
aP-1
© Philip C. Roth 2004
-3-
Distributed Performance Consultant
Overcoming Scalability Barriers: Our Approach
1. MRNet: multicast/reduction infrastructure
for building scalable tools
2. SStart: strategy for improving tool start-up
latency
3. Distributed Performance Consultant: strategy
for efficiently finding performance
bottlenecks in large-scale applications
4. Distributed Data Manager: for efficient,
distributed performance data management
5. Sub-Graph Folding Algorithm: algorithm for
effectively presenting bottleneck search
results for large-scale applications
© Philip C. Roth 2004
-4-
Distributed Performance Consultant
Overcoming Scalability Barriers: Our Approach
1. MRNet: multicast/reduction infrastructure
for building scalable tools
2. SStart: strategy for improving tool start-up
latency
3. Distributed Performance Consultant: strategy
for efficiently finding performance
bottlenecks in large-scale applications
4. Distributed Data Manager: for efficient,
distributed performance data management
5. Sub-Graph Folding Algorithm: algorithm for
effectively presenting bottleneck search
results for large-scale applications
© Philip C. Roth 2004
-5-
Distributed Performance Consultant
The Performance Consultant
• Automated performance diagnosis
• Search for application performance problems
– Start with global, general experiments (e.g., test
CPUbound across all processes)
– Collect performance data using dynamic instrumentation
– Make decisions about truth of each experiment
– Refine search: create more specific experiments based
on “true” experiments
CPUbound
– Repeat
main
Do_row
Do_col
c001.cs.wisc.edu
c002.cs.wisc.edu
main
main
…
c128.cs.wisc.edu
main
…
© Philip C. Roth 2004
-6-
Distributed Performance Consultant
The Performance Consultant
• Works well for sequential and small-scale
parallel applications…
• …but front-end is a bottleneck when looking
for performance problems in large-scale
applications
– High data processing load
– Limited network bandwidth
© Philip C. Roth 2004
-7-
Distributed Performance Consultant
Our Approach
• MRNet to reduce load for processing global
performance data (e.g., average CPU
utilization across all processes)
• But front-end still processes local
performance data (e.g., CPU utilization in
process 5247 on host blue26.pacific.llnl.gov)
Examine behavior of a specific process on
the host that runs the process
© Philip C. Roth 2004
-8-
Distributed Performance Consultant
Our Approach: Example
cham.cs.wisc.edu
c001.cs.wisc.edu
c002.cs.wisc.edu
c128.cs.wisc.edu
CPUbound
myapp367
myapp4287
myapp27549
main
Do_row
Do_col
Do_mult
…
c001.cs.wisc.edu
c002.cs.wisc.edu
myapp{367}
myapp{4287}
myapp{27549}
main
main
main
…
Do_row
Do_col
Do_row
Do_mult
Do_mult
-9-
…
Do_row
Do_col
…
…
Do_mult
…
…
…
© Philip C. Roth 2004
Do_col
c128.cs.wisc.edu
Distributed Performance Consultant
The Distributed Performance Consultant (DPC)
• Distributed
performance problem
search strategy
• Distributed
performance data
management
• DPC-aware Dynamic
Instrumentation
management
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
© Philip C. Roth 2004
-10-
Distributed Performance Consultant
The Distributed Performance Consultant (DPC)
• Distributed
performance problem
search strategy
• Distributed
performance data
management
• DPC-aware Dynamic
Instrumentation
management
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
© Philip C. Roth 2004
-11-
Distributed Performance Consultant
DPC: Search Strategy
• Local Performance
Consultant agents (LPCs)
in each daemon, Global
PC in front-end
• GPC controls overall
search, manage global
experiments
• LPCs control portion of
search specific to their
local host
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
© Philip C. Roth 2004
-12-
Distributed Performance Consultant
DPC: Search Strategy
• Like traditional PC, DPC begins search
with global, general experiments (e.g.,
CPUbound across all processes)
• When search is refined to specific host,
GPC delegates that portion of the
search to LPC on that host
• LPC examines local behavior of all
processes on the local host, returns
results to GPC
© Philip C. Roth 2004
-13-
Distributed Performance Consultant
The Distributed Performance Consultant (DPC)
• Distributed
performance problem
search strategy
• Distributed
performance data
management
• DPC-aware Dynamic
Instrumentation
management
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
© Philip C. Roth 2004
-14-
Distributed Performance Consultant
DPC: Performance Data Management
Control
• Local Data Managers
(LDMs) in each daemon,
Global Data Manager in
front-end
• Publish/subscribe model
for data flow
• MRNet for global data
aggregation
• Possible caching at LDMs
(no caching in initial
implementation)
© Philip C. Roth 2004
-15-
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
Distributed Performance Consultant
The Distributed Performance Consultant (DPC)
• Distributed
performance problem
search strategy
• Distributed
performance data
management
• DPC-aware Dynamic
Instrumentation
management
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
© Philip C. Roth 2004
-16-
Distributed Performance Consultant
DPC: Instrumentation Management
• Local Instrumentation
Manager (LIM) services
requests from GPC and
LPC without starving
either
• Instrumentation
management tailored to
distributed nature of DPC
– New model for
instrumentation cost
– New policy for scheduling
instrumentation
© Philip C. Roth 2004
-17-
Control
Data
Front-End
Data
Manager
Performance
Consultant
MRNet
Daemon
Local
Instrumentation
Manager
Local
Data
Manager
Local
Performance
Consultant
Data
Collector
Distributed Performance Consultant
DPC: Instrumentation Management
• Mismatch between centralized control of
instrumentation cost and distributed nature
of DPC
• DPC’s instrumentation management
– New model for cost of Dynamic Instrumentation
– New policy for scheduling Dynamic
Instrumentation requests
© Philip C. Roth 2004
-18-
Distributed Performance Consultant
DPC: Instrumentation Cost Model
• User-specified cost threshold, broadcast to
each Local Instrumentation Manager (LIM)
• Each LIM restricts active instrumentation
to cost threshold
• GPC throttles instrumentation requests
when it does not receive timely responses to
instrumentation requests (avoids need to
gather instrumentation cost from all LIMs
to front-end)
© Philip C. Roth 2004
-19-
Distributed Performance Consultant
DPC: Instrumentation Scheduling Policy
• LIM needs to schedule instrumentation requests
from both LPC and GPC
– GPC requests must be satisfied by LIM in all daemons to
be useful
– LPC requests are restricted to a single daemon
• Prototype LIM’s instrumentation scheduling policy
chosen to ensure both local and global searches
make progress
– Distinguishes instrumentation cost by instrumentation
type (local vs. global)
– If both types are present, guarantee a percentage of
total instrumentation cost threshold for each type
(percentage is user-configurable)
© Philip C. Roth 2004
-20-
Distributed Performance Consultant
DPC: Status
• Implementation well underway, but prototype
implementation not yet complete
– Daemons have been multithreaded
– Local Data Managers in place
– Local Performance Consultant agents in progress
• Planning experiments to evaluate overall
scalability with various instrumentation cost
models and scheduling policies
© Philip C. Roth 2004
-21-
Distributed Performance Consultant
Summary
The DPC is a major part of our approach for
overcoming barriers that limit the scalability of
automated performance diagnosis tools.
Using a distributed search strategy, distributed
performance data manager, and instrumentation
management techniques tailored to its distributed
nature, the DPC will greatly increase the scalability
of Paradyn’s automated performance diagnosis
functionality.
http://www.paradyn.org
paradyn@cs.wisc.edu
© Philip C. Roth 2004
-22-
Distributed Performance Consultant
Download