Using XDMoD to Facilitate XSEDE Operations, Planning and Analysis

advertisement
Using XDMoD to Facilitate XSEDE
Operations, Planning and Analysis
Thomas R. Furlani1, Barry I. Schneider2, Matthew D. Jones1, John Towns3, David L. Hart4,
Steven M. Gallo1, Robert L. DeLeon1, Charng-Da Lu1, Amin Ghadersohi1, Ryan J. Gentner1,
Abani K. Patra5, Gregor von Laszewski6, Fugang Wang6, Jeffrey T. Palmer1, Nikolay Simakov1
1Center
for Computational Research, University at Buffalo, SUNY, 2 CISE - Advanaced
Computing Infrastructure, National Science Foundation, 3NCSA - University of Illinois,
4National Center for Atmospheric Research, 5Mech. & Aerospace. Eng. Dept. University at
Buffalo, SUNY, 6Pervasive Technology Institute - University of Indiana
Tom Furlani, PhD
Director - Center for Computational Research
University at Buffalo, SUNY
XSEDE13 JULY 22 – 25, 2013
Outline
• Overview of Technology Audit Service (XDMoD)
• XDMoD Case Studies
– Data Driven CI Planning for XSEDE
– System Operation and Maintenance
– Interpreting XDMoD Data
• Future XDMoD Functionality
– SUPReMM (Lightning Talk – Wed, 3PM, Marina Ballroom F&G)
– PEAK (NICS) (Optimizing Utilization Across XSEDE – Thurs, 8:30AM,
Marina Ballroom G)
– Scientific Impact and Open Source Version (XDMoD TAS BOF – Wed,
6PM, Palomar)
TECHNOLOGY AUDIT SERVICE
CoAuthors
•
•
•
•
•
•
•
•
•
•
•
Barry I. Schneider (NSF)
• Fugang Wang (Indiana)
Matthew D. Jones (UB)
• Jeffrey T. Palmer (UB)
John Towns (NCSA)
• Nikolay Simakov (UB)
David L. Hart (NCAR)
Steven M. Gallo (UB)
Robert L. DeLeon (UB)
Charng-Da Lu
Amin Ghadersohi (UB)
Ryan J. Gentner (UB)
Abani K. Patra (UB)
Gregor von Laszewski (Indiana)
TECHNOLOGY AUDIT SERVICE
Motivation
Log Size (Bytes)
Example: Log File Analysis Discovers Two Malfunctioning Nodes
• Measuring
utilization of CI provides an understanding of how
Log Size As Of 9/12/2011
resource
is
being
utilized
40,000,000
• HPC systems
are a complex
combination
of software, processors,
job scheduler
error node #126
35,000,000
memory, networks, and storage systems - difficult to know if
cable node #348
optimal30,000,000
performance loose
is being
realized, or even if all
25,000,000
subcomponents
are functioning properly
20,000,000
15,000,000
10,000,000
5,000,000
0
0
200
400
600
Node Number
TECHNOLOGY AUDIT SERVICE
800
1000
XSEDE Technology Audit Service (TAS)
• Provide Auditing and Quality of Service (QoS) Metrics
• Primary components to TAS
– XDMoD: XSEDE Metrics on Demand Portal
• Analytics Framework for XSEDE
• Display results of all metrics (utilization, wait time, etc )
• Easy to use
– Application Kernel Framework
• Measure performance of XSEDE infrastructure
• Diagnostic set of tools – early identification of system problems
• Broader Impact
– Open source framework for academic HPC centers
• Organizations
– Buffalo, Indiana (Laszewski), Michigan (Finholt), UT-NICS (You)
TECHNOLOGY AUDIT SERVICE
XDMoD Data Sources
TECHNOLOGY AUDIT SERVICE
XDMoD: XD Metrics on Demand Portal
• Display metrics, Role Based, Custom Report Builder
TECHNOLOGY AUDIT SERVICE
XDMoD Case Studies
• Data Driven CI Planning for XSEDE
• System Operation and Maintenance
• Interpreting XDMoD Data
TECHNOLOGY AUDIT SERVICE
Data Driven CI Planning for XSEDE
• Largest, average and total SU allocations on XSEDE over time. Average and
largest allocations have increased by more than a factor of 10 over the time
period
9
TECHNOLOGY AUDIT SERVICE
Data Driven CI Planning for XSEDE
• Total service unit usage by parent science- Molecular Bioscience usage has
grown over time – now rivals that of Physics
10
TECHNOLOGY AUDIT SERVICE
Data Driven CI Planning for XSEDE
• However average core count varies widely over parent science – molecular
bioscience jobs tend to use a relatively small number of processors
11
TECHNOLOGY AUDIT SERVICE
CI System Operation and Maintenance
• Application kernels help detect user environment anomaly at CCR
• Example: Performance variation of NWChem due to bug in commercial parallel
file system that was subsequently fixed by vendor
TECHNOLOGY AUDIT SERVICE
CI System Operation and Maintenance
• Sudden decrease in file system performance on TACC Lonestar4 as measured by 3
different application kernels (IOR, MPI-Tile-IO, and IMB)
TECHNOLOGY AUDIT SERVICE
CI System Operation and Maintenance
• Application kernel control process to automatically detect underperforming
application kernels (poor performance). Red zone indicates an application kernel
that is underperforming
TECHNOLOGY AUDIT SERVICE
Interpreting XDMoD Data
• Like any analysis system, care must be exercised in interpretation of data
from XDMoD
• Ex. Distribution of job sizes for all parent science Physics jobs in XSEDE
resources for the period 2008-2012
TECHNOLOGY AUDIT SERVICE
Interpreting XDMoD Data
• Mean core count for Physics jobs in XSEDE resources for the period 20082012, including (blue line) and excluding (red line) serial runs
Number of Serial Physics Jobs by Resource
High Throughput Jobs Start at Purdue
TECHNOLOGY AUDIT SERVICE
Future XDMoD Functionality: SUPReMM
• SUPReMM (Lightning Talk – Wed, 3PM)
– Collaboration with TACC and U Texas at Austin
– Comprehensive job level resource use measurement for large clusters
– Will supply XDMoD with some missing job usage data – application run,
memory, local I/O, network, file-system, and CPU usage
– Sample application report for Lonestar4
TECHNOLOGY AUDIT SERVICE
Future XDMoD Functionality: PEAK
• NICS – PEAK (Thursday, 8:30AM)
–
–
–
–
Optimizing Utilization Across XSEDE (Dr. Haihang You)
Performance Environment Autoconfiguration FrameworK
UT-NICS project to automatically tune key libraries and application kernels
Ex. Performance of Amber on Kraken – Amber built with PGI much faster
TECHNOLOGY AUDIT SERVICE
Future XDMoD Functionality
Open Source XDMoD & Scientific Impact
• Open Source Version: (XDMoD BOF - Wed, 6PM)
– XDMoD functionality for non-XSEDE HPC centers
– Installation by system administrators
• Programming not required
• Guided textual installation process
• Installation support provided by TAS Team
– Pre-existing central database not required
• Aggregate data from available sources
• Resource manager log files or existing database
– Currently recruiting for beta-testing program
• Scientific Impact
– Preliminary XSEDE-based H-Index
TECHNOLOGY AUDIT SERVICE
Acknowledgement
• This work was sponsored by NSF under grant number
OCI 1025159 for the development of Technology Audit
Service for XSEDE.
• Contact Info
– furlani@buffalo.edu
– XDMoD https://xdmod.ccr.buffalo.edu/
– xdmod-support@ccr.buffalo.edu
TECHNOLOGY AUDIT SERVICE
Download