An Analysis of Node Sharing on HPC Clusters using XDMoD/TACC_Stats Joseph P White, Ph.D Scientific Programmer - Center for Computational Research University at Buffalo, SUNY XSEDE14 JULY 13– 18, 2014 Outline • • • • • • Motivation Overview of tools (XDMOD, tacc_stats) Background Results Conclusions Discussion TECHNOLOGY AUDIT SERVICE CoAuthors • • • • • • • • • • Robert L. DeLeon (UB) Thomas R. Furlani (UB) Steven M. Gallo (UB) Matthew D Jones (UB) Amin Ghadersohi (UB) Cynthia D. Cornelius (UB) Abani K. Patra (UB) James C. Browne (UTexas) William L. Barth (TACC) John Hammond (TACC) TECHNOLOGY AUDIT SERVICE Motivation • Node sharing benefits: – increases throughput by up to 26% – increases energy efficiency by up to 22% (Breslow et al.) • Node sharing disadvantages: – resource contention • Number of cores per node increasing • Ulterior motive: – Prove toolset • A. D. Breslow, L. Porter, A. Tiwari, M. Laurenzano, L. Carrington, D. M. Tullsen, and A. E. Snavely. The case for colocation of hpc workloads. Concurrency and Computation: Practice and Experience, 2013 http://dx.doi.org/10.1002/cpe.3187 TECHNOLOGY AUDIT SERVICE Tools • XDMoD – NSF funded open source tool that provides a wide range of usage and performance metrics on XSEDE systems – Web-based interface – Powerful charting features • tacc_stats – low-overhead collection of system-wide performance data – Runs on every node on a resource collects data at job start, end and periodically during job • • • • CPU usage Hardware performance counters Memory usage I/O usage TECHNOLOGY AUDIT SERVICE Data flow TECHNOLOGY AUDIT SERVICE Data flow TECHNOLOGY AUDIT SERVICE XDMoD Data Sources TECHNOLOGY AUDIT SERVICE Background • CCR's HPC resource "Rush" – – – – – 8000+ cores Heterogeneous cluster 8, 12, 16 or 32 cores per node InfiniBand Panasas parallel filesystem SLURM resource manager • node sharing enabled by default • cgroup plugin to isolate jobs • Academic computing center: higher % of smaller jobs than large XSEDE resources • All data from Jan - Feb 2014 (~370,000 jobs) TECHNOLOGY AUDIT SERVICE Number of jobs by job size TECHNOLOGY AUDIT SERVICE Results • Exclusive jobs: where no other jobs ran concurrently on the allocated node(s) (left hand side of plots) • Shared jobs: where at least one other job was running on the allocated node(s) (right hand side) – – – – – – Process memory usage Total OS memory usage LLC read miss rates Job exit status Parallel filesystem bandwidth InfiniBand interconnect bandwidth TECHNOLOGY AUDIT SERVICE Memory usage per core • (MemUsed - FilePages - Slab) from /sys/devices/system/node/node0/meminfo Memory usage per core GB Exclusive jobs Memory usage per core GB Shared jobs TECHNOLOGY AUDIT SERVICE Total memory usage per core (4GB/core nodes) Total memory usage per core GB Exclusive jobs Total memory usage per core GB Shared jobs TECHNOLOGY AUDIT SERVICE Last level cache (LLC) read miss rate per socket • UNC_LLC_MISS:READ on Intel Westmere uncore • Gives upper bound estimate of DRAM bandwidth LLC read miss rate 106/s Exclusive jobs LLC read miss rate 106/s Shared jobs TECHNOLOGY AUDIT SERVICE Job exit status reported by SLURM Exit status 1 0.9 Fraction of Jobs 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Successful Killed Exclusive jobs Shared jobs TECHNOLOGY AUDIT SERVICE Failed Panasas parallel filesystem write rate per node Write rate per node B/s Exclusive jobs Write rate per node B/s Shared jobs TECHNOLOGY AUDIT SERVICE InfiniBand write rate per node • Peaks truncated: • ~45,000 for Exclusive jobs Write rate Log10(B/s) Exclusive jobs • ~80,000 for shared jobs Write rate Log10(B/s) Shared jobs TECHNOLOGY AUDIT SERVICE Conclusions • Little difference on average between the shared and exclusive jobs on Rush • Majority of jobs have resource usage much less than max available • Have created data collection/processing software that facilitates easy evaluation of system usage TECHNOLOGY AUDIT SERVICE Discussion • Limitations of current work – Unable to determine impact (if any) on job wall time – Comparing overall average values for jobs – Shared node job statistics are convolved – Exit code not reliable way to determine failure TECHNOLOGY AUDIT SERVICE Future work • Use Application Kernels to get detailed analysis of interference • Many more metrics now available: – FLOPS – CPU clock cycles per instruction (CPI) – CPU clock cycles per L1D cache load (CPLD) • Add support for per job metrics on shared nodes. • Study classes of applications TECHNOLOGY AUDIT SERVICE Questions • BOF: XDMoD: A Tool for Comprehensive Resource Management of HPC Systems – 6:00pm - 7:00pm tomorrow. Room A602 • XDMoD – https://xdmod.ccr.buffalo.edu/ • tacc_stats – http://github.com/TACCProjects/tacc_stats • Contact info – xdmod-help@ccr.buffalo.edu TECHNOLOGY AUDIT SERVICE Acknowledgments • This work is supported by the National Science Foundation under grant number OCI 1203560 and grant number OCI 1025159 for the technology audit service (TAS) for XSEDE TECHNOLOGY AUDIT SERVICE