Final exam of Tanzima Zerin Islam School of Electrical & Computer Engineering Purdue University West Lafayette, IN Date: April 8, 2013 Distributed Computing Environments High Performance Computing (HPC): Projected MTBF 3-26 minutes in exascale Failure: hardware, software Grid: Cycle sharing system Highly volatile environment Failure: eviction of guest jobs @Notre Dame @Purdue @Indiana U. Internet Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 1 Fault-tolerance with Checkpoint-Restart Checkpoints are execution states System-level Memory state Compressible Application-level Selected variables Hard to compress Tanzima Islam (tislam@purdue.edu) Struct ToyGrp{ 1. float Temperature[1024]; 2. int Pressure[20][30]; }; Reliable & Scalable Checkpointing Systems 2 Challenges in Checkpointing Systems HPC: Scalability of checkpointing systems @Notre Dame Grid: Use of dedicated checkpoint servers @Purdue @Indiana U. Internet Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 3 Contributions of This Thesis 2nd Place, ACM Student Research Competition’10 Compression on Multi-core FALCON Reliable Checkpointing System in Grid [Best Student Paper Nomination, SC’09] MCRENGINE MCRCLUSTER Unpublished Preli m 2007 - 2009 Tanzima Islam (tislam@purdue.edu) Scalable Checkpointing System in HPC [Best Student Paper Nomination, SC’12] 2009-2010 2010-2012 2012-2013 Reliable & Scalable Checkpointing Systems 4 Agenda [MCRENGINE] Scalable checkpointing system for HPC [MCRCLUSTER] Benefit-aware clustering Future directions Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 5 A Scalable Checkpointing System using Data-Aware Aggregation and Compression Collaborators: Kathryn Mohror, Adam Moody, Bronis de Supinski Big Picture of HPC Compute Nodes Network Contention Gateway Nodes Atlas Contention for Shared File System Resources Hera Hera Contention for Other Clusters Parallel File System Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 7 Checkpointing in HPC MPI applications Take globally coordinated checkpoints asynchronously Application-level checkpoint High-level data format for portability HDF5, Adios, netCDF etc. Checkpoint writing Struct ToyGrp{ N1 (Funneled) Application NM (Grouped) 1. float Temperature[1024]; 2. short Pressure[20][30]; }; I/O Library Parallel File System (PFS) Data-Format API Not scalable HDF5 NetCDF Tanzima Islam (tislam@purdue.edu) Parallel File System (PFS) Best compromise but complex 1. 2. 3. HDF5 checkpoint{ NN (Direct) Group “/”{ Group “ToyGrp”{ DATASET “Temperature”{ DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE {(1024) / (1024)} } DATASETParallel “Pressure” File { DATATYPE H5T_STD_U8LE System (PFS) DATASPACE SIMPLE {(20,30) / (20,30)} }}}} Easiest but contention on PFS Reliable & Scalable Checkpointing Systems 8 IOR Direct (NN): 78MB per process Observations: (−) Large average write time (−) Large average read time less frequent checkpointing poor application performance Average Read Time (s) Average Write Time (s) Impact of Load on PFS at Large Scale 250 200 150 100 50 0 # of Processes (N) Tanzima Islam (tislam@purdue.edu) 1400 1200 1000 800 600 400 200 0 # of Processes (N) Reliable & Scalable Checkpointing Systems 9 What is the Problem? Today’s checkpoint-restart systems will not scale Increasing number of concurrent transfers Increasing volume of checkpoint data Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 10 Our Contributions Data-aware aggregation Reduces the number of concurrent transfers Improves compressibility of checkpoints by using semantic information Data-aware compression Improves compression ratio by 115% compared to concatenation and general-purpose compression Design and develop mcrEngine Grouped (NM) checkpointing system Improves checkpointing frequency Improves application performance Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 11 Naïve Solution: Data-Agnostic Compression Agnostic scheme – concatenate checkpoints First Phase C1 C1 pGzip C2 PFS C2 Agnostic-block scheme – interleave fixed-size blocks C1 [1-B] C1 [B+1-2B] C2 C2 [1-B] [B+1-2B] C1 [1-B] C2 C1 C2 [1-B] [B+1-2B] [B+1-2B] pGzip PFS Observations: (+) Easy (−) Low compression ratio Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 12 Our Solution: [Step 1] Identify Similar Variables Across [Step [Step 2] Merging 2] Merging Scheme Scheme II: Aware-Block I: Aware Scheme Scheme Processes P0 P1 Group ToyGrp{ Meta-data: float Temperature[1024]; 1. Name int Pressure[20][30]; 2. Data-type 3. Class: }; -- Array, Atomic Concatenating similar variables Group ToyGrp{ float Temperature[100]; int Pressure[10][50]; }; C1.T C1.P C2.T C2.P C1.T C2.T C1.P C2.P C1.T C1.P C2.T C2.P Interleaving similar variables Tanzima Islam (tislam@purdue.edu) Interleave First Next ‘B’ ‘B’ bytes bytes of Temperature Pressure Reliable & Scalable Checkpointing Systems 13 [Step 3] Data-Aware Aggregation & Compression Aware scheme – concatenate similar variables Aware-block scheme – interleave similar variables C1.H .T Data-type aware compression C2.H .T C1.D .P FPC C2.D .P Lempel-Ziv First Phase Output buffer T P H D pGzip Second Phase PFS Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 14 How MCRENGINE Works CNC : Compute node component ANC: Aggregator node component Rank-order groups, Grouped (NM) transfer Group Group CNC CNC CNC Compute Component CNC CNC CNC Compute Component Identifiesdata-aware Applies “similar” variables aggregation and compression Request Meta-data H, PD H T D P T, Aggregator T P H D pGzip Request T, PD H T D P H, Meta-data Aggregator T P PFS H D pGzip Group CNC CNC CNC Compute Component H T D P H, Request Meta-data T, PD Aggregator T P H D pGzip Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 15 Evaluation Applications ALE3D – 4.8GB per checkpoint set Cactus – 2.41GB per checkpoint set Cosmology – 1.1GB per checkpoint set Implosion – 13MB per checkpoint set Experimental test-bed LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster 23,328 cores, 1.3 Petabyte Lustre file system Compression algorithm FPC [1] for double-precision float Fpzip [2] for single-precision float Lempel-Ziv for all other data-types pGzip for general-purpose compression Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 16 Evaluation Metrics Effectiveness of data-aware compression What is the benefit of multiple compression phases? How does group size affect compression ratio? Compression ratio = Uncompressed size Compressed size Performance of mcrEngine Overhead of the checkpointing phase Overhead of the restart phase Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 17 Multiple Phases of Data-Aware Compression No Benefit with Data-Agnostic Double Compression are Beneficial Data-agnostic double compression is not beneficial Because, data-format is non-uniform and uncompressible Data-type aware compression improves compressibility First phase changes underlying data format Compression Ratio 4 3.5 Data-Agnostic 3 Data-Aware 2.5 2 1.5 1 0.5 0 First Second ALE3D Tanzima Islam (tislam@purdue.edu) First Second Cactus First Second Cosmology First Second Implosion Reliable & Scalable Checkpointing Systems 18 Impact of Group Size on Compression Ratio Different merging schemes better for different applications Larger group size beneficial for certain applications ALE3D: Improvement of 8% from group size 2 to 32 2.5 Compression Ratio 4.5 Aware-Block 3.5 1.5 2.5 0.5 ALE3D Aware Cactus Group size Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 19 Data-Aware Technique Always Wins over Data-Agnostic Data-aware technique always yields better compression ratio than Data-Agnostic technique 98-115% 2.5 Compression Ratio 4.5 Aware-Block 3.5 Aware 1.5 Agnostic-Block 2.5 0.5 ALE3D Agnostic Cactus Group size Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 20 Summary of Effectiveness Study Data-aware compression always wins Reduces gigabytes of data for Cactus Larger group sizes may improve compression ratio Different merging schemes for different applications Compression ratio follows course of simulation Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 21 Impact of Data-Aware Compression on Latency IOR with Grouped(NM) transfer, groups of 32 processes Data-aware: 1.2GB, data-agnostic: 2.4GB Data-aware compression improves I/O performance at large scale Improvement during write 43% - 70% Improvement during read 48% - 70% Average Transfer Time (sec) 400 350 Agnostic-Write 300 250 Aware-Write Agnostic 200 Agnostic-Read 150 Aware-Read Aware 100 50 0 # of Processes (N) Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 22 Impact of Aggregation & Compression on Latency Average Read Time (sec) Average Write Time (sec) Used IOR 250 Direct (NN): 87MB per process Grouped (NM): Group size 32, 1.21GB per aggregator 200 150 N->N Write 100 N->M Write 50 0 1400 1200 1000 800 600 400 200 0 N->N Read N->M Read # of Processes (N) Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 23 End-to-End Checkpointing Overhead 15,408 processes Group size of 32 for NM schemes Each process takes a checkpoint Total Checkpointing Overhead (sec) Converts network bound operation into CPU bound one Reduction in Checkpointing Overhead 350 300 87% 250 Transfer Overhead 51% CPU Overhead 200 150 100 50 0 No Comp. Indiv. No Comp. Agnostic Comp Direct Grouped ALE3D Tanzima Islam (tislam@purdue.edu) Aware No Comp. Indiv. No Comp. Agnostic Comp Direct Aware Grouped Cactus Reliable & Scalable Checkpointing Systems 24 End-to-End Restart Overhead Reduced overall restart overhead Reduced network load and transfer time Total Recovery Overhead (sec) 600 Reduction in I/O Overhead Recovery Overhead 500 62% 400 64% Transfer Overhead CPU Overhead 300 200 43% 71% 100 0 No Comp. Indiv. Comp No Comp. Agnostic Direct Grouped ALE3D Tanzima Islam (tislam@purdue.edu) Aware No Comp. Indiv. Comp No Comp. Agnostic Direct Grouped Cactus Reliable & Scalable Checkpointing Systems 25 Aware Summary of Scalable Checkpointing System Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115% Investigated different merging techniques Demonstrated effectiveness using real-world applications Designed and developed MCRENGINE Reduces recovery overhead by more than 62% Reduces checkpointing overhead by up to 87% Improves scalability of checkpoint-restart systems Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 26 Benefit-Aware Clustering of Checkpoints from Parallel Applications Collaborators: Todd Gamblin, Kathryn Mohror, Adam Moody, Bronis de Supinski Our Goal & Contributions Goal: Can suitably grouping checkpoints increase compressibility? Contributions: Design new metric for “similarity” of checkpoints Use this metric for clustering checkpoints Evaluate the benefit of the clustering on checkpoint storage Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 28 Different Clustering Schemes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2 5 1 12 8 1 7 4 16 10 3 14 11 3 8 15 7 1 4 2 13 6 Our Solution 5 14 15 13 7 10 Random Rank-wise 4 14 8 6 Reliable & Scalable Checkpointing Systems 5 13 10 9 12 9 Tanzima Islam (tislam@purdue.edu) 11 16 11 6 16 12 15 3 9 2 Data-aware 29 Research Questions How to cluster checkpoints? Does clustering improve compression ratio? Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 30 Benefit-Aware Clustering Similarity metric: Improvement in reduction Goal: Minimize the total compressed size β Benefit−matrix 1 2 0.9 3 4 6 7 8 0.8 Benefit matrix of Cactus 5 9 10 11 0.7 12 13 14 15 16 0.6 17 18 19 20 21 0.5 22 23 24 25 26 0.4 27 28 29 30 31 0.3 32 V1 Tanzima Islam (tislam@purdue.edu) V3 V5 V7 V9 V12 V15 V18 V21 Reliable & Scalable Checkpointing Systems V24 V27 V30 V33 31 Novel Dissimilarity Metric Two factors for the dissimilarity between two checkpoints 1 Δ(i, j) = × β(i, j) Tanzima Islam (tislam@purdue.edu) N Σ [(i, k) – β(j, k)]2 k=1 Reliable & Scalable Checkpointing Systems 32 How Benefit-Aware Clustering Works D P T Chunking Sample double T[3000]; double V[10]; double P[5000]; double D[4000]; double R[100]; double T[3000]; D[4000]; double P[5000]; double T[3000]; D[4000]; P1 P2 Wavelet P3 P4 P5 D P T Cluster 1 P1 P3 β(14 ) Filter Tanzima Islam (tislam@purdue.edu) Order Cluster 2 P2 P5 P4 Similarity Reliable & Scalable Checkpointing Systems Cluster 33 Structure of MCRCLUSTER P5 F PO 4 F S C OP3 S C F OP2 S C S C P1 F O F O S A2 Aggregator A1 PFS Aggregator C Compute Node Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 34 Evaluation Application IOR (synthetic checkpoints) Cactus Experimental test-bed LLNL’s Sierra: 261.3 TFLOP/s, Linux cluster 23,328 cores, 1.3 Petabyte Lustre file system Evaluation metric: Macro benchmark: Effectiveness of clustering Micro benchmark: Effectiveness of sampling Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 35 Effectiveness of MCRCLUSTER IOR: 32 checkpoints Odd processes write 0 Even processes write: <rank> | 1234567 29% more compression compared to rank-wise, 22% more compared to random grouping Silhouette plot of pam(x = distance_matrix, k = n um_cluster, diss = TRUE) 3 clusters Cj j : nj | aveiÎCj si n = 32 V1 V3 V5 V7 V9 V11 V13 V15 V17 V19 V21 V23 V25 V27 V29 V31 V2 V10 V12 V14 V6 V4 V18 V8 V24 V32 V26 V28 V30 V22 V20 V16 Weighted−Distance 1 2 3 2.0 4 5 6 7 1 : 16 | 1.00 8 9 10 1.5 11 12 13 14 15 16 17 18 1.0 2 : 9 | 0.75 19 20 21 22 23 24 25 0.5 3 : 7 | 0.92 26 27 28 29 0.2 0.4 0.6 Silhouette width si Average silhouette width : 0.91 Tanzima Islam (tislam@purdue.edu) 0.8 1.0 30 31 32 0.0 0.0 V1 V3 V5 V7 V9 V12 Reliable & Scalable Checkpointing Systems V15 V18 V21 V24 V27 V30 36 V33 Effectiveness of Sampling X axis: Each variable Y axis: Range of benefit values Take away: Chunking method preserves benefit relationships the closest Chunking 0.000 0.000 0.001 0.001 0.002 0.002 0.003 0.003 0.004 0.004 0.005 0.005 0.006 0.006 0.007 0.007 Wavelet Transform 1 6 12 19 26 33 40 47 54 61 68 75 82 89 Tanzima Islam (tislam@purdue.edu) 96 1 6 12 19 26 33 40 47 47 54 54 61 61 68 68 75 75 82 82 89 89 96 96 Reliable & Scalable Checkpointing Systems 37 Contributions of MCRCLUSTER Design similarity and distance metric Demonstrate significant result on synthetic data 22% and 29% improvement compared to random and rank-wise clustering, respectively Future directions for a first year Ph.D. student Study impact on real applications Design scalable clustering technique Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 38 Applicability of My Research Condor systems Compression for scientific data Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 39 Conclusions This thesis addresses: Reliability of checkpointing-based recovery in large-scale computing Proposed three novel systems: FALCON: Distributed checkpointing system for Grids MCRENGINE: “Data-Aware Compression” and scalable checkpointing system for HPC MCRCLUSTER: “Benefit-Aware Clustering” Provides a good foundation for further research in this field Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 40 Questions? Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 41 Future Directions: Reliability Reliability: Similarity-based process grouping for better compression Group processes based on similarity instead of rank [On going] Analytical solution to group size selection Variable streaming Integrating mcrEngine with SCR Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 42 Future Directions: Performance Cache usage analysis and optimization Developed user-level tool for analyzing cache utilization [Summer’12] Short term goals: Apply to real-applications Automate analysis Long-term goals: Suggest potential code optimizations Automate application tuning Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 43 Contact Information Tanzima Islam (tislam@purdue.edu) Website: web.ics.purdue.edu/~tislam Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 44 Effectiveness of mcrCluster Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 45 Backup Slides Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 46 [Backup Slide] Failures in HPC “A Large-scale Study of Failures in High-performance Computing Systems”, by Bianca Schroeder, Garth Gibson 1 0 0 1 0 0 8 0 6 0 4 0 2 0 0 P e rc e n ta g e (% ) 8 0 U n k n o w n 6 0 4 0 8 0 8 0 6 0 U n k n o w n 6 0 4 0 4 0 2 0 2 0 2 0 0 H a r d w a r e S o f t w a r e H a r d w a r e N e t w o r k S o f t w a r e E n v ir o n m e n t N e t w o r k H u m a n E n v ir o n m e n U n k n o w n H u m a n 1 0 0 P e rc e n ta g e (% ) 1 0 0 P e r c e n ta g e ( % ) H a r d w a r e S o f t w a r e H a r d w a r e N e t w o r k S o f t w a r e E n v ir o n m e n N e t w o r k H u m a n E n v ir o n m e n t U n k n o w n H u m a n D E F G H A lls y s t e m s D E F GH A lls y s t e m s 00 D E F G H A lls y s t e m s D E F GH A lls y s t e m s (b) (a) (a) (b) Figure 1. The breakdown of failuresofinto root causes (a) and the breakdownofofdowntime downtime into rootroot causes (b). Each Breakdown of root causes failures Breakdown into causes Figure graph 1. Theshows breakdown of failures into root causes (a)F,and theHbreakdown of downtime into root causes (A–H). (b). Each the breakdown for systems of type D, E, G, and and aggregate statistics across all systems graph shows the breakdown for systems of type D, E, F, G, and H and aggregate statistics across all systems (A–H). ure record. If the system administrator was able to identify variance or the standard deviation, is that it is47normalized by Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems [Backup Slide] Failures in HPC “Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm”, by Laxmikant Kalé et. al. Disparity between network bandwidth and memory size Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 48 [Backup Slides] Falcon Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 49 [Backup Slide] Breakdown of Overheads 180 180 160 160 Recovery Overhead (sec) Checkpointing Overhead (sec) Performance scales with checkpoint sizes Lower network transfer overhead 140 120 100 80 60 40 140 120 100 80 60 40 20 20 0 0 500MB 946MB Tanzima Islam (tislam@purdue.edu) 1677MB 500MB Reliable & Scalable Checkpointing Systems 946MB 1677MB 50 [Backup Slide] Parallel Falcon 180 180 160 160 Recovery Overhead (sec) Checkpoint Storing Overhead (sec) 67% improvement in CPU time 140 120 100 80 60 40 140 120 100 80 60 40 20 20 0 0 500MB 946MB Tanzima Islam (tislam@purdue.edu) 1677MB 500MB Reliable & Scalable Checkpointing Systems 946MB 1677MB 51 [Backup Slides] mcrEngine Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 52 [Backup Slide] How to Find Similarity Var: “ToyGrp/Temperature” Type: F32LE, Array1[1024] P0 Group ToyGrp{ float Temperature[1024]; short Pressure[20][30]; int Humidity; }; P1 Group ToyGrp{ float Temperature[50]; short Pressure[2][6]; double Unit; int Humidity; }; Inside source code: Variables represented as members of a group in actual source code. A group can be thought of the construct “Struct” in C Tanzima Islam (tislam@purdue.edu) ToyGrp/Temperature_F32LE_Array1 D Var: “ToyGrp/Pressure” Type: S8LE, Array2D [20][30] ToyGrp/Pressure_S8LE_Array2D Var: “ToyGrp/Humidity” Type: I32LE, Atomic ToyGrp/Humidity_I32LE_Atomic Var: “ToyGrp/Temperature” Type: F32LE, Array1D [50] ToyGrp/Temperature_F32LE_Array1 D Var: “ToyGrp/Pressure” Type: S8LE, Array2D [2][6] ToyGrp/Pressure_S8LE_Array2D Var: “ToyGrp/Unit” Type: F64LE, Atomic ToyGrp/Unit_F64LE_Atomic No match Var: “ToyGrp/Humidity” Type I32LE, Atomic Inside a checkpoint: Variables annotated with metadata ToyGrp/Humidity_I32LE_Atomic Generated hash key for matching Reliable & Scalable Checkpointing Systems 53 [Backup Slide] Compression Ratio Follows Course of Simulation Data-aware technique always yields better compression 2.3 Cactus Aware-Block Compression Ratio Aware 1.8 Agnostic-Block 1.3 Agnostic 0.8 2.3 Cosmology 6.0 2.1 5.0 1.9 4.0 1.7 3.0 1.5 2.0 1.3 1.0 Implosion Simulation Time-steps Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 54 [Backup Slide] Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme Application Total Size Data Types(%) Aware-Block (GB) DF S F Int (%) Aware (%) ALE3D 4.8 88.8 ~0 11.2 6.6 - 27.7 6.6 - 12.7 Cactus 2.41 33.9 4 0 66.06 10.7 – 11.9 98 - 115 Cosmology 1.1 24.3 67.2 8.5 20.1 – 25.6 20.6 – 21.1 Implosion 0.013 0 74.1 25.9 36.3 – 38.4 36.3 – 38.8 Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 55 References 1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”. 2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”. 3. L. Reinhold, “QuickLZ”. Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 56 Execution Environment: Grid Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 57 State-of-the-Art: Checkpointing in Grid with Dedicated Storage @Notre Dame @Purdue @Indiana U. Internet Dedicated Storage Server Submitter Tanzima Islam (tislam@purdue.edu) Problems: (−) High transfer latency (−) Contention on servers (−) Stress on shared network resources Reliable & Scalable Checkpointing Systems 58 Research Question Can we improve the performance of applications by storing checkpoints on the grid resources? Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 59 Overview of Our Solution: Checkpointing in Grid with Distributed Storage @Notre Dame @Purdue @Indiana U. Internet Submitter Tanzima Islam (tislam@purdue.edu) Q1. Which storage nodes? Q2. How to balance load? Q3. How to efficiently storage & retrieve? Constraint: -- All components must be user-level Reliable & Scalable Checkpointing Systems 60 Answer to Q1: Storage Host Selection Build failure model for storage resources Compute correlated temporal reliability Based on historical data Rank machines Based on: reliability, load, and network overhead Output: (m+k) storage hosts Compute Host Addresses Q2 down down Objective function: Storage Host 1 checkpoint storing overhead – benefit from restart down Tanzima Islam (tislam@purdue.edu) down Reliable & Scalable Checkpointing Systems Storage Host 2 61 Checkpoint-Recovery Scheme Disk Disk Original Checkpoint Original Checkpoint Compression Decompression Compressed Compressed Erasure Encoding (m+k) Erasure Decoding (m) Fragments Fragments Storage Host Storage Host 1 2 m+k Checkpoint Storing Phase Tanzima Islam (tislam@purdue.edu) 1 2 m+k Recovery Phase Reliable & Scalable Checkpointing Systems 62 Evaluation Setup 2 different applications with 4 input sets MCF (SPEC CPU 2006) TIGR (BioBench) System-level checkpoints Macro benchmark experiment Average job makespan Micro benchmark experiments Efficiency of checkpoint and restart Efficiency in handling simultaneous clients Efficiency in handling multiple failures Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 63 Checkpoint Storing & Recovery Overhead Performance scales with checkpoint sizes Lower network transfer overhead Transfer Overhead 180 160 CPU Overhead 160 Recovery Overhead (sec) Checkpointing Overhead (sec) 180 140 140 120 120 100 100 80 60 40 80 60 40 20 20 0 0 500MB 946MB Tanzima Islam (tislam@purdue.edu) 1677MB 500MB Reliable & Scalable Checkpointing Systems 946MB 1677MB 64 Overall Performance Comparison Performance improvement between 11% and 44% Average Makespan Time (min) 160 140 120 Remote Dedicated Server 100 Local Dedicated Server 80 Falcon with Distributed Server 60 40 20 0 mcf tigr Benchmark Applications Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 65 Summary of Reliable Checkpointing System Developed a reliable checkpoint-recovery system FALCON Select reliable storage hosts Prefer lightly loaded ones Compress and encode Store and retrieve efficiently Ran experiments with FALCON in DiaGrid Performance improvement between 11% and 44% Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 66 Checkpointing in HPC Compute Nodes Network Contention Gateway Nodes Atlas Contention for Shared File System Resources Hera Hera Contention for Other Clusters for File System Parallel File System Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 67 2-D vs N-D Compression Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 68 Benefit−matrix 1 2 0.9 3 4 5 6 7 0.8 8 9 10 11 0.7 12 13 14 15 16 0.6 17 18 19 20 21 0.5 22 23 24 25 26 0.4 27 28 29 30 31 0.3 32 V1 Tanzima Islam (tislam@purdue.edu) V3 V5 V7 V9 V12 V15 V18 V21 Reliable & Scalable Checkpointing Systems V24 V27 V30 V33 69 Challenge in Extreme-Scale: Increase in Failure-Rate 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s 10 Tflop/s N=1 1 Tflop/s N=500 100 Gflop/s 10 Gflop/s 1 Gflop/s 60,000 Number of Cores 50,000 40,000 30,000 20,000 10,000 2004 Tanzima Islam (tislam@purdue.edu) 2005 2006 Reliable & Scalable Checkpointing Systems 2007 2008 Year 2009 70 2010 2011 Towards Online Clustering Reduce dimension of β Reduce the number of variables Representative data-type Number of elements greater than a threshold [Example: 100 variables double-type covers 80% of data] Reduce the amount of data Sampling: Random, chunking and wavelet Tanzima Islam (tislam@purdue.edu) Reliable & Scalable Checkpointing Systems 71