stdchk: A Checkpoint Storage System for Desktop Grid Computing Samer Al-Kiswany – UBC Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University of British Columbia Oak Ridge National Laboratory 1 Checkpointing Introduction Checkpointing uses: fault tolerance, debugging, or migration. Typically, an application running for days on hundreds of nodes (e.g. a desktop gird ) saves checkpoint images periodically. ... C C C ICDCS ‘08 C 2 Deployment Scenario ICDCS ‘08 3 The Challenge Although checkpointing is necessary: It is a pure overhead from the performance point of view. Most of the time spent writing to the storage system. Generates a high load on the storage system Requirement: High performance, scalable, and reliable storage system optimized for checkpointing applications. Challenge: Low cost, transparent support for checkpointing at filesystem level. ICDCS ‘08 4 Checkpointing Workload Characteristics Write intensive application ( bursty ). e.g., a job running on hundreds of nodes. periodically checkpoints 100s of GB of data. Write once, rarely read during application execution. Potentially high similarity between consecutive checkpoints. Applications specific checkpoint image life span. When it is safe to delete the image ? ICDCS ‘08 5 Why Checkpointing-Optimized Storage System? Optimizing for checkpointing workload can bring valuable benefits: High throughput through specialization. Considerable storage space and network effort saving. through transparent support for incremental checkpointing. Simplified data management by exploiting the particulaities of checkpoint usage scenarios. Reduce the load on a share file-system Can be built atop scavenged resources – low cost. ICDCS ‘08 6 stdchk A checkpointing optimized storage system built using scavenged resources. ICDCS ‘08 7 Outline stdchk architecture stdchk features stdchk system evaluation ICDCS ‘08 8 stdchk Architecture Manager (Metadata management) Benefactors (Storage nodes) Client (FS interface) ICDCS ‘08 9 stdchk Features High-throughput for write operations Support for transparent incremental checkpointing Simplified data management High reliability through replication POSIX file system API – as a result using stdchk does not require modifications to the application. ICDCS ‘08 10 Optimized Write Operation Alternatives Write procedure alternatives: Complete local write Incremental write Sliding window write ICDCS ‘08 11 Optimized Write Operation Alternatives Write procedure alternatives: Complete local write Incremental write Sliding window write Compute Node Application stdchk stdchk FS Interface Disk ICDCS ‘08 12 Optimized Write Operation Alternatives Write procedure alternatives: Complete local write Incremental write Sliding window write Compute Node Application stdchk stdchk FS Interface Disk ICDCS ‘08 13 Optimized Write Operation Alternatives Write procedure alternatives: Complete local write Incremental write Sliding window write Compute Node Application stdchk stdchk FS Interface Memory Disk ICDCS ‘08 14 Write Operation Evaluation Testbed: 28 machines Each machine has : two 3.0GHz Xeon processors, 1 GB RAM, two 36.5GB SCSI disks. ICDCS ‘08 15 Achieved Storage Bandwidth Complete Local Write Sliding-Window Write NFS Linear (iperf) Incremental Write iperf Local I/O Write Throughput (MB/s) . Sliding120 Window write achieves high 100 bandwidth (110 80 MBps) 60 Saturates the 1 Gbps link 40 20 0 1 2 4 Stripe Width 8 The average ASB over a 1 Gbps testbed. ICDCS ‘08 16 stdchk Features High throughput write operation Transparent incremental checkpointing Checkpointing optimized data management POSIX file system interface – no required modification to the application ICDCS ‘08 17 Transparent Incremental Checkpointing Incremental checkpointing may bring valuable benefits: Lower network effort. Less storage space used. But : How much similarity is there between consecutive checkpoints ? How can we detect similarities between checkpoints? Is this fast enough? ICDCS ‘08 18 Similarity Detection Mechanism – Compare-by-Hash Hashing Checkpoint T0 X X T0 Y Y Z Z ICDCS ‘08 19 Similarity Detection Mechanism – Compare-by-Hash Will store T1 Hashing Checkpoint T1 X W T0 Y Y T1 Z Z W ICDCS ‘08 20 Similarity Detection Mechanism How to divide the file into blocks? Fixed-size blocks + compare-by-Hash (FsCH) Content-based blocks + compare-by-Hash (CbCH) ICDCS ‘08 21 FsCH Insertion Problem B1 B2 B3 B4 B5 B1 B2 B3 B4 B5 Checkpoint i B6 Checkpoint i+1 Result: Lower similarity detection ratio. ICDCS ‘08 22 Content-based Compare-by-Hash (CbCH) offset B1 B2 B3 B4 Checkpoint i m bytes Hashing k bits HashValue HashValue HashValue ==0K 0=? ?0 ? KK ICDCS ‘08 23 Content-based Compare-by-Hash (CbCH) B1 B2 B3 B4 Checkpoint i B1 BX B3 B4 Checkpoint i+1 Result: Higher similarity detection ratio. But: Computationally intensive. ICDCS ‘08 24 Evaluating Similarity Between Consecutive Checkpoints The Applications : BMS* and BLAST Checkpointing interval: 1, 5 and 15 minutes Type Number of checkpoints Avg. Checkpoint size Application level 100 2.4 MB System level - BLCR 1200 450 MB Virtual machine level - Xen 400 1 GB * Checkpoints by Pratul Agarwal (ORNL) ICDCS ‘08 25 Similarity Ratio and Detection Throughput Technique Interval FsCH 1MB CbCH nooverlap m=20B, k=14b BMS BLAST App BLCR 1 min 5 min 0.0% [108] 23.4% [109] 0.0% [28.4] 82% [26.6] Xen 15 min 5 or 15 min 6.3% [113] 0.0% [110] 70% [26.4] 0.0% 0.0% [28.4] The table presents the average rate of detected similarity and the throughput in MB/s (in brackets) for each heuristic. But: Using the GPU, CbCH achieves over 190 MBps throughput !! - StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. SantosNeto, G. Yuan, M. Ripeanu, HPDC, 2008. ICDCS ‘08 26 Compare-by-Hash Results FsCH slightly degrades achieved bandwidth. But reduces the storage space used and network effort by 24% Write Throughput (MB/s) . 120 100 80 60 40 no-detection FsCH 20 0 64 128 256 File System Interface Write Buffer size (MB) Achieved Storage Bandwidth ICDCS ‘08 27 Outline stdchk architecture stdchk features stdchk overall system evaluation ICDCS ‘08 28 stdchk throughput (MB/s) . stdchk Scalability 450 Steady Nodes Join 400 Nodes Leave 350 stdchk sustains 300 high loads : 250 Number of nodes 200 Workload 150 100 50 0 0 50 100 150 Time (s) 200 250 300 7 clients: Each client writes 100 files (100MB each). Total of 70GB. stdchk pool of 20 benefactor nodes. ICDCS ‘08 29 Experiment with Real Application Application : BLAST Execution time: > 5 days Checkpointing interval : 30s Stripe width : 4 benefactors Client machine: two 3.0GHz Xeon processors, SCSI disks. Checkpointing time (s) Data size (TB) Total execution time (s) Local disk stdchk 22,733 16,497 27.0% 3.55 1.14 69.0% 462,141 455,894 ICDCS ‘08 Improvement 1.3% 30 Summary stdchk : A checkpointing optimized storage system built using scavenged resources. stdchk features: High throughput write operation Saves considerable disk space and network effort. Checkpointing optimized data management Easy to adopt – implements a POSIX file system interface Inexpensive - built atop scavenged resources Consequently, stdchk: Offloads the checkpointing workload from the shared FS. Speeds up the checkpointing operations (reduces checkpointing overhead) ICDCS ‘08 31 Thank you netsyslab.ece.ubc.ca ICDCS ‘08 32