AT Tunable Tunable, bl Software S ft Software-based b based d DRAM E Error D t ti Detection and dC Correction ti Lib Library ffor HPC David Fiala (NCSU), (NCSU) Kurt Ferreira (SNL), (SNL) Frank Mueller (NCSU), (NCSU) Christian Engelmann (ORNL) Motivation Implementation Silent Silent Data Corruption (SDC) Data Corruption (SDC) undetected soft errors that result in undetected soft errors that result in corruption in storage (Processor Cache Disks RAM etc) corruption in storage (Processor, Cache, Disks, RAM, etc) SDC faults may manifest themselves as bit‐flips in memory y p y o Some bit‐flips are not correctable or even detected even with S bit fli t t bl d t t d ith hardware ECC protection hardware ECC protection o Exacerbating this situation, when SDC goes undetected, Exacerbating this situation, when SDC goes undetected, applications continue to run while reporting invalid li ti ti t hil ti i lid results lt • This is a severe problem for today This is a severe problem for today’ss large‐scale simulations large scale simulations Server class hardware supports ECC; one common form provides Server class hardware supports ECC; one common form provides single error correct, double error detect (SECDEC) i gl t, d bl d t t ((SECDEC)) Non‐server class hardware provides no protection Non server class hardware provides no protection Today there is no generic way to protect applications without ECC Today there is no generic way to protect applications without ECC Even with ECC, hardware SECDEC protection fails you when 3 or , p f y more bit flips occur more bit flips occur SDC events are expected to grow dramatically as chip density, heat SDC events are expected to grow dramatically as chip density heat generation, and core counts increase in larger HPC systems generation, and core counts increase in larger HPC systems Page Page tracking is accomplished with mprotect tracking is accomplished with mprotect (removing read/write) (removing read/write) o Each new page access triggers an access violation which allows Each new page access triggers an access violation which allows LIBSDC to monitor application activity (SEGV handler) pp y( ) o Swap out unlocked pages upon reach max‐unlocked S t l k d h l k d Permission changes break many libraries Permission changes break many libraries o Syscalls will fail if passed protected pointers Syscalls will fail if passed protected pointers • p ptrace t i is used to intercept all syscalls and unprotect pointers dt i t pt ll y ll d p t tp i t within syscall parameters within syscall parameters o MPI implementations will fail with protected pointers, too MPI implementations will fail with protected pointers too • LIBSDC’s MPI profiling layer wrappers unprotect passed ’ p fl gl y pp p p d buffers o Separate memory allocators prevent unprotected libraries from Separate memory allocators prevent unprotected libraries from sharing virtual addresses in the same page as protected data g p g p Application pp Memor Memory y with ith LIBSDC Unprotected code, BSS, data LIBSDC: A software software-based based solution Protected code, bss, data Id Idea: Provide SDC protection in software by tracking accesses to P id SDC p t ti i ft by t ki g t memory regions and ensuring their integrity before an application memory regions and ensuring their integrity before an application uses that region’ss data uses that region data For each region of memory choose one or both: F h gi f y h b h o Hashes: Detect memory corruption via hash mismatches Hashes: Detect memory corruption via hash mismatches o ECC/Hamming Codes: Correct some SDCs, even if hardware ECC ECC/Hamming Codes: Correct some SDCs even if hardware ECC f l d d failed to detect them h Application‐independent and transparent Application independent and transparent No code changes required for applications No code changes required for applications Using the MMU provides a granularity of a single page for a region Using the MMU provides a granularity of a single page for a region GLIBC M GLIBC Mem. Allocator All t U Unprotected d Lib i Libraries syscalls MPI ll MPI calls LIBSDC M LIBSDC Mem. Allocator All t P Protected d A li i Application P Protected d Lib i Libraries syscalls y ll syscalls memory allocation memory allocation memory allocation ll ti MPI calls ca s MPI calls MPI calls LIBSDC syscalls ptrace p SEGV access violations SEGV access violations Kernel HPCCG Results R lt HPCCG HPCCG – A Sandia Natl. Labs kernel conjugate gradient solver from A Sandia Natl. Labs kernel conjugate gradient solver from th M t the Mantevo Miniapps Mi i 256 processes on a 768x8x8 matrix 256 processes on a 768x8x8 matrix o AMD Opteron 6128 (Magny Core) AMD Opteron 6128 (Magny Core) – 16 cores per node 16 cores per node o 32GB RAM per node 32GB RAM p d o 40Gbit/s Infiniband 40Gbit/s Infiniband LIBSDC i t LIBSDC internal data, hashes, ECC l d t h h ECC Method Storage O Overhead h d SHA1 Hashing (4KB pages) 72/64 Hamming Codes (ECC) Locked/Protected L k d/P t t d Page P – LIBSDC mustt validate lid t b before f nextt use – read/write d/ it causes access violation i l ti Unlocked/Unprotected Page – Recently validated – may be read/written 0 49% 0.49% 12 5% 12.5% Tuning g M Max‐unlocked: l k d Adjust the maximum number of pages to be allowed Adj t th i b f t b ll d “unlocked” unlocked at a time. Ideally set at the number of pages in an at a time Ideally set at the number of pages in an application’ss working application working‐set set during runtime during runtime Hash or ECC: H h ECC Choose if you desire SDC detection and/or correction Ch if y d i SDC d t ti d/ ti Memory to protect: Choose or combine: Memory to protect: Choose or combine: o Application Application’ss heap, bss, data, and/or code heap bss data and/or code o Other linked libraries (optionally include or exclude) h l k dlb ( p lly l d l d ) LIBSDC storage (i (i.e., e hashes of pages protected) Memory Verification On page request (inital read or write): If page is locked: Perform hash of page Compare current hash with previously stored known-good hash If any inconsistency found: Notify if the presence off SDC S C and report location i T i Terminate application li i / RRollback llb k to previous i checkpoint h k i M k page as unlocked Mark l k d (mprotect) ( t t) On page lock: Calculate new hash of entire page Storage hash in separate location Mark page as locked (mprotect) Ideal Ideal max max‐unlocked unlocked around 4096 around 4096‐5120 5120 to match working to match working‐set set size size On average, about 15% of overhead spent on page hashing O g , b t 15% f h d p t p g h hi g 53% overhead vs baseline when tuned with optimal max‐unlocked 53% overhead vs baseline when tuned with optimal max unlocked Future Work Recent Recent related work has shown (Ferreira, SC11) page hashing on related work has shown (Ferreira SC11) page hashing on GPUs can greatly reduce the overhead spent hashing on the CPU GPUs can greatly reduce the overhead spent hashing on the CPU Replace LIBSDC’s FIFO policy of unlocked pages with a smarter R pl LIBSDC’ FIFO p li y f l k d p g ih frequency based algorithm frequency‐based algorithm Investigate using kernel page tables/invalid bit to reduce the Investigate using kernel page tables/invalid bit to reduce the overhead incurred from frequent use of mprotect (TLB flushes may h d df f q f p ( fl h y be responsible for much overhead) be responsible for much overhead) This work was supported in part by NSF grants CNS 1058779 CNS 0958311 DOE grant DE FG02 08ER25837 a subcontract from Sandia National Laboratory and by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL) managed by UT‐Battelle, LLC for the U. S. Department of Energy under Contract No. De‐AC05‐00OR22725. This work was supported in part by NSF grants CNS‐1058779, CNS‐0958311, DOE grant DE‐FG02‐08ER25837, a subcontract from Sandia National Laboratory. and by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL), managed by UT Battelle LLC for the U S Department of Energy under Contract No De AC05 00OR22725