A T bl S ft b d DRAM E

advertisement
AT
Tunable
Tunable,
bl Software
S ft
Software-based
b
based
d DRAM E
Error
D t ti
Detection
and
dC
Correction
ti
Lib
Library ffor HPC
David Fiala (NCSU),
(NCSU) Kurt Ferreira (SNL),
(SNL)
Frank Mueller (NCSU),
(NCSU) Christian Engelmann (ORNL)
Motivation
Implementation
 Silent
Silent Data Corruption (SDC) 
Data Corruption (SDC)  undetected soft errors that result in undetected soft errors that result in
corruption in storage (Processor Cache Disks RAM etc)
corruption in storage (Processor, Cache, Disks, RAM, etc)
 SDC faults may manifest themselves as bit‐flips in memory
y
p
y
o Some bit‐flips are not correctable or even detected even with S
bit fli
t
t bl
d t t d
ith
hardware ECC protection
hardware ECC protection
o Exacerbating this situation, when SDC goes undetected, Exacerbating this situation, when SDC goes undetected,
applications continue to run while reporting invalid
li ti
ti
t
hil
ti i lid results
lt
• This is a severe problem for today
This is a severe problem for today’ss large‐scale simulations
large scale simulations
 Server class hardware supports ECC; one common form provides Server class hardware supports ECC; one common form provides
single error correct, double error detect (SECDEC)
i gl
t, d bl
d t t ((SECDEC))
 Non‐server class hardware provides no protection
Non server class hardware provides no protection
 Today there is no generic way to protect applications without ECC
Today there is no generic way to protect applications without ECC
 Even with ECC, hardware SECDEC protection fails you when 3 or ,
p
f
y
more bit flips occur
more bit flips occur
 SDC events are expected to grow dramatically as chip density, heat SDC events are expected to grow dramatically as chip density heat
generation, and core counts increase in larger HPC systems
generation,
and core counts increase in larger HPC systems
 Page
Page tracking is accomplished with mprotect
tracking is accomplished with mprotect (removing read/write)
(removing read/write)
o Each new page access triggers an access violation which allows Each new page access triggers an access violation which allows
LIBSDC to monitor application activity (SEGV handler)
pp
y(
)
o Swap out unlocked pages upon reach max‐unlocked
S
t l k d
h
l k d
 Permission changes break many libraries
Permission changes break many libraries
o Syscalls will fail if passed protected pointers
Syscalls will fail if passed protected pointers
• p
ptrace
t
i
is used to intercept all syscalls and unprotect pointers dt i t
pt ll y ll
d p t tp i t
within syscall parameters
within syscall parameters
o MPI implementations will fail with protected pointers, too
MPI implementations will fail with protected pointers too
• LIBSDC’s MPI profiling layer wrappers unprotect passed ’
p fl gl y
pp
p
p
d
buffers
o Separate memory allocators prevent unprotected libraries from Separate memory allocators prevent unprotected libraries from
sharing virtual addresses in the same page as protected data
g
p g
p
Application
pp
Memor
Memory
y with
ith LIBSDC
Unprotected code, BSS, data
LIBSDC: A software
software-based
based solution
Protected code, bss, data
 Id
Idea: Provide SDC protection in software by tracking accesses to P id SDC p t ti i
ft
by t ki g
t
memory regions and ensuring their integrity before an application
memory regions and ensuring their integrity before an application uses that region’ss data
uses that region
data
 For each region of memory choose one or both:
F
h gi
f
y h
b h
o Hashes: Detect memory corruption via hash mismatches
Hashes: Detect memory corruption via hash mismatches
o ECC/Hamming Codes: Correct some SDCs, even if hardware ECC ECC/Hamming Codes: Correct some SDCs even if hardware ECC
f l d d
failed to detect them
h
 Application‐independent and transparent
Application independent and transparent
No code changes required for applications
No code changes required for applications
 Using the MMU provides a granularity of a single page for a region
Using the MMU provides a granularity of a single page for a region
GLIBC M
GLIBC Mem. Allocator
All t
U
Unprotected
d
Lib i
Libraries
syscalls
MPI ll
MPI calls
LIBSDC M
LIBSDC Mem. Allocator
All t
P
Protected
d
A li i
Application
P
Protected
d
Lib i
Libraries
syscalls
y ll
syscalls
memory allocation memory allocation
memory allocation
ll ti
MPI calls
ca s
MPI calls
MPI calls
LIBSDC
syscalls
ptrace
p
SEGV access violations
SEGV access violations
Kernel
HPCCG Results
R
lt
 HPCCG
HPCCG – A Sandia Natl. Labs kernel conjugate gradient solver from A Sandia Natl. Labs kernel conjugate gradient solver from
th M t
the Mantevo Miniapps
Mi i
 256 processes on a 768x8x8 matrix
256 processes on a 768x8x8 matrix
o AMD Opteron 6128 (Magny Core) AMD Opteron 6128 (Magny Core) – 16 cores per node
16 cores per node
o 32GB RAM per node
32GB RAM p
d
o 40Gbit/s Infiniband
40Gbit/s Infiniband
LIBSDC i t
LIBSDC internal data, hashes, ECC
l d t h h ECC
Method
Storage
O
Overhead
h d
SHA1
Hashing
(4KB pages)
72/64 Hamming Codes
(ECC)
Locked/Protected
L
k d/P t t d Page
P
– LIBSDC mustt validate
lid t b
before
f
nextt use – read/write
d/ it
causes access violation
i l ti
Unlocked/Unprotected Page – Recently validated – may be read/written
0 49%
0.49%
12 5%
12.5%
Tuning
g
 M
Max‐unlocked:
l k d Adjust the maximum number of pages to be allowed Adj t th
i
b
f
t b ll
d
“unlocked”
unlocked at a time. Ideally set at the number of pages in an at a time Ideally set at the number of pages in an
application’ss working
application
working‐set
set during runtime
during runtime
 Hash or ECC:
H h ECC Choose if you desire SDC detection and/or correction
Ch
if y d i SDC d t ti
d/
ti
 Memory to protect: Choose or combine:
Memory to protect: Choose or combine:
o Application
Application’ss heap, bss, data, and/or code
heap bss data and/or code
o Other linked libraries (optionally include or exclude)
h l k dlb
( p
lly l d
l d )
LIBSDC storage (i
(i.e.,
e hashes of pages protected)
Memory Verification
On page request (inital read or write):
If page is locked:
Perform hash of page
Compare current hash with previously stored known-good hash
If any inconsistency found:
Notify
if the presence off SDC
S C and report location
i
T i
Terminate
application
li i / RRollback
llb k to previous
i
checkpoint
h k i
M k page as unlocked
Mark
l k d (mprotect)
(
t t)
On page lock:
Calculate new hash of entire page
Storage hash in separate location
Mark page as locked (mprotect)
 Ideal
Ideal max
max‐unlocked
unlocked around 4096
around 4096‐5120
5120 to match working
to match working‐set
set size
size
 On average, about 15% of overhead spent on page hashing
O
g , b t 15% f
h d p t
p g h hi g
 53% overhead vs baseline when tuned with optimal max‐unlocked
53% overhead vs baseline when tuned with optimal max unlocked
Future Work
 Recent
Recent related work has shown (Ferreira, SC11) page hashing on related work has shown (Ferreira SC11) page hashing on
GPUs can greatly reduce the overhead spent hashing on the CPU
GPUs can greatly reduce the overhead spent hashing on the CPU
 Replace LIBSDC’s FIFO policy of unlocked pages with a smarter R pl
LIBSDC’ FIFO p li y f l k d p g
ih
frequency based algorithm
frequency‐based algorithm
 Investigate using kernel page tables/invalid bit to reduce the Investigate using kernel page tables/invalid bit to reduce the
overhead incurred from frequent use of mprotect (TLB flushes may h d
df
f q
f p
(
fl h
y
be responsible for much overhead)
be responsible for much overhead)
This work was supported in part by NSF grants CNS 1058779 CNS 0958311 DOE grant DE FG02 08ER25837 a subcontract from Sandia National Laboratory and by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL) managed by UT‐Battelle, LLC for the U. S. Department of Energy under Contract No. De‐AC05‐00OR22725.
This work was supported in part by NSF grants CNS‐1058779, CNS‐0958311, DOE grant DE‐FG02‐08ER25837, a subcontract from Sandia National Laboratory. and by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory (ORNL),
managed by UT Battelle LLC for the U S Department of Energy under Contract No De AC05 00OR22725
Download