Document 11977406

advertisement
Neutron Beam Testing of Triblades
Sarah E. Michalak, Andrew J. DuBois, Curtis B. Storlie, William N. Rust, David H. DuBois,
David G. Modl, Heather M. Quinn, Andrea Manuzzato, and Sean P. Blanchard
Abstract— Four IBM Triblades were tested in the Irradiation
of Chips and Electronics facility at the Los Alamos Neutron
Science Center. Triblades include four PowerXCell 8i (Cell)
processors and two dual-core AMD Opteron processors. The
Triblades were tested in their field configuration while running
different applications, with the beam irradiating all of the
hardware in the beampath of the Cell or the Opteron running the
application. Testing focused on the Cell processors, which were
tested while running five different applications and an idle
condition. While neither application nor Triblade was
statistically important in predicting the hazard rate, the hazard
rate when the beam was irradiating the hardware in the Opteron
beampath was statistically higher than when it was irradiating
the hardware in the Cell beampath. In addition, four Cell blades
(one in each Triblade) suffered voltage shorts in voltage
regulators, leading to their inoperability. The hardware tested is
the same as that in the Roadrunner supercomputer, the world’s
first Petaflop supercomputer.
Index Terms—soft error, single event effect, silent data
corruption, FIT rate, neutron beam testing
I. INTRODUCTION
Cosmic-rays interact with the earth’s atmosphere and generate
a flux of neutrons that reaches the terrestrial surface. The
interaction of neutrons with electronics can induce soft errors
[1, 2] and in microprocessor systems, these events may cause
This work has been authored by an employee of the Los Alamos
National Security, LLC (LANS), operator of the Los Alamos National
Laboratory under Contract No. DE-AC52-06NA25396 with the U.S.
Department of Energy. The U.S. Government has rights to use,
reproduce, and distribute this information. The public may copy and use
this information without charge, provided that this Notice and any
statement of authorship are reproduced on all copies. Neither the
Government nor LANS makes any warranty, express or implied, or
assumes any liability or responsibility for the use of this information. The
Los Alamos National Laboratory strongly supports academic freedom
and a researcher's right to publish; therefore, the Laboratory as an
institution does not endorse the viewpoint of a publication or guarantee
its technical correctness. This paper is published under LA-UR-11-01662.
Corresponding author S. E. Michalak is with the Statistical Sciences
Group, Los Alamos National Laboratory, Los Alamos, NM 87544 USA
(505-667-2625; fax: 505-667-4470; e-mail: michalak@lanl.gov). A. J.
DuBois, D. H. DuBois, D. G. Modl, and S. P. Blanchard are with the
System Integration Group, Los Alamos National Laboratory (e-mails:
ajd@lanl.gov, dhd@lanl.gov, digem@lanl.goc, seanb@lanl.gov). C. B.
Storlie and W. N. Rust are with the Statistical Sciences Group, Los
Alamos National Laboratory (e-mails: storlie@lanl.gov, wnr@lanl.gov).
H. M. Quinn is with the Space Data Systems Group, Los Alamos National
Laboratory (e-mail: hquinn@lanl.gov). A. Manuzzato was with
Università
degli
Studi
di
Padova
Italy
(e-mail:
andrea.manuzzato@ieee.org).
crashes and silent data corruption (SDC) [3, 4]. In highperformance computing (HPC) platforms used for scientific
computation, such errors are of concern since (1) system
crashes affect application runtimes and (2) SDC in scientific
applications may lead to incorrect scientific conclusions. The
ability to infer field experience from accelerated testing data
such as that resulting from neutron beam testing is important.
This study adds to the literature by beam testing hardware in
its field configuration while it is running different
applications, including some used for scientific research.
II. TEST SETUP AND EXPERIMENTAL PROTOCOL
Testing was performed at Los Alamos National Laboratory’s
(LANL) Los Alamos Neutron Science Center (LANSCE)
Irradiation of Chips and Electronics (ICE) House facility by
LANL personnel. The energy spectra of the neutron flux
provided by the ICE House is similar to the terrestrial one, but
is many times more intense [5].
A. Test Setup
Four Triblades [6] and a BladeCenter-H (BC-H Type 8852)
[7] that housed the Triblades were tested at LANSCE in
October 2009 to investigate their neutron susceptibility while
running different applications. For this testing, the Cell
processors were of greatest interest. All testing was performed
at nominal voltages and nominal temperatures with the test
fixture at normal incidence to the beam.
Each Triblade included one IBM LS21 blade that housed two
dual-core AMD Opteron 2210 HE processors, two IBM QS22
blades (QS22a and QS22b) each of which housed two
PowerXCell 8i (Cell) processors, and an expansion blade for
managing data traffic. The Triblades and the BC-H are
identical to those in LANL’s Roadrunner, the first Petaflop
supercomputer [8]. The Opteron 2210 HE is a 1.8 GHz 90nm
SOI dual-core microprocessor. Each Opteron core has an
ECC-protected 64 KB L1 data cache, a parity-protected 64 KB
L1 instruction cache, and an ECC-protected 1MB L2 cache.
The LS21 includes 16 GB of ECC DDR-2 DRAM. The Cell is
a 65nm SOI microprocessor with 1 power processing element
(PPE), which controls 8 synergistic processing elements
(SPE). The 3.2 GHz PPE includes a power processing unit
(PPU), a parity-protected 32 KB L1 data cache, a parityprotected 32 KB L1 instruction cache, and a 512 KB L2 cache
with ECC on data and parity on directory tags (which is
recoverable using redundant directories). Each 3.2 GHz SPE
includes a synergistic processing unit (SPU) and an ECCprotected 256 KB dedicated non-caching local store. The
QS22 includes 8 GB of ECC DDR-2 DRAM.
During testing, the Triblade under test was housed in the BCH as it would be in the field. The BC-H was oriented so that
for a single Triblade under test, the beam first entered the
QS22b, followed by the QS22a, the expansion blade, and the
LS21 respectively. Two Xilinx Virtex-II FPGAs [9] were also
in the beam, one upbeam and the other downbeam of the BCH, with their measurements used for calculating corrected
neutron fluence exposures as explained in Section IIIA. Fig. 1
shows the test setup.
A front-end node (IBM eServer X Series 336) running Fedora
9 that was external to the beam controlled the system under
test. It was used to start applications on the system under test
and to monitor its health, both of which were performed
manually by the experimental personnel. Two MacBook Pro
5,3 laptops were used for logging and backing up the data.
computation performed on the Cell [10, 11]. The correlator
code performs a multiply and accumulate needed for certain
radio-astronomy applications [12]. It utilizes both the Opteron
and PPU in very limited ways, with most of the computation
performed on the SPEs. The conjugate gradient method is a
member of a family of iterative solvers used primarily on large
sparse linear systems arising from the discretization of partial
differential equations. The conjugate gradient solver used here
performs a double precision, preconditioned conjugate
gradient (CG) algorithm and utilizes the Opteron primarily for
generation of the sparse linear system, with the CG
implementation taking place on the Cell. VPIC is a 3D
electromagnetic relativistic particle-in-cell plasma physics
simulation code [13]. The version used for this testing was
written to run on the IBM Cell processor in a hybrid processor
environment like that of a Triblade. The integer add code is a
simple hybrid code that executes primarily on the SPUs, using
vector integer units to perform simple adds. Vector registers
on the SPUs are loaded, vector adds are executed over these
registers and verified for correctness.
Each test code was configured so that it completed its work in
roughly one minute and then produced an output line, which
included start and stop times, the application being run, the
hardware running it, and other output such as that required to
assess whether an SDC had occurred or, in the case of the idle
Cell condition, whether the Cells under test were still
responding.
Two different beam diameters were used for the experiments:
a two-inch beam diameter for the first 53 experiments and a
one-inch beam diameter for the remaining 59.
B. Experimental Procedure
Fig. 1. The Test Setup. The upper photo shows the BC-H that housed the
Triblades and the down-beam Virtex-II, while the lower photo shows Triblade
3 in the BC-H.
The test applications for the Cell included five computational
codes detailed in the next paragraph (hybrid Linpack,
correlator, a conjugate gradient solver, VPIC, and an integer
add code) and an idle code in which the Opteron interrogates
the Cell to determine if all processing elements (PPU and
SPUs) are all still executing. For the Opterons the test
applications included an Opteron-only version of the
correlator code, idling, and running the top command, which
is considered an idle condition in the analyses that follow.
Hybrid Linpack performs the Linpack benchmark calculation,
optimized for the Triblade architecture with most of the
For a given experiment, a single Cell or Opteron was
configured to run the desired application while the beam was
aimed so that it irradiated all of the hardware in that
processor’s beampath. (The exception to this is the initial test
period during which the experimental team was determining
the amount of hardware that could be run in the beam
simultaneously.) With two QS22s in a Triblade, when a Cell
in one QS22 is running an application, the corresponding Cell
in the other QS22 is in the beampath. This second Cell in the
beampath was set to run the idle condition. Since the beam
irradiated a columnar volume within the Triblade under test
and the BC-H, certain attribution of an error to the Cells or
Opteron in the beampath is not possible. In particular, other
hardware in the beampath or hardware that was affected by
scatter could be the cause of an observed error. Errors could
also be the result of causes external to the beam.
The experimental protocol was to start the appropriate test
application on the appropriate processor while the beam was
off. Once the test application was observed to be operating
properly (had produced one or more output lines), the beam
was started. The experiment continued until a state of system
inoperability (e.g., a system or application crash) was reached
or sufficient time had elapsed. The beam was then turned off,
data pertaining to neutron fluence exposure were collected,
and the system was rebooted before beginning the next test.
For the Cells, the test procedure was to cycle through the test
applications on a particular Cell, typically until it became
inoperable. Repeating each test code periodically permits
investigation of any aging or dose-related effects related to
increasing exposure to the beam. The procedure for the
Opterons, which received much less testing, was to use both
the Opteron-only correlator code and possibly an idle
condition (idling or running the top command). Functionality
of the Opteron while it was idling or running the top command
was assessed by ascertaining whether it would return a new
command prompt.
III. RESULTS
A. Hardware Tested and Calculation of Neutron Fluence
Exposures
Fourteen of the 16 available Cells were operated in the beam,
with the two upper Cells on Triblade 2 not operated in the
beam. Three Opterons were operated in the beam: both
Opterons on Triblade 1 and the lower Opteron on Triblade 4.
Each Opteron beampath test was performed after Cell
beampaths in the Triblade housing the Opteron had been
tested. Thus, the behavior of hardware in the Opteron
beampaths in Triblades without previous exposure to the beam
cannot be estimated based on our testing. We chose this test
procedure since the Cells were of primary interest in our
testing.
TABLE I
HARDWARE BEAM EXPOSURE AND FINAL CONDITION
Corrected
Neutron
Blade
Beam Aim
Fluence
Condition in Post-Beam Testing
(neutrons/
cm2)
Triblade 1 LS21
Upper
1.20E8
Able to run test applications
Opteron
Triblade 1 LS21
Lower
4.50E8
Able to run test applications
Opteron
Triblade 1QS22a
Upper Cell
1.90E9
Permanent failure – voltage short
in voltage regulator
Triblade 1 QS22a
Lower Cell
1.55E9
Permanent failure – voltage short
in voltage regulator
Triblade 1QS22b
Upper Cell
2.32E9
Boots at LANL, but can’t
communicate with Opteron;
passed epc testing at IBM
Triblade 1 QS22b
Lower Cell
1.88E9
Boots at LANL, but can’t
communicate with Opteron;
passed epc testing at IBM
Triblade 2 LS21
Upper
0
Damaged ground pins; able to run
Opteron
Linpack at IBM
Triblade 2 LS21
Lower
0
Damaged ground pins; able to run
Opteron
Linpack at IBM
Triblade 2 QS22a
Upper Cell
0
Able to boot Linux and pass epc
testing at IBM
Triblade 2 QS22a
Lower Cell
1.58E9
Able to boot Linux and pass epc
testing at IBM
Triblade 2 QS22b
Upper Cell
0
Damaged 12V connector;
permanent failure – voltage short
in voltage regulator
Triblade 2 QS22b
Lower Cell
1.93E9
Damaged 12V connector;
permanent failure – voltage short
in voltage regulator
Triblade 3 LS21
Upper
0
Able to run test applications
Opteron
Triblade 3 LS21
Lower
0
Able to run test applications
Opteron
Triblade 3 QS22a
Upper Cell
1.48E10
Able to run test applications
In all, 112 experiments were performed. The first three
experiments, the only data collected for Triblade 2, were
omitted from the error modeling because they included three Triblade 3 QS22a
Triblades in the beam whereas for the remaining experiments Triblade 3 QS22b
there was a single Triblade in the beam.
Since the effect of neutrons on Virtex-IIs devices is wellcharacterized [14], measurements from the Virtex-IIs were
used to calculate the corrected neutron fluence to which the
hardware under test was exposed. Specifically, the distance
from the blade running applications to the beam source and
data from the Virtex-II devices were used to calculate
corrected neutron fluence exposures that reflected beam
attenuation caused by distance from the beam source and
effects of the hardware (BC-H and the Triblade under test) in
the beam. These neutron fluence exposures account for all
neutrons with energies above 10 MeV.
B. Longevity of Hardware in the Beam, Post-Beam Testing
and Root Cause Analysis of Permanently Failed Hardware
Some hardware experienced permanent failures relatively
quickly upon exposure to the beam, while other hardware had
greater longevity in the beam. For example, QS22b on
Triblade 2 was unable to boot after exposure to a corrected
neutron fluence of 1.93E9 neutrons/cm2, while the lower cell
on QS22a on Triblade 4 remained operational after exposure
to a corrected neutron fluence of 2.77E10 neutrons/cm2.
Lower Cell
3.60E9
Able to run test applications
Upper Cell
1.80E10
Triblade 3 QS22b
Lower Cell
4.39E9
Triblade 4 LS21
0
5.75E8
Able to run test applications
Triblade 4 QS22a
Upper
Opteron
Lower
Opteron
Upper Cell
Permanent failure – voltage short
in voltage regulator
Permanent failure – voltage short
in voltage regulator
Able to run test applications
5.67E9
Able to run test applications
Triblade 4 QS22a
Lower Cell
2.77E10
Able to run test applications
Triblade 4 QS22b
Upper Cell
6.90E9
Triblade 4 LS21
Permanent failure – voltage short
in voltage regulator
Triblade 4 QS22b
Lower Cell
3.37E10
Permanent failure – voltage short
in voltage regulator
Table 1. Corrected Beam Exposure of Hardware in Neutrons/cm2 and Condition of
Hardware When Tested at LANL and IBM. Triblade 2 suffered damage to one of the
12V connectors on its QS22b and damage to the ground pins on its LS21 as a result of
handling, so it could not be tested in a production setting at LANL. The tool epc refers
to a tool used at IBM for testing blades that tests the PPU memory, SPU local store, and
data transfers between the PPU and SPU. Testing at LANL indicated that the QS22b in
Triblade 1 booted, but could not communicate with the relevant Opteron in Triblade 1.
However, testing at IBM revealed that QS22b in Triblade 1 was able to communicate
with the relevant Opteron, and it passed the epc test.
Following the beam testing, Triblades 1, 3 and 4 were tested in
a production platform at LANL. Triblade 2 would not install
to test as its backplane was damaged due to handling (as
opposed to neutron-induced effects), namely as a result of
seating it in the BC-H. All applications used for the beam
testing plus an Opteron-only version of Linpack not available
at the time of the beam testing were used, with the exception
that the bottom Opteron in Triblade 4 was not tested with the
Opteron-only correlator code used in the beam. Each of
Triblades 1, 3 and 4 had a QS22 that would not boot; in
addition, the QS22 in Triblade 1 that would boot could not
communicate with the relevant Opteron. Following this postbeam testing, all four Triblades were returned to IBM in
Rochester, MN for root cause analysis, with the permanently
failed QS22s (one in each of the four Triblades) found to be
the result of voltage shorts in voltage regulators.
A number of voltage regulators have been experimentally
shown to experience single-event burnout and single-event
gate rupture, both of which are destructive effects that can
cause the system to be fully or partially unpowered. Because
of these problems, many space-deployed systems are heavily
screened to avoid such issues. Recent research has shown that
similar problems can occur terrestrially with both thermal and
fast neutrons [15, 16].
Table 1 details the corrected neutron fluence accumulated at
each beam aim during the testing and the condition of the
Triblades upon testing at LANL and IBM.
C. Silent Data Corruption
Four SDCs were observed. Two SDCs occurred when Cells
were running computational codes: one on the upper Cell in
€ on
the QS22b in Triblade 3 during a VPIC calculation and one
the upper Cell in the QS22a in Triblade 1 during a correlator
calculation. Two SDCs occurred on Opterons while the
Opteron-only correlator code was running: one on the upper
Opteron and one on the lower Opteron of Triblade 1. Checks
that determined when an SDC occurred were performed using
either a 160-bit SHA-1 hash (for blocks of output data for
VPIC and for the Cell correlator code) or a 32-bit CRC (for
the output block of data for the Opteron correlator code), so
the magnitude of the difference between the calculated result
and the correct result for a particular SDC cannot be
determined.
D. Failure Data
We categorized each experiment as having one of two end
states: 1) survival, meaning that the experiment ended when
the experimenter believed the application was still running or
2) failure, indicating that the application was no longer
running at the end of the experiment, e.g. because of an
application or system crash. Since the output from the test
applications appeared roughly every minute, it is possible that
in some cases in which the system is deemed to have survived
the experiment it had actually failed, but that failure was not
detected before the experiment ended. 79 of the 95 tests
conducted with the beam aimed at a Cell that were analyzed
for the results presented here ended in failure, while all 14
tests conducted with the beam aimed at an Opteron ended in
failure.
E. Modeling the Data
For our modeling, let Yi denote the corrected exposure
(neutron fluence in neutrons/cm2) until an error (SDC or
failure, e.g. crash) for each error i. The observations are
interval censored, i.e., only an interval (ai, bi) such that Yi ∈
(ai, bi) is observed. This is because the exact time that an error
occurs is unknown. Instead, all that is known are the time of
the last output line (ai) and the time at which the operator
noticed the error (bi) or ended the test. We model the Yi as
coming independently from a Cox model [17], but with a
shifted hazard rate h(y | z i , x i ) = h0 (y + x i ) exp(β ' z i ) , where (i)
h0 is the baseline hazard rate, (ii) xi is the cumulative exposure
of the component under test at the start time for the i-th error,
(iii) zi=zi,1,…, zi,p is a vector of covariates corresponding to the
i-th error€(which application was running, the Triblade under
test, the beam aim (hardware in a Cell beampath or hardware
in an Opteron beampath), and the beam diameter), and (iv)
β is a vector of unknown parameters. In this formulation, xi
shifts the baseline hazard rate to make it relative to the amount
of exposure that the component being tested has accumulated
in its lifetime, as opposed to the amount of exposure that it
received during an individual experiment. The probabilistic
model for Yi is then
 y

Si (y) = P(Yi > y) = exp− ∫ h(s | z i , x i )ds .
 0

With this, if h(s|zi,xi) increases with s, the probability that Yi >
s decreases more rapidly than if h(s|zi,xi) remained constant,
but if h(s|zi,xi) decreases with s, there is a greater chance that
Yi > s. The unknown parameters in the model were estimated
using Bayesian methods [18]. Thorough diagnostics to assess
the adequacy of this model for the experimental data were
conducted and gave no indication of a violation of model
assumptions.
F. Study Results
The results presented below pertain to the conditions under
which the experiments were conducted (the applications run,
the beam aims, the angle of incidence of the beam to the
hardware under test, etc.) and the model used for the data,
with results likely to be obtained under other conditions less
clear. All results have been estimated via Markov Chain
Monte Carlo [18].
The baseline hazard rate is essentially constant, suggesting
that the instantaneous failure rate doesn’t vary much with
increasing exposure to the beam for the exposures observed in
our testing.
The posterior probability that beam aim (hardware in a Cell
beampath or hardware in an Opteron beampath) affects the
hazard rate is 1.0. Aiming the beam at the hardware in an
Opteron beampath results in an average multiplier to the
hazard rate of 12.2 with an 95% credible interval (CI) of (5.8,
24.0), meaning there is about 12 times more risk of an error
when aiming the beam at the hardware in an Opteron
beampath as opposed to the hardware in a Cell beampath. This
result is not the same as saying that the Opteron has a hazard
Based on our estimated model, a failure (application or system
crash, etc.) FIT rate and an SDC FIT rate for each of the two
beam aims (hardware in the Opteron beampath and hardware
in the Cell beampath) are provided, along with predictive
distributions, which incorporate all estimation uncertainty and
inherent randomness, for the actual number of failures in 109
hours and the actual number of SDCs in 109 hours that might
be observed in practice for each of the two beam aims. Figs 25 present the predictive distributions, while their captions
include the posterior mean FIT rate, with the uncertainty in the
estimation of model parameter values (i.e., the hazard rate for
each beam aim) accounted for in the 95% CI for the FIT rate
also provided in the figure captions. The FIT rates and
corresponding distributions presented below are calculated
based on the estimated model for beam aim failure times and
0.008
0.004
Density
0.000
FIT Rate
200
300
400
Number of Failures
Fig. 2. Predictive Distribution of the Number of Failures in 109 Hours for All
Hardware in the Cell Beampath. The posterior mean FIT rate is 166.7 with a
95% CI of (83.0, 272.1).
0.00
0.04
0.08
SDCs in 10^9 Hours with Cell Beam Aim
0
FIT Rate
10
20
30
40
50
Number of SDCs
Fig. 3. Predictive Distribution of the Number of SDCs in 109 Hours for All
Hardware in the Cell Beampath. The posterior mean FIT rate is 8.93 with a
95% CI of (4.45, 14.58).
Failures in 10^9 Hours with Opteron Beam Aim
6e−04
The posterior mean probability that a given error is an SDC, as
opposed to a failure, is 0.051, with a 95% CI of (0.019, 0.097).
The posterior mean of the ratio of the probability of an SDC to
the probability of a failure is 0.054, with a 95% CI of (0.019,
0.107).
100
3e−04
The hazard rate was also not affected by whether an
application was being run as opposed to an idle condition (the
posterior probability was 0.023). Similarly, when effects for
each of the different applications relative to the idle condition
were included in the model, none of the applications were
found to have different hazard rates from the idle condition
(all posterior probabilities were less than 0.06). With
additional data and/or with other applications, application
might be found to have an effect.
Failures in 10^9 Hours with Cell Beam Aim
0e+00
For beam diameter, the posterior probability of an effect on
the hazard rate is 0.12, indicating very little evidence that
beam diameter is an important predictor of the hazard rate.
These results suggest that the additional hardware in the
beampath when the beam has a two-inch diameter versus oneinch diameter has little effect on the hazard rate. The beam
diameter also affects the intensity of the beam, with a lower
number of neutrons/cm2 for the 2-inch diameter. This was,
however, accounted for in the modeling results.
These results are for the neutron flux in Los Alamos, NM,
which is estimated to be approximately 6.4 times that at sea
level [3], and reflect only neutrons with energies above 10
MeV. There is also variability in the neutron flux over time.
To account for this, it is assumed, as in a previous study [3],
that the ambient neutron flux in Los Alamos, NM is normal
with mean 0.025 neutrons/cm2/sec and standard deviation
4.4E-04.
Density
The posterior probability that the hazard rate for Triblade 3 is
different from Triblade 1 is 0.046, while the posterior
probability that the hazard rate for Triblade 4 is different from
Triblade 1 is 0.319. Conditional on Triblade 4 having an
effect, the average multiplier to the hazard rate is 0.53, with a
95% CI of (0.32, 0.79). These results suggest that which
Triblade was under test does not have a substantial effect on
the hazard rate. However, with more Triblades tested and/or
more time spent under test, differences among the hazard rates
for different Triblades might be found.
an assumption of independence between successive failure
times.
Density
rate that is roughly 12 times the Cell hazard rate since other
hardware along the trajectory of the beam when it is aimed at
an Opteron or hardware affected by scatter may be responsible
for errors that resulted when the beam was aimed at an
Opteron. Moreover, the hardware in an Opteron beampath in a
given Triblade was tested after the hardware in at least one
Cell beampath in that Triblade was tested.
FIT Rate
2000
4000
6000
Number of Failures
8000
Fig. 4. Predictive Distribution of the Number of Failures in 109 Hours for All
Hardware in the Opteron Beampath. The posterior mean FIT rate is 2058 with
a 95% CI of (809, 5067).
Density
0.0e+00
0.004
0.000
Density
0
FIT Rate
200
400
600
800
Number of SDCs
Fig. 5. Predictive Distribution of the Number of SDCs in 109 Hours for All
Hardware in the Opteron Beampath. The posterior mean FIT rate is 110 with a
95% CI of (43.4, 271).
The estimated model was also used to estimate error (failures
and SDCs) FIT rates for a single Triblade and for the 180
compute Triblades in a single connected unit (CU). (The
Roadrunner supercomputer is composed of 17 CUs.) These
FIT rates only incorporate hardware in the Cell and the
Opteron beampaths, so do not include all hardware in a
Triblade. Specifically, independence of the failures occurring
at the different beam aims, whether Opteron (top or bottom) or
Cell (top or bottom), is assumed. Therefore, the hazard rates
for these four distinct aims are added together to obtain an
approximate hazard for an entire Triblade. Independence
among Triblades is assumed to get an overall hazard rate for
an entire CU in a similar manner. These FIT rates are again for
the neutron flux in Los Alamos, NM and incorporate only
neutrons with energies above 10 MeV. As above, the
predictive distribution for the number of errors in 109 hours
that might be observed in practice is presented along with the
posterior mean and 95% CI for the FIT rate (in the figure
caption); see Figs 6-7. The Triblade and Roadrunner CU FIT
rate information is extrapolated from the neutron beam testing
and does not represent observed hardware failures in the
Roadrunner supercomputer.
0.00020
0.00000
Density
Errors in 10^9 Hours for a Single Triblade
0
IV.
Number of Errors
20000
Fig. 6. Predictive Distribution of the Number of Errors in 109 Hours for One
Triblade. The posterior mean FIT rate is 4703 with a 95% CI of (1935,
11216).
ACKNOWLEDGEMENT
Many people contributed to the success of this project. The authors thank Joe Abeyta,
Chuck Alexander, Ben Bergen, Ann Borrett, Henry Brandt, James Campa, Randy
Cardon, Tom Fairbanks, Parks Fields, Alan Gibson, Gary Grider, Josip Loncaric, Pablo
Lujan, Alex Malin, Fred Marshall, Andrew Montoya, John Morrison, Andrew
Shewmaker, Manuel Vigil, Bob Villa, Steve Wender, and Cornell Wright and apologize
for any inadvertent omissions from this list.
V.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[11]
15000
2500000
Number of Errors
[10]
10000
FIT Rate
500000
1500000
Fig. 7. Predictive Distribution of Number of Errors in 109 Hours for One
Roadrunner CU. The posterior mean FIT rate is 8.37E5 with a 95% CI of
(3.51E5, 1.89E6).
[9]
FIT Rate
5000
1.5e−06
Errors in 10^9 Hours for a Single CU
0.008
SDCs in 10^9 Hours with Opteron Beam Aim
[12]
[13]
[14]
[15]
[16]
[17]
[18]
REFERENCES
J. Ziegler and W. Lanford, “The Effect of Sea-Level Cosmic Rays on Electronic
Devices,” J. Appl. Phys., vol. 52, no. 6, Jan., pp. 4305-4312, 1981.
R.C. Baumann, "Radiation-induced soft errors in advanced semiconductor
technologies," IEEE Trans. Dev. and Mat. Rel., Vol. 5, No. 3, pp. 305- 316, Sept.
2005.
S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender, “Predicting the
number of fatal soft errors in Los Alamos National Laboratory’s ASC Q
supercomputer,” IEEE Trans. Dev. and Mat. Rel., Vol. 5, No. 3, Sept. 2005.
T. Hong, S, Michalak, T. Graves, J. Ackaret, and S. Rao, “Neutron beam
irradiation study of workload dependence of SER in a microprocessor,” 2009
SELSE Proceedings.
B. Takala, “The ICE House: neutron testing leads to more-reliable electronics,”
Los Alamos Science [Online.] Available: http://library.lanl.gov/cgi-bin/getfile?3012.pdf Nov 30 2006.
K.
Koch,
“Roadrunner
platform
overview
[Online].
Available:
http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Koch%20%20Roadrunner%20Overview/RR%20Seminar%20%20System%20Overview.pdf, 2008.
“Blade
Center
H
(8852)”
[Online.]
Available:
http://publib.boulder.ibm.com/infocenter/bladectr/documentation/index.jsp?topic=/
com.ibm.bladecenter.8852.doc/bc_8852_product_page.html
H. Meuer “31st TOP500 List topped by first-ever petaflop/s supercomputer”
Scientific
Computing
[Online].
Available:
http://www.scientificcomputing.com/31st-TOP500-List-Topped-by-First-everPetaflops-Supercomputer.aspx
Xilinx, “Virtex-II Platform FPGAs Complete Data Sheet” [Online]. Available:
http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf, 2007.
M. Kistler, J. Gunnels, D. Brokenshire, and D. Benton, “Programming the Linpack
benchmark for Roadrunner,” IBM J. Res. Dev., Vol. 53, No. 5, Paper 9, 2009.
A. Petitet, R. Whaley, J. Dongarra, and A. Cleary, “HPL – a portable
implementation of the High-Performance Linpack benchmark for distributed
memory computers,” [Online.} Available: http://www.netlib.org/benchmark/hpl/.
A. DuBois, C. Connor, S. Michalak, G. Taylor, and D. DuBois, “Application of the
IBM Cell processor to real-time cross-correlation of a large antenna array radio
telescope,” Los Alamos Nat. Lab. Tech. Rep. #LA-UR-09-03483, 2009.
K Bowers. B. Albright et al. “Advances in petascale kinetic plasma simulation with
VPIC and Roadrunner,” J. Phys.: Conf. Ser. 180 (SciDAC 2009), 2009.
A. Lesea, S. Drimer et al., "The Rosetta experiment: atmospheric soft error rate
testing in differing technology FPGAs," IEEE Trans. Dev. Mat. Rel., Vol. 5, No. 3,
pp. 317- 328, Sept. 2005.
A. Hands, P. Morris, et al. “Single event effects in power MOSFETs due to
atmospheric and thermal neutrons,” to be presented at the IEEE Nuclear and Space
Radiation Effects Conference 2011.
P. Shea and Z. Shen, "Numerical and experimental investigation of single event
effects in SOI lateral power MOSFETs," to be presented at the IEEE Nuclear and
Space Radiation Effects Conference 2011.
D. Cox “Regression models and life-tables” J. of the Roy. Stat. Soc. Series B
(Meth.) Vol. 34 No. 2 pp. 187-220, 1972.
A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, London:
Chapman and Hall, 1995.
Download