Neutron Beam Testing of Triblades Sarah E. Michalak, Andrew J. DuBois, Curtis B. Storlie, William N. Rust, David H. DuBois, David G. Modl, Heather M. Quinn, Andrea Manuzzato, and Sean P. Blanchard Abstract— Four IBM Triblades were tested in the Irradiation of Chips and Electronics facility at the Los Alamos Neutron Science Center. Triblades include four PowerXCell 8i (Cell) processors and two dual-core AMD Opteron processors. The Triblades were tested in their field configuration while running different applications, with the beam irradiating all of the hardware in the beampath of the Cell or the Opteron running the application. Testing focused on the Cell processors, which were tested while running five different applications and an idle condition. While neither application nor Triblade was statistically important in predicting the hazard rate, the hazard rate when the beam was irradiating the hardware in the Opteron beampath was statistically higher than when it was irradiating the hardware in the Cell beampath. In addition, four Cell blades (one in each Triblade) suffered voltage shorts in voltage regulators, leading to their inoperability. The hardware tested is the same as that in the Roadrunner supercomputer, the world’s first Petaflop supercomputer. Index Terms—soft error, single event effect, silent data corruption, FIT rate, neutron beam testing I. INTRODUCTION Cosmic-rays interact with the earth’s atmosphere and generate a flux of neutrons that reaches the terrestrial surface. The interaction of neutrons with electronics can induce soft errors [1, 2] and in microprocessor systems, these events may cause This work has been authored by an employee of the Los Alamos National Security, LLC (LANS), operator of the Los Alamos National Laboratory under Contract No. DE-AC52-06NA25396 with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, provided that this Notice and any statement of authorship are reproduced on all copies. Neither the Government nor LANS makes any warranty, express or implied, or assumes any liability or responsibility for the use of this information. The Los Alamos National Laboratory strongly supports academic freedom and a researcher's right to publish; therefore, the Laboratory as an institution does not endorse the viewpoint of a publication or guarantee its technical correctness. This paper is published under LA-UR-11-01662. Corresponding author S. E. Michalak is with the Statistical Sciences Group, Los Alamos National Laboratory, Los Alamos, NM 87544 USA (505-667-2625; fax: 505-667-4470; e-mail: michalak@lanl.gov). A. J. DuBois, D. H. DuBois, D. G. Modl, and S. P. Blanchard are with the System Integration Group, Los Alamos National Laboratory (e-mails: ajd@lanl.gov, dhd@lanl.gov, digem@lanl.goc, seanb@lanl.gov). C. B. Storlie and W. N. Rust are with the Statistical Sciences Group, Los Alamos National Laboratory (e-mails: storlie@lanl.gov, wnr@lanl.gov). H. M. Quinn is with the Space Data Systems Group, Los Alamos National Laboratory (e-mail: hquinn@lanl.gov). A. Manuzzato was with Università degli Studi di Padova Italy (e-mail: andrea.manuzzato@ieee.org). crashes and silent data corruption (SDC) [3, 4]. In highperformance computing (HPC) platforms used for scientific computation, such errors are of concern since (1) system crashes affect application runtimes and (2) SDC in scientific applications may lead to incorrect scientific conclusions. The ability to infer field experience from accelerated testing data such as that resulting from neutron beam testing is important. This study adds to the literature by beam testing hardware in its field configuration while it is running different applications, including some used for scientific research. II. TEST SETUP AND EXPERIMENTAL PROTOCOL Testing was performed at Los Alamos National Laboratory’s (LANL) Los Alamos Neutron Science Center (LANSCE) Irradiation of Chips and Electronics (ICE) House facility by LANL personnel. The energy spectra of the neutron flux provided by the ICE House is similar to the terrestrial one, but is many times more intense [5]. A. Test Setup Four Triblades [6] and a BladeCenter-H (BC-H Type 8852) [7] that housed the Triblades were tested at LANSCE in October 2009 to investigate their neutron susceptibility while running different applications. For this testing, the Cell processors were of greatest interest. All testing was performed at nominal voltages and nominal temperatures with the test fixture at normal incidence to the beam. Each Triblade included one IBM LS21 blade that housed two dual-core AMD Opteron 2210 HE processors, two IBM QS22 blades (QS22a and QS22b) each of which housed two PowerXCell 8i (Cell) processors, and an expansion blade for managing data traffic. The Triblades and the BC-H are identical to those in LANL’s Roadrunner, the first Petaflop supercomputer [8]. The Opteron 2210 HE is a 1.8 GHz 90nm SOI dual-core microprocessor. Each Opteron core has an ECC-protected 64 KB L1 data cache, a parity-protected 64 KB L1 instruction cache, and an ECC-protected 1MB L2 cache. The LS21 includes 16 GB of ECC DDR-2 DRAM. The Cell is a 65nm SOI microprocessor with 1 power processing element (PPE), which controls 8 synergistic processing elements (SPE). The 3.2 GHz PPE includes a power processing unit (PPU), a parity-protected 32 KB L1 data cache, a parityprotected 32 KB L1 instruction cache, and a 512 KB L2 cache with ECC on data and parity on directory tags (which is recoverable using redundant directories). Each 3.2 GHz SPE includes a synergistic processing unit (SPU) and an ECCprotected 256 KB dedicated non-caching local store. The QS22 includes 8 GB of ECC DDR-2 DRAM. During testing, the Triblade under test was housed in the BCH as it would be in the field. The BC-H was oriented so that for a single Triblade under test, the beam first entered the QS22b, followed by the QS22a, the expansion blade, and the LS21 respectively. Two Xilinx Virtex-II FPGAs [9] were also in the beam, one upbeam and the other downbeam of the BCH, with their measurements used for calculating corrected neutron fluence exposures as explained in Section IIIA. Fig. 1 shows the test setup. A front-end node (IBM eServer X Series 336) running Fedora 9 that was external to the beam controlled the system under test. It was used to start applications on the system under test and to monitor its health, both of which were performed manually by the experimental personnel. Two MacBook Pro 5,3 laptops were used for logging and backing up the data. computation performed on the Cell [10, 11]. The correlator code performs a multiply and accumulate needed for certain radio-astronomy applications [12]. It utilizes both the Opteron and PPU in very limited ways, with most of the computation performed on the SPEs. The conjugate gradient method is a member of a family of iterative solvers used primarily on large sparse linear systems arising from the discretization of partial differential equations. The conjugate gradient solver used here performs a double precision, preconditioned conjugate gradient (CG) algorithm and utilizes the Opteron primarily for generation of the sparse linear system, with the CG implementation taking place on the Cell. VPIC is a 3D electromagnetic relativistic particle-in-cell plasma physics simulation code [13]. The version used for this testing was written to run on the IBM Cell processor in a hybrid processor environment like that of a Triblade. The integer add code is a simple hybrid code that executes primarily on the SPUs, using vector integer units to perform simple adds. Vector registers on the SPUs are loaded, vector adds are executed over these registers and verified for correctness. Each test code was configured so that it completed its work in roughly one minute and then produced an output line, which included start and stop times, the application being run, the hardware running it, and other output such as that required to assess whether an SDC had occurred or, in the case of the idle Cell condition, whether the Cells under test were still responding. Two different beam diameters were used for the experiments: a two-inch beam diameter for the first 53 experiments and a one-inch beam diameter for the remaining 59. B. Experimental Procedure Fig. 1. The Test Setup. The upper photo shows the BC-H that housed the Triblades and the down-beam Virtex-II, while the lower photo shows Triblade 3 in the BC-H. The test applications for the Cell included five computational codes detailed in the next paragraph (hybrid Linpack, correlator, a conjugate gradient solver, VPIC, and an integer add code) and an idle code in which the Opteron interrogates the Cell to determine if all processing elements (PPU and SPUs) are all still executing. For the Opterons the test applications included an Opteron-only version of the correlator code, idling, and running the top command, which is considered an idle condition in the analyses that follow. Hybrid Linpack performs the Linpack benchmark calculation, optimized for the Triblade architecture with most of the For a given experiment, a single Cell or Opteron was configured to run the desired application while the beam was aimed so that it irradiated all of the hardware in that processor’s beampath. (The exception to this is the initial test period during which the experimental team was determining the amount of hardware that could be run in the beam simultaneously.) With two QS22s in a Triblade, when a Cell in one QS22 is running an application, the corresponding Cell in the other QS22 is in the beampath. This second Cell in the beampath was set to run the idle condition. Since the beam irradiated a columnar volume within the Triblade under test and the BC-H, certain attribution of an error to the Cells or Opteron in the beampath is not possible. In particular, other hardware in the beampath or hardware that was affected by scatter could be the cause of an observed error. Errors could also be the result of causes external to the beam. The experimental protocol was to start the appropriate test application on the appropriate processor while the beam was off. Once the test application was observed to be operating properly (had produced one or more output lines), the beam was started. The experiment continued until a state of system inoperability (e.g., a system or application crash) was reached or sufficient time had elapsed. The beam was then turned off, data pertaining to neutron fluence exposure were collected, and the system was rebooted before beginning the next test. For the Cells, the test procedure was to cycle through the test applications on a particular Cell, typically until it became inoperable. Repeating each test code periodically permits investigation of any aging or dose-related effects related to increasing exposure to the beam. The procedure for the Opterons, which received much less testing, was to use both the Opteron-only correlator code and possibly an idle condition (idling or running the top command). Functionality of the Opteron while it was idling or running the top command was assessed by ascertaining whether it would return a new command prompt. III. RESULTS A. Hardware Tested and Calculation of Neutron Fluence Exposures Fourteen of the 16 available Cells were operated in the beam, with the two upper Cells on Triblade 2 not operated in the beam. Three Opterons were operated in the beam: both Opterons on Triblade 1 and the lower Opteron on Triblade 4. Each Opteron beampath test was performed after Cell beampaths in the Triblade housing the Opteron had been tested. Thus, the behavior of hardware in the Opteron beampaths in Triblades without previous exposure to the beam cannot be estimated based on our testing. We chose this test procedure since the Cells were of primary interest in our testing. TABLE I HARDWARE BEAM EXPOSURE AND FINAL CONDITION Corrected Neutron Blade Beam Aim Fluence Condition in Post-Beam Testing (neutrons/ cm2) Triblade 1 LS21 Upper 1.20E8 Able to run test applications Opteron Triblade 1 LS21 Lower 4.50E8 Able to run test applications Opteron Triblade 1QS22a Upper Cell 1.90E9 Permanent failure – voltage short in voltage regulator Triblade 1 QS22a Lower Cell 1.55E9 Permanent failure – voltage short in voltage regulator Triblade 1QS22b Upper Cell 2.32E9 Boots at LANL, but can’t communicate with Opteron; passed epc testing at IBM Triblade 1 QS22b Lower Cell 1.88E9 Boots at LANL, but can’t communicate with Opteron; passed epc testing at IBM Triblade 2 LS21 Upper 0 Damaged ground pins; able to run Opteron Linpack at IBM Triblade 2 LS21 Lower 0 Damaged ground pins; able to run Opteron Linpack at IBM Triblade 2 QS22a Upper Cell 0 Able to boot Linux and pass epc testing at IBM Triblade 2 QS22a Lower Cell 1.58E9 Able to boot Linux and pass epc testing at IBM Triblade 2 QS22b Upper Cell 0 Damaged 12V connector; permanent failure – voltage short in voltage regulator Triblade 2 QS22b Lower Cell 1.93E9 Damaged 12V connector; permanent failure – voltage short in voltage regulator Triblade 3 LS21 Upper 0 Able to run test applications Opteron Triblade 3 LS21 Lower 0 Able to run test applications Opteron Triblade 3 QS22a Upper Cell 1.48E10 Able to run test applications In all, 112 experiments were performed. The first three experiments, the only data collected for Triblade 2, were omitted from the error modeling because they included three Triblade 3 QS22a Triblades in the beam whereas for the remaining experiments Triblade 3 QS22b there was a single Triblade in the beam. Since the effect of neutrons on Virtex-IIs devices is wellcharacterized [14], measurements from the Virtex-IIs were used to calculate the corrected neutron fluence to which the hardware under test was exposed. Specifically, the distance from the blade running applications to the beam source and data from the Virtex-II devices were used to calculate corrected neutron fluence exposures that reflected beam attenuation caused by distance from the beam source and effects of the hardware (BC-H and the Triblade under test) in the beam. These neutron fluence exposures account for all neutrons with energies above 10 MeV. B. Longevity of Hardware in the Beam, Post-Beam Testing and Root Cause Analysis of Permanently Failed Hardware Some hardware experienced permanent failures relatively quickly upon exposure to the beam, while other hardware had greater longevity in the beam. For example, QS22b on Triblade 2 was unable to boot after exposure to a corrected neutron fluence of 1.93E9 neutrons/cm2, while the lower cell on QS22a on Triblade 4 remained operational after exposure to a corrected neutron fluence of 2.77E10 neutrons/cm2. Lower Cell 3.60E9 Able to run test applications Upper Cell 1.80E10 Triblade 3 QS22b Lower Cell 4.39E9 Triblade 4 LS21 0 5.75E8 Able to run test applications Triblade 4 QS22a Upper Opteron Lower Opteron Upper Cell Permanent failure – voltage short in voltage regulator Permanent failure – voltage short in voltage regulator Able to run test applications 5.67E9 Able to run test applications Triblade 4 QS22a Lower Cell 2.77E10 Able to run test applications Triblade 4 QS22b Upper Cell 6.90E9 Triblade 4 LS21 Permanent failure – voltage short in voltage regulator Triblade 4 QS22b Lower Cell 3.37E10 Permanent failure – voltage short in voltage regulator Table 1. Corrected Beam Exposure of Hardware in Neutrons/cm2 and Condition of Hardware When Tested at LANL and IBM. Triblade 2 suffered damage to one of the 12V connectors on its QS22b and damage to the ground pins on its LS21 as a result of handling, so it could not be tested in a production setting at LANL. The tool epc refers to a tool used at IBM for testing blades that tests the PPU memory, SPU local store, and data transfers between the PPU and SPU. Testing at LANL indicated that the QS22b in Triblade 1 booted, but could not communicate with the relevant Opteron in Triblade 1. However, testing at IBM revealed that QS22b in Triblade 1 was able to communicate with the relevant Opteron, and it passed the epc test. Following the beam testing, Triblades 1, 3 and 4 were tested in a production platform at LANL. Triblade 2 would not install to test as its backplane was damaged due to handling (as opposed to neutron-induced effects), namely as a result of seating it in the BC-H. All applications used for the beam testing plus an Opteron-only version of Linpack not available at the time of the beam testing were used, with the exception that the bottom Opteron in Triblade 4 was not tested with the Opteron-only correlator code used in the beam. Each of Triblades 1, 3 and 4 had a QS22 that would not boot; in addition, the QS22 in Triblade 1 that would boot could not communicate with the relevant Opteron. Following this postbeam testing, all four Triblades were returned to IBM in Rochester, MN for root cause analysis, with the permanently failed QS22s (one in each of the four Triblades) found to be the result of voltage shorts in voltage regulators. A number of voltage regulators have been experimentally shown to experience single-event burnout and single-event gate rupture, both of which are destructive effects that can cause the system to be fully or partially unpowered. Because of these problems, many space-deployed systems are heavily screened to avoid such issues. Recent research has shown that similar problems can occur terrestrially with both thermal and fast neutrons [15, 16]. Table 1 details the corrected neutron fluence accumulated at each beam aim during the testing and the condition of the Triblades upon testing at LANL and IBM. C. Silent Data Corruption Four SDCs were observed. Two SDCs occurred when Cells were running computational codes: one on the upper Cell in € on the QS22b in Triblade 3 during a VPIC calculation and one the upper Cell in the QS22a in Triblade 1 during a correlator calculation. Two SDCs occurred on Opterons while the Opteron-only correlator code was running: one on the upper Opteron and one on the lower Opteron of Triblade 1. Checks that determined when an SDC occurred were performed using either a 160-bit SHA-1 hash (for blocks of output data for VPIC and for the Cell correlator code) or a 32-bit CRC (for the output block of data for the Opteron correlator code), so the magnitude of the difference between the calculated result and the correct result for a particular SDC cannot be determined. D. Failure Data We categorized each experiment as having one of two end states: 1) survival, meaning that the experiment ended when the experimenter believed the application was still running or 2) failure, indicating that the application was no longer running at the end of the experiment, e.g. because of an application or system crash. Since the output from the test applications appeared roughly every minute, it is possible that in some cases in which the system is deemed to have survived the experiment it had actually failed, but that failure was not detected before the experiment ended. 79 of the 95 tests conducted with the beam aimed at a Cell that were analyzed for the results presented here ended in failure, while all 14 tests conducted with the beam aimed at an Opteron ended in failure. E. Modeling the Data For our modeling, let Yi denote the corrected exposure (neutron fluence in neutrons/cm2) until an error (SDC or failure, e.g. crash) for each error i. The observations are interval censored, i.e., only an interval (ai, bi) such that Yi ∈ (ai, bi) is observed. This is because the exact time that an error occurs is unknown. Instead, all that is known are the time of the last output line (ai) and the time at which the operator noticed the error (bi) or ended the test. We model the Yi as coming independently from a Cox model [17], but with a shifted hazard rate h(y | z i , x i ) = h0 (y + x i ) exp(β ' z i ) , where (i) h0 is the baseline hazard rate, (ii) xi is the cumulative exposure of the component under test at the start time for the i-th error, (iii) zi=zi,1,…, zi,p is a vector of covariates corresponding to the i-th error€(which application was running, the Triblade under test, the beam aim (hardware in a Cell beampath or hardware in an Opteron beampath), and the beam diameter), and (iv) β is a vector of unknown parameters. In this formulation, xi shifts the baseline hazard rate to make it relative to the amount of exposure that the component being tested has accumulated in its lifetime, as opposed to the amount of exposure that it received during an individual experiment. The probabilistic model for Yi is then y Si (y) = P(Yi > y) = exp− ∫ h(s | z i , x i )ds . 0 With this, if h(s|zi,xi) increases with s, the probability that Yi > s decreases more rapidly than if h(s|zi,xi) remained constant, but if h(s|zi,xi) decreases with s, there is a greater chance that Yi > s. The unknown parameters in the model were estimated using Bayesian methods [18]. Thorough diagnostics to assess the adequacy of this model for the experimental data were conducted and gave no indication of a violation of model assumptions. F. Study Results The results presented below pertain to the conditions under which the experiments were conducted (the applications run, the beam aims, the angle of incidence of the beam to the hardware under test, etc.) and the model used for the data, with results likely to be obtained under other conditions less clear. All results have been estimated via Markov Chain Monte Carlo [18]. The baseline hazard rate is essentially constant, suggesting that the instantaneous failure rate doesn’t vary much with increasing exposure to the beam for the exposures observed in our testing. The posterior probability that beam aim (hardware in a Cell beampath or hardware in an Opteron beampath) affects the hazard rate is 1.0. Aiming the beam at the hardware in an Opteron beampath results in an average multiplier to the hazard rate of 12.2 with an 95% credible interval (CI) of (5.8, 24.0), meaning there is about 12 times more risk of an error when aiming the beam at the hardware in an Opteron beampath as opposed to the hardware in a Cell beampath. This result is not the same as saying that the Opteron has a hazard Based on our estimated model, a failure (application or system crash, etc.) FIT rate and an SDC FIT rate for each of the two beam aims (hardware in the Opteron beampath and hardware in the Cell beampath) are provided, along with predictive distributions, which incorporate all estimation uncertainty and inherent randomness, for the actual number of failures in 109 hours and the actual number of SDCs in 109 hours that might be observed in practice for each of the two beam aims. Figs 25 present the predictive distributions, while their captions include the posterior mean FIT rate, with the uncertainty in the estimation of model parameter values (i.e., the hazard rate for each beam aim) accounted for in the 95% CI for the FIT rate also provided in the figure captions. The FIT rates and corresponding distributions presented below are calculated based on the estimated model for beam aim failure times and 0.008 0.004 Density 0.000 FIT Rate 200 300 400 Number of Failures Fig. 2. Predictive Distribution of the Number of Failures in 109 Hours for All Hardware in the Cell Beampath. The posterior mean FIT rate is 166.7 with a 95% CI of (83.0, 272.1). 0.00 0.04 0.08 SDCs in 10^9 Hours with Cell Beam Aim 0 FIT Rate 10 20 30 40 50 Number of SDCs Fig. 3. Predictive Distribution of the Number of SDCs in 109 Hours for All Hardware in the Cell Beampath. The posterior mean FIT rate is 8.93 with a 95% CI of (4.45, 14.58). Failures in 10^9 Hours with Opteron Beam Aim 6e−04 The posterior mean probability that a given error is an SDC, as opposed to a failure, is 0.051, with a 95% CI of (0.019, 0.097). The posterior mean of the ratio of the probability of an SDC to the probability of a failure is 0.054, with a 95% CI of (0.019, 0.107). 100 3e−04 The hazard rate was also not affected by whether an application was being run as opposed to an idle condition (the posterior probability was 0.023). Similarly, when effects for each of the different applications relative to the idle condition were included in the model, none of the applications were found to have different hazard rates from the idle condition (all posterior probabilities were less than 0.06). With additional data and/or with other applications, application might be found to have an effect. Failures in 10^9 Hours with Cell Beam Aim 0e+00 For beam diameter, the posterior probability of an effect on the hazard rate is 0.12, indicating very little evidence that beam diameter is an important predictor of the hazard rate. These results suggest that the additional hardware in the beampath when the beam has a two-inch diameter versus oneinch diameter has little effect on the hazard rate. The beam diameter also affects the intensity of the beam, with a lower number of neutrons/cm2 for the 2-inch diameter. This was, however, accounted for in the modeling results. These results are for the neutron flux in Los Alamos, NM, which is estimated to be approximately 6.4 times that at sea level [3], and reflect only neutrons with energies above 10 MeV. There is also variability in the neutron flux over time. To account for this, it is assumed, as in a previous study [3], that the ambient neutron flux in Los Alamos, NM is normal with mean 0.025 neutrons/cm2/sec and standard deviation 4.4E-04. Density The posterior probability that the hazard rate for Triblade 3 is different from Triblade 1 is 0.046, while the posterior probability that the hazard rate for Triblade 4 is different from Triblade 1 is 0.319. Conditional on Triblade 4 having an effect, the average multiplier to the hazard rate is 0.53, with a 95% CI of (0.32, 0.79). These results suggest that which Triblade was under test does not have a substantial effect on the hazard rate. However, with more Triblades tested and/or more time spent under test, differences among the hazard rates for different Triblades might be found. an assumption of independence between successive failure times. Density rate that is roughly 12 times the Cell hazard rate since other hardware along the trajectory of the beam when it is aimed at an Opteron or hardware affected by scatter may be responsible for errors that resulted when the beam was aimed at an Opteron. Moreover, the hardware in an Opteron beampath in a given Triblade was tested after the hardware in at least one Cell beampath in that Triblade was tested. FIT Rate 2000 4000 6000 Number of Failures 8000 Fig. 4. Predictive Distribution of the Number of Failures in 109 Hours for All Hardware in the Opteron Beampath. The posterior mean FIT rate is 2058 with a 95% CI of (809, 5067). Density 0.0e+00 0.004 0.000 Density 0 FIT Rate 200 400 600 800 Number of SDCs Fig. 5. Predictive Distribution of the Number of SDCs in 109 Hours for All Hardware in the Opteron Beampath. The posterior mean FIT rate is 110 with a 95% CI of (43.4, 271). The estimated model was also used to estimate error (failures and SDCs) FIT rates for a single Triblade and for the 180 compute Triblades in a single connected unit (CU). (The Roadrunner supercomputer is composed of 17 CUs.) These FIT rates only incorporate hardware in the Cell and the Opteron beampaths, so do not include all hardware in a Triblade. Specifically, independence of the failures occurring at the different beam aims, whether Opteron (top or bottom) or Cell (top or bottom), is assumed. Therefore, the hazard rates for these four distinct aims are added together to obtain an approximate hazard for an entire Triblade. Independence among Triblades is assumed to get an overall hazard rate for an entire CU in a similar manner. These FIT rates are again for the neutron flux in Los Alamos, NM and incorporate only neutrons with energies above 10 MeV. As above, the predictive distribution for the number of errors in 109 hours that might be observed in practice is presented along with the posterior mean and 95% CI for the FIT rate (in the figure caption); see Figs 6-7. The Triblade and Roadrunner CU FIT rate information is extrapolated from the neutron beam testing and does not represent observed hardware failures in the Roadrunner supercomputer. 0.00020 0.00000 Density Errors in 10^9 Hours for a Single Triblade 0 IV. Number of Errors 20000 Fig. 6. Predictive Distribution of the Number of Errors in 109 Hours for One Triblade. The posterior mean FIT rate is 4703 with a 95% CI of (1935, 11216). ACKNOWLEDGEMENT Many people contributed to the success of this project. The authors thank Joe Abeyta, Chuck Alexander, Ben Bergen, Ann Borrett, Henry Brandt, James Campa, Randy Cardon, Tom Fairbanks, Parks Fields, Alan Gibson, Gary Grider, Josip Loncaric, Pablo Lujan, Alex Malin, Fred Marshall, Andrew Montoya, John Morrison, Andrew Shewmaker, Manuel Vigil, Bob Villa, Steve Wender, and Cornell Wright and apologize for any inadvertent omissions from this list. V. [1] [2] [3] [4] [5] [6] [7] [8] [11] 15000 2500000 Number of Errors [10] 10000 FIT Rate 500000 1500000 Fig. 7. Predictive Distribution of Number of Errors in 109 Hours for One Roadrunner CU. The posterior mean FIT rate is 8.37E5 with a 95% CI of (3.51E5, 1.89E6). [9] FIT Rate 5000 1.5e−06 Errors in 10^9 Hours for a Single CU 0.008 SDCs in 10^9 Hours with Opteron Beam Aim [12] [13] [14] [15] [16] [17] [18] REFERENCES J. Ziegler and W. Lanford, “The Effect of Sea-Level Cosmic Rays on Electronic Devices,” J. Appl. Phys., vol. 52, no. 6, Jan., pp. 4305-4312, 1981. R.C. Baumann, "Radiation-induced soft errors in advanced semiconductor technologies," IEEE Trans. Dev. and Mat. Rel., Vol. 5, No. 3, pp. 305- 316, Sept. 2005. S. Michalak, K. Harris, N. Hengartner, B. Takala, and S. Wender, “Predicting the number of fatal soft errors in Los Alamos National Laboratory’s ASC Q supercomputer,” IEEE Trans. Dev. and Mat. Rel., Vol. 5, No. 3, Sept. 2005. T. Hong, S, Michalak, T. Graves, J. Ackaret, and S. Rao, “Neutron beam irradiation study of workload dependence of SER in a microprocessor,” 2009 SELSE Proceedings. B. Takala, “The ICE House: neutron testing leads to more-reliable electronics,” Los Alamos Science [Online.] Available: http://library.lanl.gov/cgi-bin/getfile?3012.pdf Nov 30 2006. K. Koch, “Roadrunner platform overview [Online]. Available: http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Koch%20%20Roadrunner%20Overview/RR%20Seminar%20%20System%20Overview.pdf, 2008. “Blade Center H (8852)” [Online.] Available: http://publib.boulder.ibm.com/infocenter/bladectr/documentation/index.jsp?topic=/ com.ibm.bladecenter.8852.doc/bc_8852_product_page.html H. Meuer “31st TOP500 List topped by first-ever petaflop/s supercomputer” Scientific Computing [Online]. Available: http://www.scientificcomputing.com/31st-TOP500-List-Topped-by-First-everPetaflops-Supercomputer.aspx Xilinx, “Virtex-II Platform FPGAs Complete Data Sheet” [Online]. Available: http://www.xilinx.com/support/documentation/data_sheets/ds031.pdf, 2007. M. Kistler, J. Gunnels, D. Brokenshire, and D. Benton, “Programming the Linpack benchmark for Roadrunner,” IBM J. Res. Dev., Vol. 53, No. 5, Paper 9, 2009. A. Petitet, R. Whaley, J. Dongarra, and A. Cleary, “HPL – a portable implementation of the High-Performance Linpack benchmark for distributed memory computers,” [Online.} Available: http://www.netlib.org/benchmark/hpl/. A. DuBois, C. Connor, S. Michalak, G. Taylor, and D. DuBois, “Application of the IBM Cell processor to real-time cross-correlation of a large antenna array radio telescope,” Los Alamos Nat. Lab. Tech. Rep. #LA-UR-09-03483, 2009. K Bowers. B. Albright et al. “Advances in petascale kinetic plasma simulation with VPIC and Roadrunner,” J. Phys.: Conf. Ser. 180 (SciDAC 2009), 2009. A. Lesea, S. Drimer et al., "The Rosetta experiment: atmospheric soft error rate testing in differing technology FPGAs," IEEE Trans. Dev. Mat. Rel., Vol. 5, No. 3, pp. 317- 328, Sept. 2005. A. Hands, P. Morris, et al. “Single event effects in power MOSFETs due to atmospheric and thermal neutrons,” to be presented at the IEEE Nuclear and Space Radiation Effects Conference 2011. P. Shea and Z. Shen, "Numerical and experimental investigation of single event effects in SOI lateral power MOSFETs," to be presented at the IEEE Nuclear and Space Radiation Effects Conference 2011. D. Cox “Regression models and life-tables” J. of the Roy. Stat. Soc. Series B (Meth.) Vol. 34 No. 2 pp. 187-220, 1972. A. Gelman, J. Carlin, H. Stern and D. Rubin, Bayesian Data Analysis, London: Chapman and Hall, 1995.