Fingerprinting: Bounding Soft-Error Detection Latency and Bandwidth Jared C. Smolens, Brian T. Gold, Jangwoo Kim, Babak Falsafi, James C. Hoe, and Andreas G. Nowatzyk Computer Architecture Laboratory Carnegie Mellon University http://www.ece.cmu.edu/˜truss ABSTRACT Recent studies have suggested that the soft-error rate in microprocessor logic will become a reliability concern by 2010. This paper proposes an efficient error detection technique, called fingerprinting, that detects differences in execution across a dual modular redundant (DMR) processor pair. Fingerprinting summarizes a processor’s execution history in a hash-based signature; differences between two mirrored processors are exposed by comparing their fingerprints. Fingerprinting tightly bounds detection latency and greatly reduces the interprocessor communication bandwidth required for checking. This paper presents a study that evaluates fingerprinting against a range of current approaches to error detection. The result of this study shows that fingerprinting is the only error detection mechanism that simultaneously allows high-error coverage, low error detection bandwidth, and high I/O performance. Categories and Subject Descriptors C.4 [Performance of Systems]: Reliability, availability, and servicability; B.8.1 [Performance and Reliability]: Reliability, testing, and fault-tolerance General Terms Performance, Design, Reliability Keywords Soft errors, error detection, dual modular redundancy (DMR), backwards error recovery (BER) 1. INTRODUCTION Technology trends will raise soft-error rates in microprocessors to levels that require changes to the design and implementation of future computer systems [12, 17, 23]. Detecting soft-errors in the processor’s core logic presents a Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS’04, October 7–13, 2004, Boston, Massachusetts, USA. Copyright 2004 ACM 1-58113-804-0/04/0010 ...$5.00. more difficult challenge than errors in storage or interconnect devices, which can be handled via error detecting and correcting codes. Today, microprocessor systems requiring reliable, available computation employ dual modular redundancy (DMR) at various levels to enable detection—ranging from replicating pipelines within the same die [25] to mirroring complete processors [15, 21]. In this paper, we offer an investigation of the general design space of error detection in DMR processor configurations. Our study of DMR error detection is motivated by a larger effort to develop a cost-efficient reliable server architecture in the TRUSS project.1 To exploit the economy of scale, modern servers are increasingly built from commodity modules. In particular, rack-mounted server “blades” are emerging as a cost-effective and scalable approach to building highperformance servers. The TRUSS architecture aims to provide reliability in a commodity blade cluster environment with minimal changes to commodity hardware and without modifications to the application software. The TRUSS architecture is designed to survive any single component failure (e.g., memory device, processor or system ASIC) by enforcing distributed redundancy at all levels of processing and storage. For example, under TRUSS, redundant program execution will be carried out by mirrored processors on different nodes interconnected by a system area network. This distributed redundancy approach means DMR error detection must be implemented under the constraints of limited communication bandwidth and non-negligible communication latency between the mirrored processors. In this paper, we examine four design points for DMR error detection. The designs are differentiated by the monitoring points where the behavior of two mirrored processors is compared. The first three alternatives reflect previously proposed techniques. These three designs correspond to comparing the mirrored processors at 1. chip-external interface, 2. L1 cache interface and 3. full state of program execution. We introduce a fourth option based on the comparison of fingerprints. A fingerprint is a cryptographic hash value computed on the sequence of updates to a processor’s architectural state during program execution. A simple fingerprint comparison between the mirrored processors effectively verifies the correspondence of all executed instructions covered by the fingerprint. Fingerprinting drastically reduces the required data exchange bandwidth for DMR error detection. We evaluate the four DMR error detection design points 1 Total Reliability Using Scalable Servers. in conjunction with a backward-error recovery (BER) framework that restarts execution from a checkpoint in the presence of an error. The results of our evaluation offer the following key insights. First, as the soft-error rate increases, traditional chip-external error detection requires a minimal checkpoint interval of tens of millions of instructions to maintain the reference mean-time-to-failure target (MTTF) of 114 FIT [4]. Second, DMR error detection at the L1 cache interface or on the full state of two mirrored processors requires an unacceptably high interprocessor bandwidth. Third, the I/O traffic in commercial OLTP systems forces a checkpoint interval that is too short to provide adequate error coverage with chip-external and L1 cache interface error detection. This combination of results argues that fingerprinting is the only viable error detection mechanism that simultaneously allows high-error coverage, low error detection bandwidth, and high I/O performance in a DMR processor system. Paper Outline. Section 2 continues with further details of the DMR error detection design space and evaluation metrics for this study. Section 3 presents the concepts and principles of fingerprinting. Section 4 describes our experimental setup and data analysis. Sections 5-7 present and discuss the results of our study. Section 8 describes related prior work. Finally, Section 9 provides a summary and our conclusions. 2. DESIGN SPACE OVERVIEW Recovering from soft-errors in program execution can be accomplished in two general ways. With forward-error recovery (FER), enough redundancy exists in the system to determine the correct operation, should a processor fail. Triple modular redundancy (TMR) is the classic example of FER: three processors execute the same program and when one processor fails a majority vote determines the correct state [24]. For cost-sensitive commercial systems, the added overhead associated with a TMR system is prohibitive [18]. In this paper, we evaluate backwards-error recovery (BER) mechanisms that create checkpoints of correct system state and rollback processor execution when an error is detected. Dual modular redundancy (DMR) is a BER technique where two redundant processors are used to detect errors in execution. We use DMR as the base design for evaluating soft-error detection mechanisms and their implications for system reliability and performance. We assume a DMR processor system where two processors execute redundantly in lockstep. We assume the chosen processor microarchitecture is fully deterministic such that in an error-free scenario, the two processors behave identically. An error in the execution of one processor manifests as a deviation in the behavior of the two processors. In this section, we first give an overview of the checkpointing process that we assume. We then introduce the error detection design space for DMR processor systems and finally present the three key criteria we use to evaluate and compare the DMR error detection design points. 2.1 Checkpointing A checkpoint of program state consists of a snapshot of architectural registers and memory values. Although a checkpoint logically represents a single point in time, our model consists of copying the register file at checkpoint creation and using a copy-on-write mechanism to maintain a change Checkpoint n Checkpoint n+1 time Error Detection Irreversible operation requested Irreversible operation released Figure 1: The error detection and checkpoint comparison timeline for a single processor. The redundant processor (not shown) precisely replicates the above timeline. log for caches and memory (similar to that proposed in SafetyNet [27]). Rollback consists of restoring register and memory values from the log. The time between checkpoints is referred to as the checkpoint interval. For short intervals (hundreds to tens of thousands of instructions), checkpoints are small enough to fit in on-chip structures [7, 27]. With periodic checkpoints, the storage can be guaranteed to fit on-chip. If longer intervals are required, checkpoints must be stored off-chip. To recover from errors, the checkpoint and error detection mechanisms must be synchronized. Figure 1 illustrates the sequence of actions for the checkpoint and error detection mechanisms assumed in this paper. Immediately before an operation with irreversible effects (an operation that cannot be re-executed, such as uncached loads and stores), any latent errors should be detected and corrected by recovering to the last checkpoint. If no error is detected, the irreversible operation can then be released and a new checkpoint must be taken. If the detection mechanism does not find all latent errors before discarding the previous checkpoint, the new checkpoint may contain erroneous state and successful recovery will no longer be possible. 2.2 Error Detection We consider four DMR error detection options differentiated by the monitoring points where the behavior of the two processors are compared (illustrated in Figure 2). Chip-External Detection. The least intrusive approach is to monitor and compare the two processors’ external behavior at the chip pins. For this paper, we conceptualize observable chip-external behavior as all address and data traffic exiting the lowest level of on-chip cache. There is an inherent error detection latency associated with this approach: the effect of an error originating in the execution core may not appear at the output pins for some time due to buffering in registers and the cache hierarchy. The exact error detection latency is a program- and architecturedependent property. Nevertheless, detecting and containing an error at the pins is still effective in preventing the error from propagating to the rest of the system (e.g., memory, disks, network controller) with irreversible effects. L1 Cache Interface Comparison. An alternative to chipexternal detection is to monitor and compare the address and data traffic entering the interface of the L1 cache. This Execution Engine Execution Engine Execution Engine Execution Engine Execution Engine Execution Engine Register File Register File Register File Register File Register File Register File On-Chip Caches On-Chip Caches On-Chip Caches On-Chip Caches On-Chip Caches On-Chip Caches (a) (b) (c) Figure 2: Error detection mechanisms: (a) chip-external, (b) L1 cache interface, (c) full state. Fingerprinting (not shown) provides detection capabilities equivalent to full state comparison, (c). is similar to the approach taken in SRT [19] and CRT [16]. To support this comparison, we assume our DMR processor system has a dedicated channel connecting the two processors’ internal cache interfaces. One of the issues we address in Section 6 is the data exchange bandwidth required by this approach. At the cost of higher DMR data bandwidth, this internal monitoring point significantly reduces the error detection latency because no errors can be buffered in the cache hierarchy. The detection latency remains nonzero, however, because an erroneous value can still be buffered in the register file until a store instruction eventually propagates the value into the cache. Full-State Comparison at Checkpoint Creation. At the time a checkpoint is created, the complete set of changes to architectural state can be compared between mirrored processors. Naturally, this mechanism uses more pin bandwidth than the passive chip-external, but the checkpoint can be guaranteed error free. The number of unique cache lines or memory pages changed since the previous checkpoint determines the amount of work necessary for full-state comparison. As the checkpoint interval increases, we expect spatial locality to amortize the comparison cost over the interval. However, if available bandwidth is insufficient, fullstate comparison may require the processors to stall until the error detection completes. Fingerprint Comparison. A fingerprint is a hash value (computed using a linear block code such as CRC-16) that summarizes several instructions’ state updates into a single value. The mirrored processor pairs can corroborate the effects of multiple instructions through a single fingerprint comparison. When used with checkpointing, all instructions between two checkpoints will be summarized by a single 16bit fingerprint (assuming CRC-16). Starting at the earlier checkpoint, a fingerprint is accumulated for all instruction updates until the next checkpoint is created. If the fingerprints computed by the mirrored processors agree at this point, all of the fingerprinted instructions since the last checkpoint are known to be correct. If the fingerprints of the mirrored processors disagree, the computation must be restarted from the earlier checkpoint. Fingerprinting offers two important advantages. First, fingerprinting addresses the prohibitive bandwidth requirement of comparing all architectural state updates before they are committed. At the same time, fingerprinting provides a summary of all state changes, including any possible soft-errors. 2.3 Evaluation Metrics In this paper, we evaluate and compare the different DMR error detection mechanisms under three key criteria, namely error coverage, DMR error detection bandwidth, and I/O performance. Error Coverage. The error coverage of a particular faulttolerant system is the fraction of errors that can be successfully detected and corrected. In a rollback-recovery system, a checkpoint taken after a soft-error has occurred, but before detection, leads to an unrecoverable failure. Hence, error coverage is determined by the probability that a checkpoint contains an undetected error. As the checkpoint interval increases, an error has a greater chance of being detected before the next checkpoint, since the average instruction is now further from the checkpoint time. Conversely, with smaller checkpoint intervals there is less time to detect an error before the next checkpoint is taken. If all architectural state updates are compared before a checkpoint is taken, the error coverage is independent of the checkpoint interval because errors in the current interval will be exposed by the detection mechanism. The error detection latencies of chip-external and L1 cache interface comparisons, however, place a restrictive lower bound on acceptable checkpoint interval. In this paper, we show that the lower bound for checkpoint interval with chip-external detection is 10 to 25 million instructions; for L1 cache interface detection the minimum interval is 10 to 100 thousand instructions. Error Detection Bandwidth. The detection bandwidth required to compare the behavior between a mirrored processor pair is critical to the overall system’s feasibility and implementation cost. While the chip-external comparison approach places no additional data on the chip’s pins, the other three techniques we consider create additional traffic exiting the chip. Our bandwidth requirement study in Section 6 will show L1 cache interface detection and full-state comparison both require bandwidth exceeding the capability of current packaging technologies. I/O Performance. Conventional rollback-recovery mechanisms log all input from external devices and delay all output until it is guaranteed that the system will not be rolled back across a committed I/O operation [6, 29]. In softwareassisted checkpointing of I/O-intensive applications, buffering can hide the performance impact of delaying I/O to the Table 1: Evaluating rollback-recovery mechanisms. Checkpoint Interval Detection Mechanism Error Coverage Large Chip-External L1 Cache Interface Full Comparison at Checkpoint ü ü ü Small Chip-External L1 Cache Interface Full Comparison at Checkpoint Full Comparison w/ Fingerprinting ü ü B/W I/O ü ü ü ü ü ü ü ü end of a coarse-grained checkpoint (millions of instructions). However, a hardware solution must create checkpoints immediately after an I/O operation to prevent rollback that might reissue past I/O operations. When saving architectural state immediately after accessing a device, we require an error detection mechanism with high detection coverage to ensure that rollback is not necessary. In Section 7, we show that I/O intervals in commercial OLTP systems are often in the range of 100s to 1000s of instructions and finegrained checkpointing is necessary. From the above discussion, there are contradicting demands on the checkpoint interval by I/O performance and error coverage requirements. Table 1 summarizes the opposing factors when trying to balance I/O performance, error coverage, and comparison bandwidth in choosing a DMR detection mechanisms. Only fingerprinting, presented in detail in the next section, can simultaneously satisfy all requirements of high error coverage, low detection bandwidth, and high I/O throughput. 3. FINGERPRINTING In this section, we first discuss the properties of a fingerprint in detail. We then explore the options available in an implementation of fingerprinting. 3.1 Fingerprinting Overview Fingerprints provide a concise view of past and present program state in order to detect differences in execution across two redundant processor dies, without the overhead normally associated with full state comparison. A fingerprint must provide high coverage of errors for all instructions in the checkpoint interval, require little comparison bandwidth, and work with a variety of checkpoint intervals. The fingerprint contains a summary of the outputs of any new register values created by each executing instruction, new memory values (for stores), and effective addresses (for both loads and stores). This set of updates is both necessary and sufficient to capture errors in architectural state; errors in other architectural and microarchitectural structures will quickly propagate to those being fingerprinted. Examples of structures outside the scope of fingerprinting include decoded instruction bits, condition code values, the current program counter, and internal microarchitectural state. Fingerprint comparisons are forced at the end of every checkpoint interval. A matching fingerprint indicates that the computation in both processors, from the beginning of the checkpoint to the current instruction, is identical. At this point, the old checkpoint will be replaced with a new one and the process starts again. The hash used to compute fingerprint values should be a well-constructed, linear-block code. There are two key requirements for the code. First, the code must have a low probability of undetected errors. Second, the code should be small for both easy computation and low bandwidth comparison. Hamming codes are a class of well-known codes with an understood lower bound on the probability of an undetected error and a low storage overhead. For a p-bit hamming code, the probability of an undetected error is at most 2−p [35]. This lower bound is independent of the number of updates to the code and represents the probability of detecting an error given an infinitely long sequence of uniformly random bits. Given this statistical bound, we choose a 16-bit cyclic redundancy check (CRC) for our evaluation that has a 0.999985 probability of detecting an error. In Section 5, we show that this CRC is sufficient to achieve acceptable system reliability when using fingerprinting as an error detection mechanism over the useful range of checkpoint intervals. 3.2 Fingerprint Implementation Hardware implementations of hash mechanisms are well known and readily found in the literature [24]. In this section, we study the implications of fingerprinting on the processor pipeline. Microarchitecture alternatives. Two major alternatives exist when implementing fingerprinting in a speculative superscalar out-of-order processor pipeline, as illustrated in Figure 3. First, we can capture all committed state by taking the values from instructions retiring from the reorder buffer (ROB) and load-store queue (LSQ). Conventionally, the ROB does not contain instruction results. Rather than trying to read the instruction results from the register file, we believe it would be more cost-effective to add the instruction results to the ROB. Alternatively, if the cost of adding instruction results to the ROB is too high, another approach would be to fingerprint microarchitectural state by hashing instruction results as they complete. Figure 3(b) shows this design. In this case, the fingerprint will contain a hash of committed state plus additional, potentially speculative state. Precise Replication. Implicit in our description of fingerprinting is that redundant processors must now have precisely replicated behavior. For example, suppose a redundant pair of nodes are in a spin-lock, waiting for another processor pair to release ownership of a shared variable. The two nodes must receive ownership at exactly the same (logical) time, such that they terminate waiting after the same number of loop iterations. ROB + Result Values Fetch Decode/ Rename Fpnt LD/ST Queue Issue Queue RegFile Exec. Units (a) ROB Fetch Decode/ Rename LD/ST Queue Issue Queue RegFile Fpnt Exec. Units (b) Figure 3: Microarchitectural implementation options: (a) committed state (b) committed + speculative state. Most systems operate the processors in lockstep with a (small) fixed delay between them [15, 21]. Additionally, we require that all microarchitectural structures be deterministic so that, given identical input streams, replicated processors execute the same instruction sequence. This requirement exists for chip-external detection as well; otherwise, correlating replicated processor results would be impossible. 4. ANALYSIS TECHNIQUES In this section, we present the dynamic dependency analysis algorithms used to measure the error detection latency in a processor with caches for detection mechanisms at the chip-external and L1 cache interfaces. Then, we present a method for estimating the bandwidth required to compare full architectural state across redundant processors. Finally, we summarize our simulation environment. 4.1 Dynamic Dependency Analysis Soft-errors propagate from an initial, affected instruction to a detection point through program dataflow. We measure error coverage by finding the minimum error detection latency (counted in instructions) from execution to detection for each instruction. The minimum detection latency of a particular instruction is obtained by generating dynamic dependency graphs (DDGs) and finding the distance to the first detection opportunity for the data value produced by that instruction. A DDG is a directed, acyclic graph defining a partial ordering of instructions as related through register and cache dependencies [3]. For chip-external detection, we consider three classes of detection opportunities: load/store addresses, registerindirect jumps, and cache writebacks. If an address used for a load or store is in error, we assume that it will cause a cache miss and be detected outside the chip (i.e., a direct comparison with a fault-free processor would not show this address request). An erroneous register value used for a register-indirect jump will likely cause an instruction cache miss, a TLB miss, or an exception. Any of these will result in near-immediate detection outside the chip. Finally, cache writeback values and addresses are explicitly compared in the external detection schemes. For L1 cache interface detection, we consider three classes of detection opportunities: load/store addresses, registerindirect jumps, and store values. Each of these classes requires communication between the processor and L1 cache, which will be directly compared using this detection method. DDG Construction. The DDGs are constructed by annotating in-order program execution traces with registerregister dataflow, load/store addresses, cache miss, and writeback information to track the lifetime of values in the register file and cache. A post-simulation analysis tool then traverses the traces and builds graphs from data dependencies. The dependencies flow through the registers and cache to capture the paths that an erroneous instruction result may take, including being stored, read from the cache, and reused in another instruction. The analysis tool traverses backwards from the end of the trace. When a detection opportunity occurs (e.g. cache writeback or load/store), a new DDG is constructed to find all instructions leading to the present detection. Each new instruction is added to the most recently-constructed DDG containing one or more dependent instructions. Each instruction belongs to one and only one DDG, indicating the earliest point of detection. Figure 4 shows a simple example of the trace analysis graph construction. The trace contains instructions, a timestamp (instruction count), and cache access information. In this simple example, two graphs are shown for the two detection points (instructions 2 and 6). As an instruction i is added to its graph, the instruction distance between the detection opportunity and i is written to a log. Analysis Limitations. This error analysis makes several simplifying assumptions. First, we treat all transient faults as equally likely to result in an architecturally-visible error. As discussed by Mukherjee, et al. [17], this assumption ignores the fact that certain structures are more likely to result in architectural errors. An instruction queue, for example, may cause many more architecturally-visible errors than the functional units performing computation. Second, by not injecting actual faults in the system, we ignore the effects of logical masking on error detection latency. However, logical masking only extends the error detection latency if an erroneous instruction result is masked in one chain of instructions, but used in another, longer chain. Third, we assume all errors in address computation miss in the on-chip cache and that the effects of control flow errors are detected immediately. Because the size of the onchip cache is miniscule when compared to the entire address space, we assume errors in address computation cause a miss in the cache that is immediately detected. Similarly, errors in control flow will most likely cause an immediate instruction cache miss that is visible outside the chip. Finally, the instruction traces used in the analysis are of a finite length, which artificially shortens cache block dead times (time between the last cache line modification and writeback). All of these limitations makes our analysis conservatively favor existing detection mechanisms; we estimate a shorter detection latency than may actually exist. analysis time instruction 0 1 2 3 4 5 6 0 1 detection 2 4 det ect ion N 3 6 trace Figure 4: An instruction trace and the associated Dynamic Dependency Graphs (DDGs), with detection points at instructions 2 and 6. 4.2 Comparison Bandwidth To measure the comparison bandwidth needed by an error detection mechanism, we use the same traces collected for the dynamic dependency analysis. For chip-external detection, the bandwidth required includes all L2 load/store miss addresses and addresses/data for writebacks. When detecting errors at the L1 cache interface, the comparison bandwidth includes all loads/stores addresses and stores. Both chip-external and L1 cache interface detection bandwidth are independent of the checkpoint interval. Comparing the full processor state at the checkpoint requires bandwidth proportional to the amount of memory changed during the checkpoint interval and inversely proportional to the checkpoint interval itself. For example, if 1MB of memory is changed during a checkpoint interval of 10 milliseconds, then we must be able to compare at a rate of 100MB/s to avoid any performance loss, assuming comparisons can be overlapped with execution. We evaluate the bandwidth requirements for this detection class by counting the unique cache lines written during a checkpoint interval. The bandwidth required for fingerprint comparison is dependent only on the checkpoint interval and size of the fingerprint register, which we assume is 16 bits. Because fingerprints are compared before every checkpoint, the bandwidth required is simply 2 bytes per checkpoint interval. 4.3 Methodology We simulate 26 SPEC2K applications using SimpleScalar [5] and two commercial workloads using Virtutech Simics [14]. In SimpleScalar, we use a functional cache simulator for the Alpha ISA and simulate CPU-intensive workloads using the host platform for system calls. Simics is a full system functional simulator that allows functional simulation of unmodified commercial applications and operating systems on the SPARC v9 architecture. In SimpleScalar, we simulate the first input set for all 26 SPEC2K benchmarks. For each benchmark, we simulate up to eight pre-determined 100-million instruction regions from the benchmark’s complete execution trace, using the prescribed procedure from SimPoint [22]. In Simics, we boot an unmodified Solaris 8 operating system to run two commercial workloads: a 40 warehouse TPC-C like workload with IBM’s DB2 and SPECWeb. The TPC-C like workload [30] consists of a 40 warehouse database striped across five raw disks and one dedicated log disk with 100 clients. The SPECWeb workload [28] services 100 connections with Apache 2.0. Both commercial workloads are warmed un- til the CPU utilization reaches 100% and, in the case of the DB workload, until the transaction rate reaches steady state. Once warmed, the commercial workloads execute for 500 million instructions. In both models, the processor is assumed to run at 1GHz, with a constant IPC of 1.0. In this study, the relevant architectural parameter is the level-two cache. We simulate an inclusive 1MB 4-way set associative cache with 64-byte lines. 5. ERROR COVERAGE In this section, we present an evaluation of the soft-error coverage for four error detection mechanisms. We then use a simple DMR reliability model to relate the coverage to a mean time to failure (MTTF). As discussed in Section 2.3, rollback recovery fails when the error detection latency extends past the checkpoint interval. In this section, we use the dynamic dependency analysis from Section 4.1 to compute error detection latencies for both chip-external and L1 cache interface detection points. Then, we present a model of the MTTF calculated from these detection latencies and, finally, we discuss the tradeoff between checkpoint interval length and error coverage as it applies to the four detection mechanisms. 5.1 Detection Latency From the output of the dynamic dependency analysis, we are interested in knowing the likelihood that an error has been detected by the end of a checkpoint interval. The dynamic dependency analysis generates a listing of minimum distances to detection for each instruction in the program trace. We construct a histogram according to the instruction’s minimum distance to detection. We normalize by the total instruction count to find the probability distribution of detection distances for the overall program. This distribution tells us the probability of instructions having a given distance to detection. When working with checkpoint intervals, it is more convenient to work with the corresponding cumulative distribution function (CDF). The CDF for a given detection distance gives the probability that a random instruction has been detected by a given distance (instead of at a fixed distance). We calculate the CDF for each distance by summing the probability distribution from zero to the detection distance. In Figure 5, we show the detection latency CDFs for both the chip-external and L1 cache interface detection methods. We compute the aggregate detection distances for three classes of applications: SPEC integer, SPEC floating-point, and the commercial workloads. The full state and fingerprinting detection methods are not shown here because they have immediate detection at the end of the checkpoint interval. In the L1 cache interface CDF curves, the probability of detecting all latent errors in the register file is almost unity, but a small fraction of instructions are left undetected for millions of cycles. Detection latencies longer than the simulation period cannot be measured, so all CDFs reach 1.0 by the end of the period, even though some errors may actually hide within the cache hierarchy for longer periods of time. While most general purpose registers are constantly overwritten, a few values such as return address pointers may be left untouched for long periods of time before accesses that eventually lead to detection. For the chip-external detection case, a significant frac- 0.8 0.6 0.4 0.2 0 10 2 10 4 10 1 CDF of error detection 1 CDF of error detection CDF of error detection 1 0.8 0.6 0.4 0.2 6 0 10 10 Detection distance (instructions) 2 10 4 10 0.8 0.6 0.4 0.2 L1 front−side interface Chip−external 6 0 10 (a) 2 10 Detection distance (instructions) 10 4 10 6 10 Detection distance (instructions) (b) (c) 10 −2 1000 years 10 −4 10 −6 10 0 MTTF (billions of hours) 0 MTTF (billions of hours) MTTF (billions of hours) Figure 5: CDF of error detection computed over the average of (a) SPEC integer, (b) SPEC floating-point, and (c) commercial workloads. 10 −2 1000 years 10 −4 10 −6 0 10 2 10 4 10 6 10 Checkpoint interval (instructions) 10 0 10 −2 1000 years 10 −4 10 Fingerprinting CRC−16 L1 front−side interface Chip−external −6 0 10 2 10 4 10 6 10 Checkpoint interval (instructions) (a) 10 0 2 10 10 4 10 6 10 Checkpoint interval (instructions) (b) (c) Figure 6: MTTF averaged over (a) SPEC integer, (b) SPEC floating-point, and (c) commercial workloads. tion of values hide within the cache hierarchy for millions of cycles before being written back to memory. Both SPEC integer benchmarks and the commercial applications show a clear tendency towards keeping some values buffered in the cache extended periods of time. Floating-point benchmarks, however, regularly displace the L2 cache contents due to streaming data access patterns. The cache hierarchy has roughly 131,000 double words and the probability of detecting a fault increases drastically within a few multiples of that many instructions (not every instruction is a load or store). 5.2 Mean Time to Failure Model Next, we calculate the MTTF of a system where undetected errors may exist at the time a checkpoint is taken. We define a system’s error coverage, C, as the probability of detecting all possible errors within a checkpoint interval. For a specific instruction, the CDF gives the probability of detecting an error before the next checkpoint. In our model, each instruction is equally likely to experience an error. Therefore, the coverage is defined as the mean probability that an error could be detected in each instruction before the next checkpoint. For a checkpoint interval of length t, we sum over all i instructions from the checkpoint to obtain: t i=1 CDF (i) . t Given the coverage and fault rate, the system MTTF with dual modular redundancy is [31]: Coverage = MT T F = 1 , 2λ (1 − C) where λ is the raw transient fault rate and C is the error coverage. The transient fault rate, determined by process and circuit characteristics, is considered constant and independent from this work. A raw fault rate of 104 FIT is a predicted fault rate in logic circuits for high-performance processors early in the next decade [17, 23]. Using the above formula for MTTF, we obtain the failure data shown in Figure 6. For fingerprinting, which has an instantaneous detection latency and detection independent of the checkpoint interval, we define coverage as the probability of detecting an error from Section 3.1, C = 1 − 2−p , where we use p = 16 for the fingerprint code (i.e., CRC-16). In all of the benchmarks, we see that chip-external detection requires checkpoint intervals of at least 10 to 25 million instructions to achieve the required reliability. This indicates that in order to achieve acceptable error coverage, checkpoint intervals for chip-external detection must be long: on the order of milliseconds. In Section 7, we will show that the high frequency of I/O operations in OLTP workloads makes this size checkpoint interval infeasible without sophisticated I/O delay mechanisms. By comparing at the L1 cache interface, the checkpoint interval necessary to achieve an acceptable reliability is reduced to 10 to 100 thousand instructions. This interval is sufficient to support OLTP I/O requirements; however, we Table 2: Bandwidth requirements for the passive detection mechanisms, in bytes/instruction. Chip-external L1 cache interface SPEC Int 0.0038 6.13 SPEC FP 0.0456 5.39 Commercial 0.1210 5.92 show in Section 6 that inordinate bandwidth is required to support detection across chips. In our model, full state comparison covers all errors, leading to an infinite MTTF. Finally, fingerprinting with a 16-bit CRC, whose coverage is independent of the checkpoint interval, has an MTTF that is superior to the chip-external and L1 cache interface detection methods. The checkpoint interval can be reduced to support a high I/O rate, while imposing a minimal bandwidth requirement. 6. COMPARISON BANDWIDTH In this section, we evaluate error detection mechanisms based on their requirements for state comparison bandwidth between mirrored processors. Chip-external detection is used in existing systems by placing the mirrored processors in close proximity and comparing data values as they exit the chips [15, 21]. This approach to detection requires no extra pin bandwidth over that already required to run applications, since it passively monitors traffic going to memory or the rest of the system. In Table 2, we report the average bandwidth generated by the three application classes (calculated as the sum of address and data traffic required to complete the memory requests). The passive nature of this approach makes it an attractive option for detecting errors in redundant processors across physical chips. However, the error coverage with this technique only makes it viable at large checkpoint intervals. Alternatively, error detection can be implemented at the L1 cache interface. As with chip-external detection, this method involves passive comparison of addresses and data, but now the comparison includes all loads and stores. As expected, the average comparison bandwidth reported in Table 2 is orders of magnitude higher than with chip-external. The bandwidth reported here is well above the external pin bandwidth sustainable by a modern processor; therefore, this technique is only applicable when the redundant cores are located on the same die Full-state comparison at checkpoint creation compares the data changed since the last checkpoint and is therefore dependent on the checkpoint interval. We show the average comparison bandwidth required for a range of checkpoint intervals in Figure 7(a). Notice that the bandwidth for this form of comparison is greater than or comparable to the chip-external bandwidth already demanded by the applications. The average bandwidth requirement decreases as the checkpoint interval increases. Intuitively, this trend is expected because programs have spatial locality within a limited working set. As the checkpoint interval grows, the working set size generally grows at a slower rate. Because of the bandwidth overhead, full-state comparison is only viable for very large checkpoint intervals. Another view of the full-state comparison is to consider the amount of data that must be transfered in each checkpoint interval. We show this graphically in Figure 7(b). If this bandwidth cannot be sustained during the checkpoint interval, execution must stall until the comparison has completed. The following equation estimates the checkpoint bandwidth for a checkpoint interval, given the checkpoint size, the estimated instructions per cycle, and processor clock frequency: Bandwidth = Checkpoint Size · IP C · F requency . Checkpoint Interval The bandwidth required for fingerprinting is the size of the fingerprint (two bytes) over the checkpoint interval. We also plot the fingerprinting bandwidth requirement as a function of the checkpoint in Figure 7(a). For checkpoint intervals larger than 1000 instructions, the bandwidth required for fingerprinting is at least an order of magnitude less than the chip-external comparison bandwidth. Finally, as a reference point on a current state-of-theart system with a 2.8GHz Pentium 4 Xeon, the theoretical bandwidth from the 533MHz bus is 4.2GB/s. A memory streaming test intended to maximize utilized external bus bandwidth achieves a maximum throughput of 2.5GB/s. Of this usable bandwidth, we measure a TPC-C like database on the system to generate a load of 980MB/s on the bus, which corresponds to the chip-external bandwidth. Assuming the system executes one instruction per cycle and the checkpoint interval is 32K instructions, the estimated bandwidth required for full-state detection is 440MB/s—50% of the existing bandwidth from the TPC-C like database application. In comparison, fingerprinting requires 64KB/s under the same assumptions. 7. I/O PERFORMANCE The preceding results for error coverage and bandwidth requirements would suggest that, without fingerprinting, a larger checkpoint interval is necessary. In I/O-intensive workloads, however, performance suffers with the conventional approach of logging input and delaying output to avoid rollback across the I/O [6, 29]. Sophisticated software solutions can be developed, which delay writes to disk or a network interface and log results of reads from external devices. If enough concurrency is present in the application, and buffering space is not a concern, a software approach may be possible. However, operating systems and commercial applications are too complex to support major changes to the I/O subsystem. A hardware solution to avoid rollback across I/O would create a checkpoint immediately following every read or write at the device level (uncached loads and stores). Using our full-system TPC-C like simulation (see Section 4.3), we collect traces of reads and writes to the SCSI controller during 100 billion instructions (50,000 DB transactions). Figure 8 shows the cumulative interarrival distribution of device reads and writes. We observe a clustering of accesses in several places. There is a fine-grained clustering corresponding to the reads and writes required to initiate every disk access (up to 5000 instructions). At 50,000 instructions, we observe a clustering due to the time between physical I/Os sent to the SCSI controller. This point is corroborated with Amdahl’s I/O rule of thumb [11] and Gray’s updated version [9]: for random I/O access, approximately 50,000 instructions separate physical I/O operations. 8 10 Bytes compared per interval Checkpoint bandwidth (bytes/instruction) 0 10 −2 10 −4 10 SPEC Int SPEC FP Commercial Fingerprinting −6 10 2 6 10 5 10 4 10 3 SPEC Int SPEC FP Commercial 10 2 4 10 7 10 10 2 10 6 10 10 Checkpoint interval (instructions) (a) 4 6 10 10 Checkpoint interval (instructions) (b) Figure 7: In (a) we show the bandwidth requirements for full-state error detection over a range of checkpoint intervals, and (b) bytes compared per checkpoint interval. CDF of I/O commands 1 0.8 0.6 0.4 0.2 0 0 10 2 10 4 10 6 10 I/O command interarrival time (instructions) Figure 8: CDF of interarrival of SCSI controller accesses running a TPC-C like workload. Based on these measurements, we conclude that checkpoints must be created at least as frequently as physical I/O operations (50,000 instructions). Fine-grained checkpoints require a high-coverage, low-bandwidth detection method to guarantee fault-free operation with checkpoint intervals of thousands of cycles. Fingerprinting is precisely this mechanism which when combined with low-cost checkpoint creation, can be used to construct rollback-recovery systems with excellent performance in I/O-intensive workloads. 8. RELATED WORK Fault Detection. Sogomonyan, et al. [26] propose a mechanism similar in spirit to fingerprinting. Their technique relies on a modified flip-flop design that permits a scan chain to operate without interfering with normal computation. A sequence of single-bit outputs of the scan chain constitute a signature of the internal state. The signature can be compared to detect differences in the operation of two mirrored systems on a chip (SoC) designs. This proposal differs from the fingerprinting approach in that all flip-flops on the chip must be changed to scannable Multi-Mode Elements (MMEs), and the state is continuously monitored through a 1-bit signature stream. Unlike fingerprinting, the signature stream cannot be tied to a specific instruction; rather, errors may take some time to propagate out the scan chain. This error detection latency may lead to low error coverage and unacceptable poor system MTTF. Additionally, fingerprinting consumes less bandwidth by sending a single 16-bit word for error detection rather than a continuous bit stream. Other proposed techniques instrument binaries with selfchecking mechanisms or use compiler-inserted instructions to check for errors in control flow [34]. Unlike fingerprinting, these techniques cannot detect a large class of errors in dataflow and require difficult analysis to determine error coverage. Existing commercial fault-tolerant systems use replication in conjunction with lockstepped execution. Two identical copies of a program run on the redundant hardware. The IBM G5 uses replicated, lockstepped pipelines in a single core [25]. The Tandem Himalaya [15] and Stratus [21] use replicated, lockstepped processors and compare execution using the chip-external detection mechanism. There are several proposals for simultaneous multithreaded processors (SMT) and chip multiprocessors (CMP) with redundant execution on the same die. The DIVA architecture [2] uses a simple in-order checker to detect soft and permanent errors in a closely-coupled out-of-order core using full state comparison. Rotenberg proposed using two staggered, redundant threads in an SMT processor [20] to detect soft errors with full state comparison at commit. Vijaykumar, et al. suggested full-state comparison for detection and recovery in SMT [32] and CMP [8] architectures. Alternatively, Reinhardt and Mukherjee proposed the SRT processor [19] and later the CMP-based CRT processor [16], which compares only stores across threads (equivalent to the L1 cache interface detection mechanism). Architectural Checkpointing. A prerequisite of all backward error recovery schemes is the checkpoint mechanism. Microarchitectural techniques work for short checkpoint intervals (thousands of instructions). The Checkpoint Processing and Recovery proposal [1] scales the out-of-order execution window by building a large, hierarchical store buffer and aggressively reclaiming physical registers. The SC++lite [7] mechanism speculatively allows values into the processor’s local memory hierarchy, while maintaining a history of the previous values. On a larger scale, global checkpoints can take a consistent checkpoint of the architectural state across a multiprocessor, but at intervals of hundreds of thousands to millions of instructions. SafetyNet [27] provides a way to build a global checkpoint in a multiprocessor system. Processor caches are augmented with checkpoint buffers that contain old cache line values on the first write to a cache block in each interval. ReVive [18] takes global checkpoints in main memory by flushing all caches and enforcing a copy-on-write policy for changed cache lines after the checkpoint time. Fault Analysis. Our technique for measuring error detection latency differs substantially from the traditional faultinjection approach, which inserts single errors at the gate, register, or pin level and observes the corresponding detection latency [10, 13, 33]. The injection approach suffers from a number of limitations. Modern microprocessors are sufficiently complex that fault injection cannot cover a reasonable subset of potential transient faults in an acceptable time. Other fault injection techniques include bombarding an actual chip in heavy ion testing [36]. Austin [3] used dynamic dependency analysis very similar to our dependency analysis to analyze the dataflow parallelism in SPEC benchmarks. The goal of this work was to find available parallelism in real programs using register renaming. Austin did not incorporate cache dependencies and there is no sense of detection or appearance of values at the chip pins. 9. CONCLUSIONS Increasing soft-error rates in microprocessor logic will reach unacceptable levels in the near future. We identify three metrics for evaluating DMR error detection mechanisms: error coverage, comparison bandwidth, and I/O performance. No existing error detection mechanism satisfies all three metrics. We propose fingerprinting, a hash of updates to architectural state, as an approach that offers excellent error coverage, low bandwidth requirements, and inexpensive on-demand comparison. We evaluate existing mechanisms for error detection in DMR systems, and quantify the benefits fingerprinting provides. 10. ACKNOWLEDGEMENTS We would like to thank Konrad Lai, T.M. Mak, Shubu Mukherjee, and the anonymous reviewers for their valuable feedback on early drafts of this paper. We thank the SimFlex team at Carnegie Mellon for providing the simulation infrastructure used in this research. Funding for this research was supported in part by NSF award ACI-0325802 and by Intel Corporation. Computer systems used in the research were provided by an equipment grant from Intel Corporation. Brian Gold is supported by graduate fellow- ships from NSF, Northrop Grumman, and the US DoD (NDSEG/HPCMO). 11. REFERENCES [1] H. Akkary, R. Rajwar, and S. T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec 2003. [2] T. M. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 32), Nov. 1999. [3] T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 1992. [4] D. Bossen, J. Tendler, and K. Reick. Power4 system design for high reliability. In Hot Chips - 13, August 2001. [5] D. Burger and T. M. Austin. The SimpleScalar tool set, version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin–Madison, June 1997. [6] E. N. Elnozahy, D. B. Johnson, and Y. M. Wang. A survey of rollback-recovery protocols in message-passing systems. Technical report, CMU-CS-96-181, Department of Computer Science, Carnegie Mellon University, Sept 1996. [7] C. Gniady and B. Falsafi. Speculative sequential consistency with little custom storage. In Proceedings of the Tenth International Conference on Parallel Architectures and Compilation Techniques, Sept. 2002. [8] M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30h Annual International Symposium on Computer Architecture, June 2003. [9] J. Gray and P. Shenoy. Rules of thumb in data engineering. In Proceedings of the IEEE International Conference on Data Engineering, Feb 2000. [10] M. Hall, J. Mellor-Crummey, A. Carle, and R. Rodriguez. Fiat: a framework for interprocedural analysis and transformation. In Proceedings of the Sixth Annual Workshop on Compilers for Parallel Processing, Aug 1993. [11] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 3rd edition, 2002. [12] T. Juhnke and H. Klar. Calculation of the soft error rate of submicron cmos logic circuits. IEEE Journal of Solid State Circuits, 30(7):830–834, July 1995. [13] G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. FERRARI: a tool for the valiadation of system dependability properties. In Proceedings of the 22nd International Symposium on Fault Tolerant Computing, 1992. [14] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50–58, Feb. 2002. [15] D. McEvoy. The architecture of tandem’s nonstop system. In ACM/CSC-ER, 1981. [16] S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multi-threading alternatives. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. [17] S. S. Mukherjee, C. T. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), Dec 2003. [18] M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Proceedings of the 29th Annual International Symposium on Computer Architecture, June 2002. [19] S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 17th Annual International Symposium on Computer Architecture, June 2000. [20] E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing Systems, June 1999. [21] L. Sherman. Stratus continuous processing technology – the smarter approach to uptime. Technical report, Stratus Technologies, 2003. [22] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 2002. [23] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on soft error rate of combinational logic. In International Conference on Depdendable Systems and Networks, June 2002. [24] D. P. Sieworek and R. S. S. (Eds.). Reliable Computer Systems: Design and Evaluation. A K Peters, 3rd edition, 1998. [25] T. J. Slegal and et al. IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2):12 – 23, March - April 1999. [26] E. Sogomonyan, A. Morosov, M. Gossel, A. Singh, and J. Rzeha. Early error detection in systems-on-chip for fault-tolerance and at-speed debugging. In Proceedings of the 19th VLSI Test Symposium, May 2001. [27] D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th Annual International Symposium on Computer Architecture, June 2002. [28] Standard Performance Evaluation Corporation. SPECweb99 benchmark. http://www.specbench.org/osg/web99/. [29] R. Strom and S. Yemini. Optimistic recovery in distributed systems. ACM Transactions on Computer Systems, 3(3):204–226, August 1985. [30] The Transaction Processing Performance Council. TPC Benchmark C: Standard specification. http://www.tpc.org/tpcc/spec/tpcc current.pdf, Dec 2003. [31] K. S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley and Sons, 2nd edition, 2001. [32] T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient fault recovery using simultaneous multithreading. In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002. [33] N. Wang and S. Patel. Modeling the effect of transient errors on high performance microprocessors. In Center for Circuits, Systems, and Software (C2S2), 2nd Annual Review, March 2003. [34] K. Wilken and J. P. Shen. Continuous signature monitoring: Low-cost concurrent dectection of processor control errors. IEEE Transactions on Computer-Aided Design, 9(6):629–641, June 1990. [35] J. K. Wolf, A. M. Michelson, and A. H. Levesque. On the probability of undetected error for linear block codes. IEEE Transactions on Communications, 30(2), Feb 1982. [36] J. F. Zeigler, H. P. Muhlfeld, C. J. Montrose, H. W. Curtis, T. J. O’Gorman, and J. M. Ross. Accelerated testing for cosmic soft-error rate. IBM Journal of Research and Development, 40(1), 1996.