Appears in the 37th Annual IEEE/ACM International Symposium on Microarchitecture Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures Jared C. Smolens, Jangwoo Kim, James C. Hoe, and Babak Falsafi Computer Architecture Laboratory (CALCM) Carnegie Mellon University, Pittsburgh, PA 15213 {jsmolens, jangwook, jhoe, babak}@ece.cmu.edu http://www.ece.cmu.edu/~truss Abstract Previous proposals for soft-error tolerance have called for redundantly executing a program as two concurrent threads on a superscalar microarchitecture. In a balanced superscalar design, the extra workload from redundant execution induces a severe performance penalty due to increased contention for resources throughout the datapath. This paper identifies and analyzes four key factors that affect the performance of redundant execution, namely 1) issue bandwidth and functional unit contention, 2) issue queue and reorder buffer capacity contention, 3) decode and retirement bandwidth contention, and 4) coupling between redundant threads' dynamic resource requirements. Based on this analysis, we propose the SHREC microarchitecture for asymmetric and staggered redundant execution. This microarchitecture addresses the four factors in an integrated design without requiring prohibitive additional hardware resources. In comparison to conventional single-threaded execution on a state-ofthe-art superscalar microarchitecture with comparable cost, SHREC reduces the average performance penalty to within 4% on integer and 15% on floating-point SPEC2K benchmarks by sharing resources more efficiently between the redundant threads. 1. Introduction A soft error occurs when the voltage level of a digital signal is temporarily disturbed by an unintended mechanism such as radiation, thermal, or electrical noise. Previously, soft errors have been a phenomenon associated with memory circuits. The diminishing capacitance and noise margin anticipated for future deep-submicron VLSI has raised concerns regarding the susceptibility of logic and flip-flop circuits. Recent studies have speculated that by the next decade, logic and flip-flop circuits could become as susceptible to soft errors as today's unprotected memory circuits [6,8,18]. In light of increasing soft-error concerns, recent research has studied the soft-error vulnerability of the microarchitecture as a whole [11]. A number of studies have investigated ways to protect the execution pipeline by means of concurrent error detection; this protection requires 1) instructions to be executed redundantly and independently and 2) the redundant outcomes to be compared for correctness. Among recent studies, a popular approach has been to leverage the inherent redundancy in today’s high-performance superscalar pipelines to support concurrent error detection [10,13,14,15,16,20]. Performance Penalty. A minimally modified superscalar datapath can support concurrent error detection by executing an application as two independent, redundant threads. A drawback of this approach is that the doubled workload from redundant execution incurs a significant IPC penalty on superscalar datapaths tuned for single thread performance. Understandably, an application that achieves high utilization and high IPC as a single thread on a given superscalar datapath will saturate the available resources and lose performance when running as two redundant threads on the same datapath. More interestingly, however, applications with low IPC as a single thread can still perform poorly as two redundant threads, despite the apparently ample resources to support both low-IPC threads. In this paper, we explain the performance bottlenecks that contribute to this paradoxical behavior and offer a microarchitecture design that improves redundant execution performance in these cases. Performance Factors. We study four factors in superscalar datapath design that affect the performance of redundant execution. The first three factors represent the different resources that support high performance execution. The factors are 1) issue and functional unit bandwidth, 2) issue queue (ISQ) and reorder buffer (ROB) capacity, and 3) decode and retirement bandwidth. Our analysis indicates that the bottleneck imposed by forcing two redundant threads to share issue and functional unit bandwidth has a first-order effect on performance. However, in the cases where issue bandwidth is not clearly saturated, sharing the ISQ and ROB between two threads— effectively halving the out-of-order window size—also imposes a significant performance bottleneck, particularly in floating-point applications. Decode and retirement bandwidth may be a performance limiting factor for microarchitectures that fetch or decode instructions redundantly. The fourth factor in this study is a design choice of whether to allow the progress of two redundant threads to stagger elastically. Staggered re-execution has been previously shown effective [9,15,16,20]. Our analysis further shows that the effects from allowing stagger and relieving the ISQ and ROB contention are independent and can be combined to achieve better performance than either factor alone. SHREC Microarchitecture. SHREC (SHared REsource Checker) is an efficient concurrent error detection microarchitecture that requires only limited modifications to a conventional superscalar microarchitecture. SHREC employs asymmetric redundant execution to relieve the sharing pressure on ISQ and ROB capacity without physically enlarging those structures. It is also naturally amenable to permitting an elastic stagger between redundant executions of the same instruction. Our evaluation shows that asymmetric redundant execution and stagger together enable SHREC to more efficiently utilize available issue bandwidth and functional units, thus reducing the performance impact of sharing between two redundant threads. Assuming a comparable state-of-the-art baseline superscalar microarchitecture, SHREC reduces the average performance penalty of redundant execution to within 4% on integer and 15% on floating-point applications. This performance is in comparison to prior proposals that experience an IPC loss of 15% on integer and 32% on floatingpoint applications. Related Work. In asymmetric redundant execution, an instruction is only checked after the second execution using input operands already available from the first execution. This idea has been proposed previously by Austin in the DIVA microarchitecture [2]. In DIVA, the initial execution and the redundant instruction checking do not share issue and functional unit bandwidth; a second dedicated set of functional units is added to the checker. In this study, we assume the resource configuration of the underlying superscalar microarchitecture is fixed by technology and market forces and cannot be easily increased. Therefore, the initial execution and redundant checking in SHREC share the same set of functional units. An important issue addressed in our study is whether functional units can be shared efficiently. Previously, Mendelson and Suri proposed the O3RS approach for redundant execution of register update unitbased (RUU) superscalar microarchitectures [10]. Redundant execution of the same instruction is initiated from the same RUU entry (from the same ISQ and ROB entries in the context of recent superscalar microarchitectures). This is another way to resolve the sharing pressure on ISQ and ROB capacity. However, the O3RS approach is not amenable to supporting stagger between redundant executions because the ISQ and ROB occupancy times would be necessarily increased (entries are held until the second execution). Without stagger, O3RS cannot fully capture the performance opportunity available to SHREC. Conversely, staggered execution has been shown to be effective in improving performance in several SMT-based concurrent error detection designs, but these proposals do not address the bottleneck imposed by sharing ISQ and ROB capacity. In Section 3, we present more detailed analysis and comparison of designs that only support staggered re-execution or only resolve the ISQ and ROB bottleneck. Contributions. This paper offers two key contributions: • We investigate and explain the performance behavior of redundant execution on superscalar microarchitectures. • We propose the SHREC soft-error tolerant superscalar microarchitecture, which maximizes the performance of redundant execution with minimized changes to a conventional baseline superscalar microarchitecture. Paper Outline. This paper is organized as follows. Section 2 provides background and defines assumptions regarding our baseline superscalar microarchitecture and redundant execution superscalar microarchitecture. Section 3 presents a factorial analysis of key design choices impacting the performance of redundant execution on superscalar microarchitectures. Section 4 describes and evaluates SHREC. We offer our conclusions in Section 5. 2. Baseline Microarchitectures In this section, we first define our assumptions about the baseline superscalar microarchitecture. Next, we explain previous soft-error tolerance proposals that extend superscalar microarchitectures to redundantly execute a program as two concurrent threads. Finally, we establish the performance penalty associated with the prior proposals. 2.1. SS1: Baseline Superscalar Microarchitecture The starting point of all soft-error tolerant designs in this paper is a balanced superscalar speculative out-oforder microarchitecture. In this paper, we assume a conventional wide-issue design where: 1. 2. architectural register names are renamed onto a physical register file an issue queue (ISQ) provides an out-of-order window for dataflow-ordered instruction scheduling to a mix of functional units (FUs) Exception Phys RegFile Phys RegFile ISQ Retirement FUs (a) Rename/ Decode ROB Fetch Retirement FUs Rename/ Decode Fetch ISQ ROB (b) FIGURE 1. The (a) SS1 baseline superscalar and (b) SS2 redundant execution microarchitectures. TABLE 1. Baseline SS1 parameters. 3. 4. OoO Core 128-entry ISQ, 512-entry ROB, 64-entry LSQ, 8-wide decode, issue, and retirement Memory System 64K 2-way L1 I/D caches, 64-byte line size, 3-cycle hit, unified 2M 4-way L2, 12cycle hit, 200-cycle memory access, 32 8target MSHRs, 4 memory ports Functional Units (latency) 8 IALU (1), 2 IMUL/IDIV (3/19), 2 FALU (2), 2 FMUL/FDIV (4/12), all pipelined except IDIV and FDIV Branch Predictor Combining predictor with 64K gshare, 16K/64K 1st/2nd level PAs, 64K meta table, 2K 4-way BTB, 7 cycle branch misprediction recovery latency a reorder buffer (ROB) tracks program order and execution status for all in-flight instructions a load/store queue (LSQ) holds speculative memory updates and manages out-of-order memory operations Figure 1(a) shows the layout of the baseline microarchitecture with the key datapath elements and their interconnections. Table 1 lists the salient parameters assumed for this paper. We refer to this reference microarchitecture configuration as SS1. SS1 nominally is an 8-way superscalar design. The settings in Table 1 are extrapolated from the Alpha EV8 [7].1 Overall, SS1 should reflect an aggressive superscalar design that pushes today’s state-of-the-art. Consequently, we assume the parameters in Table 1 cannot be increased further without substantial cost. For instance, the functional units in the EV8 occupy a substantial amount of die area, similar in size to 1MB of L2 cache storage [12]. We assume all memory arrays in SS1 (such as register files, caches and TLBs) are protected against soft errors by error correcting codes (ECC). The storage overhead to protect a 64-bit value with single error correction, double error detection ECC is roughly 13%. We assume the instruction fetch stages are protected by circuit-level soft-error toler1. Although EV8 is an SMT, it is emphasized that EV8 is tuned for single-threaded superscalar execution. ance techniques; these in-order stages can be more deeply pipelined to overcome the associated performance overhead. We assume circuit-level techniques are also responsible for protecting non-performance critical portions of the datapath (e.g., memory management state machines). The main focus of this paper is on providing soft-error protection for the out-of-order pipeline. Neither ECC nor circuitlevel techniques are applicable within the out-of-order pipeline due to the pipeline’s structural complexity and extreme sensitivity to the critical path delay. 2.2. SS2: Symmetric Redundant Execution Register renaming and dataflow-driven execution in an out-of-order pipeline enable fine-grain simultaneous execution of two independent instruction streams—this is the key premise behind the simultaneous multithreading approach [19]. Figure 1(b) illustrates a simple approach to execute a program redundantly as two data-independent threads for concurrent error detection [14]. The key changes from SS1 are concentrated in the decode and retirement stages. In Figure 1(b), the instruction fetch stream is duplicated into an M-thread (main thread) and an R-thread (redundant thread) at the decode stage. Register renaming allows the M- and R-threads to execute independently between the decode and the retirement stages. This portion of the pipeline operates identically to SS1, with the exception of memory instructions. Load and store addresses are calculated redundantly. The memory access required by a load instruction is only performed by the Mthread. The returned load value is buffered in a load-value queue [15] for later consumption by the corresponding load instruction in the R-thread. This structure ensures that two executions of the same load instruction result in the same load value. In the retirement stage, the redundantly computed results of the duplicated instructions are checked against each other in program order. If an error is detected, program execution can be restarted at the faulty instruction by triggering a soft-exception. We refer to this 2-way symmetric redundant superscalar pipeline as SS2. Although SS2 is based on a superscalar datapath, the SS2 discussion in this paper is also applicable to SMTbased designs that fully re-execute a program as two data independent threads (e.g., AR-SMT [16], SRT [15], Slip- 6 Low High 5 4 4 IPC 5 3 SS2 SS1 1 1 0 0 Av er A Av age ve er (L rag ag o e e wo (H n ig ly) h on ly) eq ua k fm e a3 d lu ca fa s ce re c sw im m gr id ap pl ar u t-1 1 am 0 wu mp pw is ga e lg six el tra ck m es a ap si 2 Av er Av a er Av ge a er (Lo ge ag w e on (H l ig y) h on ly) Low High 3 2 g vp ap r-r ou te pa rs er bz tw ip ol 2f s pe ourc rlb e m kgz di ip -g ff ra ph gc ic c16 6 eo n- craf ru ty sh m vo eie rte r xon e IPC 6 (a) Integer (b) Floating-point FIGURE 2. Performance impact of redundant execution on SS2. stream [13]). A key commonality between them is the resource pressure from concurrently supporting two threads. We assume SS2 has the same level of resources as SS1 (Table 1). For every instruction executed in SS1, not only must the instruction use functional units twice in SS2, but the instruction must also consume twice the bandwidth throughout the SS2 pipeline (i.e., decode, issue, writeback and retirement). Furthermore, each instruction also occupies two separate entries in SS2’s ISQ and ROB, effectively halving the out-of-order execution window size. 2.3. SS2 Performance Penalty Figure 2 reports the IPC of SPEC2K benchmarks from executing their corresponding SimPoints [17] on cycleaccurate performance simulators of SS1 and SS2. Throughout this paper, we report performance results based on 11 integer and 14 floating-point SPEC2K benchmarks1 running their first reference input. To estimate IPC for each benchmark, we simulate an out-of-order core and measure IPC for up to eight 100-million-instruction SimPoint regions (as published in [17]) that are supposed to be representative of the benchmark’s complete execution trace. We follow SimPoint’s prescribed procedure to weight and combine the IPCs of individual SimPoint regions to estimate the entire application’s average IPC. Average IPCs over multiple benchmarks are computed as harmonic means. The performance simulators used in this paper are derived from the sim-outorder/Alpha performance simulator in Simplescalar 3.0 [4]. For this study, we modify simoutorder’s stock RUU-based model to support the issue 1. We exclude the integer benchmark mcf from our study because it is insensitive to all design parameters in this paper; its IPC is within 1% of 0.17 for all experiments. queue-based SS1 datapath shown in Figure 1, MSHRs and memory bus contention. Unless otherwise noted, the configuration in Table 1 is used consistently in this paper. Graphs (a) and (b) in Figure 2 separately report the IPCs of integer and floating-point benchmarks. In each graph, the benchmarks are sorted according to their SS1 IPC. The benchmarks are further divided into high-IPC versus lowIPC subsets; each graph is summarized in terms of the overall average IPC as well as separate averages of the high-IPC and low-IPC subsets. Overall, the SS2 IPC is 15% lower than SS1 for integer benchmarks and 32% lower for floating-point benchmarks. The difference in IPC is more significant for the high-IPC benchmarks. For benchmarks with high IPC on SS1, SS2 IPC suffers noticeably because the doubled workload of redundant execution saturates the already highly utilized 8-wide superscalar datapath. In this scenario, the SS2 performance penalty cannot be reduced without increasing the issue bandwidth and functional units. It is interesting to note that significant performance loss is also experienced by SS2 for benchmarks with an SS1 IPC well below four—half the issue width. There are several reasons why this can happen. First, a low-IPC benchmark may nonetheless demand high utilization of the SS1 pipeline if a large number of mispredicted wrong-path instructions are part of the overall execution. Similarly, a low IPC benchmark may require the full capacity of the ISQ and ROB to extract available instruction level parallelism (ILP). Several of the low- and moderate-IPC floatingpoint benchmarks also experience significant floatingpoint functional unit contention. A benchmark with a low average IPC can include short program phases with high IPC; SS2 bottlenecks during these instantaneously high IPC regions can further lower the average IPC. In all of these scenarios, it is possible to reduce the performance penalty of re-execution through more efficient scheduling and management of re-execution. In the next section, we present an analysis to help clarify the performance behavior of SS2 and the effects of the different resource bottlenecks. 3. Factorial Design Analysis The performance difference between SS2 and SS1 is not caused by any single bottleneck. The contributing bottlenecks vary from benchmark to benchmark, and even between different phases of the same benchmark. The analysis in this section establishes the importance of individual bottlenecks in SS2, as well as their interactions. 3.1. SS2 Performance Factors We reduce the factors contributing to performance bottlenecks in SS2 into three resource-related factors and one mechanism-related factor: • • • • X: issue width and functional unit bandwidth C: ISQ and ROB capacity B: decode and retirement bandwidth S: staggering redundant threads (mechanism related) X-factor: The X-factor represents the issue and functional unit bandwidth, a limit on raw/peak instruction execution throughput of a superscalar pipeline. We combine issue and functional unit bandwidth into a single factor because they must be scaled together in a balanced design. A fundamental property of the X-factor is that if an application on average consumes more than one-half of the issue and functional unit bandwidth in SS1, it is impossible for the same application to achieve the same level of performance in SS2. However, what happens to an application that on average consumes less than one half of the issue and functional unit bandwidth in SS1 is not immediately apparent, since whether the available issue and functional unit bandwidth can be used effectively in SS2 depends on the other factors. C-factor: The C-factor represents the ISQ and ROB capacity. The C-factor governs a pipeline’s ability to exploit instruction-level parallelism by executing instructions outof-order. Again, we combine the ISQ and ROB capacity into a single factor because they are scaled together in commercial designs.1 In SS2, the out-of-order window size is effectively halved. Applications that depend on a large out-of-order window to expose ILP lose performance on 1. A recent study has suggested, however, that there may be advantages to scaling the ROB capacity disproportionally [1]. SS2, independent of the X-factor. On the other hand, applications that are insensitive to the out-of-order window size perform predictably on SS2, as governed by the X-factor. B-factor: The B-factor represents the various bandwidth limits at the interface to the ISQ and ROB. These include the decode bandwidth, the completion bandwidth and the retirement bandwidth. In SS2, these bandwidth limits impose a bottleneck because they must be shared between the redundant M- and R-threads. We expect the B-factor to have the least observed impact on SS2 performance, because in a balanced design the X and C factors are generally tuned to become limiting factors before saturating the other bandwidth limits. S-factor: Besides the three resource-related factors, we introduce a fourth factor, stagger (S-factor), that improves the performance of SS2 but is not applicable to SS1. Several previous proposals for redundant execution on SMT advocate maintaining a small stagger (128~256 instructions) between the progress of the leading M-thread and the trailing R-thread [15,16]. In SMT-based designs, the stagger serves to hide cache-miss latencies from the R-thread and to allow the M-thread to provide branch prediction information to the trailing thread. Allowing some elasticity between the progress of the two threads also allows better resource scheduling. By giving static scheduling priority to the M-thread, the resulting effect is that the R-thread makes use of the slack resources left over by the M-thread rather than competing with the M-thread for resources. The reported benefit of allowing stagger is typically around a 10% increase in the IPC of redundant execution [15]. 3.2. SS2 Design Space The X, C, and B factors are potential bottlenecks that cause SS2 to lose performance relative to SS1. The bottleneck associated with a factor is removed if the resources associated with that factor are doubled from the assumed configuration in Table 1. To understand the different contributing factors to SS2’s performance, we evaluate sixteen SS2 configurations, based on enumerating all possible combinations of settings for X, S, C and B. In each configuration, the factors X, C, and B are either kept the same as Table 1 or doubled and an elastic stagger of up to 256 instructions is either enabled or disabled. The resulting IPC improvements relative to SS2 for the sixteen possible configurations are summarized in Table 2. We order the factors in columns 1 through 4 by their overall impact on performance. In each configuration, we report the percentage improvement relative to SS2 for the integer and floating-point SPEC2K benchmarks and separately for the high and low-IPC integer and floating-point benchmarks. TABLE 2. Percentage increase in IPC relative to SS2 for all combinations of factors. The notes column identifies mappings to designs referenced in this paper. Factor X X X X X X X X X S S S S S S S S S C C C C C C C C C B B B B B B B B B Integer % Increase All High Low 0 0 0 3 4 2 3 1 4 5 4 6 9 13 6 10 15 7 12 14 11 13 16 11 15 25 9 15 26 9 19 28 14 20 30 14 16 27 9 17 29 10 20 30 14 22 33 15 Several general trends can be observed from Table 2. First, a comparison of the upper and lower halves of the table shows that the X-factor has a large performance impact on all benchmark classes. Comparing the neighboring quarters of the table shows that stagger also improves the performance of all benchmarks. Only the floating-point benchmarks are highly sensitive to the C-factor. Finally, from comparing neighboring rows, the B-factor has a minimal effect on all benchmarks. Floating-point % Increase All High Low 0 0 0 0 1 0 28 9 34 28 11 35 11 9 12 11 10 11 33 15 40 34 16 40 12 36 6 12 38 6 46 58 43 48 61 44 19 46 13 21 51 15 49 62 45 52 71 47 Notes SS2 SS2+C+B ≈ O3RS [10] SS2+S ≈ SRT [15] SS2+S+C+B ≈ SHREC SS2+X+C ≈ SS1 ≈ DIVA [2] TABLE 3. The main factors and interactions that increase performance in redundant execution. Integer: High Integer: Low 3.3. 2-k Factorial Analysis Floating-point: High Factor X S X+S X C S X C S C S X S+C Effect 17.1 7.1 -5.2 5.2 4.3 3.1 33.5 9.9 6.4 27.0 6.2 4.6 -3.9 To examine the SS2 design space more systematically, we apply 2-k factorial analysis to the CPI1 of the sixteen configurations to disentangle the performance impact of each factor and the impact from interactions between multiple factors. (Refer to [3] for a detailed explanation of 2-k factorial analysis.) The result of the 2-k factorial analysis is a quantitative measure of the average effect of changing each of the four factors individually. In addition, factorial analysis also gives the effect due to interactions between changing any two factors, three factors, or all four factors together. Table 3 summarizes the significant effects (> 3%) for high and low-IPC integer and floating-point bench- marks. For rows showing an individual factor, the value under the “Effect” column signifies the average percentage CPI decrease (i.e., performance increase) between all configurations with the factor enabled versus all configurations with the factor disabled. 1. This is because CPI is additive when normalized with respect to the instruction count, while IPC is not. For example, the significant factors to increase performance on high-IPC floating-point benchmarks are X, C and S. The interactions between the three factors (not shown) are insignificant. Their effects sum nearly linearly Floating-point: Low 6 6 Low High 5 5 4 4 Low High SS2 SS2+C IPC 3 3 1 1 0 0 Av er A Av age ve er (L rag ag o e e wo (H n ig ly) h on ly) eq ua k fm e a3 d lu c fa as ce re c sw im m gr id ap p ar lu t-1 1 am 0 w mp up w is ga e lg si el xt ra ck m es a ap si 2 Av er Av a er Av ge a er (Lo ge ag w e on (H l ig y) h on ly) 2 g vp ap r-r ou te pa rs er bz tw ip ol 2f s pe ourc rlb e m kgz di ip -g ff ra ph gc ic c16 6 eo n- craf ru ty sh m vo eie rte r xon e IPC SS1 (a) Integer (b) Floating-point FIGURE 3. The effect of relieving ISQ and ROB contention in SS2 (C-factor). when more than one factor is enabled. In another example, the two most significant factors for high-IPC integer benchmarks are X and S. In this case, X and S have a significant and negative interaction. In other words, the effect of implementing both X and S is that the added performance improvement will be less than the simple sum of the two effects. Although the X-factor has significant effects for all classes of benchmarks, we do not explore the X-factor further, since we cannot increase the issue bandwidth or functional units. We will show, however, that the S-factor improves performance by helping alleviate the issue bandwidth and functional unit pressure. One should note that from Table 2, doubling the issue bandwidth and functional units alone (SS2+X) cannot improve SS2’s performance to SS1 levels. We also do not discuss the B-factor since it is not a significant factor in any benchmark class. We concentrate on the C and S factors next. words, given the resources in Table 1, O3RS would perform similarly to SS2 with twice the ISQ and ROB capacity. Figure 3 compares the IPC of SPEC2K benchmarks on SS1, SS2 and SS2 with twice the ISQ and ROB capacity (SS2+C). As suggested by the factorial analysis, the IPCs of SS2 and SS2+C are comparable on integer benchmarks, and SS2+C shows a 28% improvement over SS2 for floating-point benchmarks.2 The previous 32% floating-point performance gap between SS2 and SS1 has been reduced to 25% by the C-factor. A limitation of the O3RS approach is that it cannot be easily modified to support stagger. In O3RS, redundant instructions should issue from the ISQ in rapid succession to not increase the effective occupancy of the ISQ. Increasing this interval to support stagger has the effect of reducing the out-of-order window, offsetting O3RS’s original benefit from the C-factor. 3.5. Staggered Re-execution 3.4. Capacity of ISQ and ROB The 2-k factorial analysis suggests that the capacity of the ISQ and ROB (C-factor) in SS2 can have a major impact on the performance of low-IPC floating-point benchmarks. Although one cannot physically double the capacity of the ISQ and ROB due to complexity and area constraints, microarchitectural techniques exist to achieve the same effect in the context of redundant execution. Mendelson and Suri proposed the O3RS approach for redundant execution in a superscalar microarchitecture [10]. A unique feature of O3RS is that an instruction occupies just one entry of the ISQ and ROB after decode.1 An instruction is issued twice from the same ISQ entry before it is removed; similarly, one ROB entry keeps track of the two redundant executions of the same instruction. In other Figure 4 compares the IPC of SPEC2K benchmarks on SS1, SS2, and SS2 with stagger (SS2+S). In accordance with previous stagger studies, SS2+S results in a noticeable IPC improvement over SS2, 9% on integer and 11% on floating-point benchmarks. However, notice in Table 2 that also relieving the C-factor (i.e., SS2+S+C) can yield another 3% improvement on integer benchmarks and 22% on floating-point benchmarks beyond SS2+S. 1. The original O3RS is based on an RUU. Here, we describe the spirit of the proposal in the context of an ISQ and ROB. 2. It is important to note that the effect given by the factorial analysis is an average over all settings of other factors; the actual change in performance from changing one variable may be higher or lower, depending on the state of the other factors. 6 6 Low High 5 5 4 4 Low High SS2 SS2+S IPC 3 3 1 1 0 0 Av er A Av age ve er (L rag ag o e e wo (H n ig ly) h on ly) eq ua k fm e a3 d lu ca fa s ce re c sw im m gr id ap pl ar u t-1 1 am 0 w mp up w is ga e lg si el xt ra ck m es a ap si 2 Av er Av a er Av ge ag ( L e er ag ow e o (H nly ) ig h on ly) 2 g vp ap r-r ou te pa rs er bz tw ip ol 2f s pe ourc rlb e m kgz d ip -g iff ra ph gc ic c16 6 eo c r naf ru t sh y m vo eie rte r xon e IPC SS1 (a) Integer (b) Floating-point FIGURE 4. The effect of allowing a stagger of 256 instructions in SS2 (S-factor). 4 0 Stagger 256 Stagger 1K Stagger 1M Stagger IPC 3 2 1 be latency-bound. With an IPC of 1.5, an application would need a stagger of up to 300 instructions to completely overlap a 200-cycle main memory access. Enabling 1K or 1M stagger—longer than the longest system latency—has little additional benefit. Very large stagger could be effective if a program exhibited IPC variations across large program phases, but the associated hardware cost is indefensible for the incremental performance benefit. 4. SHREC 0 Integer Low Integer High Floating-point Low Floating-point High FIGURE 5. Applications are not sensitive to degrees of stagger over 256 instructions. We conclude this section with an analysis of the impact of varying the degree of elastic stagger. The allowed degree of stagger impacts performance because it limits the granularity of IPC variation that can be effectively hidden. For example, if the maximum degree of stagger is longer than an L2 miss latency, it is likely that the R-thread can continue to make forward progress on backlogged instructions while the M-thread is stalled on a L2 miss. However, if the maximum degree of stagger is much shorter than an L2 miss latency, the R-thread will also stall due to data dependencies on the same L2 miss. We evaluate SS2+S+C over different maximum staggers of 0, 256, 1K, and 1M instructions. The trends of IPC versus degree of stagger are reported in Figure 5 for the high/low-IPC integer and floating-point benchmarks. Figure 5 shows that, with the exception of high-IPC floating-point benchmarks, a moderate stagger of up to 256 instructions is sufficient to capture nearly all the benefits. Low-IPC applications, in particular, have the potential to In this section, we present the details of SHREC and evaluate its performance. The SHREC design reduces the performance penalty of redundant execution through more efficient scheduling and management. First, a key feature of the SHREC design is that R-thread instructions are processed differently from M-thread instructions. The Rthread instructions are only checked using input operands produced by the M-thread. This asymmetric handling of reexecution restores the full capacity at the ISQ and ROB to the M-thread. R-thread instruction checking is efficiently handled by a separate, simple in-order scheduler. Thus, bottlenecks associated with the ISQ and ROB are avoided without physically enlarging those structures. Second, the progress of the threads can stagger elastically, up to the size of the ROB. This added scheduling flexibility allows the R-thread to fill in the slack issue bandwidth and functional units left idle by the M-thread, rather than constantly competing with the M-thread. Delaying R-thread execution from the corresponding M-thread instruction also hides both cache miss latency and branch misprediction latency from the corresponding R-thread instruction. Thus, the issue bandwidth and functional unit bottlenecks can be ameliorated by means of more efficient utilization, rather than providing additional issue bandwidth or functional units. 4.2. Asymmetric Re-execution on Shared FUs Retirement ROB Fus FUs Rename/ Decode Fetch ISQ In−order Issue Phys RegFile Arch RF DIVA Checker Exception (a) Exception Phys RegFile Compare Retirement FUs Rename/ Decode Fetch In−order SHREC Checker ISQ ROB (b) FIGURE 6. (a) DIVA and (b) SHREC microarchitectures. 4.1. Asymmetric Re-execution on Dedicated FUs To aid our discussion of SHREC, we first review DIVA, a prior proposal for asymmetric re-execution [2]. The previously proposed DIVA microarchitecture augments a conventional superscalar pipeline (e.g., SS1) with a highly abbreviated checker pipeline [2, 5, 21]. Figure 6(a) depicts the datapath of the DIVA microarchitecture. In DIVA, before the completed instructions in the main out-of-order pipeline can retire, the instructions must first proceed in program-order through the checker pipeline, accompanied by their previously computed results. The checker pipeline re-computes each instruction individually; a mismatch between the original result and the re-computed result triggers an exception in the main out-of-order pipeline. To minimize the throughput mismatch, the DIVA checker pipeline is assumed to have the same issue bandwidth and functional unit mix as the main out-of-order pipeline. Since the instructions in the checker pipeline are not constrained by data dependencies, the checker issue logic is in-order and the functional units can be deeply pipelined. In other words, the checker pipeline, although greatly simplified, can match the throughput of the main out-of-order pipeline. DIVA’s performance should closely follow SS1, since DIVA’s main out-of-order pipeline supports only the initial execution of an instruction stream on essentially the same hardware as SS1. However, DIVA’s good performance is, to a large degree, also bolstered by a second set of functional units at a significant cost over SS1, as discussed in Section 2.1. As illustrated in Figure 6(b), a SHREC pipeline is comprised of a primary out-of-order pipeline (essentially SS1) and an in-order checker pipeline (shaded in gray). The outof-order pipeline and the checker pipeline in SHREC share access to a common pool of functional units. In other words, SHREC adopts DIVA’s asymmetric execution scheme to relieve capacity and bandwidth contention at the ISQ and ROB, but shares functional units to reduce the hardware cost. This design point is important because softerror tolerance may not be needed under all circumstances (e.g., Doom vs. TurboTax). In DIVA, soft-error tolerance comes at a fixed hardware overhead. SHREC’s soft-error tolerance comes with a performance overhead but can be disabled to regain performance in non-critical applications. The SHREC pipeline is similar to the DIVA pipeline except in the design surrounding the functional units and issue bandwidth of the checker pipeline. Based on the assumed SS1 configuration in Table 1, the derived SHREC microarchitecture can issue only up to eight instructions per cycle to the original set of functional units, whether from the out-of-order or the checker pipelines. The out-oforder pipeline has static priority for the instruction issue bandwidth and functional unit selection; only unused issue bandwidth and functional units are available to the in-order issue window of the checker pipeline. Naturally, the out-of-order issue logic in SHREC is more complicated than in SS1. However, by 1) keeping the number of issue slots the same, 2) assigning fixed priority to the out-of-order pipeline, and 3) issuing from only a small in-order window of instructions from the checker pipeline, the increase in complexity can be kept to a minimum. SS1 originally supports a 128-entry ISQ. The equivalent SHREC checker pipeline is allowed an eight-entry inorder issue window, and the SHREC ISQ is reduced commensurately to 120 entries. Thus, the total number of entries feeding into the issue selection logic remains 128, as in SS1. No additional ports are added to the out-of-order issue queue, since the in-order issue window is a logically and physically separate structure. 4.3. Microarchitecture Operation SHREC’s operation is similar to DIVA and can leverage many of the same optimizations proposed for DIVA [5, 21]. Below, we describe the basic operations of the SHREC pipeline. Out-of-order Pipeline: The main SHREC pipeline is a conventional superscalar out-of-order design. An instruction stream proceeds through the out-of-order pipeline normally, except completed instructions must be re-executed 6 Low High 5 4 4 IPC 5 3 1 0 0 Av er A Av age ve er (L rag ag o e e wo (H n ig ly) h on ly) eq ua k fm e a3 d lu c fa as ce re c sw im m gr id ap p ar lu t-1 1 am 0 wu mp pw is ga e lg six el tra ck m es a ap si 1 Av er Av a er Av ge ag ( L e er ag ow e o (H nly ) ig h on ly) 2 (a) Integer SS2 SHREC SS2+S+C+B SS1 Low High 3 2 g vp ap r-r ou te pa rs er bz tw ip ol 2f s pe ourc rlb e m kgz d ip -g iff ra ph gc ic c16 6 eo c r naf ru t sh y m vo eie rte r xon e IPC 6 (b) Floating-point FIGURE 7. The IPCs of SS2, SHREC, and SS1. SS2+S+C+B models an idealized SHREC model. by the checker pipeline before they can retire from the ROB. The main pipeline needs to be augmented by a result buffer that records the result of each instruction. These values are later re-executed by the checker. This result buffer operates in tandem with the ROB, but has its own read and write ports, thus it does not increase ROB complexity or port requirements. Checker Pipeline: As in DIVA, we assume information about an instruction’s initial execution in the out-of-order pipeline is conveyed to the checker pipeline without error via the result buffer (protected by ECC and circuit-level techniques). The checker pipeline considers at most eight consecutive completed instructions in the ROB for re-execution in program order, subject to the availability of surplus issue slots and functional units not scheduled by the main pipeline. When issued, the checker fetches operands from the same physical registers that provided operands to the initial execution. Since the total number of instructions issued per cycle is never more than in SS1, additional register file ports are not needed. The SHREC result generally agrees with the performance of the idealized SS2+S+C+B model (included in Figure 7 for comparison). The small difference between SHREC and SS2+S+C+B is caused by SHREC’s in-order issue restriction on the R-thread; SS2+S+C+B still supports outof-order issue for the R-thread. SHREC is especially effective in reducing the performance penalty for benchmarks with low to moderate IPC on SS1. SHREC matches the SS1 IPC for low-IPC integer benchmarks and comes to within 8% for low-IPC floatingpoint benchmarks. In this scenario, SHREC’s main out-oforder pipeline operates identically to SS1 because there is ample issue bandwidth and functional units to support equal progress by both the main pipeline and the checker pipeline. The same cannot be claimed for SS2 because, even with ample issue bandwidth and functional units, the contention for other resources still reduces IPC. 4.4. SHREC Performance On high-IPC benchmarks, SHREC loses performance relative to SS1 because SHREC can only sustain half of the peak performance of SS1 (when the issue bandwidth and/ or functional units are saturated). In these scenarios, exemplified by the high-IPC floating-point benchmarks, SHREC and SS2 perform comparably since both suffer from the same bottlenecks in issue bandwidth and functional units. In apsi, sixtrack, and art, SHREC is even slightly slower than SS2. This slowdown is attributed to the SHREC checker pipeline’s in-order issue, which introduces more stalls than SS2 when there is contention for the floating-point units. Figure 7 compares the IPC of SPEC2K benchmarks on SS2, SHREC, SS2+S+C+B, and SS1 models. Overall, SHREC reduces the performance penalty relative to SS1 to 4% on integer and 15% on floating-point benchmarks. This penalty is in comparison to SS2’s performance penalty of 15% on integer and 32% on floating-point benchmarks. In many benchmarks, performance improvement opportunities also exist when the program exerts fluctuating pressure on issue slots and functional units. IPC fluctuation can be caused by stall events such as cache misses, pipeline refills after branch mispredictions, long latency operations, or inherent IPC variations. During high-IPC cycles, the Verification and Retirement: Following re-execution, instructions are considered for program-order retirement. The checker-computed result is immediately compared against the original result computed by the out-of-order pipeline. If the values agree, the instruction retires from the ROB. Otherwise, a soft-exception restarts execution at the corrupted instruction. 5 4 4 3 3 IPC IPC 5 2 2 1 1 0 SHREC - High SS2 - High SHREC - Low SS2 - Low 0 0.5X X 1.5X 2X (a) Integer 0.5X X 1.5X 2X (b) Floating-point FIGURE 8. Over a range of functional unit mix sizes, SHREC uses the supplied units more efficiently than SS2. SHREC out-of-order pipeline momentarily keeps pace with SS1, while the checker pipeline is starved from issue. If the out-of-order pipeline has sustained high-IPC (i.e., in high-IPC benchmarks) the out-of-order pipeline will eventually stall because the ROB backs up with instructions awaiting checking. Ideally, high-IPC program phases are interspersed with sufficiently frequent low-IPC phases such that the checker pipeline catches up. In these fluctuating IPC scenarios, SHREC will be more efficient than SS2. Although SHREC does not increase issue bandwidth or the functional unit count, it reduces the pressure of sharing issue bandwidth and functional units through more efficient utilization. Here, we present additional data to show that SHREC’s utilization efficiency is scalable when additional issue bandwidth and functional units are allowed. Figure 8 plots IPC versus the X-factor (available issue bandwidth and functional units) for high/low integer and floating-point benchmarks. ‘2X’ means twice the issue bandwidth and functional units as specified in Table 1. Between 0.5X and 2X, SHREC matches or outperforms SS2, except under high contention for floating-point units. The integer benchmarks operate in a mode where they become ILP-bound with additional functional units. At that point, the benefit of SHREC’s efficient scheduling diminishes; the IPC of SS2 and SHREC converge for both high and low-IPC integer benchmarks. The floating-point benchmarks tend to have high ILP, but experience functional unit bottlenecks. As the number of functional units increases, this bottleneck diminishes and the effect of more efficient SHREC scheduling increases. The performance of SHREC and SS2 should eventually converge, however, as even more functional units are introduced. 5. Conclusion Recent proposals for soft-error tolerance in superscalar microarchitectures suffer a significant performance loss by requiring the instruction stream to be executed redundantly. In a balanced superscalar design, supporting a second redundant thread incurs a severe performance loss due to contention for bandwidth and capacity in the datapath, (e.g., issue bandwidth and the number of functional units, issue queue, reorder buffer size, and decode/retirement bandwidth). Relaxing any single resource bottleneck cannot fully restore the lost performance. The proposed SHREC microarchitecture effectively relieves all of the above bottlenecks, except issue bandwidth and the number of functional units. In SHREC, the main thread executes normally, as on a conventional superscalar datapath. The redundant thread's instructions, however, are only checked using input operands already computed by the main thread. The redundant thread does not compete with the main thread for out-of-order scheduling resources because checking can be efficiently managed by a separate, simple in-order scheduler. Furthermore, the progress of the main and redundant threads can stagger elastically up to the size of the reorder buffer. This added scheduling flexibility allows better utilization of issue bandwidth and functional units, reducing the performance pressure from sharing. The above techniques enable SHREC-style re-execution to share resources more efficiently and to achieve higher IPC over previous sharedresource concurrent re-execution proposals. Acknowledgements We would like to thank the anonymous reviewers and members of the Impetus research group at Carnegie Mellon for their helpful feedback on earlier drafts of this paper. This work was supported by NSF CAREER award CFF037568, NSF award ACI-0325802, and by Intel Corporation. Computers used in this research were provided by an equipment grant from Intel Corporation. References [1] Haitham Akkary, Ravi Rajwar, and Srikanth T. Srinivasan. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proceedings of the 36th Annual International Symposium on Microarchitecture, December 2003. [2] T. M. Austin. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proceedings of the 32nd International Symposium on Microarchitecture, November 1999. [3] G. E. P. Box, W. G. Hunter, and J. S. Hunter. Statistics for Experimenters. John Wiley and Sons, Inc., 1978. [4] Doug Burger and Todd M. Austin. The SimpleScalar tool set, version 2.0. Technical Report 1342, Computer Sciences Department, University of Wisconsin–Madison, June 1997. [5] Saugata Chatterjee, Chris Weaver, and Todd Austin. Efficient checker processor design. In Proceedings of the 33rd International Symposium on Microarchitecture, pages 87–97, December 2000. [6] Neil Cohen, T.S. Sriram, Norm Leland, David Moyer, Steve Butler, and Robert Flatley. Soft error considerations for deep-submicron CMOS circuit applications. In IEEE International Electron Devices Meeting: Technical Digest, pages 315–318, December 1999. [7] Joel S. Emer. EV8: the post-ultimate alpha. In International Conference on Parallel Architectures and Compilation Techniques, September 2001. Keynote address. [8] T. Juhnke and H. Klar. Calculation of the soft error rate of submicron CMOS logic circuits. IEEE Journal of Solid State Circuits, 30(7):830–834, July 1995. [9] Shih-Chang Lai, Shih-Lien Lu, Konrad Lai, and Jih-Kwon Peir. Ditto processor. In Proceedings of the International Conference on Dependable Systems and Networks, June 2002. [10] Avi Mendelson and Neeraj Suri. Designing high-performance and reliable superscalar architectures: The Out of Order Reliable Superscalar O3RS approach. In Proceedings of the International Conference on Dependable Systems and Networks, June 2000. [11] Shubhendu S. Mukherjee, Christopher Weaver, Joel Emer, Steven K. Reinhardt, and Todd Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual International Symposium on Microarchitecture, December 2003. [12] Ronald P. Preston, et al. Design of an 8-wide superscalar RISC microprocessor with simultaneous multithreading. In IEEE International Solid-State Circuits Conference, Digest of Technical Papers, February 2002. [13] Zach Purser, Karthik Sundaramoorthy, and Eric Rotenberg. Slipstream processors: Improving both performance and fault tolerance. In Proceedings of the 33rd International Symposium on Microarchitecture, December 2000. [14] Joydeep Ray, James C. Hoe, and Babak Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001. [15] Steven K. Reinhardt and Shubhendu S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture, June 2000. [16] Eric Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999. [17] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, October 2002. [18] Premkishore Shivakumar, Michael Kistler, Stephen W. Keckler, Doug Burger, and Lorenzo Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks, pages 389–398, June 2002. [19] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd International Symposium on Computer Architecture, June 1995. [20] T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002. [21] Chris Weaver and Todd Austin. A fault tolerant approach to microprocessor design. In IEEE International Conference on Dependable Systems and Networks, July 2001.