Execution-based Prediction Using Speculative Slices Craig Zilles and Gurindar Sohi Computer Sciences Dept. University of Wisconsin - Madison 1210 West Dayton Street, Madison, WI 53706-1685, USA [zilles, sohi] @cs .wisc. edu Abstract A relativeA, small set of static instructions has significant leverage on program execution performance. These problem instructions contribute a disproportionate number of cache misses and branch mispredictions because their behavior cannot be accurately anticipated using existing prefetching or branch prediction mechanisms. The behavior of many problem instructions can be predicted by executing a small code fragment called a speculative slice. If a speculative slice is executed before the corresponding problem instructions are fetched, then the problem instructions can move smoothly through the pipeline because the slice has tolerated the latency of the mereor).' hierarchy (for loads) or the pipeline (for branches). This technique results in speedups up to 43 percent over an aggressive baseline machine. To benefit from branch predictions generated by speculative slices, the predictions must be bound to specific dynamic branch instances. We present a technique that invalidates predictions when it can be determined (by monitoring the program's execution path) that they will not be used. This enables the remaining predictions to be correctly correlated. 1 Introduction Wide-issue microprocessors are capable of remarkable execution rates, but they generally achieve only a fraction of their peak instruction throughput on real programs. This discrepancy is due to performance degrading events (PDEs), largely branch mispredictions and cache misses. This work attempts to avoid (or reduce the latency of) stalls due to branch mispredictions and cache misses. Performance degrading events tend to be concentrated in a subset of static instructions whose behavior is not predictable with existing mechanisms [ 1, 8]. Current branch predictors attempt to identify patterns in branch outcomes and exploit them to accurately predict those branches whose behaviors exhibit these patterns. Similarly, caches service 'most requests by exploiting temporal and spatial locality, a n d hardware prefetching detects and anticipates simple memory access patterns. Because the "simple" instructions are handled accurately by existing mechanisms, most of the branch mispredictions and cache misses are concentrated in the remaining static instructions whose behavior is more "complex." 1063-6897/01 $10.00 © 2001 I E E E In Section 2, we demonstrate that a small set of frequently executed static instructions, which we refer to as problem instructions, are responsible for causing a majority of PDEs. In addition, we show that avoiding the mispredictions and cache misses caused by problem instructions gives much of the benefit of a perfect cache and branch predictor. To avoid cache misses and branch mispredictions associated with problem instructions, we construct a code fragment that mimics the computation leading up to and including the problem instruction. We call such a piece of code a speculative slice [19]. Like a program slice [18], a speculative slice includes only the operations that are necessary to compute the outcome of the problem instruction. In contrast, a speculative slice only needs to approximate the original program (i.e., it need not be 100 percent accurate), resulting in significant flexibility in slice construction. Constructing efficient, accurate speculative slices is feasible for many problem instructions in the benchmarks studied. In Section 3, we describe the slices that we constructed by hand and the optimization techniques used. Typically, the slices cover multiple problem instructions and generate a prefetch or prediction every 2-4 dynamic instructions executed by the slice; branch predictions generated by the slices are greater than 99% accurate. Speculative slices execute as helper threads on a multithreaded machine. These helper threads accelerate the program's execution microarchitecturally, by prefetching data into the cache and generating branch predictions, but they do not affect architected state; a slice has its own registers and performs no stores. Forking the slice many cycles before the problem instructions are encountered by the main thread provides the necessary latency tolerance. After copying some register values from the main thread, the slice thread executes autonomously. The necessary extensions to a multithreaded machine to support these helper threads are described in Section 4. An important component of this additional hardware binds branch predictions generated by helper threads to the correct branch instances in the main thread. When a prediction is generated by a slice, it is computed using a certain set of input values and assumptions about the control flow taken. Due to control flow in the program not included in the slice, not all of the generated predictions will be consumed. Table 1. Simulated machine parameters Front End Execution Core Caches Prefetch A 64KB instruction cache, a 64Kb YAGS [6] branch predictor, a 32Kb cascading indirect branch predictor [5], and a 64-entry return address stack. The front end can fetch past taken branches. A perfect BTB is assumed for providing target addresses, which are available at decode, for direct branches. All nops are removed without consuming fetch bandwidth. The 4-wide machine has a 128-entry instruction window, a full complement of simple integer units, 2 load/store ports, and a single complex integer unit, all fully pipelined. The pipeline depth (and hence the branch misprediction penalty) is 14 stages. For simulation simplicity, scheduling is performed in the same cycle in which an instruction is executed. This is equivalent to having a perfect cache hit/miss predictor for loads, allowing the scheduler to avoid scheduling operations dependent on loads that miss in the cache. Store misses are retired into a write buffer. The 8-wide machine is similar, except it has a 256-entry window and 4 load/store units. The first-level data cache is a 2-way set-associative 64KB cache with 64 byte lines and a 3-cycle access latency, including address generation. The L2 cache is a 4-way set-associative 2MB unified cache with 128-byte lines and a 6-cycle access. All caches are write-back and write-allocate. All data request bandwidth is modeled, although writeback bandwidth is not. Minimum memory latency is 100 cycles. In parallel with cache accesses, a 64-entry unified prefetch/victim buffer is checked on all accesses. A hardware stream prefetcher detects cache misses with unit stride (positive and negative) and launches prefetches. In addition, when bandwidth is available, sequential blocks are prefetched (before a stride is detected) to exploit spatial locality beyond 64 bytes. maintain the working set size of the reference inputs, while reducing the run lengths to 9 to 40 billion instructions. A mechanism is required to determine which predictions should be ignored. In Section 5, we present our correlation technique that monitors the path taken by the main thread to identify when a prediction can no longer be used by its intended branch. In Section 6, we present the performance results attained by executing speculative slices in the context of a simultaneous multithreading (SMT) processor. We show that the slices accurately anticipate the problem instructions they cover and provide speedups between 1% and 43%. 2.2 C h a r a c t e r i z a t i o n For each benchmark, we attribute PDEs (i.e., cache misses and branch mispredictions) to static instructions. If a static instruction accounts for a non-trivial number of PDEs and at least 10% of its executions cause a PDE, it is characterized as a problem instruction. This classification is somewhat arbitrary and serves merely as a demonstration that an uneven distribution of PDEs exists. Table 2 characterizes the problem instructions occurring in full runs of the benchmarks, showing that these instructions are responsible for a disproportionate fraction of performance degrading events. 2 Problem Instructions In this section, we present a brief characterization of problem instructions. Because such a characterization depends on the underlying microarchitecture (e.g., better predictors and caches reduce the number of problem instructions), we first present our hardware model and simulation methodology. In Section 2.2, we demonstrate that there is a set of static instructions that accounts for a disproportionate number of branch mispredictions and cache misses. Then, in Section 2.3, we show that this concentration of PDEs causes these instructions to have significant performance impact on the execution. In Section 2.4, we present a source-level example of these instructions to provide a conceptual understanding of why these instructions are difficult to handle with conventional caches and branch predictors. Table 2. Coverage of performance degrading events by problem instructions. The first group of columns shows how many static loads and stores (#SI) were marked as problem instructions, what fraction of all memory operations these instructions comprise (mere), and the fraction of L1 cache misses covered (mis). The second three columns give similar information for branches (br = % of dynamic branches, mis = % of mispredictions). Memory Insts Program bzip2 2.1 M e t h o d o l o g y The underlying microarchitecture is an aggressive, heavily-pipelined, out-of-order superscalar processor. It includes large branch predictors, large associative caches, and hardware prefetching for sequential blocks. Table 1 provides the details of our 4- and 8-wide simulation models, which are based loosely on the Alpha architecture version of SimpleScalar [2]. We perform our experiments on the S P E C 2 0 0 0 benchmarks. In general, these programs exhibit hard-to-predict branches and some non-trivial memory access patterns. We have used input parameters that roughly crafty eon gap gcc gzip mcf parser perl twolf vortex vpr 3 #SI 75 103 mem 5% 2% Control Insts mis 97% 70% #SI br mis 86 108 30% 10% 83% 40% insufficient misses 140 4% 98% 429 1% 59% 45 193 530 23% 12% 4% 88% 69% 43% 30 127 157 140 21% 32% 5% 2% 96% 97% 82% 67% 79 34 164 16% 23% 14% 62% 71% 51% 211 148 186 10% I% 14% 93% 49% 92% 82 167 142 95 9% 51% 1% 16% 68% 91% 58% 75% Figure 1. Performance impact of problem instructions. The bottom bar shows tire IPC of the baseline configuration, tire second bar shows the performance improvement of avoidhrg the data cache misses and branch mispredictions associated with the problem instructions presented in Table 2, and the top bar shows the performance achievable if all data cache misses and branch mispredictions are avoided. Data shown for both 4- and 8-wide configurations for the SPEC2000 benchmarks. [ ] all perfect Z D prob. inst. perfect = [ ] baseline 4 8 bzip2 4 8 crafty 4 8 con 4 8 gap 4 8 gcc 4 8 gzip 4 8 mcf 2.3 Performance Impact To observe how the PDEs caused by problem instructions affect performance, we augmented our simulator to give the appearance of a perfect branch predictor and perfect cache On a per static instruction basis. Figure 1 shows that removing the PDEs from problem instructions provides a substantial increase in performance. In fact, these instructions account for much of the performance discrepancy between the baseline model and the all perfect model. Because the impact of the PDEs is larger in the 8-wide machine, the potential speedups from perfecting problem instructions are larger. Notice that even with perfect data caches and perfect branch prediction, most benchmarks fail to achieve the peak throughput of the machine due to the limited instruction window, real memory disambiguation, and operation latency. 2.4 S o u r c e - L e v e l Example We present a concrete example of problem instructions from the SPEC2000 benchmark v p r . This example performs a heap insertion. A heap is a data structure that supports removal of the minimum element and random insertions [4]. The structure is organized as a binary tree and maintains the invariant that a node's value is always less than that of either of its children. The heap is stored as an array of pointers, such that if a node's index is N, its children are found at indices 2N and 2N+I. Insertions are performed (in the function a d d t o heap, shown in Figure 2) by adding the new element to the end of this array (at h e a p _ t a i l , line 1), creating a new leaf node Of the tree. At this point, we need to ensure that the heap invariant is maintained, so we need to check to see if the new node ( i f r o m ) is less than its parent ( i t o ) and, if so, swap the two pointers (lines 7-9). This test is performed recursively until the heap invariant is satisfied (line 6) or the root of the tree is reached (line 5). In v-or, the heap size is thousands of elements, so the whole structure does not fit in the LI cache. This means that accessing the c o s t field of the i t o element (in line 6) typically incurs a cache miss. Due to variation in the key values 4 8 4 parser 8 4 perl 8 twolf 4 8 vortex 4 8 vpr associated with the inserted elements, different insertions "trickle" different distances up the tree. An average loop count of 2-3 iterations causes the comparison branch (also in line 6) to be unbiased and frequently mispredicted. This example demonstrates a number of common attributes of problem instructions. Problem loads typically involve dereferencing at least one pointer; dereferencing pointer chains is common. This can make them difficult to prefetch using existing methods. Problem branches are typically unbiased and involve a test on a recently loaded data element. Problem instructions are frequently dependent upon one another, as the problem branch in the example is data-dependent on the problem load, and the load is control-dependent on the problem branch from the previous iteration. Lastly, problem instructions often occur in tight loops• In the next section, we discuss how to tolerate problem instructions. 3 Speculative Slices Given that problem instructions are not easily predicted with state-of-the-art predictors, a different approach is Figure 2. Example problem instructions from heap insertion routine in v'pr. s t r u c t s _ h e a p **heap; // f r o m [ l . . h e a p _ s i z e ] int h e a p _ s i z e ; // # of slots in the h e a p int h e a p _ t a i l ; // first u n u s e d s l o t in h e a p void I. 2. 3. 4. 5. 6. 7. 8. 9. i0 ii add to h e a p (struct heap[heap_tail] = hptr; int i f r o m : h e a p _ t a i l ; int ito = ifrom/2; heaptail++; while s_heap *hptr) { branch misprediction / / cache miss\ ((ito >= i) && (heap[ifrom]->cost < heap[ito]->cost)) struet s heap *temp_ptr = heap[ito] ; heap[ito] : heap[ifrom]; h e a p [ i f r o m ] = temp_ptr; i f r o m = ito; ito = ifrom/2; } } required to predict them correctly. The key insight is that the program is capable of computing all addresses and branch outcomes in the execution; it just does so too late to tolerate the latency of the memory hierarchy (for loads) or the pipeline (for branches). To predict the behavior of problem instructions, we execute a piece of code that approximates the program. We extract the subset of the program necessary to compute the address or branch outcome associated with the problem instruction. We refer to this subset as the slice of the problem instruction. 3.2 Speculative Slice Construction Slice construction is most easily understood in the context of a concrete example. Using the code example from vpr, we demonstrate many of the characteristics of slices and the techniques we use to construct them. Slice Structure. Because the two problem instructions in this example are inter-dependent, we construct a single slice that pre-executes both. In general, slice overhead can be minimized by aggregating all problem instructions that share common dataflow predecessors into a single slice. The problem instructions are in a loop. This is a common occurrence because one of our criteria for selection problem instructions is execution frequency. Because the loop body is small - - only 15 instructions - - the necessary latency tolerance cannot be achieved by executing a separate slice for each loop iteration. Instead the loop is encapsulated in the slice, and we attempt to fork the slice long before the loop is encountered in the main program. Our previous research [19] observed that it is not profitable to build conservative slices (i.e., those that compute the address or branch outcome 100% accurately) because such slices must include a significant portion of the whole program. Instead, by using the slices to affect only microarchitectural state (through prefetching and branch prediction hints), we remove the correctness constraint on the slices. This speculative nature provides significant flexibility in slice construction, enabling smaller slices that can provide more latency tolerance with lower overhead. We have found it feasible to construct efficient and accurate slices for many problem instructions. In this section we characterize the speculative slices we constructed, discussing the ways in which slices enable early execution before describing the slice construction process using our running example• The slices for this paper were constructed manually. They are meant to serve as a proof of concept and to suggest the set of techniques that should be used when constructing slices automatically. Automatic slice construction is discussed briefly in Section 3.3. Selecting a Fork Point. Fork points should be "hoisted" past unrelated code. The add t o h e a p function is only called by the n o d e t o heap function (shown in Figure 3), so it is inlined by the compiler. Before calling add t o heap, the node t o h e a p function allocates a heap element and sets its fields• Because the slice only requires the c o s t field (passed in as a parameter to node t o heap) and the h e a p itself (which is seldomly modified before the problem instructions are executed), the slice can be forked at the beginning of the n o d e t o heap function. This fork point is sufficiently before (60 dynamic instructions) the first instance of the problem branch to derive benefit from the slice in our simulated machine. Selecting a fork point often requires carefully balancing two conflicting desires. Maximizing the likelihood that the problem instruction's latency is fully tolerated requires that the fork point be as early as possible, but increasing the distance between the fork point and the problem instructions can both increase the size of the slice and reduce the likelihood that the problem instructions will be executed. Often there is a set of"sweet spots" where the latency tolerance is maximized for a given slice size and accuracy. 3.1 How Speculative Slices Enable Early Execution The benefit of executing a speculative slice is derived from pre-executing (i.e., computing the outcome of) a problem instruction significantly in advance of when it is executed in the main program. Before discussing our speculative slices, we review how a slice enables early execution of an instruction relative to the whole program: • Only instructions necessary to execute the problem instruction are included in the slice. Thus, the slice provides a faster way to fetch the computation leading to a problem instruction than does the whole program. • The slice avoids misprediction stalls that impede the whole program. Control flow that is required by the slice is if-converted, either arithmetically or through predication (e.g., conditional move instructions). Unnecessary control flow is not included. • The code from the original program can be transformed to reduce overhead and shorten the critical path to computing a problem instruction. Transformations can be performed on the slice that were not applied by the compiler for one of two reasons: (1) not provably safe for all paths - - speculative slices neither have to be correct nor consider all paths - - or (2) not feasible in the whole program due to limited resources (e.g., registers). Optimizations. Beyond removing the code that is unnecessary to execute the problem instructions, the size of our slice has been reduced by applying two optimizations. Although these particular transformations would be correct to apply to Figure 3. The n o d e to h e a p function, which the fork point for the slice that covers a d d void node struct hptr to h e a p s_heap add 5 float = alloc_heap_data(); hptr->cost } ( .... *hptr; to h e a p = cost; (hptr) ; to serves as heap. cost .... ) fork point { SiieeDescription. Figure4 shows the original assembly code that corresponds to the source in Figure 2, and our slice that covers this region is shown in Figure 5. The un-optimized slice is shaded in Figure 4 and the problem instructions are in bold in both the original and slice code. The optimized slice consists of only 8 static instructions. Typically a slice has fewer instructions than 4 times the number of problem instructions it covers. This small static the original code, it would be difficult for the compiler, which must preserve correctness across all possible inputs, to prove that this is so. Since the slice cannot affect correctness, it merely must discern that these transformations are correct most of the time, so that performance is not detrimentally affected. • Register Allocation: The series of values loaded .by heap [ i f r o m ] - > c o s t is always the value c o s t , which is passed in as an argument to node t o heap. This fact could be detected by memory dependence profiling. Our slice takes the value c o s t as a live in value and removes all loads from h e a p [ i f r o m ] and their corresponding stores to heap [ i t o ] . This optimization removes 8 static instructions from the slice. • Strength Reduction:The i t o = i f r o m / 2 statements in the original code are implemented by the compiler with a right shift operation to avoid a long latency division. To ensure the correct semantics of the division operator, a 3 instruction sequence is used that adds 1 to the value before the shift, if the value is negative. Value profiling could determine that the value added is always 0 (because i from is never negative). By using strength reduction to remove the addition of the constant zero, the division operation is reduced down to just the right shift, removing 4 static instructions from the slice. Other profitable optimizations included removing register moves, eliminating unnecessary operand masking, exploiting invariant values, if-conversion, and reverse if-conversion. We have found removing communication through memory (like the register allocation example above) to be the most important. Figure 4. Alpha assembly for the add_to heapfunction. The instructions are annotated with the number of the line in Figure 2 to which they correspond. The problem instructions are in bold and the shaded instructions comprise the un-optimized slice. node_toheap: ... /* 2 ida 2 ldl 1 idq 3 cmplt 4 addl 1 s8addq 4 stl 1 stq 3 addl 3 sra 5 ble loop: 6 s8addq 6 s8addq ii cmplt I0 move 6 idq 6 Idq ii addl ii sra 6 ids 6 ids 6 cmptlt 6 fbeq 8 stq 9 stq 5 bgt Slice Termination. When a slice contains a loop, as our example does, we must determine how many iterations of the loop should be executed. The obvious means - - replicating the loop exit code from the program - - is often not the most efficient. Many loops have multiple exit conditions and inclusion of this computation in the slice can significantly impact slice overhead. To avoid the possibility of a "runaway" slice when exit conditions are excluded, each slice is assigned a maximum iteration count. This number is derived from a profile-based estimate of the upper-bound of the number of iterations executed by the loop. When this maximum is close to the average iteration count for a loop, as is the case for our example, overhead can often be minimized by removing all loop exit computation from the slice and completely relying on the maximum iteration count to terminate the slice. Even when accurate back-edge branches are included in the slice, the latency of evaluating branch conditions often causes extra iterations of the loop to be fetched, Once the instructions have been fetched, the cost of executing them is typically small. Exceptions will also terminate slices; hence, linked list traversals will automatically terminate when they dereference a null pointer. return: ... skips -40 i n s t r u c t i o n s */ sl 252(gp) # &heap_tail t2 0(sl) # ifrom = h e a p _ t a i l t5 -76(si) # &heap[0] t2 0, t4 # see n o t e t2 0xl, t6 # h e a p _ t a i l ++ t2 t5, t3 # &heap[heap_tail] t6 0(sl) # store h e a p _ t a i l # h e a p [ h e a p tail] sO 0(t3) t2 t4, t4 # see n o t e t4 0xl, t4 # ito = ifrom/2 t4 return # (ito < i) t2 t5, a0 t4 t5, t7 t4 0, t9 t4 t2 a2 0(a0) a4 0(t7) t4 t9, t9 t9 0xl, t4 Sf0, 4(a2) $fl, 4(a4) $f0,$fl,$f0 $f0, r e t u r n a2, 0(t7) a4, 0(a0) t4, loop /* r e g i s t e r # # # # # # # # # # # # # # # restore &heap[ifrom] &heap[ito] see n o t e ifrom = ito heap[ifrom] heap[ito] see n o t e ito = ifrom/2 heap[ifrom]->cost heap[ito]->cost (heap[ifrom]->cost < heap[ito]->cost) heap[ito] heap[ifrom] (ito >= i) code & r e t u r n */ note: the divide by 2 operation is implemented by a 3 instruction sequence described in the strength reduction optimization. Figure 5. Slice constructed for example problem instructions. Much smaller than the original code, the slice contains a loop that mimics the loop in the original code. slice: 1 idq 2 idl slice_loop: 3,11 sra 6 sSaddq 6 idq 6 1as cmptle 6 br $6, $3, 328(gp) 252(gp) $3, 0xl, $3 $3, $6, $16 $18, 0($16) $fl, 4($18) $fl,$fl7,$f31 # # # # # # ito /: 2 &heap[ito] heap[ito] heap[ito]->cost (heap[ito]->cost < cost) PRED slice_loop ## A n n o t a t i o n s fork: on first i n s t r u c t i o n l i v e - i n : $fl7<cost>, gp m a x l o o p iterations: 4 6 # &heap # ito = h e a p _ t a i l of n o d e _ t o _ h e a p size translates to a small dynamic size relative to the program. This slice performs a prefetch or generates a prediction about every 3 instructions executed, and this is representative of the slices we have built. The slice only has 2 live-in registers, c o s t and the global pointer (used to access h e a p ) . In general, the natural slice fork points tend to require a small number of live in values; rarely are more than 4 values required. Table 3 characterizes a representative sample of the slices we have constructed. 3.3 Automatic Slice Construction For speculative slice pre-execution to be viable, an automated means for constructing slices will be necessary. Although a few of our slices benefit from high-level program structure information that may be difficult to extract automatically, most of the slices and optimizations only use profile information that is easy to collect. The most difficult issues are those that involve estimating a slice's potential benefit, which is needed to maximize that benefit as well as to decide whether a slice will be profitable. Roth and Sohi [13] automatically selected un-optimized slices from an execution trace using the approximate benefit metric fetch-constrained dataflow-height. Automated slice optimization is important future work. 4 Slice Execution Hardware To derive all benefits from the slices the hardware must provide the following: • a set of resources with which to execute the slice. • a means to determine when to fork the slice. • a means for communicating initial state. • a means for marking prediction generating instructions. • a means for correlating predictions to branches in the main thread. In this section, we describe how our example hardware implementation handles the first four of these issues. Due to its relative complexity, the prediction correlation hardware is described in its own section (Section 5). 4.1 Execution Resources Rather than inserting the slices into the program inline, we execute them as separate, "helper" threads [3, 13, 14] on a simultaneous multithreading (SMT) [ 16] processor. In this way we avoid introducing stalls and squashes into the main thread due to the cache missing loads (i.e., not just prefetches) and control flow in the slices. Using idle threads in an S M T processor avoids adding execution hardware specifically for slices. The helper threads compete with the main thread for execution resources (e.g., fetch/decode bandwidth, ALUs, cache ports). Because the helper threads enable the main thread to use resources more efficiently (by avoiding mispredictions and stalls), the application with the slices can be executed with f e w e r resources than the application alone can. Fetch resources are allocated to threads using an ICOUNT-like policy [16] that is biased toward the main thread, Table 3. Characterization of Slices. For a representative set of slices constructed we provide the size in static instructions (static size), the number of live-in register values, tile nunzber of problem loads prefetched (pref), problem branches predicted (pred), and tile number of kills used for branch correlation (see Section 5). If the slice contains a loop, the number of each categor3' within tire loop is shown in parentheses, and the iteration Ihnit is shown in the final column. static live size ins Prog. pref pred kills max iter 2000 bzip2 8 (7) 4 1 (1) 2 (2) 1 (1) crafty 7 2 0 I I con 8 l 0 6 3 gap 8 (5) 2 0 3 (3) 1 (1) 85 gcc 4(3) 1 0 2(2) 1 (I) 31 mcf 12(12) 1 4(4) 1 (1) 2(1) 98 twolf 8 (5) 2 0 2 (2) I (1) 7 vpr 31 (15) 3 5 (3) 5 (2) 3 (1) 18 vortex 4 1 1 0 0 The helper threads and the main thread share the first-level data cache, so when data is prefetched by the slice it is available to the main thread. With a sufficiently large cache, the timing of the prefetches is not critical. 4.2 Slice F o r k i n g Slices are forked when a given p o i n t - the f o r k p o i n t - - is reached by the main thread. There are two ways of marking a fork point: inserting explicit fork instructions or designating an existing instruction as a fork point and detecting when that instruction is fetched. In order to maintain binary compatibility, we study the second approach in this work, but the hardware can be simplified by the former approach. Our simulated hardware includes a small hardware structure called the slice table (shown in Figure 6(a)), which is maintained in the front end of the machine I. The first field, the fork PC, is a content-addressable memory (CAM) that is compared to the range of program counters (PC) fetched each cycle. If a match occurs, an idle thread is allocated to execute the slice. If no threads are idle, the fork request is ignored. The slice table provides the starting PC for the thread (slice PC). The instructions that comprise the slice are stored as normal instructions in the instruction cache. 4.3 Register C o m m u n i c a t i o n In addition to the slice code, the helper thread needs a few "root" data values (as discussed in Section 3.2). To perform this inter-thread communication efficiently, we exploit SMT's centralized physical register file, by copying register map table entries and synchronizing the communication using the existing out-of-order mechanism [13, 17]. The small n u m b e r of copies, logically performed when the fork 1. Slicetable entries cannot be demand loaded: without a loaded entry the processor will not know that a fork point has been reached. One possible implementation would be to load the entries associated with a page of code when an instruction TLB miss occurs. Figure 6. The slice and prediction generating instruction (PGI) tables, which reside at the front-end of the pipeline. One ent O' bz the slice table (a) is stored for each slice. The PGI table (b) has one ento ~ for each prediction generating instruction in a slice. The fields are described in the text. These structures together require less than 512B of storage. a) Slice Table Figure 7. A conceptual view of a control-based prediction correlation mechanism. When a branch is fetched by the main thread, its PC is compared to the correlator's PC tags. If a match occurs, the prediction at the head of the queue overrides the traditional branch predictor: I b) PGI Table • f PC Tag 16 entries 64 entries I slice # I slice inst# I branch PC I point is renamed, are completed in the background by stealing rename ports from instructions that do not rename a full complement of registers. Because the main thread may overwrite a copied register before the slice is done with it, we slightly modify the register reclamation process to ensure that physical registers are not prematurely freed. The processor keeps a reference count for those registers. No inter-thread synchronization is performed for memory, as it would require significant alteration of the load-store queue. Instead, slices are selected such that values loaded by the slice have not been recently stored by the main thread. 4.4 Prediction Generating Instructions To improve branch prediction accuracy, a speculative slice computes branch outcomes to be used as predictions for problem branches in the main thread. The instruction in the slice that computes such an outcome is referred to as a prediction generating instruction (PGI). PGIs are identified by entries in the PG1 table (Figure 6(b)), which are loaded along with slice table entries. In addition, each entry identifies the problem branch in the main thread to which it corresponds. When a PGI is fetched, it is marked with an , identifier that indicates that the value it computes should be routed back to the front end of the machine and provided to the prediction correlator (described in the next section). 5 Branch Prediction Correlation To derive any benefit from a branch prediction computed by a speculative slice, the prediction must be assigned to the intended dynamic instance of the problem branch. Since problem branches are unbiased, failure to accurately correlate predictions with branches will severely impact prediction accuracy. To maximize the benefit of the prediction, correlation is performed at the fetch stage. Conceptually, the branch correlator (Figure 7) consists of multiple queues of predictions, each tagged with the PC of a branch. When a branch is fetched and it matches with a non-empty queue, the processor uses the prediction at the head of the queue in place of the one generated by the traditional branch predictor. Although conceptually simple, there are three issues that complicate the implementation of a prediction correlator: • Conditionally-executed branches: If predictions are generated for all potential dynamic instances of a branch, ~ ~ :~' t~ i~, " _ -~" I " i ~ ~,~ and not all of them are executed due to control flow, a mechanism is required to kill the predictions that will not be used. • Mis-speculation recovery: Because the correlator is manipulated when instructions are fetched by the main thread, it must be able to undo any action performed while the main thread was on an incorrect path. • Late Predictions: A prediction can be generated after the branch for which it is intended was fetched. If this is not detected, the late prediction can potentially be incorrectly correlated with future dynamic instances of the branch• The following sub-sections describe each of these issues in turn, providing more details on the problems and how our mechanism solves them. 5.1 Conditionally-executed Branches Achieving sufficient latency tolerance often necessitates "hoisting" the fork of a slice above branches that determine if a problem branch will be executed• When the fork point and the problem branch are no longer control-equivalent (i.e., a problem branch is not executed exactly once for each fork point executed), we may generate more predictions than there are problem branches to consume them. If any unused predictions are left in the queue, the predictions will become mis-aligned, severely impacting prediction accuracy• Figure 8 shows an example of a conditionally executed problem branch• This problem branch (in block D) exists inside of a loop and is only executed on the iterations in which the i f statement in block C evaluates to t r u e . Assuming that this loop is rarely executed more than three Figure 8. E x a m p l e o f a conditionally executed p r o b l e m branch within a loop. Code example (a) and control flow graph (b). a) b) [A bool int *A */ ~ P[3] ; a = while O; (a < = if (...) if . .) { { (P[a]) { * D*/ } } ++; fork slice / r o b l e m branch * B */ * C*/ /* E */ a ~ /* F */ } /* G */ ~/ Although there are potentially a large number of exits from a valid region, we have found it unnecessary to monitor all such edges. In practice, a prediction does not need to be killed exactly when its valid region is exited; it merely must be killed before the next instance of the problem branch is fetched. Because of the reconvergent nature of control flow in real programs, all paths tend to be channeled back together at distinct points (e.g., all paths through a function eventually return from the function). For this reason, we merely need to select a reconvergent point that occurs between the two instances of the problem branch. More precisely, we select a point (or set of points) in the program that post-dominates all of the valid region exits and dominates the next execution of the problem instruction. In practice, we have found selecting such points to be straightforward. In our example, block F can serve as a loop iteration kill and block G as a slice kill. Figure 9(b) shows the operation of our mechanism for this example loop. In this work, we identify existing instructions in the main thread for use as kills; the slice table provides these kill PCs to the prediction correlator when a slice is forked. Typically, a small number of kill instructions is sufficient to perform correlation correctly. We have found that often the best loop iteration kill block is the block that is the target of loop back-edges. In this case, the first instance of the block should not kill any predictions. times before exiting, we create a slice that generates a prediction for each of the first three iterations. Assuming, without loss of generality, that the slice is executed in a timely fashion, the queue begins with one prediction for each of the first three loop iterations. If the problem branch in the first iteration is not executed, the problem branch in the second iteration will be matched with the prediction generated for the first iteration. Furthermore, if the loop exits without consuming all of the generated predictions, these predictions could potentially cause future mis-correlations to occur. This problem could be avoided by including the control flow that determines whether the problem branch will be executed (e.g., the conditions computed by blocks B and C in the example in Figure 8) in the slice. We have found that including this additional code, the existence sub-slice [19], in the slice substantially increases the overhead of the helper thread and potentially impacts the latency of generating the predictions. Given that these control conditions are already going to be evaluated by the main thread, replication of this code is unnecessary. Instead, our mechanism observes the control flow as it is resolved by the main thread. Instead of "consuming" predictions when they are used by a problem branch, we use the main thread's execution path to deallocate predictions when it can be determined that the prediction is no longer valid. For each prediction, we identify the valid region - - the region in which the intended branch instance can still be reached - - in the control flow graph (CFG), and when the execution leaves this region the prediction is killed. Figure 9(a) shows an unrolled CFG for the loop in our example and the valid regions for the first two predictions. Note that there are two types of exits from a region: 1) the arcs CF, DE, and DF kill the prediction for a single iteration, and 2) the arc BG (the loop exit) kills all predictions for the loop. We refer to the first kind as a loop iteration kill and the second type as a slice kill. 5.2 M i s - s p e c u l a t i o n R e c o v e r y Modern processors employ speculation in many respects, often squashing and re-fetching instructions to recover from mis-speculation. In order to properly correlate predictions, it is necessary to undo any actions performed on the correlator by the squashed instructions (much like branch history must be restored). The obvious solution is to restore predictions if the killing instruction gets squashed. Thus predictions are Figure 9. Killing predictions when they are no longer valid. Figure (a) shows tile regions in which tile predictions for the first and second iterations are valid on an unrolled control flow graph. Block F (tile loop back-edge) can be used to kill the prediction Jor each iteration, and block G (the loop exit) should be used to kill all predictions. Figure (b) shows how loop iteration kills (block F) and slice kills (block G) can be used to ensure correct prediction correlation. The path taken is ABCFBCDFBG. b) a) Main thread fetch events Prediction queue state Slice guesses loop will be executed 3 times, generates 3 predictions. Initial State I I e21 e3 Block D 1 not fetched -no changeBlock FI fetched Problem branch is not executed on first iteration. First iteration prediction killed. Imm Block D2fetched -no changeBlock F 2fetched Problem branch correctly matched with prediction for second iteration. Second iteration prediction killed. I v3 Loop exits (Block G) 10" E Remaining predictions killed. -empO'- Figure 10. Prediction correlation hardware. A single entt3' of the branch queue contains a valid bit (valid), the PC of the branch in the main thread (branch PC), the PCs of loop kill and slice kill instructions in the main program (loop PC and kill PC), and state for 8 predictions. Each prediction has a state (one of h~valid, EmpO', Full, and Late), the direction of the prediction (T/NT), the Von Neumann number(VN#) of the consuming instruction (consumer #) if the prediction is late, the VN# of the instruction that killed this prediction (killer #), and whether the prediction has been killed. The table contains about 1KB of state. p e r problem branch state -- branch queue I . . . . __ __ p e r prediction state __]1 state not deallocated until the kill instruction retires, but merely marked as killed and ignored for future correlations. We store the Von Neumann number (VN#, a sequence number used for ordering instructions) of the kill instruction with the prediction. If the kill VN# is in the range of squashed VN#s, we clear the kill bit, restoring the prediction. IT TI consumer# 64 branches 8 predictions/branch Ikiller#[ killed I explicit kill annotations into the program, as was described for fork instructions in Section 4.2. 6 Performance Results Significant performance benefits can be achieved through speculative slice pre-execution. Because the slices in this work have been constructed manually, we have only generated slices for a portion of the problem instructions in each benchmark. We have profiled the runs of the SPEC2000 integer benchmarks and have selected a 100 million instruction region of the execution in the dominant phase of each simulation2. We have identified the problem instructions in this region and constructed slices to cover some of them. We warm-up the caches and branch predictors by running 100 million instructions. For these experiments we assume that only a single program is running, hence all remaining thread contexts are idle. If multiple applications are running, higher throughput will likely be achieved by using the thread contexts to run those applications instead of slices. We present results for three experiments: 1) a base case with the microarchitecture described in Section 2.1, 2) the base case augmented with speculative slices (on a 4 thread SMT), and 3) a constrained limit study to understand how effectively the slices are achieving their goals. This constrained limit study is performed by "magically" avoiding the PDEs from the problem instructions covered by our slices. The speedups of the slice and limit cases, relative to the base case, are shown in Figure 11 for a 4-wide machine (the 8-wide results, omitted for space, are similar). Supplemental data for the benchmarks with non-trivial speedups is provided in Table 4; the others are discussed in Section 6.2. Substantial speedups are achieved (note that many of these programs are within 30-50% of peak throughput, as shown in Figure 1) by reducing the number of mispredictions and cache misses (up to 72% and 64%, respectively). For most benchmarks, the speedup is largely derived from avoiding mispredictions; only gap, racf, and v p r get a majority of their benefit from prefetching. The speedups are on the order of half of the speedup of the limit case. 5.3 Late Predictions Late predictions can cause mis-correlation if the prediction arrives after the associated kill. Two factors affect the latency of generating a prediction: (1) the pipeline depth, because the prediction generating instruction (PGI) must reach the execute stage before the prediction is computed, and (2) the execution latency of the slice (recall that problem branches often depend on problem loads). Because not all slices are forked sufficiently early, our mechanism gracefully supports late predictions. To avoid mis-correlation, we allocate a prediction in the queue when the PGI is fetched, rather than executed, because it is easy to ensure that the PGI is fetched before its kill. This entry is initialized to the Erupt)., state and changed to the Full state when the PGI executes. If an empty entry is matched to a branch, the traditional predictor is used. Kills behave the same whether the entry is Empty or Full. Late predictions can also be used for early "resolution," because a late prediction may still arrive significantly in advance of when the problem branch is resolved. If a prediction in the Empty state is matched to a problem branch, it transitions to the Late state and the VN# of the branch and the prediction it used are stored in the prediction entry. Later, when the PGI is executed, if the computed branch outcome does not match the stored prediction and the branch has yet to be resolved, the prediction is reversed and fetch is redirected. Because speculative slices are not necessarily correct, extra squashes can be introduced. These rare occurrences are corrected when the branch is resolved. 5.4 Prediction Correlation Hardware Given prediction entries that support conditionally executed branches, mis-speculation recovery, and late predictions, it is straightforward to construct an accurate prediction correlator. Figure 10 shows one possible implementation of such a correlator. More efficient implementations, as well as those that support recursion, are possible, but they cannot be covered here due to space considerations. Also, much of the CAM circuitry can be avoided by putting 6.1 Analysis Achieving the speedup attained by the limit study necessitates that the predictions and prefetches are both accurate and timely. Accuracy is seldom a problem; our slices and 2. Becausethe data shown here is only for a sample of the program'sexecution, it cannotbe compareddirectly with data presented in Section2. 10 depends not only on the number of dynamic instructions executed by a slice, but also on the cost of stealing an instruction opportunity from the main thread. If the main thread has insufficient ILP, is stalled on an untolerated cache miss, or is fetching down a false path, the cost of executing the slice can be small. On the other hand, if the main thread is efficiently utilizing the processors resources (indicated by a high base IPC), executing a slice can be expensive, making many potential slices unprofitable. Overhead is aggravated because not every instance of a problem instruction will miss or be mispredicted. Although many problem loads consistently miss in the cache, branch predictors are good enough that even problem branches are predicted 70-85% correctly. Overhead is consumed whenever a slice is executed, but benefit is derived only when a PDE is avoided; this magnifies the overhead by the inverse of the miss/misprediction rate. As can be seen in Table 4, the amount of overhead can be substantial. As much as 10-15% of the instructions fetched belong to slices, but in all cases the total number of instructions fetched is reduced. Experiments (data not shown) estimate that this execution overhead represents about half of the discrepancy between our results and the limit study. Some of the fetched slice instructions fail to "retire," generally because the associated fork point was squashed. Fetch bandwidth wasted on squashed slices is not a large concern Figure 11. Speedup achieved by the slice-assisted execution and a limit study that perfects the same set o f problem instructions. Data shown for a 4-wide machine. 80 7o 6°t 50 [ 40 f "~ 30 ] r ~ ' ~ ~ slice [-Ilimit l°- l 0 bzip2craftyeon gap gee gzip mcfparse-"-"rperl twolfvortexvpr prediction correlation mechanism exceed a 99% prediction accuracy when they override the traditional predictor (4 of the 8 benchmarks in Table 4 have no induced mispredictions). On the other hand, forking a slice sufficiently early to tolerate the full latency of a misprediction or cache miss can be problematic (discussed in Section 3.2). Around a quarter of the slices constructed are consistently late, reducing their benefit. One such example is prefetching the tree for mcf; the work performed at each node is insufficient to cover the latency of the sequential memory accesses. Furthermore, executing slices is not free; there is the opportunity cost of stealing resources - - mostly instruction fetch opportunities - - from the main thread that contributes to the performance discrepancy in Figure 11. This overhead Table 4. Characterization of program execution with attd without speculative slices for benchmarks with non-trivial speedups. Each benchmark was run for 100 million instructions. bzip2 eon gap gzip mcf perl twolf 187.8 1.0 0.7 148.0 154.0 0.3 0.0 133.9 127.5 0.3 0.1 112.8 179.8 1.1 0.1 127.9 357.5 2.6 6.8 273.4 133.9 0.1 0.2 121.9 200.0 1.2 1.4 164.6 175.8 0.7 1.3 130.6 11.5 11.3 15.3 40.8 2.0 6.6 16.3 # Fork Points (K) # Fork Points Squashed (K) 306 53 6.5 5.1 455 141 4.0 Slice instructions "'retired" (M) 3.9 135 12 9.8 928 334 40.7 8 3 1.9 138 18 3.3 594 254 14.5 893 78 9 1,300 261 257 52% 5 2,365 221 215 64% 25 2,867 687 680 64% 3 2,540 413 388 15% 12 687 161 120 35% 0 12 3,610 471 380 33% 15 6 2,131 547 510 72% 4% 8 20,253 20% 1 379 0 I% 1 272 3 31% 7 4,689 4,940 3,798 55% -80% 58 51 30% -20% 346 172 12% -10% 1,299 1,261 64% -50% Base Program instructions fetched (M) Branch mispredictions (M) Load misses (M) Base + Slices Program instructions fetched (M) Slice instructions fetched (M) # Fork Points Ignored (K) Problem Branches Covered Predictions generated (K) Mis-predictions "covered" (K) Mis-predictions removed (K) Total mis-predictions removed (%) Incorrect predictions (K) Late predictions (%) 8 3,303 393 370 37% 1 15% Problem Loads Covered "prefetches" performed (K) Cache misses "'covered" (K) 4 1,876 547 Net reduction in misses (K) Net reduction in total misses (%) Fraction of speedup from Toads 396 46% -10% 40% 0 - . 1% 2 222 . 37 37 60% -50% 11 5 11% 0 vpr because it only reduces the number of wrong path instructions fetched by the main thread. Only two programs, t w o l f and vpr, ignore fork requests on a machine with 3 idle helper threads, but most programs benefit from having more than one idle thread. Often there is one long-running, "background" slice and a number of periodic, localized slices. size, and timeliness is improved by reducing the critical path through the slice. • Programs and processors with low base IPCs (relative to peak IPC) are more likely to benefit from slices because the opportunity cost of slice execution is lower. • Overhead can be reduced by not executing slices for problem instructions that will not miss/mispredict. Some of our slices do this statically; if the problem instructions behave differently in different calling contexts and the fork point is hoisted into the calling procedure, only the profitable contexts fork slices. Obvious future work is gating the fork using confidence [8]. • Execution overhead could be eliminated by having dedicated resources to execute the slice at the expense of additional hardware. In addition, existing slices may be improved and additional slices may be profitable without resource constraints. 6.2 Slice Construction Failures For the selected phase of execution, we were unable to achieve significant speedups on 3 of the benchmarks considered: gcc, p a r s e r , and v o r t e x 3. These benchmarks provide examples of difficulties in slice construction. Many problem branches in g e c are in functions that process r t x structures. Typically these functions have a switch statement based on the node type, and they recursively descend the tree-like r t x on a subset of the cases. This switch statement is frequently a problem branch. The unpredictability of the traversal, coupled with fact that computing the traversal order is a substantial fraction of these functions, makes generating profitable slices difficult. p a r s e r has two main problem localities: a hash table and memory deallocation. Key generation for the hash table is computationally intensive, over 50 instructions, and it occurs right before the problem instructions. These instructions would be replicated, causing substantial slice overhead. The memory allocator is organized as a stack, and much of the work of deallocation is deferred until the deallocated chunk becomes the top of the stack. Thus, when the top of stack element is deallocated, a long cascade of deallocations is performed. To prefetch the sequential accesses, the fork point must be hoisted up significantly, but the unpredictability of which call to x f r e e will cause the cascade obstructs hoisting without causing many useless slices. The problems with v o r t e x are more mundane. Its base IPC is high, within 13% of peak throughput for our 4-wide machine, which makes the opportunity cost of slice execution high. This is aggravated by low miss/misprediction rates by some problem instructions. In some cases, notably p a r s e r , minor source-level restructuring could likely enable these slices to be profitable. 7 Related W o r k Besides our previously mentioned research to understand backward slices [19], the closest related work is speculative data-driven multithreading (DDMT) [13]. In that work, Roth and Sohi describe a muhithreaded processor that can fork critical computations (much like our slices) to tolerate memory latency and resolve branches early. The differences between the DDMT work and this research are largely derived from the former being implemented around a technique called register integration [12]. Integration allows the results of speculatively executed instructions to be incorporated into the main thread when a data-flow comparison determines that they exactly match instructions renamed by the main thread. In this way, mispredicted branches that were pre-executed are resolved at the rename stage (as opposed to at fetch as with our control-based scheme) and the main thread avoids the latency of re-executing instructions that have been pre-executed. This data-flow comparison requires one-to-one correspondence in the computations, precluding optimization of slices. Furthermore, in this work we consider pre-executing computations that contain control flow (loops and if-converted general purpose control flow). Three earlier works provide more restricted implementations of pre-execution, including support for linked data structures [10], virtual function call target computation [11], and conditional branches in "local loops" [7]. Farcy, et al. describe a simple prediction correlator that supports mis-speculation and late predictions, but not conditionally executed branches [7]. The prediction correlation technique proposed by Roth, et al. [11] is, in some sense, the complement of the one described here; it uses the path through the program to attempt to determine when a prediction should be used, while we use the path to invalidate predictions. Sundaramoorthy, et al. propose a different approach to producing a reduced version of the program with the potential to execute instructions early [l 5]. Their slipstream archi- 6.3 Summary Execution-based prediction using speculative slices appears to be very promising. By adding only a small amount of hardware (a few kilobytes of storage), we were able to significantly improve performance on an already aggressive machine for many of the benchmarks studied. In considering other possible implementations, we can qualitatively reason about some aspects of performance: • The speculative optimizations applied to slices have a two-fold benefit: overhead is reduced by reducing slice 3. We also did not significantlyimprovecrafty'S performance.Many of crafty's probleminstructionsoccur in two functions,FirstOne and LastOne, which find first and last set bits in 64-bit integers,respectively.Because the Alpha architecture has instructions for these functions, we did not bother optimizingthem. 12 lecture executes a program on two cores of a chip multiprocessor (CMP), speculatively removing "ineffectual" computations from the first execution (the A-stream) and verifying the speculation with the second execution (the R-stream). Prediction correlation between the executions is performed by the A-stream supplying a complete "trace" of predictions to the R-stream. If the R-stream detects an incorrect prediction in the trace, the trace and the A-stream execution are squashed. The uneven distribution of performance degrading events has previously been observed. Abraham, et al. [1] quantified the concentration of cache misses by static instruction in SPEC89 benchmarks. Mowry and Luk [9] found that path and context could be used to predict which dynamic instances of instructions were likely to cache miss. Jacobsen, et al. [8] characterized confidence mechanisms to identify the static branches, as well as the particular dynamic instances of those branches, likely to be mispredicted. SSMT [3] and Assisted Execution [14] propose the concept of helper threads (i.e., using idle threads in a multithreaded machine to improve single thread performance by executing auxiliary code), and explore helper thread implementations of local branch prediction and stride prefetching, respectively. 9 Acknowledgements 8 Conclusion [8] The authors would like to thank Ras Bodik, Adam Butts, Charles Fischer, Milo Martin, Manoj Plakal, Dan Sorin, and Amir Roth for commenting on drafts of this paper. This work was supported in part by National Science Foundation grants CCR-9900584 and EIA-0071924, donations from Intel and Sun Microsystems, and the University of Wisconsin Graduate School. Craig Zilles was supported by a Wisconsin Distinguished Graduate Fellowship. 10 References [1] [2] [3] [4] [5] [6] [7] We have shown that a small set of static instructions, which we call problem instructions, are responsible for a significant amount of lost performance. These instructions are not easily anticipated with existing caching, prefetching, and branch prediction mechanisms because their behavior does not exhibit any easily exploited regularity. Because the program correctly computes the outcome of all instructions, we can predict the behavior of problem instructions by building approximations of the program called speculative slices. These speculative slices are executed, in advance of when the problem instructions are encountered by the main thread, to perform memory prefetching and generate branch predictions. The effects of the slices are completely microarchitectural in nature, in no way affecting the architectural state (and hence correctness) of the program. We have found that efficient and accurate speculative slices can be constructed for many problem instructions. With moderate hardware extensions to a multithreaded machine, these slices can provide a significant speedup (up to 43 percent). In order to benefit from branch predictions generated by speculative slices, we must correlate these predictions with problem branches as they are fetched by the program. The final contribution of this paper is a prediction correlation mechanism that uses the path taken by the program to kill predictions when they can no longer be used. This technique is accurate and can potentially be used to correlate other types of predictions (e.g., value predictions). [9] [10] [ 11] [12] [13] [14] [15] [16] [17] [18] [ 19] 13 S. Abraham, R. Sugumar, D. Windheiser, B. Rau, and R. Gupla. Predictability of Load/Store Instruction Latencies. In Proc. 26th htternational Symposium on Microarchitecture, pages 139-152, Dec. 1993. D. Burger and T. Austin. The SimpleScalar Tool Set, Version 2.0. Technical Report CS-TR-97-1342, University of Wisconsin-Madison, Jun. 1997. R. Chappell, J. Slat-k, S. Kim, S. Reinhardt, and Y. Pall. Simultaneous Subordinate Microthreading (SSMT). In Proc. 26th hzternational Symposium on Computer Architecmre, May 1999. T.H. Cormen, C. E. Leiserson, and R. L. Rivesl. hztroduction to Algorithms. The MIT Press, 1990. K. Driesen and U. Hoelzle. The Cascaded Predictor: Economical and Adaptive Branch Target Prediction. In Proc. 3Is, !nternational Symposium on Microarchitecture, pages 249-258, Dec. 1998. A. Eden and Y. Mudge. The YAGS Branch Prediction Scheme. In Proc. 31nd hztelwational Symposium on Microarchitecture, pages 69-77, Nov. 1998. A. Farcy, O. Temam, R. Espasa, and T. Juan. Dataflow Analysis of Branch Mispredictions and Its Application to Early Resolution of Branch Outcomes. In Proc. 31st hzternational S3vnposium on Microarchitecture, pages 59-68. Dec. 1998. E. Jacobsen, E. Rotenberg, and J. Smith. Assigning confidence to conditional branch predictions. In Proc. 29th hzternatitmal Symposium on Microarchitecture, Dec. 1996. T. Mowry and C.-K. Luk. Predicting Data Cache Misses in Non-Numeric Applications Through Correlation Profiling. In Proc. 30th hzternational Sympositlm oil Microarchitecture, pages 314-320. Dec. 1997. A. Roth, A. Moshovos, and G. Sohi. Dependence Based Prefetching for Linked Data Structures. In Proc. 8th Cot!/erence on Architectural Support.for Programming Languages atzd Operating Systems, pages 115-126, Oct. 1998. A. Roth, A. Moshovos, and G. Sohi. Improving Virtual Funclion Call Target Prediction via Dependence-Based Pre-Computation. In Proc. 1999 htternational Cot!ference on Supercomputing, pages 356-364. Jun. 1999. A. Roth and G. Sohi. Register Integration: A Simple and Efficient Implementation of Squash Reuse. in Proc. 33rid hltertzational STillposiutn on Mictwarchitecture, Dec. 2000. A. Roth and G. Sohi. Speculative Data-Driven Multi-Threading. In Proc. 7th International Syntposium on High PetJhrmance Computer Architecture, Jan. 2001. Y. Song and M. Dubois. Assisted Execution. Technical Report #CENG 98-25, Department of EE-Systems, University of Southern California, Oct. 1998. K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving both Performance and Fault Tolerance. In Proc. 9th htternational Cot!/erence on Architectural Support for Programming Languages and Operating Systems, Nov. 2000. D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R Stature Exploiting Choice: Instruction Fetch and Issue on an lmplementahle Simultaneous Mullilhreading Processor. In Ptvc. 23rd hzternational Symposium on Computer Architecture, pages 191-202, May 1996. S. Wallace, B. Calder, and D. Tullsen. Threaded multiple path execution. In Proc. 25th International Symposium on Computer Architecture, pages 238-249, Jun. 1998. M Weiser. Program Slicing. IEEE Transactions on Sr~f~l'are Engineering, 10(4):352-357, 1984. C. Zilles and G. Sohi. Understanding lhe Backwards Slices of Performance Degrading Instructions. In Proc. 27th hltertzational Swnposlum on Computer Architecture, June 2000.