Reducing the Scheduling Critical Cycle using Wakeup Prediction Todd E. Ehrhart and Sanjay J. Patel Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign {ehrhart,sjp}@crhc.uiuc.edu Abstract For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive – increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter. 1. Introduction An instruction is scheduled in a conventional superscalar microprocessor by projecting the times at which the instruction’s source operands will be ready. In the case of a fixed-latency source instruction, this time is known as soon as the source instructions are scheduled. For other instructions, this time is not known. Since these variable-length delays will propagate through dependency chains, such instructions are a source a variability which manifests itself throughout program execution. The presence of such instructions means that a superscalar processor must have a means of scheduling instructions which does not depend on knowing operand ready times a priori. In many current processors, instruction scheduling is performed dynamically through the use of wakeup and select. Wakeup is the process by which instructions with ready operands are discovered. Selection is performed by selecting for execution a set of n instructions among all ready instructions within the scheduling window. Instructions that are selected for execution in one cycle inform dependent instructions to wakeup in subsequent cycles. If the execution latency of an instruction is one cycle, then the wakeup and select logic loop should nominally take one cycle also, thereby creating a very tight critical path. Increasing clock frequency aggravates the design constraints in the scheduler. This critical execution loop cannot be pipelined without significant impact on overall instruction-level parallelism. This paper proposes the idea of eliminating the conventional wakeup logic altogether, and replacing it with a system where the wakeup time for each instruction is predicted, and the instruction speculatively wakes itself up when the time expires. The prediction occurs in parallel with fetch, and may be pipelined with as many stages as the fetch unit itself. If an instruction is executed too early, the common instruction replay mechanism is used to correct it. This system has the advantage of eliminating the feedback mechanism for operand readiness required for typical wakeup logic. In addition, the predicted wakeup time for an instruction is always known at least one cycle before the speculative wakeup would actually occur. This means the select logic can be built to anticipate future scheduling needs, thereby further reducing the critical cycle in the system. A potential drawback of this system is that it can significantly increase the number of replays that occur during the course of program execution. Replays, when they occur, increase the load on the processor and can increase the time until a replayed instruction finally produces its result. This problem is most pronounced when the processor is heavily loaded and replayed instructions are interfering with the execution of instructions that are non-speculatively ready. In order to alleviate this problem, the predictive system presented here adds an allowance to the predicted wakeup time to reduce the probability of a replay. This allowance may be associated with individual static instructions, or with the overall machine state. In either case, it is dynamically adjusted using a gradient-descent minimum-searching technique that attempts to find the minimum-cost balance (in terms of performance) between executing an instruction sooner – and increasing the chances of a 2. Previous Work Palacharla, et al. [1] studied the delays of the various parts of the rename, wakeup, select and bypass logic, and are widely credited with identifying the critical loop involving the wakeup and select logic. They studied the sensitivity of the delays to changes in technology and instruction window size. They also proposed a solution to the critical cycle: A set of FIFO structures feeds the select logic. An instruction is steered into the FIFO already occupied by an instruction upon which it depends. Each cycle, the instructions at the head of each FIFO are executed if they are ready. This reduces the number of instructions that need to be checked for wakeup. The scheduling critical cycle was also addressed by Canal and González [2, 3]. Their first solution dispatches instructions to two different buffers, depending on the usage characteristics of their source operands. If all nonready operands represent the first use of those values, the instruction is dispatched to a table indexed by physical register number. If any non-ready operand represents a second or later use of that value, it is dispatched into a content-addressable buffer. The net effect is to reduce the usage of content-addressable memory to instructions that are the second (or later) users of values. The second solution proposed by Canal and González involves generating a VLIW-style schedule a few cycles in advance of execution. Incoming instructions are scheduled into the earliest available instruction slot after their operands are projected to be ready. Projecting ready times for operands generated by loads is done by assuming the operand would be ready a cycle after the load was executed. Ernst, et al. [4] use a similar approach, but explicitly account for memory dependence and use a switchback countdown queue as a cheaper way of constructing the VLIW schedule. Michaud and Seznec [5] also attempt to schedule instructions before they are ready to execute. In their model, the scheduler assumes that the processor has infinite execution resources and schedules accordingly. The scheduled instructions are placed in a FIFO so that they are delivered to the wakeup logic in dataflow order. This modification allows a smaller wakeup buffer to be used to achieve a similar throughput. Like the approach in this paper, Stark, et al. [6] use a speculative wakeup approach. The technique allows the wakeup logic to be implemented as a two-stage pipeline by assuming that an instruction will be ready two cycles after all of its grandparent instructions are ready. If this assumption is wrong for an instruction, the instruction is replayed. The same authors also introduce a technique they call “select-free scheduling” [7] which addresses the select portion of the critical cycle. When an instruction wakes up, it is sent to the select logic; if there are too many instructions that wake up in a cycle, the scheduling logic sends some back to be rescheduled. This allows the select logic to be pipelined with minimal performance penalty. Instructions which are incorrectly scheduled because they depend on unschedulable instructions are caught when they access the register file. 3. Minimum-Searching 100 90 80 70 Frequency (%) replay – or executing it later – and possibly losing the opportunity to execute sooner. The system presented here uses two adjustable parameters: the rate of averaging µ, and the estimated cost of a replay r. The parameter r, may be adjusted to make prediction more or less aggressive, where “more aggressive” means that the system will tend to speculatively execute instructions sooner. When processor load is low, it makes sense to use more aggressive prediction, because it only fills instruction slots that would otherwise go to waste. When processor load is high, it makes sense to use less aggressive prediction, because instructions destined to be replayed consume instruction slots that could have been filled by useful instructions. Also, the avalanche effect of replays in a heavily loaded processor gives another reason to use less aggressive prediction in this scenario. With this in mind, the system used in this paper employs a second layer of feedback to adjust r according to the measured load on the processor and the ratio of replayed to ready instructions. 60 50 40 30 20 10 0 -25 -15 -5 5 15 25 Wakeup Time (cycles) as Difference from Mode of Static Instruction Figure 1. Wakeup time deviation The motivation for the technique used in this paper is found in Figure 1. The figure shows the distribution of differences between the actual wakeup times of dynamic instructions and the modes of wakeup times for all instances of the corresponding static instructions. (i.e., this figure may be thought of as being produced by generating a distribution of wakeup times of dynamic instances of each static instruction, subtracting the mode from each distribution, and combining the results; e.g. if the most common wakeup time for dynamic instances of a static instruction is 6 cycles, and a particular dynamic instance wakes up after 7 cycles, it contributes to the frequency of the value 7-6=1 in Figure 1.) As can be seen from the figure, about 66% of dynamic instructions have wakeup times identical to the most common wakeup times for their corresponding static instructions. This suggests that the wakeup time of a dynamic instruction is highly predictable if the mode of wakeup times for the static instruction can be estimated. However, there is a complication. If an instruction is predicted to have too short a wakeup time, it will be replayed. If the predictions for dependent instructions are also based on observations of the source instruction executing earlier, then the dependent instructions will also be predicted to have wakeup times that are too short. In this way, the avalanche of replays that occurs in dependence-based scheduling can occur in a predicted system also. Even if the avalanche is avoided, the replayed instruction still might have consumed an instruction slot which could have been occupied by a useful instruction, or it may have lost the opportunity to execute earlier if it was recovering from a replay when it became ready. So, while avalanches are not guaranteed in a predicted system, the cost of a replay can still be high. Also, as can be seen from the tails in Figure 1, an instruction is more likely to be ready later than a modebased prediction would anticipate than it is to be ready earlier. This exacerbates the replay cost problem, making underpredictions a concern. pdf; g(x) Predicted Wakeup Lost Opportunity Replay Cost d m Wakeup Time; x Figure 2. Costs involved with the wakeup time distribution The discussion above suggests that it should be beneficial to take replay cost into account when designing a predictor. The strategy taken in this paper is to add an allowance to the predicted wakeup time to reduce the chances of a replay. As a starting point for deriving a predictor which uses this strategy, Figure 2 shows a representation of the distribution of wakeup times for a static instruction, along with regions representing the major costs. m represents the mode of the distribution and d represents the allowance. m+d is the cost-adjusted predicted wakeup time. The curve g(x) represents the probability density function of the actual wakeup time. The shaded area on the left represents the probabilistic cost of lost opportunity resulting from an overprediction of the wakeup time. The shaded area on the right represents the probabilistic cost of a replay resulting from an underprediction. Let f(x)=g(x+m) – i.e. Let f(x) be the same distribution shifted to put the mode at x=0. If the assumption is made that this discrete distribution corresponds to an underlying continuous, differentiable distribution, it is easy to write a cost function for any particular value of d: ∞ d (d ) = φN cost function ∫ −∞ f ( x) k (d − x) dx + lost opportunity density ( x − d ) dx ∫ f ( x)r d replay cost density expected lost opportunity cost expected replay cost The objective is to find a value of d that minimizes φ(d). The problem is that the lost opportunity cost is difficult to know when the prediction is being made, and the cost of any particular replay is unobservable during execution. However, there are some reasonable assumptions that can be made. For lost opportunity cost – k(d-x) – knowledge of instruction slack is required. This is unknown when the predicted wakeup time needs to be generated, because it depends on subsequent instructions. It is also difficult to measure and store for future reference. It can be conservatively assumed that slack is 0. This implies that the delay cost is one cycle for every cycle instruction wakeup is delayed. Under this assumption, k(d-x) = d-x. This assumption is false, but a dynamic feedback mechanism introduced later will be used to correct for it. Discussion of replay cost deserves some mention as to how the replay condition will be detected. In order to determine if an instruction was issued too early, the processor must detect whether the parent instructions have produced values. It seems that the simplest way of doing this is to use the “ready” bit mechanism used by speculative scheduling of instructions dependent on loads (the inverse of the “poison” bit in [8]). Instead of clearing this bit based on a cache miss, it can be cleared as soon as the physical register is assigned to an architectural register. This works because there can only be one value associated with a particular physical register in the machine at one time. So, detection will occur when the mis-issued instruction accesses the register file and fails to see the ready bit in one or more of its operands. The replay cost – r(x-d) – is the penalty (in cycles) incurred when a replay occurs. Not only is this difficult to determine in advance, it is also difficult to define. The fundamental question is: If an instruction must be replayed, which instruction – by attempting execution too early – caused the replay? For example, blame could be placed on a replayed instruction in the past that consumed an instruction slot which caused a parent instruction to execute later than anticipated, or on other instructions in a variety of other scenarios. Alternately, blame could be placed on the replaying instruction itself, with the rationale that, regardless of the cause, it should have executed later. Using the second option simplifies the implementation and the analysis. In particular, it makes reasonable the assumption that the replay cost is independent of x-d. The placeholder R will be used to represent this cost. With these assumptions made, the cost function is now: φ (d ) = d ∞ −∞ d ∫ f ( x)(d − x)dx + R ∫ f ( x)dx Incoming Instructions Register Rename (1) dφ ( d ) = F (d ) − Rf (d ) dd d 2φ ( d ) = f (d ) − Rf ' (d ) dd 2 Self-schedule Array Replay & Wakeup Time Information “Not-ready” Tags Register File Data Func. Unit So, at the minimum, F(d) = Rf(d) and f(d) > Rf '(d). As long as R is positive, there will be at least one solution. Unfortunately, the solution cannot be found deterministically, because f(d) is not known. The method proposed here is to use an iterative minimum-searching technique based on the first derivative (i.e. a gradientdescent search). The formula for the first derivative above suggests the following update equation: (2) By adding the mode from g(x) back in, using w=d+m, this can be written in terms of the total wakeup time: wi +1 = wi + µ (w − wi + R ) 4. Prediction Architecture Wakeup Predictor The real objective is to find a value of d that minimizes this function. The derivatives of the cost function follow. d i +1 = d i + µ (d − d i + R ) with the running estimate. The difference consists of the new replay cost plus the error between the previous wakeup time estimate and the current sample. In this way, over the long run, the cost associated with replaying this instruction is averaged in with a weight roughly proportional to the frequency of replay. Over the short term, the estimate will drift down to the mean at a rate specified by µ; making replays more and more likely. Eventually a replay will occur, increasing the estimate again. An appropriate value for µ should cause the estimate to move to a cost-optimal value. (3) Here w is the newly observed wakeup time, wi is the current estimate of the optimal wakeup time, and R is the replay cost blamed on this instruction (which will be zero if the instruction does not replay). µ is a constant that defines the rate of averaging of the iterative update. Conceptually, the update equation calculates the difference in cost-adjusted wakeup time between the old estimate and the current sample, weights it, and merges it Func. Unit Load Unit Data & Destination Tags Figure 3. High-level architecture To implement the predicted wakeup system, the operand-readiness feedback mechanism was removed from the typical superscalar architecture and was replaced with a prediction table and a self-schedule array as shown in Figure 3. In this system, the register file maintains a “ready” bit associated with each physical register. These bits are cleared when the register renamer assigns the register, and set when a functional unit delivers a value to that register. A new instruction is sent both to the renamer and the wakeup predictor. If the wakeup predictor cannot generate a prediction, a global default value is used. After renaming and prediction, the instruction payload is sent to the self-schedule array. The architecture of this array is shown in Figure 4. The self-schedule array performs the wakeup and scheduling functions of the processor, and is the key to eliminating the critical cycle. The wakeup time predictions of instructions in the table are stored in a separate array. Each element in this array counts down by 1 each cycle until it reaches 1. When it does, it sends a Incoming Instruction Incoming Prediction 30 25 20 15 10 5 0 0 Payload Bypass Select 2 4 6 8 10 12 Elapsed wakeup time (cycles) . . . . . . Ready-nextcycle Vector Grant Vector Countdown Array schedule table with a new wakeup time prediction. The scheme to generate new predictions is motivated by the basic observation of Figure 5. Expected remaining wakeup time (cycles) signal to the select logic indicating that the corresponding instruction will wake up during the next cycle. The select logic chooses among the speculatively ready instructions and generates a grant vector. When the next cycle arrives, this grant vector is used to index the payload array and send the selected instructions to the register file for speculative execution. For instructions with an initial wakeup prediction of zero cycles, there is a bypass path. When the select logic sees that an incoming instruction has a prediction of zero cycles, it may choose to select that instruction immediately. If the instruction is selected, it bypasses the payload array and goes directly to the register file. If the instruction is not selected, it goes into the array, signals its readiness and waits to be selected. Payload Array Selected Instructions Figure 4. Self-schedule array architecture Note that this design involves no critical feedback circuit in the scheduler, and the critical path is likely to be the time taken to write into the payload array. Even with this relatively short critical path, it still allows back-toback execution of dependent instructions, immediate execution of urgent instructions, and can still implement any selection heuristics the designer feels are appropriate. More sophisticated selection heuristics may be employed by increasing the advance notice that the select logic gets from the counter array. This increases the complexity of the bypass path, but potentially allows for better scheduling. Also, the only one-stage cycle in the design is the loop used to update the counters. This is highly unlikely to be on the critical path, but even if it were, it does not need to be a critical cycle, because it could be replaced by a pair (even/odd) of two-stage counters. In addition, the critical path in the payload array can be pipelined by breaking it into two arrays. This would require more sophisticated bypass logic, but it means that the situation where the self-schedule array is the critical path of the processor can be avoided. After an instruction is selected, it proceeds to the register file and accesses its source registers. If all of an instruction’s operands are ready, it proceeds to execution. If any source register is not marked as ready, then this instruction was executed too early and it is replayed. Replay involves sending the instruction back to the self- Figure 5. Expected remaining wakeup time The figure shows the expected remaining wakeup time as a function of elapsed wakeup time. The figure shows, for example, that if an instruction has spent two cycles waiting for wakeup, then it can expect to spend 6 more cycles before wakeup actually happens. The portion of the curve shown covers 99% of all dynamic instructions. The slope of this section is approximately 2, and motivates a strategy of doubling the previous prediction and using this as the new prediction. While the curve of Figure 5 cannot be expected to apply exactly to a predicted system, in practice, the exponential backoff method it implies proves to be effective in eliminating unnecessary replays. Tag Index V Tag Wakeup . . . . . . . . . Update (µ, r) Hit? Wakeup Delay Replay Count Wakeup Time Figure 6. Wakeup predictor architecture When an instruction is finally sent for execution, the register file sends the instruction’s actual wakeup time and the number of replays it experienced to the wakeup predictor, so that the predictor may be updated. The structure of the wakeup predictor is shown in Figure 6. Wakeup times are stored as offset-0.5 fixed-point numbers to eliminate the need for circuitry to perform rounding to the nearest integer. Only the integer portions of the wakeup prediction are sent to the self-schedule array. The update unit uses two parameters, µ and r, which represent the rate of averaging and the modeled cost of a single replay, respectively. The R in Equation (3) is implemented as r multiplied by the replay count. Note that the update unit can be as slow as needed, because it is not part of any critical cycle, and because the value it is updating is supposed to move toward a single value. Thus there is no urgency to ensure an update occurs before the next prediction is made. The actual value of the wakeup time is determined by a further modification to the register file. Each physical register keeps a counter of the number of cycles that have passed since it became ready. When an instruction successfully accesses its source registers, the minimum of the counter values of the registers gives the number of cycles by which the actual wakeup exceeded the true wakeup time, adjusted for the amount of time spent waiting for selection. This can be used to reconstruct the true wakeup time to be used to adjust the predictor. Tag Index V Tag Wakeup Conf. . . . . . . . . . Update Hit? + Wakeup Delay Global Allowance Wakeup Time Update (µ, r) Replay Count Figure 7. Global allowance wakeup predictor An alternative, and somewhat cheaper, prediction scheme based on the same idea involves keeping the replay allowance as part of the global state. The rationale for this is based on the observation that when one replay happens, it is likely to be followed soon by other replays, especially in a heavily-loaded system. This suggests the strategy of using a more conventional predictor mechanism to find the mode of the wakeup times for dynamic instances of each static instruction. When a replay occurs (or a cycle passes with no replay), a global register is adjusted in the same way that instruction entries are updated in the previous prediction mechanism. When a prediction is made, the mode retrieved from the wakeup predictor is added to the integer portion of the global register to form the final cost-adjusted prediction. Figure 7 shows such a system. Note also that for both predictors, there is a mechanism just like the global allowance feedback loop that generates a default prediction in the event of a miss in the main predictor. 5. Feedback-Based Adjustment The architecture described in the last section relies on two parameters – µ and r – representing the rate of averaging and the cost of a replay, respectively. More precisely, r is the cost of a replay in excess of the cost of missed opportunity. While the best value for µ would not be expected to vary greatly from benchmark to benchmark, this is not so for r. For benchmarks with high instruction throughput (IPC), a replayed instruction is more likely to cause problems because it is consuming resources that could be consumed by a useful instruction. Conversely, for benchmarks with low throughput, a replayed instruction may be of no consequence, as it did not necessarily consume critical resources and did not become ready while recovering from a replay. Commonly used benchmark programs, as well as other applications commonly run on computer systems, vary considerably in their instruction throughput. So, one would expect that the best value for r might depend heavily on the application being run. With this in mind, it makes sense to adjust r with instruction throughput. There is a complication, though: r affects instruction throughput. Decreasing r will, up to a point, increase the throughput of useful instructions, but decreasing r will almost always increase the number of useless (replayed) instructions in the system. In fact, if r is too low, useless instructions will start crowding out useful instructions – remember that a useless instruction cannot be differentiated from a useful instruction until an attempt at execution is made. If this happens, then the measured (useful) throughput will be artificially low, suggesting that r should be decreased further, which is clearly not good. On the other hand, if r is too high, useful instructions will be unnecessarily delayed. The objective is to maximize useful throughput. This can be done by reducing r up to the point where crowding starts to occur. Crowding can be detected by monitoring the load (throughput divided by number of units) for each functional unit class. If any class has a total (useful plus useless) throughput near 100%, and the useless throughput is significant, then crowding is occurring. Note that a useless instruction never actually passes through any functional unit, but it may prevent another instruction from accessing the register file or being scheduled into a functional unit, and it therefore contributes to the load of the unit in which it would have executed. The solution presented here uses a set of two load thresholds to adjust the value of r to prevent crowding. The first threshold is called the “target load,” and the second, larger, threshold is called the “useful-only load.” When the useful load is less than the useful-only load, the system tries to keep the total load between the two thresholds – decreasing r if the total load is too low, or increasing r if it is too high. If the useful load goes above the useful-only load, then any useless instruction detected causes r to increase. The idea behind this is to allow the useful load to go as high as it can go with minimal interference from useless instructions. The objective of the feedback-based adjustment is not to react to transient changes in load, but to move toward a value of r that will be good for an entire phase of program execution. Thus, the parameter adjustment unit can be located far from the mainline processor, if necessary, and can be as slow as thrifty design practices suggest. The adjustment unit has a simple implementation: It consists of a set of accumulators which track running averages of the useful and useless load factors for each class of functional unit. The values for the class with the highest total load are used to compare against the thresholds. If the useful load for that class is less than the useful-only load, and the total load is less than the target load, then the value of r is divided by 2 to increase the aggressiveness of wakeup prediction. If the useful load is less than the useful-only load, but the total load is greater than this, then the value of r is multiplied by 2 to prevent crowding. If the useful load is greater than the usefulonly load, and the useless load is non-zero, then the value of r is also multiplied by 2. The value of r is continually sent to the update predictor. To prevent wild oscillations of r, its value is restricted to being changed only once every 1000 cycles. Also, r is restricted to even powers of 2 to simplify the multiplier logic in the update portion of the wakeup predictor. The value of r can be treated as a generic parameter that can incorporate any predictable effect on performance that can be associated with replays in any way. Given this, there may be some instances where the best value of r is negative – in lightly loaded machines, for example. To allow for this, if an attempt is made to make the absolute value of r less than 0.25, then the absolute value is retained, but the sign is changed. Furthermore, when r is negative, r is reduced by multiplying by 2, and increased by dividing by 2. 6. Experimental Setup All experiments were performed on a timing model which reads instruction traces for the x86 ISA, translates them into a sequence of micro-operations, and executes them on a model of a modern superscalar processor core [9]. The traces used for these experiments represent contiguous sequences of 26 million to 100 million instructions in frequently executed portions of seven of the SPEC benchmarks: bzip2, crafty, eon, gzip, parser, twolf, and vortex. Even though traces are used, the timing effects of the instruction cache are still simulated. The timing model was run in several different configurations to allow the predicted wakeup methods to be analyzed. The common attributes of all configurations are listed in Table 1. Table 1. Attributes common to all configurations Width (Decode, Rename, Issue & Retire) Simple ALUs (latency) Complex ALUs (latency) Integer Multipliers (latency) Load Units (latency) Store Units (latency) Simple FP Units (latency) FP Move Units (latency) Complex FP Units (latency) Fetch Latency Decode Latency Rename Latency Register Read Latency Pipeline Restart Latency Retire Latency Instruction Cache Data Cache L2 Cache Memory Latency 8 6 (2 cycles) 2 (4 cycles) 2 (8 cycles) 4 (2 cycles) 4 (2 cycles) 3 (10 cycles) 3 (2 cycles) 1 (80 cycles) 4 cycles 4 cycles 4 cycles 4 cycles 4 cycles 2 cycles 32KB, 4-way, 2 cycles 64KB, 4-way, 4 cycles 1MB, 2-way, 20 cycles 200 cycles Furthermore, for the baseline systems, there are 64 reservation stations, and for predicted wakeup models, there are 64 entries in the self-schedule array. The system labeled Baseline has dependency-based scheduling. Details for the other configurations, as they differ from Baseline, are given below. All methods with predicted wakeup have a predictor with 128 sets, 4 ways, and µ=1/16. All methods with feedback adjustment use a target load of 0.8 and a useful-only load of 0.9. BasePipeSched: wakeup latency = 2 cycles WPLocal: prediction, local allowance, r = 1 WPLocalAdj: feedback-adj. pred., local allowance WPGlobal: prediction, global allowance, r = 1 WPGlobalAdj: feedback-adj. pred., global allowance The Baseline configuration is meant to represent a deeply-pipelined superscalar microprocessor of the near future. BasePipeSched represents the same processor with the wakeup and select logic pipelined. This configuration is only used as a reference point; the meaningful comparisons are made with the Baseline configuration. WPLocal represents a processor with wakeup predictions made as described in Section 4, with replay allowance adjusted on a per-instruction basis. WPGlobal is similar to WPLocal, except that the replay allowance is adjusted globally. WPLocalAdj and WPGlobalAdj are the same as WPLocal and WPGlobal, respectively, but the value of r is adjusted using processor load as feedback. 7. Experimental Results 3 Baseline BasePipeSched WPLocal WPLocalAdj WPGlobal WPGlobalAdj 2.5 IPC 2 1.5 To simulate the effects of the various higherbandwidth instruction fetch mechanisms, some simulations were performed using ideal fetching. The results are shown in Figure 9. From the figure it can be seen that the global scheme has an IPC loss of 9% from Baseline and the feedback-based adjustment schemes have an IPC loss of 7% – similar to the case with an instruction cache. In other words, even though the predicted schemes can take advantage of fewer extra instruction slots, the self-adjusting nature of the predictions prevents deterioration of performance when presented with a high-bandwidth fetch. 1 4.5 4 0.5 Baseline BasePipeSched WPLocal WPLocalAdj WPGlobal WPGlobalAdj 3.5 0 1.5 1 0.5 0 gz ip pa rs er tw ol f vo rte av x er ag e Figure 8 shows an IPC comparison between the scheduling schemes. As can be seen, the predicted schemes experience a small slowdown. For the case of the global predictor, the slowdown is 9% (in terms of IPC) from Baseline, but throughput still exceeds the pipelined scheduler by 17%. In general, the global allowance method outperforms the local method by a small margin. The small slowdown from Baseline buys the ability to remove wakeup and scheduling from the critical path. In a deeply pipelined processor, this ability can potentially yield higher frequency operation. Using feedback-based adjustment consistently gives higher performance than the fixed r. The difference is small in most cases, but the cost of implementation is also small, so it may be a viable feature to include in a production processor design. Here the global feedbackadjusted method experiences a 7% IPC slowdown compared to Baseline and a 3% speedup compared with the non-adjusted global scheme. Note that the local and global allowance methods yield nearly identical performance when both are feedback-adjusted. Sensitivity to the value of µ was also measured. The local scheme showed the highest sensitivity, but even for it, the maximum difference in IPC between systems with values of µ ranging from 1/8 to 1/128 – all integer powers of 2 were tested – was 0.9% between the maximum and minimum values. Most individual benchmarks showed a difference of less than 0.5%, with the largest difference (twolf) being 2.25%. This is good, as it suggests that µ need not be dynamically adjusted. 2 eo n Figure 8. Throughput of scheduling schemes IPC Benchmark 2.5 bz ip 2 cr af ty av er ag e x f rte vo ol tw ip er rs pa gz n eo 2 ip bz cr af ty 3 Benchmark Figure 9. Throughput of scheduling schemes for ideal fetch mechanism Another issue of significant concern is how this system performs on heavily-loaded processors. Figure 10 shows a comparison between processor configurations which are just like those described in Section 6, but with fewer functional units: 3 Simple ALUs, 2 each of Load Units and Store Units; and 1 each of every other type of unit. Here the IPC loss is 14% for the global prediction scheme, but both predicted methods still outperform the pipelined scheduler. For the feedback-adjusted prediction method, the IPC loss is 9%. This is, as expected, higher than for the less-loaded configurations, but still small enough such that the elimination of the critical cycle could make up for it in terms of overall performance. It also should be noted that, as suggested in Section 4, the global prediction scheme outperforms the local prediction scheme by a greater amount on this heavily loaded machine. Overall, the results so far suggest that that all of the predicted wakeup methods in this paper are tolerant to changes in processor load and fetch bandwidth. Table 2. Prediction IPC loss for deep pipelines 2.5 Baseline BasePipeSched WPLocal WPLocalAdj WPGlobal WPGlobalAdj 2 IPC 1.5 1 0.5 gz ip pa rs er tw ol f vo rte av x er ag e eo n bz ip 2 cr af ty 0 Benchmark Figure 10. Throughput of scheduling schemes for resource-constrained processor Since instruction slack plays a role in determining performance, the question of the importance of predictor accuracy arises. For the original system of Figure 8, the throughput values of WPLocalAdj and WPGlobalAdj were compared to that of schemes which used the same prediction methods, but multiplied each prediction by a factor of 2. For WPLocalAdj, this resulted in a 47% IPC loss from the non-doubled predictor. For WPGlobalAdj, the loss was only 27%. This shows that accuracy in WPGlobalAdj is important, but if some inaccuracy occurs, the performance loss may not be catastrophic. The difference between the losses for WPLocalAdj and WPGlobalAdj are likely attributable to the speed at which WPGlobalAdj reacts to the changing environment in the processor. The doubling of predictions will tend to have more pronounced effects on the load of the processor, and the global scheme can adapt to these changes faster. Since the effective prediction is still correlated to the actual prediction, this manifests itself as a lower IPC loss. Since there is a tendency in processor design toward deeper pipelines, the performance of the predictor schemes in deep pipelines is a concern. To test this, a series of simulations was run on processor models of different depths. Each model was created by multiplying all of the latencies in Table 1 by a constant. Table 2 shows the resulting IPC loss from Baseline scheduling to the two feedback-adjusted prediction schemes. The relative depth in Table 2 is the constant by which the pipeline depths in Table 1 were multiplied. Relative Depth WPLocalAdj IPC loss (%) WPGlobalAdj IPC loss (%) 1 7.6 7.0 2 8.1 7.7 3 8.3 7.8 4 8.4 7.8 As can be seen, after an initial small deterioration from relative depth 1 to 2, the loss is quite stable. This means that the predictors can be expected to perform about as well on deep pipelines as they do on shorter ones. However, the results in Table 2 assume that the scheduler remains unpipelined and capable of scheduling dependent instructions back-to-back, which is somewhat unrealistic. Table 3 shows the same data for processors where the wakeup logic (predicted or otherwise) is pipelined. For relative depth 2 the wakeup logic has two stages, and this increases proportionally for the other depths. Here the IPC loss also remains largely stable and the changes that do occur actually reduce the performance gap between dependency-based wakeup and predicted wakeup. One reason is that a pipelined predicted system can still execute many dependent instructions back-to-back, whereas the pipelined Baseline system cannot. Table 3. Prediction IPC loss for deep pipelines with pipelined wakeup logic Relative Depth WPLocalAdj IPC loss (%) WPGlobalAdj IPC loss (%) 1 7.6 7.0 2 7.4 6.7 3 7.3 6.4 4 7.2 6.2 In Section 4, the idea of exponential backoff of predicted wakeup times when replays occur was motivated by observations of the Baseline configuration. To validate it in predictor-based systems, a comparison was performed between an exponential backoff scheme and a linear backoff scheme. The results are given in Table 4. While the throughput values of the two backoff schemes are similar on this relatively wide configuration, the replay rate for the exponential scheme is only half what it is for the linear scheme. Table 4. Exponential vs. linear backoff Backoff Scheme linear exponential WPLocalAdj WPGlobalAdj IPC replay% IPC replay% 1.32 118 1.38 57 1.37 52 1.39 38 Finally, some discussion is needed on the number of replays generated by wakeup prediction. In the Baseline configuration, there were 0.03 replays, on average, for each dynamic instruction executed. This was compared to WPGlobalAdj, with 0.38 replays. Nearly all of these were single replays. This may seem significant, but even in a constrained processor, many resources are unused at any particular time. The feedback mechanism caused the prediction scheme to make more optimal use of those resources given the way the scheduling system operates. The relatively low drop in the IPC measurement, along with the observation that the replay rate is lower in benchmarks with higher IPC values, supports this. The drawback is, though, that the predicted wakeup scheme can potentially use more power than a traditional wakeup scheme. While the predicted wakeup could conceivably use less power for wakeup than, for example, a tag-matching mechanism, it also places a higher load on the functional units and register file. This may negate, or even exceed, any front-end power benefit. 8. Future Work Observation of the methods in this paper has shown that the feedback-adjusted methods would sometimes use negative values for r. The probable reason for this is that on a lightly-loaded machine, it may be advantageous to attempt to execute instructions before they are expected to be ready. Thus, the feedback-adjustment mechanism needs to cause the update units of the predictors to generate predictions that are less than the expected wakeup time. This can be achieved through a negative value of r, but this is a bit awkward. By incorporating processor load or other factors directly into the update units, aggressive scheduling can be achieved directly, perhaps leading to greater performance. The idea of gradient-descent minimum searching and other feedback-based adjustments seems like a solution to the growing number of hardwired constant values present in modern processors. Nearly every new architectural feature introduces some constants that must be set to reasonable values to ensure proper operation. Currently, a designer has a choice for these constants: guess at a good set of values, or run simulations with non-linear optimization to find a good set of values. The number of simulations required increases fast with the number of parameters, thereby taking a considerable amount of the allocated design time. Even so, the values found may only produce good results when run on the same benchmarks used for optimization. Another possible approach is to optimize values for which the optimal value is consistent across a large set of benchmarks, and, for the other values, dynamic feedback-based adjustment mechanisms can be built into the hardware itself. The potential advantage is that a processor can self-configure to run any particular application in a near-optimal way. 9. Conclusion This paper presented a method of eliminating the scheduling critical cycle. The method relies on the ability to predict the wakeup times of instructions and the ability to adjust those predictions to account for the cost of instruction replays. In order to account for differing application behavior and to remove the need for an arbitrary constant, a dynamic feedback-based adjustment technique was introduced. The system has no potential critical path or cycle that cannot be pipelined. The performance of the system shows a 7% IPC slowdown when compared to a processor without the system, but this buys the ability to remove any part of the wakeup and select logic from the critical path of a processor; in a deeply pipelined processor, this is of critical importance. It is reasonable to conclude that this system constitutes a plausible method of eliminating the critical cycle in the wakeup and scheduling logic. 10. Acknowledgements The authors thank the other members of the Advanced Computing Systems group as well the anonymous referees for providing feedback during various stages of this work. This material is based upon work supported by the C2S2 MARCO Center with support from AMD. 11. References [1] Subbarao Palacharla, Norman P. Jouppi, and J.E. Smith, “Complexity-Effective Superscalar Processors”, ISCA-24, 1997, pp. 206-218. [2] Ramon Canal and Antonio González, “Reducing the Complexity of the Issue Logic”, ICS 2001, pp. 312320. [3] Ramon Canal and Antonio González, “A LowComplexity Issue Logic”, ICS 2000, pp. 327-335. [4] Dan Ernst, Andrew Hamel, and Todd Austin, “Cyclone: A Broadcast-Free Dynamic Instruction Scheduler with Selective Replay”, ISCA-30, 2003. [5] Pierre Michaud and André Seznec, “Data-flow Prescheduling for Large Instruction Windows in Outof-order Processors”, HPCA-6, 2001. [6] J. Stark, M. D. Brown, and Y. N. Patt, “On Pipelining Dynamic Instruction Scheduling Logic”, MICRO-33, 2000. [7] Mary D. Brown, Jared Stark, and Yale N. Patt, “Select-Free Instruction Scheduling Logic”, MICRO34, 2001. [8] S. A. Mahlke, et al, “Sentinel Scheduling for VLIW and Superscalar Procesors”, ACM Transactions on Computer Systems, 11(4):376-408, 1993. [9] B. Slechta, et al, “Dynamic Optimization of MicroOperations”, HPCA-9, 2003.