Reducing the Scheduling Critical Cycle using Wakeup

advertisement
Reducing the Scheduling Critical Cycle using Wakeup Prediction
Todd E. Ehrhart and Sanjay J. Patel
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
{ehrhart,sjp}@crhc.uiuc.edu
Abstract
For highest performance, a modern microprocessor must
be able to determine if an instruction is ready in the same
cycle in which it is to be selected for execution. This
creates a cycle of logic involving wakeup and select.
However, the time a static instruction spends waiting for
wakeup shows little dynamic variance. This idea is used
to build a machine where wakeup times are predicted,
and instructions executed too early are replayed. This
form of self-scheduling reduces the critical cycle by
eliminating the wakeup logic at the expense of additional
replays. However, replays and other pipeline effects affect
the cost of misprediction. To solve this, an allowance is
added to the predicted wakeup time to decrease the
probability of a replay.
This allowance may be
associated with individual instructions or the global state,
and is dynamically adjusted by a gradient-descent
minimum-searching technique. When processor load is
low, prediction may be more aggressive – increasing the
chance of replays, but increasing performance, so the
aggressiveness of the predictor is dynamically adjusted
using processor load as a feedback parameter.
1. Introduction
An instruction is scheduled in a conventional
superscalar microprocessor by projecting the times at
which the instruction’s source operands will be ready. In
the case of a fixed-latency source instruction, this time is
known as soon as the source instructions are scheduled.
For other instructions, this time is not known. Since these
variable-length delays will propagate through dependency
chains, such instructions are a source a variability which
manifests itself throughout program execution. The
presence of such instructions means that a superscalar
processor must have a means of scheduling instructions
which does not depend on knowing operand ready times a
priori.
In many current processors, instruction scheduling is
performed dynamically through the use of wakeup and
select. Wakeup is the process by which instructions with
ready operands are discovered. Selection is performed by
selecting for execution a set of n instructions among all
ready instructions within the scheduling window.
Instructions that are selected for execution in one cycle
inform dependent instructions to wakeup in subsequent
cycles. If the execution latency of an instruction is one
cycle, then the wakeup and select logic loop should
nominally take one cycle also, thereby creating a very
tight critical path. Increasing clock frequency aggravates
the design constraints in the scheduler. This critical
execution loop cannot be pipelined without significant
impact on overall instruction-level parallelism.
This paper proposes the idea of eliminating the
conventional wakeup logic altogether, and replacing it
with a system where the wakeup time for each instruction
is predicted, and the instruction speculatively wakes itself
up when the time expires. The prediction occurs in
parallel with fetch, and may be pipelined with as many
stages as the fetch unit itself. If an instruction is executed
too early, the common instruction replay mechanism is
used to correct it. This system has the advantage of
eliminating the feedback mechanism for operand
readiness required for typical wakeup logic. In addition,
the predicted wakeup time for an instruction is always
known at least one cycle before the speculative wakeup
would actually occur. This means the select logic can be
built to anticipate future scheduling needs, thereby further
reducing the critical cycle in the system.
A potential drawback of this system is that it can
significantly increase the number of replays that occur
during the course of program execution. Replays, when
they occur, increase the load on the processor and can
increase the time until a replayed instruction finally
produces its result. This problem is most pronounced
when the processor is heavily loaded and replayed
instructions are interfering with the execution of
instructions that are non-speculatively ready. In order to
alleviate this problem, the predictive system presented
here adds an allowance to the predicted wakeup time to
reduce the probability of a replay. This allowance may be
associated with individual static instructions, or with the
overall machine state. In either case, it is dynamically
adjusted using a gradient-descent minimum-searching
technique that attempts to find the minimum-cost balance
(in terms of performance) between executing an
instruction sooner – and increasing the chances of a
2. Previous Work
Palacharla, et al. [1] studied the delays of the various
parts of the rename, wakeup, select and bypass logic, and
are widely credited with identifying the critical loop
involving the wakeup and select logic. They studied the
sensitivity of the delays to changes in technology and
instruction window size. They also proposed a solution to
the critical cycle: A set of FIFO structures feeds the select
logic. An instruction is steered into the FIFO already
occupied by an instruction upon which it depends. Each
cycle, the instructions at the head of each FIFO are
executed if they are ready. This reduces the number of
instructions that need to be checked for wakeup.
The scheduling critical cycle was also addressed by
Canal and González [2, 3]. Their first solution dispatches
instructions to two different buffers, depending on the
usage characteristics of their source operands. If all nonready operands represent the first use of those values, the
instruction is dispatched to a table indexed by physical
register number. If any non-ready operand represents a
second or later use of that value, it is dispatched into a
content-addressable buffer. The net effect is to reduce the
usage of content-addressable memory to instructions that
are the second (or later) users of values. The second
solution proposed by Canal and González involves
generating a VLIW-style schedule a few cycles in
advance of execution.
Incoming instructions are
scheduled into the earliest available instruction slot after
their operands are projected to be ready. Projecting ready
times for operands generated by loads is done by
assuming the operand would be ready a cycle after the
load was executed. Ernst, et al. [4] use a similar
approach, but explicitly account for memory dependence
and use a switchback countdown queue as a cheaper way
of constructing the VLIW schedule.
Michaud and Seznec [5] also attempt to schedule
instructions before they are ready to execute. In their
model, the scheduler assumes that the processor has
infinite execution resources and schedules accordingly.
The scheduled instructions are placed in a FIFO so that
they are delivered to the wakeup logic in dataflow order.
This modification allows a smaller wakeup buffer to be
used to achieve a similar throughput.
Like the approach in this paper, Stark, et al. [6] use a
speculative wakeup approach. The technique allows the
wakeup logic to be implemented as a two-stage pipeline
by assuming that an instruction will be ready two cycles
after all of its grandparent instructions are ready. If this
assumption is wrong for an instruction, the instruction is
replayed. The same authors also introduce a technique
they call “select-free scheduling” [7] which addresses the
select portion of the critical cycle. When an instruction
wakes up, it is sent to the select logic; if there are too
many instructions that wake up in a cycle, the scheduling
logic sends some back to be rescheduled. This allows the
select logic to be pipelined with minimal performance
penalty. Instructions which are incorrectly scheduled
because they depend on unschedulable instructions are
caught when they access the register file.
3. Minimum-Searching
100
90
80
70
Frequency (%)
replay – or executing it later – and possibly losing the
opportunity to execute sooner.
The system presented here uses two adjustable
parameters: the rate of averaging µ, and the estimated cost
of a replay r. The parameter r, may be adjusted to make
prediction more or less aggressive, where “more
aggressive” means that the system will tend to
speculatively execute instructions sooner.
When
processor load is low, it makes sense to use more
aggressive prediction, because it only fills instruction
slots that would otherwise go to waste. When processor
load is high, it makes sense to use less aggressive
prediction, because instructions destined to be replayed
consume instruction slots that could have been filled by
useful instructions. Also, the avalanche effect of replays
in a heavily loaded processor gives another reason to use
less aggressive prediction in this scenario. With this in
mind, the system used in this paper employs a second
layer of feedback to adjust r according to the measured
load on the processor and the ratio of replayed to ready
instructions.
60
50
40
30
20
10
0
-25
-15
-5
5
15
25
Wakeup Time (cycles) as Difference from Mode of Static
Instruction
Figure 1. Wakeup time deviation
The motivation for the technique used in this paper is
found in Figure 1. The figure shows the distribution of
differences between the actual wakeup times of dynamic
instructions and the modes of wakeup times for all
instances of the corresponding static instructions. (i.e.,
this figure may be thought of as being produced by
generating a distribution of wakeup times of dynamic
instances of each static instruction, subtracting the mode
from each distribution, and combining the results; e.g. if
the most common wakeup time for dynamic instances of a
static instruction is 6 cycles, and a particular dynamic
instance wakes up after 7 cycles, it contributes to the
frequency of the value 7-6=1 in Figure 1.)
As can be seen from the figure, about 66% of dynamic
instructions have wakeup times identical to the most
common wakeup times for their corresponding static
instructions. This suggests that the wakeup time of a
dynamic instruction is highly predictable if the mode of
wakeup times for the static instruction can be estimated.
However, there is a complication. If an instruction is
predicted to have too short a wakeup time, it will be
replayed. If the predictions for dependent instructions are
also based on observations of the source instruction
executing earlier, then the dependent instructions will also
be predicted to have wakeup times that are too short. In
this way, the avalanche of replays that occurs in
dependence-based scheduling can occur in a predicted
system also. Even if the avalanche is avoided, the
replayed instruction still might have consumed an
instruction slot which could have been occupied by a
useful instruction, or it may have lost the opportunity to
execute earlier if it was recovering from a replay when it
became ready. So, while avalanches are not guaranteed in
a predicted system, the cost of a replay can still be high.
Also, as can be seen from the tails in Figure 1, an
instruction is more likely to be ready later than a modebased prediction would anticipate than it is to be ready
earlier. This exacerbates the replay cost problem, making
underpredictions a concern.
pdf; g(x)
Predicted
Wakeup
Lost
Opportunity
Replay
Cost
d
m
Wakeup
Time; x
Figure 2. Costs involved with the wakeup time
distribution
The discussion above suggests that it should be
beneficial to take replay cost into account when designing
a predictor. The strategy taken in this paper is to add an
allowance to the predicted wakeup time to reduce the
chances of a replay. As a starting point for deriving a
predictor which uses this strategy, Figure 2 shows a
representation of the distribution of wakeup times for a
static instruction, along with regions representing the
major costs. m represents the mode of the distribution
and d represents the allowance. m+d is the cost-adjusted
predicted wakeup time.
The curve g(x) represents the probability density
function of the actual wakeup time. The shaded area on
the left represents the probabilistic cost of lost
opportunity resulting from an overprediction of the
wakeup time. The shaded area on the right represents the
probabilistic cost of a replay resulting from an
underprediction. Let f(x)=g(x+m) – i.e. Let f(x) be the
same distribution shifted to put the mode at x=0. If the
assumption is made that this discrete distribution
corresponds to an underlying continuous, differentiable
distribution, it is easy to write a cost function for any
particular value of d:
∞
d
(d ) =
φN
cost
function
∫
−∞
f ( x) k (d − x) dx +
lost opportunity
density
( x − d ) dx
∫ f ( x)r
d
replay cost
density
expected lost opportunity cost
expected replay cost
The objective is to find a value of d that minimizes
φ(d). The problem is that the lost opportunity cost is
difficult to know when the prediction is being made, and
the cost of any particular replay is unobservable during
execution.
However, there are some reasonable
assumptions that can be made. For lost opportunity cost –
k(d-x) – knowledge of instruction slack is required. This
is unknown when the predicted wakeup time needs to be
generated, because it depends on subsequent instructions.
It is also difficult to measure and store for future
reference. It can be conservatively assumed that slack is
0. This implies that the delay cost is one cycle for every
cycle instruction wakeup is delayed.
Under this
assumption, k(d-x) = d-x. This assumption is false, but a
dynamic feedback mechanism introduced later will be
used to correct for it.
Discussion of replay cost deserves some mention as to
how the replay condition will be detected. In order to
determine if an instruction was issued too early, the
processor must detect whether the parent instructions
have produced values. It seems that the simplest way of
doing this is to use the “ready” bit mechanism used by
speculative scheduling of instructions dependent on loads
(the inverse of the “poison” bit in [8]). Instead of clearing
this bit based on a cache miss, it can be cleared as soon as
the physical register is assigned to an architectural
register. This works because there can only be one value
associated with a particular physical register in the
machine at one time. So, detection will occur when the
mis-issued instruction accesses the register file and fails
to see the ready bit in one or more of its operands.
The replay cost – r(x-d) – is the penalty (in cycles)
incurred when a replay occurs. Not only is this difficult
to determine in advance, it is also difficult to define. The
fundamental question is: If an instruction must be
replayed, which instruction – by attempting execution too
early – caused the replay? For example, blame could be
placed on a replayed instruction in the past that consumed
an instruction slot which caused a parent instruction to
execute later than anticipated, or on other instructions in a
variety of other scenarios. Alternately, blame could be
placed on the replaying instruction itself, with the
rationale that, regardless of the cause, it should have
executed later. Using the second option simplifies the
implementation and the analysis. In particular, it makes
reasonable the assumption that the replay cost is
independent of x-d. The placeholder R will be used to
represent this cost. With these assumptions made, the
cost function is now:
φ (d ) =
d
∞
−∞
d
∫ f ( x)(d − x)dx + R ∫ f ( x)dx
Incoming
Instructions
Register Rename
(1)
dφ ( d )
= F (d ) − Rf (d )
dd
d 2φ ( d )
= f (d ) − Rf ' (d )
dd 2
Self-schedule Array
Replay &
Wakeup
Time
Information
“Not-ready”
Tags
Register File
Data
Func.
Unit
So, at the minimum, F(d) = Rf(d) and f(d) > Rf '(d). As
long as R is positive, there will be at least one solution.
Unfortunately, the solution cannot be found
deterministically, because f(d) is not known. The method
proposed here is to use an iterative minimum-searching
technique based on the first derivative (i.e. a gradientdescent search). The formula for the first derivative
above suggests the following update equation:
(2)
By adding the mode from g(x) back in, using w=d+m,
this can be written in terms of the total wakeup time:
wi +1 = wi + µ (w − wi + R )
4. Prediction Architecture
Wakeup
Predictor
The real objective is to find a value of d that minimizes
this function. The derivatives of the cost function follow.
d i +1 = d i + µ (d − d i + R )
with the running estimate. The difference consists of the
new replay cost plus the error between the previous
wakeup time estimate and the current sample. In this
way, over the long run, the cost associated with replaying
this instruction is averaged in with a weight roughly
proportional to the frequency of replay. Over the short
term, the estimate will drift down to the mean at a rate
specified by µ; making replays more and more likely.
Eventually a replay will occur, increasing the estimate
again. An appropriate value for µ should cause the
estimate to move to a cost-optimal value.
(3)
Here w is the newly observed wakeup time, wi is the
current estimate of the optimal wakeup time, and R is the
replay cost blamed on this instruction (which will be zero
if the instruction does not replay). µ is a constant that
defines the rate of averaging of the iterative update.
Conceptually, the update equation calculates the
difference in cost-adjusted wakeup time between the old
estimate and the current sample, weights it, and merges it
Func.
Unit
Load
Unit
Data &
Destination
Tags
Figure 3. High-level architecture
To implement the predicted wakeup system, the
operand-readiness feedback mechanism was removed
from the typical superscalar architecture and was replaced
with a prediction table and a self-schedule array as shown
in Figure 3. In this system, the register file maintains a
“ready” bit associated with each physical register. These
bits are cleared when the register renamer assigns the
register, and set when a functional unit delivers a value to
that register. A new instruction is sent both to the
renamer and the wakeup predictor. If the wakeup
predictor cannot generate a prediction, a global default
value is used. After renaming and prediction, the
instruction payload is sent to the self-schedule array. The
architecture of this array is shown in Figure 4.
The self-schedule array performs the wakeup and
scheduling functions of the processor, and is the key to
eliminating the critical cycle.
The wakeup time
predictions of instructions in the table are stored in a
separate array. Each element in this array counts down by
1 each cycle until it reaches 1. When it does, it sends a
Incoming Instruction
Incoming
Prediction
30
25
20
15
10
5
0
0
Payload Bypass
Select
2
4
6
8
10
12
Elapsed wakeup time (cycles)
.
.
.
.
.
.
Ready-nextcycle Vector
Grant
Vector
Countdown
Array
schedule table with a new wakeup time prediction. The
scheme to generate new predictions is motivated by the
basic observation of Figure 5.
Expected remaining wakeup time (cycles)
signal to the select logic indicating that the corresponding
instruction will wake up during the next cycle. The select
logic chooses among the speculatively ready instructions
and generates a grant vector. When the next cycle arrives,
this grant vector is used to index the payload array and
send the selected instructions to the register file for
speculative execution. For instructions with an initial
wakeup prediction of zero cycles, there is a bypass path.
When the select logic sees that an incoming instruction
has a prediction of zero cycles, it may choose to select
that instruction immediately. If the instruction is selected,
it bypasses the payload array and goes directly to the
register file. If the instruction is not selected, it goes into
the array, signals its readiness and waits to be selected.
Payload Array
Selected
Instructions
Figure 4. Self-schedule array architecture
Note that this design involves no critical feedback
circuit in the scheduler, and the critical path is likely to be
the time taken to write into the payload array. Even with
this relatively short critical path, it still allows back-toback execution of dependent instructions, immediate
execution of urgent instructions, and can still implement
any selection heuristics the designer feels are appropriate.
More sophisticated selection heuristics may be employed
by increasing the advance notice that the select logic gets
from the counter array. This increases the complexity of
the bypass path, but potentially allows for better
scheduling. Also, the only one-stage cycle in the design
is the loop used to update the counters. This is highly
unlikely to be on the critical path, but even if it were, it
does not need to be a critical cycle, because it could be
replaced by a pair (even/odd) of two-stage counters. In
addition, the critical path in the payload array can be
pipelined by breaking it into two arrays. This would
require more sophisticated bypass logic, but it means that
the situation where the self-schedule array is the critical
path of the processor can be avoided.
After an instruction is selected, it proceeds to the
register file and accesses its source registers. If all of an
instruction’s operands are ready, it proceeds to execution.
If any source register is not marked as ready, then this
instruction was executed too early and it is replayed.
Replay involves sending the instruction back to the self-
Figure 5. Expected remaining wakeup time
The figure shows the expected remaining wakeup time
as a function of elapsed wakeup time. The figure shows,
for example, that if an instruction has spent two cycles
waiting for wakeup, then it can expect to spend 6 more
cycles before wakeup actually happens. The portion of
the curve shown covers 99% of all dynamic instructions.
The slope of this section is approximately 2, and
motivates a strategy of doubling the previous prediction
and using this as the new prediction. While the curve of
Figure 5 cannot be expected to apply exactly to a
predicted system, in practice, the exponential backoff
method it implies proves to be effective in eliminating
unnecessary replays.
Tag
Index
V Tag Wakeup
.
.
.
.
.
.
.
.
.
Update
(µ, r)
Hit?
Wakeup Delay
Replay Count
Wakeup
Time
Figure 6. Wakeup predictor architecture
When an instruction is finally sent for execution, the
register file sends the instruction’s actual wakeup time
and the number of replays it experienced to the wakeup
predictor, so that the predictor may be updated. The
structure of the wakeup predictor is shown in Figure 6.
Wakeup times are stored as offset-0.5 fixed-point
numbers to eliminate the need for circuitry to perform
rounding to the nearest integer. Only the integer portions
of the wakeup prediction are sent to the self-schedule
array. The update unit uses two parameters, µ and r,
which represent the rate of averaging and the modeled
cost of a single replay, respectively. The R in Equation
(3) is implemented as r multiplied by the replay count.
Note that the update unit can be as slow as needed,
because it is not part of any critical cycle, and because the
value it is updating is supposed to move toward a single
value. Thus there is no urgency to ensure an update
occurs before the next prediction is made.
The actual value of the wakeup time is determined by
a further modification to the register file. Each physical
register keeps a counter of the number of cycles that have
passed since it became ready. When an instruction
successfully accesses its source registers, the minimum of
the counter values of the registers gives the number of
cycles by which the actual wakeup exceeded the true
wakeup time, adjusted for the amount of time spent
waiting for selection. This can be used to reconstruct the
true wakeup time to be used to adjust the predictor.
Tag
Index
V Tag Wakeup Conf.
.
.
.
.
.
.
.
.
.
Update
Hit?
+
Wakeup Delay
Global Allowance
Wakeup
Time
Update
(µ, r)
Replay Count
Figure 7. Global allowance wakeup predictor
An alternative, and somewhat cheaper, prediction
scheme based on the same idea involves keeping the
replay allowance as part of the global state. The rationale
for this is based on the observation that when one replay
happens, it is likely to be followed soon by other replays,
especially in a heavily-loaded system. This suggests the
strategy of using a more conventional predictor
mechanism to find the mode of the wakeup times for
dynamic instances of each static instruction. When a
replay occurs (or a cycle passes with no replay), a global
register is adjusted in the same way that instruction
entries are updated in the previous prediction mechanism.
When a prediction is made, the mode retrieved from the
wakeup predictor is added to the integer portion of the
global register to form the final cost-adjusted prediction.
Figure 7 shows such a system. Note also that for both
predictors, there is a mechanism just like the global
allowance feedback loop that generates a default
prediction in the event of a miss in the main predictor.
5. Feedback-Based Adjustment
The architecture described in the last section relies on
two parameters – µ and r – representing the rate of
averaging and the cost of a replay, respectively. More
precisely, r is the cost of a replay in excess of the cost of
missed opportunity. While the best value for µ would not
be expected to vary greatly from benchmark to
benchmark, this is not so for r. For benchmarks with high
instruction throughput (IPC), a replayed instruction is
more likely to cause problems because it is consuming
resources that could be consumed by a useful instruction.
Conversely, for benchmarks with low throughput, a
replayed instruction may be of no consequence, as it did
not necessarily consume critical resources and did not
become ready while recovering from a replay.
Commonly used benchmark programs, as well as other
applications commonly run on computer systems, vary
considerably in their instruction throughput. So, one
would expect that the best value for r might depend
heavily on the application being run.
With this in mind, it makes sense to adjust r with
instruction throughput. There is a complication, though: r
affects instruction throughput. Decreasing r will, up to a
point, increase the throughput of useful instructions, but
decreasing r will almost always increase the number of
useless (replayed) instructions in the system. In fact, if r
is too low, useless instructions will start crowding out
useful instructions – remember that a useless instruction
cannot be differentiated from a useful instruction until an
attempt at execution is made. If this happens, then the
measured (useful) throughput will be artificially low,
suggesting that r should be decreased further, which is
clearly not good. On the other hand, if r is too high,
useful instructions will be unnecessarily delayed.
The objective is to maximize useful throughput. This
can be done by reducing r up to the point where crowding
starts to occur. Crowding can be detected by monitoring
the load (throughput divided by number of units) for each
functional unit class. If any class has a total (useful plus
useless) throughput near 100%, and the useless
throughput is significant, then crowding is occurring.
Note that a useless instruction never actually passes
through any functional unit, but it may prevent another
instruction from accessing the register file or being
scheduled into a functional unit, and it therefore
contributes to the load of the unit in which it would have
executed. The solution presented here uses a set of two
load thresholds to adjust the value of r to prevent
crowding. The first threshold is called the “target load,”
and the second, larger, threshold is called the “useful-only
load.” When the useful load is less than the useful-only
load, the system tries to keep the total load between the
two thresholds – decreasing r if the total load is too low,
or increasing r if it is too high. If the useful load goes
above the useful-only load, then any useless instruction
detected causes r to increase. The idea behind this is to
allow the useful load to go as high as it can go with
minimal interference from useless instructions.
The objective of the feedback-based adjustment is not
to react to transient changes in load, but to move toward a
value of r that will be good for an entire phase of program
execution. Thus, the parameter adjustment unit can be
located far from the mainline processor, if necessary, and
can be as slow as thrifty design practices suggest. The
adjustment unit has a simple implementation: It consists
of a set of accumulators which track running averages of
the useful and useless load factors for each class of
functional unit. The values for the class with the highest
total load are used to compare against the thresholds. If
the useful load for that class is less than the useful-only
load, and the total load is less than the target load, then
the value of r is divided by 2 to increase the
aggressiveness of wakeup prediction. If the useful load is
less than the useful-only load, but the total load is greater
than this, then the value of r is multiplied by 2 to prevent
crowding. If the useful load is greater than the usefulonly load, and the useless load is non-zero, then the value
of r is also multiplied by 2. The value of r is continually
sent to the update predictor. To prevent wild oscillations
of r, its value is restricted to being changed only once
every 1000 cycles. Also, r is restricted to even powers of
2 to simplify the multiplier logic in the update portion of
the wakeup predictor.
The value of r can be treated as a generic parameter
that can incorporate any predictable effect on
performance that can be associated with replays in any
way. Given this, there may be some instances where the
best value of r is negative – in lightly loaded machines,
for example. To allow for this, if an attempt is made to
make the absolute value of r less than 0.25, then the
absolute value is retained, but the sign is changed.
Furthermore, when r is negative, r is reduced by
multiplying by 2, and increased by dividing by 2.
6. Experimental Setup
All experiments were performed on a timing model
which reads instruction traces for the x86 ISA, translates
them into a sequence of micro-operations, and executes
them on a model of a modern superscalar processor core
[9]. The traces used for these experiments represent
contiguous sequences of 26 million to 100 million
instructions in frequently executed portions of seven of
the SPEC benchmarks: bzip2, crafty, eon, gzip, parser,
twolf, and vortex. Even though traces are used, the
timing effects of the instruction cache are still simulated.
The timing model was run in several different
configurations to allow the predicted wakeup methods to
be analyzed. The common attributes of all configurations
are listed in Table 1.
Table 1. Attributes common to all configurations
Width (Decode, Rename, Issue & Retire)
Simple ALUs (latency)
Complex ALUs (latency)
Integer Multipliers (latency)
Load Units (latency)
Store Units (latency)
Simple FP Units (latency)
FP Move Units (latency)
Complex FP Units (latency)
Fetch Latency
Decode Latency
Rename Latency
Register Read Latency
Pipeline Restart Latency
Retire Latency
Instruction Cache
Data Cache
L2 Cache
Memory Latency
8
6 (2 cycles)
2 (4 cycles)
2 (8 cycles)
4 (2 cycles)
4 (2 cycles)
3 (10 cycles)
3 (2 cycles)
1 (80 cycles)
4 cycles
4 cycles
4 cycles
4 cycles
4 cycles
2 cycles
32KB, 4-way, 2 cycles
64KB, 4-way, 4 cycles
1MB, 2-way, 20 cycles
200 cycles
Furthermore, for the baseline systems, there are 64
reservation stations, and for predicted wakeup models,
there are 64 entries in the self-schedule array. The system
labeled Baseline has dependency-based scheduling.
Details for the other configurations, as they differ from
Baseline, are given below. All methods with predicted
wakeup have a predictor with 128 sets, 4 ways, and
µ=1/16. All methods with feedback adjustment use a
target load of 0.8 and a useful-only load of 0.9.
BasePipeSched: wakeup latency = 2 cycles
WPLocal: prediction, local allowance, r = 1
WPLocalAdj: feedback-adj. pred., local allowance
WPGlobal: prediction, global allowance, r = 1
WPGlobalAdj: feedback-adj. pred., global allowance
The Baseline configuration is meant to represent a
deeply-pipelined superscalar microprocessor of the near
future. BasePipeSched represents the same processor
with the wakeup and select logic pipelined. This
configuration is only used as a reference point; the
meaningful comparisons are made with the Baseline
configuration. WPLocal represents a processor with
wakeup predictions made as described in Section 4, with
replay allowance adjusted on a per-instruction basis.
WPGlobal is similar to WPLocal, except that the replay
allowance is adjusted globally.
WPLocalAdj and
WPGlobalAdj are the same as WPLocal and WPGlobal,
respectively, but the value of r is adjusted using processor
load as feedback.
7. Experimental Results
3
Baseline
BasePipeSched
WPLocal
WPLocalAdj
WPGlobal
WPGlobalAdj
2.5
IPC
2
1.5
To simulate the effects of the various higherbandwidth instruction fetch mechanisms, some
simulations were performed using ideal fetching. The
results are shown in Figure 9. From the figure it can be
seen that the global scheme has an IPC loss of 9% from
Baseline and the feedback-based adjustment schemes
have an IPC loss of 7% – similar to the case with an
instruction cache. In other words, even though the
predicted schemes can take advantage of fewer extra
instruction slots, the self-adjusting nature of the
predictions prevents deterioration of performance when
presented with a high-bandwidth fetch.
1
4.5
4
0.5
Baseline
BasePipeSched
WPLocal
WPLocalAdj
WPGlobal
WPGlobalAdj
3.5
0
1.5
1
0.5
0
gz
ip
pa
rs
er
tw
ol
f
vo
rte
av x
er
ag
e
Figure 8 shows an IPC comparison between the
scheduling schemes. As can be seen, the predicted
schemes experience a small slowdown. For the case of
the global predictor, the slowdown is 9% (in terms of
IPC) from Baseline, but throughput still exceeds the
pipelined scheduler by 17%. In general, the global
allowance method outperforms the local method by a
small margin. The small slowdown from Baseline buys
the ability to remove wakeup and scheduling from the
critical path. In a deeply pipelined processor, this ability
can potentially yield higher frequency operation.
Using feedback-based adjustment consistently gives
higher performance than the fixed r. The difference is
small in most cases, but the cost of implementation is also
small, so it may be a viable feature to include in a
production processor design. Here the global feedbackadjusted method experiences a 7% IPC slowdown
compared to Baseline and a 3% speedup compared with
the non-adjusted global scheme. Note that the local and
global allowance methods yield nearly identical
performance when both are feedback-adjusted.
Sensitivity to the value of µ was also measured. The
local scheme showed the highest sensitivity, but even for
it, the maximum difference in IPC between systems with
values of µ ranging from 1/8 to 1/128 – all integer powers
of 2 were tested – was 0.9% between the maximum and
minimum values. Most individual benchmarks showed a
difference of less than 0.5%, with the largest difference
(twolf) being 2.25%. This is good, as it suggests that µ
need not be dynamically adjusted.
2
eo
n
Figure 8. Throughput of scheduling schemes
IPC
Benchmark
2.5
bz
ip
2
cr
af
ty
av
er
ag
e
x
f
rte
vo
ol
tw
ip
er
rs
pa
gz
n
eo
2
ip
bz
cr
af
ty
3
Benchmark
Figure 9. Throughput of scheduling schemes for
ideal fetch mechanism
Another issue of significant concern is how this system
performs on heavily-loaded processors. Figure 10 shows
a comparison between processor configurations which are
just like those described in Section 6, but with fewer
functional units: 3 Simple ALUs, 2 each of Load Units
and Store Units; and 1 each of every other type of unit.
Here the IPC loss is 14% for the global prediction
scheme, but both predicted methods still outperform the
pipelined scheduler. For the feedback-adjusted prediction
method, the IPC loss is 9%. This is, as expected, higher
than for the less-loaded configurations, but still small
enough such that the elimination of the critical cycle
could make up for it in terms of overall performance. It
also should be noted that, as suggested in Section 4, the
global prediction scheme outperforms the local prediction
scheme by a greater amount on this heavily loaded
machine. Overall, the results so far suggest that that all of
the predicted wakeup methods in this paper are tolerant to
changes in processor load and fetch bandwidth.
Table 2. Prediction IPC loss for deep pipelines
2.5
Baseline
BasePipeSched
WPLocal
WPLocalAdj
WPGlobal
WPGlobalAdj
2
IPC
1.5
1
0.5
gz
ip
pa
rs
er
tw
ol
f
vo
rte
av x
er
ag
e
eo
n
bz
ip
2
cr
af
ty
0
Benchmark
Figure 10. Throughput of scheduling schemes
for resource-constrained processor
Since instruction slack plays a role in determining
performance, the question of the importance of predictor
accuracy arises. For the original system of Figure 8, the
throughput values of WPLocalAdj and WPGlobalAdj
were compared to that of schemes which used the same
prediction methods, but multiplied each prediction by a
factor of 2. For WPLocalAdj, this resulted in a 47% IPC
loss from the non-doubled predictor. For WPGlobalAdj,
the loss was only 27%. This shows that accuracy in
WPGlobalAdj is important, but if some inaccuracy
occurs, the performance loss may not be catastrophic.
The difference between the losses for WPLocalAdj and
WPGlobalAdj are likely attributable to the speed at which
WPGlobalAdj reacts to the changing environment in the
processor. The doubling of predictions will tend to have
more pronounced effects on the load of the processor, and
the global scheme can adapt to these changes faster.
Since the effective prediction is still correlated to the
actual prediction, this manifests itself as a lower IPC loss.
Since there is a tendency in processor design toward
deeper pipelines, the performance of the predictor
schemes in deep pipelines is a concern. To test this, a
series of simulations was run on processor models of
different depths. Each model was created by multiplying
all of the latencies in Table 1 by a constant. Table 2
shows the resulting IPC loss from Baseline scheduling to
the two feedback-adjusted prediction schemes. The
relative depth in Table 2 is the constant by which the
pipeline depths in Table 1 were multiplied.
Relative Depth
WPLocalAdj IPC loss (%)
WPGlobalAdj IPC loss (%)
1
7.6
7.0
2
8.1
7.7
3
8.3
7.8
4
8.4
7.8
As can be seen, after an initial small deterioration from
relative depth 1 to 2, the loss is quite stable. This means
that the predictors can be expected to perform about as
well on deep pipelines as they do on shorter ones.
However, the results in Table 2 assume that the scheduler
remains unpipelined and capable of scheduling dependent
instructions back-to-back, which is somewhat unrealistic.
Table 3 shows the same data for processors where the
wakeup logic (predicted or otherwise) is pipelined. For
relative depth 2 the wakeup logic has two stages, and this
increases proportionally for the other depths. Here the
IPC loss also remains largely stable and the changes that
do occur actually reduce the performance gap between
dependency-based wakeup and predicted wakeup. One
reason is that a pipelined predicted system can still
execute many dependent instructions back-to-back,
whereas the pipelined Baseline system cannot.
Table 3. Prediction IPC loss for deep pipelines
with pipelined wakeup logic
Relative Depth
WPLocalAdj IPC loss (%)
WPGlobalAdj IPC loss (%)
1
7.6
7.0
2
7.4
6.7
3
7.3
6.4
4
7.2
6.2
In Section 4, the idea of exponential backoff of
predicted wakeup times when replays occur was
motivated by observations of the Baseline configuration.
To validate it in predictor-based systems, a comparison
was performed between an exponential backoff scheme
and a linear backoff scheme. The results are given in
Table 4. While the throughput values of the two backoff
schemes are similar on this relatively wide configuration,
the replay rate for the exponential scheme is only half
what it is for the linear scheme.
Table 4. Exponential vs. linear backoff
Backoff Scheme
linear
exponential
WPLocalAdj WPGlobalAdj
IPC replay% IPC replay%
1.32
118 1.38
57
1.37
52 1.39
38
Finally, some discussion is needed on the number of
replays generated by wakeup prediction. In the Baseline
configuration, there were 0.03 replays, on average, for
each dynamic instruction executed. This was compared
to WPGlobalAdj, with 0.38 replays. Nearly all of these
were single replays. This may seem significant, but even
in a constrained processor, many resources are unused at
any particular time. The feedback mechanism caused the
prediction scheme to make more optimal use of those
resources given the way the scheduling system operates.
The relatively low drop in the IPC measurement, along
with the observation that the replay rate is lower in
benchmarks with higher IPC values, supports this.
The drawback is, though, that the predicted wakeup
scheme can potentially use more power than a traditional
wakeup scheme. While the predicted wakeup could
conceivably use less power for wakeup than, for example,
a tag-matching mechanism, it also places a higher load on
the functional units and register file. This may negate, or
even exceed, any front-end power benefit.
8. Future Work
Observation of the methods in this paper has shown
that the feedback-adjusted methods would sometimes use
negative values for r. The probable reason for this is that
on a lightly-loaded machine, it may be advantageous to
attempt to execute instructions before they are expected to
be ready. Thus, the feedback-adjustment mechanism
needs to cause the update units of the predictors to
generate predictions that are less than the expected
wakeup time. This can be achieved through a negative
value of r, but this is a bit awkward. By incorporating
processor load or other factors directly into the update
units, aggressive scheduling can be achieved directly,
perhaps leading to greater performance.
The idea of gradient-descent minimum searching and
other feedback-based adjustments seems like a solution to
the growing number of hardwired constant values present
in modern processors. Nearly every new architectural
feature introduces some constants that must be set to
reasonable values to ensure proper operation. Currently,
a designer has a choice for these constants: guess at a
good set of values, or run simulations with non-linear
optimization to find a good set of values. The number of
simulations required increases fast with the number of
parameters, thereby taking a considerable amount of the
allocated design time. Even so, the values found may
only produce good results when run on the same
benchmarks used for optimization. Another possible
approach is to optimize values for which the optimal
value is consistent across a large set of benchmarks, and,
for the other values, dynamic feedback-based adjustment
mechanisms can be built into the hardware itself. The
potential advantage is that a processor can self-configure
to run any particular application in a near-optimal way.
9. Conclusion
This paper presented a method of eliminating the
scheduling critical cycle. The method relies on the ability
to predict the wakeup times of instructions and the ability
to adjust those predictions to account for the cost of
instruction replays. In order to account for differing
application behavior and to remove the need for an
arbitrary constant, a dynamic feedback-based adjustment
technique was introduced. The system has no potential
critical path or cycle that cannot be pipelined. The
performance of the system shows a 7% IPC slowdown
when compared to a processor without the system, but
this buys the ability to remove any part of the wakeup and
select logic from the critical path of a processor; in a
deeply pipelined processor, this is of critical importance.
It is reasonable to conclude that this system constitutes a
plausible method of eliminating the critical cycle in the
wakeup and scheduling logic.
10. Acknowledgements
The authors thank the other members of the Advanced
Computing Systems group as well the anonymous
referees for providing feedback during various stages of
this work. This material is based upon work supported by
the C2S2 MARCO Center with support from AMD.
11. References
[1] Subbarao Palacharla, Norman P. Jouppi, and J.E.
Smith,
“Complexity-Effective
Superscalar
Processors”, ISCA-24, 1997, pp. 206-218.
[2] Ramon Canal and Antonio González, “Reducing the
Complexity of the Issue Logic”, ICS 2001, pp. 312320.
[3] Ramon Canal and Antonio González, “A LowComplexity Issue Logic”, ICS 2000, pp. 327-335.
[4] Dan Ernst, Andrew Hamel, and Todd Austin,
“Cyclone: A Broadcast-Free Dynamic Instruction
Scheduler with Selective Replay”, ISCA-30, 2003.
[5] Pierre Michaud and André Seznec, “Data-flow
Prescheduling for Large Instruction Windows in Outof-order Processors”, HPCA-6, 2001.
[6] J. Stark, M. D. Brown, and Y. N. Patt, “On Pipelining
Dynamic Instruction Scheduling Logic”, MICRO-33,
2000.
[7] Mary D. Brown, Jared Stark, and Yale N. Patt,
“Select-Free Instruction Scheduling Logic”, MICRO34, 2001.
[8] S. A. Mahlke, et al, “Sentinel Scheduling for VLIW
and Superscalar Procesors”, ACM Transactions on
Computer Systems, 11(4):376-408, 1993.
[9] B. Slechta, et al, “Dynamic Optimization of MicroOperations”, HPCA-9, 2003.
Download