A Buffer-Oriented Methodology for Microarchitecture Validation

advertisement
JOURNAL OF ELECTRONIC TESTING: Theory and Applications 16, 49–65 (2000)
c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.
°
A Buffer-Oriented Methodology for Microarchitecture Validation
NOPPANUNT UTAMAPHETHAI, R.D. (SHAWN) BLANTON AND JOHN PAUL SHEN
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213
nau@ece.cmu.edu
blanton@ece.cmu.edu
shen@ece.cmu.edu
Received December 21, 1998; Revised July 7, 1999
Editor: M.S. Abadir
Abstract. We propose a methodology for validating microarchitecture specifications. We view microarchitecture
features as specific operations on entries of various buffers in the processor. Our validation approach is to determine
the functionality of a buffer type, model its operations at the microarchitecture level using abstract finite state
machine (FSM) models, and rigorously generate instruction sequences that systematically exercise the model of
each instance of that buffer type. A high-level test sequence is derived based on the abstract FSM model using
FSM testing techniques, and then translated to a test program that exercises the functionality of each buffer entry.
This methodology is applied to the microarchitecture specifications of the PowerPC 604. The effectiveness of the
sequences generated using our methodology is compared with that of some real and randomly-generated programs.
Simulation results show that all targeted FSM transitions are covered by our sequences with at least 1000× and 3×
fewer instructions than real and randomly-generated programs, respectively.
Keywords:
1.
processor validation, design validation, superscalar microarchitecture
Introduction
Microprocessor performance has been doubling every
eighteen months due primarily to the increase in clock
speed and instruction-level parallelism. Higher clocking rates are achieved via process improvements and the
use of deeper pipelines. Instruction-level parallelism is
measured in terms of the average or sustained number
of instructions processed per machine cycle (IPC). To
increase IPC, current high-end microprocessors employ very aggressive and complex microarchitecture
mechanisms. The dominant mechanisms include: dynamic branch prediction to ease the constraints imposed by control flow dependencies; register renaming
to remove the unnecessary serialization imposed by
false data dependencies; reservation stations to buffer
instructions awaiting their operands without having to
stall the fetch and decode stages; out-of-order issuing to avoid the unnecessary stalling of subsequent
ready instructions; and reorder buffers to support inorder completion and precise exception even when instructions are executed out of order. These aggressive
mechanisms introduce significant complexity to modern superscalar microarchitectures which makes their
validation an extremely difficult task.
1.1.
Motivation
The current practice in industry for validating these aggressive microarchitecture mechanisms is through simulation. Simulation models of the microarchitecture are
developed early in the design cycle and then maintained
and used during the entire design cycle. Traditionally
two types of simulation models are used, functional
© 2000 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in other works.
50
Utamaphethai, Blanton and Shen
and timing. Functional models are used to validate the
correct behavior of the mechanisms, while the timing
models are used for characterizing the performance (in
terms of machine cycles) of these mechanisms.
Design validation is performed by exercising the
simulation models and examining the simulation outcome. In order to exercise these models for validation,
instruction sequences or test sequences are used as input stimuli to the simulation models. Generally three
types of test sequences are used [1]. First, real application programs can be used as test sequences. While
these programs may represent the actual user workload, they may not fully exercise the machine. Second,
hand-generated test programs are written by the designers to probe specific areas of the machine and to
test the “corner conditions” of the machine behavior.
Third, randomly generated programs are used to supplement the previous two types of test sequences by
exercising the machine models to a greater extent.
The above approaches to microarchitecture validation have a number of weaknesses. Using real application programs and randomly generated programs as test
sequences can be very inefficient. On the other hand,
explicitly generated test sequences are produced in an
ad-hoc fashion based on the intuitive knowledge of the
designer and not on a rigorous representation of the microarchitecture mechanisms. There is a real need for a
systematic method to generate highly-efficient test sequences for design validation that is based on rigorous
and efficient representations of the microarchitecture
mechanisms that can yield quantitative coverage figures. The goal of this work is to address this need.
1.2.
Our Approach
This work presents a systematic method for generating efficient test sequences for rigorously validating contemporary superscalar microarchitectures that
employ deep pipelines, aggressive speculation, and
out-of-order execution. This method operates at the
microarchitecture level of abstraction, not at the RTL
implementation, with the objective to validate the behaviors of the key microarchitecture mechanisms. To
deal with the complexity problem, we partition the
entire machine into a set of critical buffers and then
generate test sequences for validating the operation of
each of these critical buffers. We model the behavior of
each buffer using abstract finite state machines (FSMs).
Based on these abstract FSM models, we generate test
sequences for each buffer. The approach resembles that
of automatic test pattern generation (ATPG) for logic
circuits and borrows some ideas from functional testing
of iterative array structures [2].
We first summarize related work in Section 2 and
then describe the key concepts of our buffer-oriented
approach in Section 3. Sections 4 and 5 demonstrate
and apply this approach to the PowerPC 604 superscalar microprocessor. The experimental results are
presented in Section 6. Summary and conclusion are
documented in Section 7.
2.
Related Work
An alternative to the current practice of design validation via simulation is the use of formal methods for design verification [3–7]. Design verification techniques
have proven to be quite successful in verifying logic
circuits. More recently these techniques have been extended to verifying microprocessors. However, only a
portion of a complex microprocessor or relatively simple microprocessors can be verified using these formal
methods [8, 9].
A key problem with the formal verification methods is the issue of complexity explosion. While the
efficiency of formal verification algorithms has been
significantly improving, the complexity of contemporary superscalar designs appears to be increasing at
an even faster rate. This complexity problem is further exacerbated by two dominant trends in contemporary superscalar design that appear to be antagonistic
with respect to the current strengths of formal methods.
Formal verification techniques have been successfully
applied to and shown to be quite effective with the verification of the datapath portion of a microprocessor
[9, 10]. However, the added complexity in modern superscalar machines that makes the task of validation
difficult is mostly in the control logic portion for managing instructions that are simultaneously in flight in
many pipe stages.
Recent research efforts on formal methods have focused on verifying pipelined designs [8, 11]. These
techniques are most effective when the pipeline is fairly
shallow and straightforward, with a single sequential
path for instruction flow and minimal interaction between different pipe stages. Modern superscalar design
trends are contrary to these assumptions in that they
are becoming deeper and deeper in terms of pipe depth
with increasing interactions between the different pipe
stages. There are many examples of such interactions.
For example, branch predictors are accessed during the
A Buffer-Oriented Methodology for Microarchitecture Validation
51
fetch stage but must be updated by the branch execution unit in the execute stage. Register renaming is performed in the decode stage, but rename registers are
retired by the write back stage. In addition to advancing instructions to the reservation stations in the next
pipe stage, the dispatch stage must also allocate entries
in the reorder buffer, which is usually viewed as part of
the completion stage. The validation of these complex
interactions between pipe stages is crucial and poses
the most difficult validation problem.
In addressing the complexity problem, a number of
recent papers have been published that attempt to leverage specific algorithms from formal verification but
adopt a more simulation-based and/or testing-oriented
methodology. One emerging paradigm combines formal verification and functional-level simulation. Most
work [12–15] uses abstract FSMs provided by the designer or part of the specification as models for test generation. Formal verification techniques, such as property checking during state enumeration and transition
traversal, are used to analyzed the FSMs and generate
tests. In general, a set of tests are generated to maximize
some coverage metric (i.e. transition coverage).
The abstract FSM used in previous work is typically
derived from the structural implementation of the design. The resulting model is then likely to be limited
to the functionality of a single pipe stage. As pointed
out earlier, microarchitecture mechanisms generally involve interactions between many pipe stages. Consequently, representing a microarchitecture mechanism
requires the composition of many FSMs which then
leads to the aforementioned complexity explosion.
This level of abstraction is the one used by microarchitects during the early definition of a new microarchitecture. While pipeline stages and major functional
units are explicitly represented at this level of abstraction, our focus is more on validating the behavior of
the major microarchitecture mechanisms and not necessarily the actual register-transfer level implementation of these mechanisms. The goal is to ensure that
the microarchitecture mechanisms, such as branch prediction, register renaming, and out-of-order execution,
function as the microarchitect intends. Notice that this
in effect validates the performance of the machine since
machine timing is determined by the movement of
instructions through the pipeline states, which is in
turn determined by the operation of the buffers being
validated. Once a microarchitecture design is implemented at the register transfer or logic gate level, other
Boolean-based tools can be employed to validate the
implementation based on the structural representation
of the design.
The proposed method also targets the control aspect
of the microarchitecture and not the operation of the
datapath functional units. Datapath functional units can
be effectively and independently verified using formal
verification methods. Even though some of these functional units may be pipelined, such execution pipelines
are straightforward in that internal feedback, forwarding paths and interacting instructions do not exist between pipeline stages. Our validation method focuses
on the more difficult control aspect of the microarchitecture that involves interactions between instructions
in different pipeline stages of the machine.
3.
3.2.
Buffer-Oriented Methodology
The microarchitecture validation method proposed in
this work differs from current ad-hoc practices in being
a systematic, general method that is based on a rigorous model and is amenable to algorithmic implementation. As compared to formal methods, it addresses
the complexity explosion problem via the use of appropriate abstraction and machine partitioning. It also
leverages elegant ideas from traditional test generation
techniques for iterative digital logic circuits.
3.1.
Abstraction of Microarchitecture Mechanisms
The proposed validation method operates at the microarchitecture level of abstraction and targets the
validation of critical microarchitecture mechanisms.
Buffer-Oriented Machine Partitioning
It is impractical to perform validation of the entire machine as a single unit. In order to alleviate the complexity problem, partitioning of the entire machine into
multiple modules that can be validated separately is always used. However, unlike many recently published
papers that partitioned the machine based on pipeline
stages, we take a very different approach to machine
partitioning. A careful examination of most contemporary superscalar designs reveals that all the major
microarchitecture mechanisms involve the operation
of certain critical buffers. For example, branch prediction involves the reading of the branch target buffer
or the branch history table in the fetch stage. During
branch resolution, the branch execution unit updates
these buffers.
52
Utamaphethai, Blanton and Shen
Fig. 1. A microarchitecture can be viewed a set of critical buffers,
each of which has multiple, identical entries. Each buffer entry
stores “DATA” (instructions and/or operands) and status information that describes the state of the current “DATA”.
These critical buffers provide the basis for all the
control aspect of the machine and thus determine performance. They usually have multiple entries and are
frequently multi-ported. Their operations are usually
dependent on control signals originating from different
pipeline stages. The “DATA” stored in these buffer entries can be operands or instructions. Other than these
“DATA” bits, a buffer entry can also store status bits
that indicate the status or state of the “DATA” being
stored there; see Fig. 1. Usually these stored status bits
are used as control signals to trigger subsequent events
and carry out the control function of the machine. The
critical buffers in a contemporary superscalar microarchitecture include the following: branch target buffer,
branch history table, register rename buffers, reservation stations, and reorder buffer. All of these buffers are
considered critical because their entries all carry certain status or state bits. The control aspect of the entire
superscalar microarchitecture can be viewed as consisting of the management of the operations performed
on entries of these critical buffers.
3.3.
FSM Model of Buffer Entries
Based on the above stated machine partitioning, this
work proposes a buffer-oriented validation method. A
microarchitecture is partitioned into its critical buffers.
All the critical buffers of a machine are separately and
sequentially validated. All of these critical buffers have
multiple entries. The number of entries can vary from
a few to several dozens or potentially several hundreds
in future designs. All the entries of the same buffer are
symmetrical; hence the behavior of the entire buffer
can be characterized by the behavior of an entry. The
operation of an entry can involve the reading or writing
of the “DATA” in that entry.
Fig. 2. (a) Read and write operations of the status bits of a particular buffer entry. (b) Modeling status bits as an FSM.
In addition to the “DATA”, each entry can also store
status bits representing the state of the “DATA” stored
there. Other than reading or writing of “DATA” in an
entry, a buffer entry operation can also involve the updating of the status bits of that entry. The status bits of
an entry can in turn be viewed as the “state bits” of that
entry. Hence, each entry can be viewed as being in one
of several states. When the status bits are updated, this
can be viewed as a state transition. Consequently each
entry of the buffer can be modeled as a small finite state
machine (FSM) capable of being in one of many states
with state transitions triggered by external control signals and the next state logic of the FSM. The FSM
model of the entry fully characterizes the behavior of
each entry of a critical buffer; see Fig. 2.
In traditional logic testing, an iterative array structured circuit can be elegantly tested by partitioning the
circuit into its symmetrical modules, each of which is
separately and identically tested. We leverage this concept in our buffer-oriented validation method. Since the
entries of a buffer are symmetrical, a buffer is validated
by separately and identically validating each of its entries. Since each entry is characterized by an FSM, it
can be validated by exercising its FSM model. A number of specific techniques can be employed for exercising the FSM model. Three obvious choices are: (1)
check that all the states can be entered; (2) check that
all the state transitions can be performed; and (3) check
the FSM behavior exhaustively.
In our work [16–18], we have used the latter two
techniques but here we focus on the second technique
of exercising all the state transitions by generating a
test sequence of instructions, the simulation of which
will force a buffer entry to traverse all of its state transitions. The state transitions are verified by examining
the simulation outcome. We repeat this process for each
of the entries of a buffer and then repeat the entire process for each of the critical buffers. We then define
coverage of a test sequence as the percentage of all
possible state transitions checked by that test sequence
A Buffer-Oriented Methodology for Microarchitecture Validation
of instructions. To summarize, our buffer-based validation method involves:
1. partitioning a microarchitecture into its critical
buffers;
2. generating the FSM models for each entry of all the
critical buffers;
3. constructing a transition tour for each FSM
model;
4. synthesizing a test sequence of instructions to carry
out each of the transition tours; and
5. simulation of the resulting test program to verify
coverage of each transition tour.
In the next two sections, we apply this buffer-based
methodology to the PowerPC 604 microarchitecture to
demonstrate the practical feasibility of our validation
method and the efficiency of the test sequences so generated. In this example application to the 604, further
details of the validation method are also presented.
4.
Application to the PowerPC 604
The PowerPC 604 (Fig. 3) is used as a research vehicle
to explore our validation methodology. It serves our
Fig. 3.
53
purpose since it implements all the characteristics of a
speculative and out-of-order processor through the microarchitecture mechanisms of branch prediction, register renaming, reorder buffering and operand buffering
via reservation stations. In the following subsections,
we describe each of these components in detail and
present the FSM models of their buffer entries.
4.1.
Branch Prediction
Processors attempt to minimize the performance degradation due to altered program flow caused by branch
instructions through branch prediction. Branch prediction attempts to determine the flow of the instruction
stream so that instructions and operands are prefetched
sufficiently far in advance [19]. Branch prediction is
conceptually composed of determining the direction
of a branch (taken or not-taken) and its target address.
The branch prediction mechanism implemented in the
PowerPC 604 utilizes a branch target address cache
(BTAC) and a branch history table (BHT) [20]. The
BTAC is a 64-entry, fully associative cache with a
round-robin replacement policy. It stores the target addresses of both unconditional and predicted-taken conditional branches.
Pipeline organization and the critical buffers (shaded blocks) of the PowerPC 604.
54
Utamaphethai, Blanton and Shen
Fig. 4. FSM model for the unconditional branch for an
entry of the branch target address cache (BTAC).
The BHT is a 512-entry, direct-mapped cache that
stores the execution history of conditional branches.
The two status bits of a BHT entry encodes the histories:
strong not taken (SNT), weak not taken (WNT), weak
taken (WT), and strong taken (ST). The history state
is used to make predictions for conditional branches.
If the state is either ST or WT, the conditional branch
is predicted to be taken. Otherwise it is predicted not
taken. The history table is updated when the actual
branch is resolved to be taken (RT) or not taken (RNT).
4.1.1. Branch Target Address Cache. The BTAC is
a critical buffer which stores prediction information
for both unconditional and conditional branches using 1-bit of status information. We use two separate
FSMs to model the functionality of a BTAC entry for
the unconditional and conditional branches. Figure 4
shows the FSM model for a BTAC entry that corresponds to an unconditional branch, hence this model
describes the unconditional branch prediction function
for each of the 64 entries of the BTAC. When an unconditional branch instruction is first encountered, the
remove transition is traversed into the predicted-nottaken state (PNT), where the prediction of not taken
(T = 0) is made. This is the initialization that allocates
an entry of BTAC for a particular unconditional branch.
The fetch address of the branch and its target address
are loaded into the BTAC at the position of the roundrobin pointer. After the branch is executed, the FSM
transitions to the predicted-taken (PT) state. Now if the
same unconditional branch is encountered, it is predicted to be taken (T = 1) and the FSM responds by
self-looping on the PT state. The transition from PT to
PNT occurs when the branch is removed (the remove
transition) from the BTAC by the entrance of another
branch instruction when the round-robin pointer is at
this entry.
The model for a BTAC entry containing a conditional branch is more complex. Figure 5 shows a 6-state
FSM that makes predictions based on branch histories.
When a new conditional branch enters the BTAC, one
of the four remove transitions (which is a BTAC miss)
Fig. 5. FSM model for the conditional branch for an entry of the
branch target address cache (BTAC).
is traversed to a state where a prediction of not taken
(T = 0) is made. The specific remove transition performed depends on the branch history stored in the corresponding direct-mapped entry of the BHT. When the
conditional branch is resolved in the execution stage, a
transition is made based on the outcome. (Thus, upon
the first encounter of a branch, two transitions are made
in this model.) A branch resolved taken (RT) causes a
state transition along the RT arc. A resolved not taken
(RNT) outcome causes an RNT transition. The nottaken states (WNT and SNT) make not-taken predictions (T = 0) while the taken states (WT and ST) predict branches to be taken (T = 1). Future encounters of
the branch instruction use the output value indicated in
the current state (T = 0 or T = 1) for the prediction.
4.1.2. Branch History Table. Figure 6 shows the
FSM model for each of the 512 BHT entries. A cold
Fig. 6. FSM model for an entry of the
branch history table (BHT).
A Buffer-Oriented Methodology for Microarchitecture Validation
start initializes all entries in the BHT to the start state
SNT. Any conditional branch whose address directly
maps to the same BHT entry will cause transitions in
that entry’s FSM when the branch is resolved in the
execution stage. The current state of the FSM is used
to make predictions (possibly overriding BTAC predictions) for instructions in the decode and dispatch stages.
States WNT and SNT make a branch prediction of not
taken (T = 0) while states WT and ST predict taken
(T = 1).
4.2.
Register Renaming
Register renaming [21] is a technique that improves
parallelism by eliminating stalled cycles due to antiand output dependencies. It avoids contention for a
given register file location in the course of out-of-order
execution by storing instruction results in temporary
(rename) buffers. The PowerPC 604 uses three sets
of rename buffers for three register files—a 12-entry
general purpose rename buffer (GRB) for general purpose registers, an 8-entry floating-point rename buffer
(FRB) for floating-point registers and an 8-entry condition code rename buffer (CRB) for condition code
registers.
Figure 7 shows the FSM model for the operation
of each rename buffer entry. Note, the model is valid
whether the rename buffer entry is from the GRB, FRB
or CRB. An entry is Free until the dispatch unit allocates an entry for an instruction in the dispatch stage.
This occurs if an instruction modifies any register. The
entry remains allocated until the instruction completes
Fig. 7. FSM model for an entry of any of the three
rename buffers.
55
and the result is written back to the register file. There
are two states for an allocated entry. At the time of
renaming, each newly allocated rename entry will always hold the most recent (MR) value for the renamed
register denoted by the MR Alloc state of Fig. 7. If a
rename entry is allocated to a register which is then
later renamed by another instruction, the previously allocated entry will no longer hold the most recent value
and will therefore transition from the MR Alloc state
to the NonMR Alloc state.
Once the instruction finishes, the content of the rename entry becomes valid which causes a transition
from MR Alloc (NonMR Alloc) to MR Valid (NonMR
Valid). The FSM stays in the valid state until the result is written to the register file (WB transition) or a
prior instruction causes an exception that requires all
subsequent instructions to be discarded (discard transition).
4.3.
Reservation Stations
When instructions are dispatched to an appropriate
functional unit, they are placed in reservation stations
(RSs) before they are executed [22]. Instructions are
buffered in reservation stations even if their operands
are ready. Reservation stations are useful because they
allow instructions that do not have ready operands to
progress deeper in the pipeline; this allows the frontend
of the pipeline to process new instructions instead of
stalling.
The PowerPC 604 has distributed reservation stations, that is, each of the functional units has a dedicated
two-entry buffer for two instructions of the proper type.
If a reservation station is available and an instruction
can be dispatched, an entry in the reservation station
will be allocated for the instruction. The following conditions prevent an instruction from being dispatched to
a reservation station: (i) The required reservation station or its write port is full; (ii) The reorder buffer
(described next) or its write port is full; (iii) An instruction break occurs and changes the program flow;
(iv) No rename register is available.
When an entry in a reservation station is allocated,
the value of each instruction operand is written into the
reservation station entry. If the value is not yet available,
the tag (i.e., the status bits that identify the result) of
the pending operand will be used instead of the actual
value. Once all of the operands (“DATA”) are available,
the instruction is removed from the reservation station
and execution begins.
56
Utamaphethai, Blanton and Shen
(a)
Fig. 8.
FSM model for a reservation station entry containing an instruction with (a) two source operands, (b) one source operand.
Figure 8(a) shows the FSM model for each RS entry of a two-operand PowerPC instruction (an add instruction for example). Every RS entry starts as free
and available for allocation. Hence, the starting state
for an RS entry is the Free state shown in the center
of the diagram of Fig. 8(a). When an instruction is in
the dispatch stage and all dispatch conditions are met,
an entry in the RS is allocated. Since the instruction
requires two source operands, there are four possible
status states for an RS entry:
•
•
•
•
Alloc
Alloc
Alloc
Alloc
00:
01:
10:
11:
No source operands are available
Only the right source is available
Only the left source is available
Both source operands are available
The FSM can transition from the Alloc 00, Alloc
01 and Alloc 10 states to another Alloc state when
other operand(s) become available. An RS entry is deallocated if all operands are ready and the instruction is
issued (issue transition); or if the entry is discarded
due to an exception created by a prior instruction (discard transition).
4.4.
(b)
Reorder Buffer
Out-of-order execution allows independent instructions to be executed in an order that is different from the
original program order. However, out-of-order instructions must complete in program order to ensure precise
exception handling. Pipelines of superscalar processors
can be typically divided into an in-order frontend, an
out-of-order execution core and an in-order backend.
During the last stage of the in-order frontend, an entry
is allocated for an instruction in a reorder buffer [23].
The execution of the instruction is then performed in
the out-of-order core. When the instruction finally completes or transfers its speculative state into permanent
machine state in the in-order backend, the associated
reorder buffer entry is deallocated in program order.
In essence, a reorder buffer entry is a place holder for
results that preserves program order.
The PowerPC 604 uses a 16-entry reorder buffer to
implement in-order completion at the backend. A reorder buffer entry is allocated during instruction dispatch. When an instruction finishes execution, its status (completed with or without exception) is recorded
in the status bits of the corresponding buffer entry. The
completion unit retires up to four finished instructions
per cycle from the reorder buffer and updates register
files in the complete stage. The completion unit recognizes exceptions and, if necessary, discards any operations performed by subsequent instructions in program
order.
Figure 9 shows the FSM model for the operation
of each reorder buffer entry. A reorder buffer entry
is available for allocation if its FSM is in the Free
state. The FSM transitions from the Free state to the
Allocate state when the instruction is dispatched to
a reservation station. The FSM will transition from the
Allocate state to the Execute state and finally to the
Finish state when the instruction executes and finishes, respectively. The reorder buffer entries can be
discarded if the corresponding instructions follow an
A Buffer-Oriented Methodology for Microarchitecture Validation
Fig. 9.
buffer.
given the general assumption that the initial state is unknown. The second phase verifies the existence of all
the specified states while the third ensures the correctness of each state transition. The checking sequence
for an FSM guarantees that the FSM of a buffer entry
is indeed the machine originally described and not one
of the many other possible state machines of the same
or fewer number of states. Thus, any buffer entry not
completely satisfying the microarchitecture specification according to the FSM model will be discovered by
the simulation of a checking sequence.
We choose to use a complete transition tour for testing our FSM models since it provides a good tradeoff
between sequence generation effort and the level of validation achieved. Small sequences of PowerPC assembly instructions are used to translate transition tours
of the FSMs into a simulatable instruction sequence,
i.e. an assembly program. These small instruction sequences are called atomic sequences. Execution of
the atomic sequence instructions causes the associated
FSM transition to be traversed. The atomic sequence
structure is partitioned into two parts: an initialization
subsequence and a single triggering instruction. The
initialization subsequence places the processor into a
machine state that makes it ready to traverse the transition associated with the atomic sequence. Execution
of the triggering instruction then causes the traversal of
the transition. Some atomic sequences do not require
the initialization subsequence (the reorder buffer for
example). Figure 10 shows an example of an atomic
sequence that triggers the RNT transition from state
(WT, T = 0) to (WNT, T = 0) of the conditional branch
FSM for BTAC in Fig. 5. The instruction cmpi is a part
of the initializing subsequence. The triggering instruction is the conditional branch instruction bc.
It is important to note that there exist many atomic
sequences that can trigger a particular transition in an
FSM. Two atomic sequences are different if their triggering instructions have different parameters such as
instruction opcode, instruction operands, etc. A more
robust and systematic means to specify an atomic sequence is to use two templates. A sequence template
is used to specify the structure (order of instructions)
FSM model for an entry in the reorder
instruction that causes an exception (discard transition).
5.
Test Generation
Based on the FSM models of the critical buffers, a test
sequence can be derived using FSM testing techniques.
The test sequence consists of a sequence of FSM transitions that exercise the FSM in some prescribed way
(such as complete or partial state tour, complete or partial transition tour or a checking experiment [24]). An
obvious tradeoff exists between sequence generation
complexity (effort required and sequence length) and
the level of validation achieved, depending on the FSM
testing technique selected. Partial tours cover a given
subset of states or transitions of a FSM while complete
tours visit every transition in the FSM. A complete
transition tour is widely used because it is an effective
means for exercising a significant amount of the FSM’s
functionality while requiring little effort to generate.
Although a transition tour cannot guarantee total correctness, it is a proven technique for uncovering design
errors [25].
A checking sequence [24] is the most powerful form
of FSM validation. It consists of three phases. The first
phase synchronizes the machine into a particular state
BC ADDRO:
cmpi
bc
0,
12,
30,
2,
1
BC0 0
ori
0,
0,
0
BC 0:
Fig. 10.
57
# Set CRO
# Make transition
# Target instruction
# NOP
Traversal of the RNT arc of Fig. 5 using an atomic sequence of assembly instructions.
58
Utamaphethai, Blanton and Shen
cmpi
0, 30, 1
<conditional branch>
# Initializing subsequence
# Triggering instruction
BC 0:
# Target instruction
ori
0,
0, 0
# NOP
(a)
Instruction parameters
Opcode
Operand 0
Operand 1
Operand 2
Operand 3
GPRO-GPR31
GPRO-GPR31
GPRO-GPR31
bc
BDO-BO31
BIO-BI31
b
imm (24 bits)
—
—
—
add
..
.
..
.
imm (14 bits)
—
—
Total # template
instances
215
224
224
(b)
Fig. 11.
Example of (a) a sequence template of the atomic sequence in Fig. 10 and (b) PowerPC instruction set table for instruction templates.
of an atomic sequence. The sequence template for the
atomic sequence of Fig. 10 is shown in Fig. 11(a) where
the conditional branch instruction bc is replaced by
<conditional branch> which is an instruction template. An instruction template specifies the set of instructions that can serve as a triggering instruction in
the sequence template. Given an ISA specification, all
instruction templates can be listed systematically in
the form of a table as shown in Fig. 11(b). Every instruction in the ISA is listed along with the possible
range of values for each instruction parameter. For example, the instruction template of the bc instruction
in Fig. 11(b) requires three operands: BO (a five-bit
immediate value that specifies the condition for which
the branch is taken), BI (a five-bit immediate specifying the bit in the condition register to be used as the
condition of the branch) and a 14-bit immediate for
the target address. Thus, a total of 224 instances of the
bc instruction can be used in the sequence template of
Fig. 11(a). There are some other conditional branches
like bca, bcl and bcla that can also be used for the
sequence template of Fig. 11(a) instead of bc.
It can be clearly seen that enumerating through all
possible values for every instruction parameter in an
instruction template is an overkill. Therefore, there
must be some criteria for selecting a subset of the possible template instances. For example, given that we
would like to exercise the functionality of an entry in
the BHT where the instruction parameter of interest is
the instruction opcode, there is a total of four different
triggering instructions: bc, bca, bcl and bcla. Consequently, there are four different atomic sequences or
sequence templates that can cause a transition in the
BHT FSM using the opcode as the only selection criteria for the instruction template. We leave it to the user
to specify which instruction parameters are important
and relevant to the microarchitecture being validated.
At present, sequence templates and instruction parameters deemed important for validation are manually
identified and derived from the microarchitecture and
the ISA specifications. Atomic sequences are generated by filling the sequence templates with instructions
that are enumerated through the selected instruction
parameters. Figure 12 shows the selected instruction
parameters that are enumerated for different microarchitecture features. Once the atomic sequences are obtained, assembly-level test programs are created using
Perl scripts to concatenate various atomic sequences in
the order of the FSM transitions specified in the transition tour.
6.
Experimental Results
The MW [26, 27] performance simulator is used for our
simulation experiments for the PowerPC 604. Checking code has been added to the simulator to track and
measure the coverage of FSM transitions by the program under simulation. In our earlier work [18] two
metrics, transition coverage and mutant detection, were
A Buffer-Oriented Methodology for Microarchitecture Validation
Microarchitecture
feature
BTAC (unconditional)
BTAC (conditional)
BHT
Rename buffer
Reservation station (1 operand)
Reservation station (2 operand)
Reorder buffer
Instruction
parameter 1
Instruction opcode
Instruction opcode
Instruction opcode
Destination register
Source operand (register only)
Source operand I (register only)
Instruction type (based on functional unit)
59
Instruction
parameter 2
—
—
—
—
—
Source operand II (register only)
—
Fig. 12. Instruction parameters that are enumerated to generate different atomic sequences for different microarchitecture features.
used to show the effectiveness of our sequences. Transition coverage measures the percentage of targeted
transitions that are multiply traversed (explained below) during the simulation. For mutant detection, the
software model describing the behavior of the PowerPC
604 is purposely modified and used during the simulation. If the transition coverage of the modified version
is different from that of the original version, the instruction sequence used in simulation is deemed to detect
the mutant.
We only use the transition coverage metric in this
paper to show the effectiveness of our method. In this
case, coverage is simply the ratio of the number of traversed FSM transitions to the total number of targeted
FSM transitions. As pointed out in the previous section, a transition in an FSM can be traversed in many
possible ways and generating test stimuli for exhaustive simulation is not feasible. Therefore, the coverage
calculation is limited to the number of targeted transitions defined by the selected instruction parameters in
the instruction template. For example, if the instruction
parameter selected in the instruction template for the
BHT FSM is the instruction opcode, the targeted transitions include each arc traversed by each of the branch
instructions bc, bcl, bca and bcla.
Since the current version of MW is strictly tracedriven, it is not capable of simulating the functionality
associated with mis-speculated paths. As a result, some
transitions in the FSM models are untrackable. The
untrackable transitions are shown as dashed arrows in
Figs. 7–9. These transitions can only be traversed by executing instructions down speculative paths. Although
our test programs include instructions that cause the
traversal of those speculative transitions, MW does not
allow us to verify the coverage.
Real and pseudorandomly-generated programs are
typically used in industry for validating hardware design [28–32]. Pseudorandom testing requires relatively
little effort for test generation compared to other approaches. However, random-based test generators are
intended to discover design errors by generating a huge
number (millions to billions) of instructions. The test
generation time is typically small but the simulation
time is quite large.
We evaluate the effectiveness of our method by comparing the transition coverage obtained from our test
programs with the coverage of instructions randomlygenerated and real workloads (SPEC95 benchmarks).
We have created a C program for randomly choosing
PowerPC instructions and their operands. Since some
PowerPC instructions are not handled by MW, these
instructions are noted and not considered for the random program. Instructions affecting the control flow
of a program are carefully handled. The generated random program has both forward (i.e. jumps to subroutine) and backward (i.e. loops) branches. However, no
nested loops are permitted. Characteristics of the random program including the total number of instructions to be generated, the frequency of branch instructions, the maximum number of iterations per loop, the
size of loops and subroutines can be specified by the
user. Figure 13 shows the number of simulated instructions in the SPEC95 benchmarks (12 white bars), the
randomly-generated program (the light grey bar) and
our generated test program (the dark grey bar). Note
that the y axis of the graph is log scale.
Figure 14 shows the percentage of edges in the FSM
models of the BTAC and the BHT covered by our generated sequences and the real and randomly-generated
programs. Based on the instruction parameters in the
table of Fig. 12, each FSM transition in the BTAC FSMs
can be traversed by any type of branch. Hence, there
are four different ways (b, ba, bl, bla) to traverse
a transition by unconditional branches and two ways
(bc, bcl, excluding bca, bcla due to the limitations
of the PowerPC assembler and loader) for conditional
60
Utamaphethai, Blanton and Shen
Fig. 13. Program sizes (i.e. the number of instructions) for the SPEC95 benchmarks (white bars), the
random program (light grey bar), and our generated test programs (dark grey bars) used for validation
through simulation.
Fig. 14. FSM transition coverage comparison between the SPEC95 benchmarks (first 12 sets of bars),
the random program (the next-to-last set of bars) and our generated test program (the last set of bars).
A Buffer-Oriented Methodology for Microarchitecture Validation
61
Fig. 15. Rename buffer coverage comparison between the SPEC95 benchmarks (first 12 sets of bars),
the random program (the next-to-last set of bars) and our generated test programs (the last set of bars).
branches. Each transition in the BTAC unconditional
branch FSM must be traversed four times using branch
instructions b, ba, bl and bla. Similarly, branch
instructions bc, bcl must be used to traverse each
transition twice in the BTAC conditional branch FSM.
For the BHT FSM, each transition can be traversed by
any conditional branch (again excluding bca, bcla).
Therefore, a transition in a BHT entry must be traversed
twice using bc and bcl to achieve 100% coverage. As
shown in Fig. 14, our test programs achieve 100% coverage of all three of the branch prediction FSMs. The
SPEC benchmarks have an edge coverage of less than
50% for all the FSMs. The SPEC benchmarks do not
have branch types ba, bla and bcl and a large portion of the dynamic branch instructions map to the same
BHT entry, which results in the low coverage achieved.
It can be clearly seen that the randomly-generated program is able to achieve 100% coverage for some of
the branch prediction FSMs. Our simulation results report that the 100% coverage for the BTAC conditional
branch FSM and the BHT FSM is achieved after executing 889K and 1.6M instructions, respectively. Even
though the random program has a 100% coverage on
these two FSMs, the number of instructions required is
about 3× and 19× more than our sequences for the
BTAC conditional branch FSM and the BHT FSM,
respectively.
Figure 15 shows overall coverage of the GRB, FRB
and CRB rename buffers. Our test sequence is exhaustive in that each FSM model edge is traversed in every possible way. For an entry in the GRB, this means
that each edge can be traversed using any one of the
32 different general purpose registers. Similarly, there
are 32 possible ways to traverse of the FRB model
since there are 32 different floating point registers. But
there are only eight alternatives for a condition code
rename entry since there are only eight condition registers. Figure 15 shows that none of the SPEC95 benchmarks achieve 100% coverage of the targeted functionality of the rename buffer. The benchmark programs
from compress to vortex have low FRB coverage
because these benchmarks are integer applications and
therefore contain very few floating point instructions.
The random program has 100% edge coverage for
the FRB after executing 567K instructions (81× more
than ours) and 316K instructions (45× more) for the
CRB. However, our test programs easily achieve 100%
coverage using significantly fewer instructions.
Figure 16 shows overall coverage for the reorder
buffer. The targeted functionality for the reorder buffer
is quite simplistic. Here, we ensure that each buffer
entry is exercised by each of the six functional units:
two simple integer units (SFX0 and SFX1), one complex integer unit (CFX), one floating point unit (FPU),
62
Utamaphethai, Blanton and Shen
Fig. 16. Reorder buffer coverage comparison between the SPEC95 benchmarks (first 12 bars), the
random program (the next-to-last bar), and our generated test program (the last bar).
one load/store unit (LSU), and one branch unit (BRU).
All but three of the benchmark programs (go, li, and
vortex) achieve 100% coverage. The random program
requires 2200 instructions to achieve 100% coverage
for the reorder buffer while our sequence requires only
300 instructions.
Figures 17 and 18 compare coverages for the 1- and
2-operand reservation station FSMs, respectively, for
the four functional units SFX0, SFX1, CFX and FPU.
The targeted functionality for each reservation station includes traversing each transition with all register
operand possibilities. For the two-operand reservation
stations, this means there are 1024 different alternatives for exercising an edge since there are 32 register
choices for each of the two sources. The one-operand
reservation stations have only 32 possible alternatives
since there is only one source. The benchmark coverage of the one-operand reservation station is quite low,
with the highest coverage achieved being about 40%
for the program fpppp. The benchmark coverage is
even lower for the two-operand reservation stations. No
benchmark program achieves more than 15% coverage.
For the randomly-generated program, the coverage for
one-operand reservation station is almost 100% for the
floating point reservation stations. However, the random program has poor coverage for the one-operand
reservation station in the complex integer unit (only
3%). The highest edge coverage achieved by the random program is for the simple integer two-operand
reservation station FSM, which is a mere 33%.
7.
Conclusions and Future Work
We have shown that the microarchitecture features
(branch prediction, register renaming, reorder buffers,
etc.) of superscalar processors can be effectively modeled as a set of critical buffers. Each buffer consists of
multiple, identical entries, each of which can be modeled by a simple FSM. Based on this model, traditional
iterative array and FSM testing techniques were leveraged to create test programs that validate these complex microarchitecture mechanisms. Our test programs
achieved 100% coverage with 1000× fewer instruction
than benchmark programs and 3-81× fewer instructions than randomly-generated programs. Moreover,
the benchmark and randomly-generated programs do
not achieve 100% coverage.
Our future work will focus on validating the interaction that exists between buffer FSMs. This work
adopts a “single fault assumption” which means we assume that each buffer in the microarchitecture operates
A Buffer-Oriented Methodology for Microarchitecture Validation
Fig. 17. One-operand reservation station coverage comparison between the SPEC95 benchmarks
(first 12 sets of bars), the random program (the next-to-last set of bars), and our generated test program
(the last set of bars).
Fig. 18. Two-operand reservation station coverage comparison between the SPEC95 benchmarks
(first 12 sets of bars), the random program (the next-to-last set of bars) and our generated test program
(the last set of bars).
63
64
Utamaphethai, Blanton and Shen
correctly except for the one under consideration. Our
future work relaxes this assumption by considering the
communication among groups of FSMs. We are also
increasing the level of automation of this validation
method. We envision an ATPG-like algorithm which
accepts FSM models of the microarchitecture, information about the instruction set architecture and coverage
goals that are set by the program user. The output of this
ATPG algorithm would be an assembly-level test program that “exercises” the functionality targeted by the
FSMs. The type of exercise can range from tours and
checking experiments applied to individual FSMs to
the intercommunication of these FSMs. The PowerPC
604 will continue to serve as our case study but the
algorithm would be general and therefore applicable to
any superscalar processor that uses similar microarchitecture mechanisms.
13.
14.
15.
16.
17.
18.
References
1. B. Black and J.P. Shen, “Calibration of Microprocessor Performance Models,” IEEE Computer, Vol. 31, No. 5, pp. 59–65,
May 1998.
2. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, IEEE Press, Piscataway, NJ,
1991.
3. R.E. Bryant, “Graph Based Algorithms for Boolean Function
Manipulation,” IEEE Transactions on Computers, Vol. C-35,
No. 8, pp. 677–691, Aug. 1986.
4. R.E. Bryant, D.L. Beatty, and C.J.H. Seger, “Formal Hardware
Verification by Symbolic Ternary Trajectory Evaluation,” Proc.
of Design Automation Conference (DAC), June 1991, pp. 397–
402.
5. M.C. McFarland, “Formal Verification of Sequential Hardware:
A Tutorial,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, Vol. 12, No. 5, pp. 633–654,
May 1993.
6. K.L. McMillan, Symbolic Model Checking, Kluwer Academic
Publishers, 1993.
7. M. Yoeli, Formal Verification of Hardware Design, IEEE Computer Society Press, Los Alamitos, CA, 1990.
8. J. Burch and D. Dill, “Automatic Verification of Pipelined
Microprocessor Control,” International Conference on Computer Aided Verification, June 1994, pp. 68–80.
9. K.L. Nelson, A. Jain, and R.E. Bryant, “Formal Verification
of a Superscalar Execution Unit,” Proc. of Design Automation
Conference (DAC), June 1997, pp. 161–166.
10. C.L. Berman and L.H. Trevillyan, “Functional Comparison of
Logic Designs for VLSI Circuits,” Proc. of International Conference on Computer Design (ICCD), Nov. 1989, pp. 456–459.
11. D.L. Beatty and R.E. Bryant, “Formally Verifying a Microprocessor Using a Simulation Methodology,” Proc. of Design Automation Conference, June 1994, pp. 596–602.
12. D. Geist, M. Farkas, A. Landver, Y. Lichtenstein, S. Ur, and Y.
Wolfsthal, “Coverage-Directed Test Generation Using Symbolic
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
Techniques,” Proc. of the International Conference in Formal
Methods in Computer-Aided Design, Nov. 1996, pp. 143–158.
R.C. Ho, C.H. Yang, M.A. Horowitz, and D.L. Dill, “Architecture Validation for Processors,” Proc. of the International Symposium on Computer Architecture, June 1995, pp. 404–413.
H. Iwashita, S. Kowatari, T. Nakata, and F. Hirose, “Automatic
Test Program Generation for Pipelined Processors,” Proc. of International Conference on Computer-Aided Design, Nov. 1994,
pp. 580–583.
D. Moundanos, J.A. Abraham, and Y.V. Hoskote, “Abstraction
Techniques for Validation Coverage Analysis and Test Generation,” IEEE Transactions on Computers, Vol. 47, No. 1, pp. 2–14,
Jan. 1998.
N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Superscalar
Processor Validation at the Microarchitecture Level,” Digest of
Papers of the International High Level Design Validation and
Test Workshop (HLDVT), Nov. 1997, pp. 202–209.
N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Validation
of Speculative adn Out-of-Order Execution Microarchitecture,”
First International Workshop on Microprocessor Test and Verification, Oct. 1998.
N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Superscalar
Processor Validation at the Microarchitecture Level,” Proc. of
International Conference on VLSI Design, Jan. 1999, pp. 300–
305.
Jr. A.G. Liles and B.E. Willner, “Branch Prediction Mechanism,”
IBM Technical Disclosure Bulletin, Vol. 22, No. 7, pp. 3013–
3016, Dec. 1979.
J.K.F. Lee and A.J. Smith, “Branch Prediction Strategies and
Branch Target Buffer Design,” IEEE Computer, pp. 6–22, Jan.
1984.
R.M. Keller, “Look-Ahead Processors,” Computing Surveys,
Vol. 7, No. 4, pp. 177–195, Dec. 1975.
R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple
Arithmetic Units,” IBM Journal, Vol. 11, No. 1, pp. 25–33, Jan.
1967.
J.E. Smith and A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. of the International Symposium on Computer Architecture, June 1985, pp. 36–44.
Z. Kohavi, Switching and Finite Automata Theory, McGrawHill, New York, 1978.
S. Naito and M. Tsunoyama, “Fault Detection for Sequential
Machines by Transition-Tours,” Proc. of the Interanational Symposium on Fault-Tolerant Computing, June 1981, pp. 238–243.
T.A. Diep, “VMW: A Visualization-based Microarchitecture
Workbench,” Ph.D. Thesis, Carnegie Mellon University, Aug.
1995.
A.S. Huang and T.A. Diep, “MW Developer’s Guide,” Technical Report, CMuART-95-1, Carnegie Mellon University, Aug.
1995.
A. Aharon, A. Bar-David, B. Dorfman, E. Gofman, M.
Leibowitz, and V. Schwartzburd, “Verification of the IBM RISC
System/6000 by a Dynamic Biased Pseudo-Random Test Program Generator,” IBM System Journal, Vol. 30, No. 4, pp. 527–
538, 1991.
P. Bose, “Architectural Timing Verification and Test for Superscalar Processors,” Proc. of the International Symposium on
Fault-Tolerant Computing, June 1994, pp. 256–265.
P. Bose, “Performance Test Case Generation for Microprocessors,” Proc. of VLSI Test Symposium, Apr. 1998, pp. 54–59.
A Buffer-Oriented Methodology for Microarchitecture Validation
31. N. Dohm et al., “Zen and Art of Alpha Verification Microarchitecture Level,” Proc. of International Conference on Computer
Design, Oct. 1998, pp. 111–117.
32. S.T. Mangelsdorf et al., “Functional Verification of the HP PA
8000 Processor,” Hewlett-Packard Journal, pp. 22–31, Aug.
1997.
Noppanunt Utamaphethai is a PhD candidate in the ECE Department at Carnegie Mellon University. Currently, he is working on
the ATPG-based validation of microprocessors and has applied his
methodology to designs at both IBM and Intel. He received his BS
from Brown University and his MS from CMU.
Shawn Blanton is an assistant professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University where
he is a member of the Center for Electronic Design Automation. He
received the Bachelor’s degree in engineering from Calvin College
65
in 1987, a Master’s degree in Electrical Engineering in 1989 from the
University of Arizona, and a Ph.D. degree in Computer Science and
Engineering from the University of Michigan, Ann Arbor in 1995.
His research interests include the computer-aided design of VLSI
circuits and systems; verification and testing; and computer architecture. Dr. Blanton is the recipient of National Science Foundation
Career Award and is a member of IEEE and ACM.
John Paul Shen is a professor in CMU’s ECE Department and
heads up the Carnegie Mellon Microarchitecture Research Team
(CMuART). He received a BS from the University of Michigan and
an MS and PhD from the University of Southern California all in
Electrical Engineering. He spent several years at Hughes and TRW.
His current research interests are in high performance microprocessor
design and validation, speculative and dynamic microarchitectures,
and software thread integration for embedded computing systems.
He is an IEEE fellow.
Download