JOURNAL OF ELECTRONIC TESTING: Theory and Applications 16, 49–65 (2000) c 2000 Kluwer Academic Publishers. Manufactured in The Netherlands. ° A Buffer-Oriented Methodology for Microarchitecture Validation NOPPANUNT UTAMAPHETHAI, R.D. (SHAWN) BLANTON AND JOHN PAUL SHEN Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213 nau@ece.cmu.edu blanton@ece.cmu.edu shen@ece.cmu.edu Received December 21, 1998; Revised July 7, 1999 Editor: M.S. Abadir Abstract. We propose a methodology for validating microarchitecture specifications. We view microarchitecture features as specific operations on entries of various buffers in the processor. Our validation approach is to determine the functionality of a buffer type, model its operations at the microarchitecture level using abstract finite state machine (FSM) models, and rigorously generate instruction sequences that systematically exercise the model of each instance of that buffer type. A high-level test sequence is derived based on the abstract FSM model using FSM testing techniques, and then translated to a test program that exercises the functionality of each buffer entry. This methodology is applied to the microarchitecture specifications of the PowerPC 604. The effectiveness of the sequences generated using our methodology is compared with that of some real and randomly-generated programs. Simulation results show that all targeted FSM transitions are covered by our sequences with at least 1000× and 3× fewer instructions than real and randomly-generated programs, respectively. Keywords: 1. processor validation, design validation, superscalar microarchitecture Introduction Microprocessor performance has been doubling every eighteen months due primarily to the increase in clock speed and instruction-level parallelism. Higher clocking rates are achieved via process improvements and the use of deeper pipelines. Instruction-level parallelism is measured in terms of the average or sustained number of instructions processed per machine cycle (IPC). To increase IPC, current high-end microprocessors employ very aggressive and complex microarchitecture mechanisms. The dominant mechanisms include: dynamic branch prediction to ease the constraints imposed by control flow dependencies; register renaming to remove the unnecessary serialization imposed by false data dependencies; reservation stations to buffer instructions awaiting their operands without having to stall the fetch and decode stages; out-of-order issuing to avoid the unnecessary stalling of subsequent ready instructions; and reorder buffers to support inorder completion and precise exception even when instructions are executed out of order. These aggressive mechanisms introduce significant complexity to modern superscalar microarchitectures which makes their validation an extremely difficult task. 1.1. Motivation The current practice in industry for validating these aggressive microarchitecture mechanisms is through simulation. Simulation models of the microarchitecture are developed early in the design cycle and then maintained and used during the entire design cycle. Traditionally two types of simulation models are used, functional © 2000 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. 50 Utamaphethai, Blanton and Shen and timing. Functional models are used to validate the correct behavior of the mechanisms, while the timing models are used for characterizing the performance (in terms of machine cycles) of these mechanisms. Design validation is performed by exercising the simulation models and examining the simulation outcome. In order to exercise these models for validation, instruction sequences or test sequences are used as input stimuli to the simulation models. Generally three types of test sequences are used [1]. First, real application programs can be used as test sequences. While these programs may represent the actual user workload, they may not fully exercise the machine. Second, hand-generated test programs are written by the designers to probe specific areas of the machine and to test the “corner conditions” of the machine behavior. Third, randomly generated programs are used to supplement the previous two types of test sequences by exercising the machine models to a greater extent. The above approaches to microarchitecture validation have a number of weaknesses. Using real application programs and randomly generated programs as test sequences can be very inefficient. On the other hand, explicitly generated test sequences are produced in an ad-hoc fashion based on the intuitive knowledge of the designer and not on a rigorous representation of the microarchitecture mechanisms. There is a real need for a systematic method to generate highly-efficient test sequences for design validation that is based on rigorous and efficient representations of the microarchitecture mechanisms that can yield quantitative coverage figures. The goal of this work is to address this need. 1.2. Our Approach This work presents a systematic method for generating efficient test sequences for rigorously validating contemporary superscalar microarchitectures that employ deep pipelines, aggressive speculation, and out-of-order execution. This method operates at the microarchitecture level of abstraction, not at the RTL implementation, with the objective to validate the behaviors of the key microarchitecture mechanisms. To deal with the complexity problem, we partition the entire machine into a set of critical buffers and then generate test sequences for validating the operation of each of these critical buffers. We model the behavior of each buffer using abstract finite state machines (FSMs). Based on these abstract FSM models, we generate test sequences for each buffer. The approach resembles that of automatic test pattern generation (ATPG) for logic circuits and borrows some ideas from functional testing of iterative array structures [2]. We first summarize related work in Section 2 and then describe the key concepts of our buffer-oriented approach in Section 3. Sections 4 and 5 demonstrate and apply this approach to the PowerPC 604 superscalar microprocessor. The experimental results are presented in Section 6. Summary and conclusion are documented in Section 7. 2. Related Work An alternative to the current practice of design validation via simulation is the use of formal methods for design verification [3–7]. Design verification techniques have proven to be quite successful in verifying logic circuits. More recently these techniques have been extended to verifying microprocessors. However, only a portion of a complex microprocessor or relatively simple microprocessors can be verified using these formal methods [8, 9]. A key problem with the formal verification methods is the issue of complexity explosion. While the efficiency of formal verification algorithms has been significantly improving, the complexity of contemporary superscalar designs appears to be increasing at an even faster rate. This complexity problem is further exacerbated by two dominant trends in contemporary superscalar design that appear to be antagonistic with respect to the current strengths of formal methods. Formal verification techniques have been successfully applied to and shown to be quite effective with the verification of the datapath portion of a microprocessor [9, 10]. However, the added complexity in modern superscalar machines that makes the task of validation difficult is mostly in the control logic portion for managing instructions that are simultaneously in flight in many pipe stages. Recent research efforts on formal methods have focused on verifying pipelined designs [8, 11]. These techniques are most effective when the pipeline is fairly shallow and straightforward, with a single sequential path for instruction flow and minimal interaction between different pipe stages. Modern superscalar design trends are contrary to these assumptions in that they are becoming deeper and deeper in terms of pipe depth with increasing interactions between the different pipe stages. There are many examples of such interactions. For example, branch predictors are accessed during the A Buffer-Oriented Methodology for Microarchitecture Validation 51 fetch stage but must be updated by the branch execution unit in the execute stage. Register renaming is performed in the decode stage, but rename registers are retired by the write back stage. In addition to advancing instructions to the reservation stations in the next pipe stage, the dispatch stage must also allocate entries in the reorder buffer, which is usually viewed as part of the completion stage. The validation of these complex interactions between pipe stages is crucial and poses the most difficult validation problem. In addressing the complexity problem, a number of recent papers have been published that attempt to leverage specific algorithms from formal verification but adopt a more simulation-based and/or testing-oriented methodology. One emerging paradigm combines formal verification and functional-level simulation. Most work [12–15] uses abstract FSMs provided by the designer or part of the specification as models for test generation. Formal verification techniques, such as property checking during state enumeration and transition traversal, are used to analyzed the FSMs and generate tests. In general, a set of tests are generated to maximize some coverage metric (i.e. transition coverage). The abstract FSM used in previous work is typically derived from the structural implementation of the design. The resulting model is then likely to be limited to the functionality of a single pipe stage. As pointed out earlier, microarchitecture mechanisms generally involve interactions between many pipe stages. Consequently, representing a microarchitecture mechanism requires the composition of many FSMs which then leads to the aforementioned complexity explosion. This level of abstraction is the one used by microarchitects during the early definition of a new microarchitecture. While pipeline stages and major functional units are explicitly represented at this level of abstraction, our focus is more on validating the behavior of the major microarchitecture mechanisms and not necessarily the actual register-transfer level implementation of these mechanisms. The goal is to ensure that the microarchitecture mechanisms, such as branch prediction, register renaming, and out-of-order execution, function as the microarchitect intends. Notice that this in effect validates the performance of the machine since machine timing is determined by the movement of instructions through the pipeline states, which is in turn determined by the operation of the buffers being validated. Once a microarchitecture design is implemented at the register transfer or logic gate level, other Boolean-based tools can be employed to validate the implementation based on the structural representation of the design. The proposed method also targets the control aspect of the microarchitecture and not the operation of the datapath functional units. Datapath functional units can be effectively and independently verified using formal verification methods. Even though some of these functional units may be pipelined, such execution pipelines are straightforward in that internal feedback, forwarding paths and interacting instructions do not exist between pipeline stages. Our validation method focuses on the more difficult control aspect of the microarchitecture that involves interactions between instructions in different pipeline stages of the machine. 3. 3.2. Buffer-Oriented Methodology The microarchitecture validation method proposed in this work differs from current ad-hoc practices in being a systematic, general method that is based on a rigorous model and is amenable to algorithmic implementation. As compared to formal methods, it addresses the complexity explosion problem via the use of appropriate abstraction and machine partitioning. It also leverages elegant ideas from traditional test generation techniques for iterative digital logic circuits. 3.1. Abstraction of Microarchitecture Mechanisms The proposed validation method operates at the microarchitecture level of abstraction and targets the validation of critical microarchitecture mechanisms. Buffer-Oriented Machine Partitioning It is impractical to perform validation of the entire machine as a single unit. In order to alleviate the complexity problem, partitioning of the entire machine into multiple modules that can be validated separately is always used. However, unlike many recently published papers that partitioned the machine based on pipeline stages, we take a very different approach to machine partitioning. A careful examination of most contemporary superscalar designs reveals that all the major microarchitecture mechanisms involve the operation of certain critical buffers. For example, branch prediction involves the reading of the branch target buffer or the branch history table in the fetch stage. During branch resolution, the branch execution unit updates these buffers. 52 Utamaphethai, Blanton and Shen Fig. 1. A microarchitecture can be viewed a set of critical buffers, each of which has multiple, identical entries. Each buffer entry stores “DATA” (instructions and/or operands) and status information that describes the state of the current “DATA”. These critical buffers provide the basis for all the control aspect of the machine and thus determine performance. They usually have multiple entries and are frequently multi-ported. Their operations are usually dependent on control signals originating from different pipeline stages. The “DATA” stored in these buffer entries can be operands or instructions. Other than these “DATA” bits, a buffer entry can also store status bits that indicate the status or state of the “DATA” being stored there; see Fig. 1. Usually these stored status bits are used as control signals to trigger subsequent events and carry out the control function of the machine. The critical buffers in a contemporary superscalar microarchitecture include the following: branch target buffer, branch history table, register rename buffers, reservation stations, and reorder buffer. All of these buffers are considered critical because their entries all carry certain status or state bits. The control aspect of the entire superscalar microarchitecture can be viewed as consisting of the management of the operations performed on entries of these critical buffers. 3.3. FSM Model of Buffer Entries Based on the above stated machine partitioning, this work proposes a buffer-oriented validation method. A microarchitecture is partitioned into its critical buffers. All the critical buffers of a machine are separately and sequentially validated. All of these critical buffers have multiple entries. The number of entries can vary from a few to several dozens or potentially several hundreds in future designs. All the entries of the same buffer are symmetrical; hence the behavior of the entire buffer can be characterized by the behavior of an entry. The operation of an entry can involve the reading or writing of the “DATA” in that entry. Fig. 2. (a) Read and write operations of the status bits of a particular buffer entry. (b) Modeling status bits as an FSM. In addition to the “DATA”, each entry can also store status bits representing the state of the “DATA” stored there. Other than reading or writing of “DATA” in an entry, a buffer entry operation can also involve the updating of the status bits of that entry. The status bits of an entry can in turn be viewed as the “state bits” of that entry. Hence, each entry can be viewed as being in one of several states. When the status bits are updated, this can be viewed as a state transition. Consequently each entry of the buffer can be modeled as a small finite state machine (FSM) capable of being in one of many states with state transitions triggered by external control signals and the next state logic of the FSM. The FSM model of the entry fully characterizes the behavior of each entry of a critical buffer; see Fig. 2. In traditional logic testing, an iterative array structured circuit can be elegantly tested by partitioning the circuit into its symmetrical modules, each of which is separately and identically tested. We leverage this concept in our buffer-oriented validation method. Since the entries of a buffer are symmetrical, a buffer is validated by separately and identically validating each of its entries. Since each entry is characterized by an FSM, it can be validated by exercising its FSM model. A number of specific techniques can be employed for exercising the FSM model. Three obvious choices are: (1) check that all the states can be entered; (2) check that all the state transitions can be performed; and (3) check the FSM behavior exhaustively. In our work [16–18], we have used the latter two techniques but here we focus on the second technique of exercising all the state transitions by generating a test sequence of instructions, the simulation of which will force a buffer entry to traverse all of its state transitions. The state transitions are verified by examining the simulation outcome. We repeat this process for each of the entries of a buffer and then repeat the entire process for each of the critical buffers. We then define coverage of a test sequence as the percentage of all possible state transitions checked by that test sequence A Buffer-Oriented Methodology for Microarchitecture Validation of instructions. To summarize, our buffer-based validation method involves: 1. partitioning a microarchitecture into its critical buffers; 2. generating the FSM models for each entry of all the critical buffers; 3. constructing a transition tour for each FSM model; 4. synthesizing a test sequence of instructions to carry out each of the transition tours; and 5. simulation of the resulting test program to verify coverage of each transition tour. In the next two sections, we apply this buffer-based methodology to the PowerPC 604 microarchitecture to demonstrate the practical feasibility of our validation method and the efficiency of the test sequences so generated. In this example application to the 604, further details of the validation method are also presented. 4. Application to the PowerPC 604 The PowerPC 604 (Fig. 3) is used as a research vehicle to explore our validation methodology. It serves our Fig. 3. 53 purpose since it implements all the characteristics of a speculative and out-of-order processor through the microarchitecture mechanisms of branch prediction, register renaming, reorder buffering and operand buffering via reservation stations. In the following subsections, we describe each of these components in detail and present the FSM models of their buffer entries. 4.1. Branch Prediction Processors attempt to minimize the performance degradation due to altered program flow caused by branch instructions through branch prediction. Branch prediction attempts to determine the flow of the instruction stream so that instructions and operands are prefetched sufficiently far in advance [19]. Branch prediction is conceptually composed of determining the direction of a branch (taken or not-taken) and its target address. The branch prediction mechanism implemented in the PowerPC 604 utilizes a branch target address cache (BTAC) and a branch history table (BHT) [20]. The BTAC is a 64-entry, fully associative cache with a round-robin replacement policy. It stores the target addresses of both unconditional and predicted-taken conditional branches. Pipeline organization and the critical buffers (shaded blocks) of the PowerPC 604. 54 Utamaphethai, Blanton and Shen Fig. 4. FSM model for the unconditional branch for an entry of the branch target address cache (BTAC). The BHT is a 512-entry, direct-mapped cache that stores the execution history of conditional branches. The two status bits of a BHT entry encodes the histories: strong not taken (SNT), weak not taken (WNT), weak taken (WT), and strong taken (ST). The history state is used to make predictions for conditional branches. If the state is either ST or WT, the conditional branch is predicted to be taken. Otherwise it is predicted not taken. The history table is updated when the actual branch is resolved to be taken (RT) or not taken (RNT). 4.1.1. Branch Target Address Cache. The BTAC is a critical buffer which stores prediction information for both unconditional and conditional branches using 1-bit of status information. We use two separate FSMs to model the functionality of a BTAC entry for the unconditional and conditional branches. Figure 4 shows the FSM model for a BTAC entry that corresponds to an unconditional branch, hence this model describes the unconditional branch prediction function for each of the 64 entries of the BTAC. When an unconditional branch instruction is first encountered, the remove transition is traversed into the predicted-nottaken state (PNT), where the prediction of not taken (T = 0) is made. This is the initialization that allocates an entry of BTAC for a particular unconditional branch. The fetch address of the branch and its target address are loaded into the BTAC at the position of the roundrobin pointer. After the branch is executed, the FSM transitions to the predicted-taken (PT) state. Now if the same unconditional branch is encountered, it is predicted to be taken (T = 1) and the FSM responds by self-looping on the PT state. The transition from PT to PNT occurs when the branch is removed (the remove transition) from the BTAC by the entrance of another branch instruction when the round-robin pointer is at this entry. The model for a BTAC entry containing a conditional branch is more complex. Figure 5 shows a 6-state FSM that makes predictions based on branch histories. When a new conditional branch enters the BTAC, one of the four remove transitions (which is a BTAC miss) Fig. 5. FSM model for the conditional branch for an entry of the branch target address cache (BTAC). is traversed to a state where a prediction of not taken (T = 0) is made. The specific remove transition performed depends on the branch history stored in the corresponding direct-mapped entry of the BHT. When the conditional branch is resolved in the execution stage, a transition is made based on the outcome. (Thus, upon the first encounter of a branch, two transitions are made in this model.) A branch resolved taken (RT) causes a state transition along the RT arc. A resolved not taken (RNT) outcome causes an RNT transition. The nottaken states (WNT and SNT) make not-taken predictions (T = 0) while the taken states (WT and ST) predict branches to be taken (T = 1). Future encounters of the branch instruction use the output value indicated in the current state (T = 0 or T = 1) for the prediction. 4.1.2. Branch History Table. Figure 6 shows the FSM model for each of the 512 BHT entries. A cold Fig. 6. FSM model for an entry of the branch history table (BHT). A Buffer-Oriented Methodology for Microarchitecture Validation start initializes all entries in the BHT to the start state SNT. Any conditional branch whose address directly maps to the same BHT entry will cause transitions in that entry’s FSM when the branch is resolved in the execution stage. The current state of the FSM is used to make predictions (possibly overriding BTAC predictions) for instructions in the decode and dispatch stages. States WNT and SNT make a branch prediction of not taken (T = 0) while states WT and ST predict taken (T = 1). 4.2. Register Renaming Register renaming [21] is a technique that improves parallelism by eliminating stalled cycles due to antiand output dependencies. It avoids contention for a given register file location in the course of out-of-order execution by storing instruction results in temporary (rename) buffers. The PowerPC 604 uses three sets of rename buffers for three register files—a 12-entry general purpose rename buffer (GRB) for general purpose registers, an 8-entry floating-point rename buffer (FRB) for floating-point registers and an 8-entry condition code rename buffer (CRB) for condition code registers. Figure 7 shows the FSM model for the operation of each rename buffer entry. Note, the model is valid whether the rename buffer entry is from the GRB, FRB or CRB. An entry is Free until the dispatch unit allocates an entry for an instruction in the dispatch stage. This occurs if an instruction modifies any register. The entry remains allocated until the instruction completes Fig. 7. FSM model for an entry of any of the three rename buffers. 55 and the result is written back to the register file. There are two states for an allocated entry. At the time of renaming, each newly allocated rename entry will always hold the most recent (MR) value for the renamed register denoted by the MR Alloc state of Fig. 7. If a rename entry is allocated to a register which is then later renamed by another instruction, the previously allocated entry will no longer hold the most recent value and will therefore transition from the MR Alloc state to the NonMR Alloc state. Once the instruction finishes, the content of the rename entry becomes valid which causes a transition from MR Alloc (NonMR Alloc) to MR Valid (NonMR Valid). The FSM stays in the valid state until the result is written to the register file (WB transition) or a prior instruction causes an exception that requires all subsequent instructions to be discarded (discard transition). 4.3. Reservation Stations When instructions are dispatched to an appropriate functional unit, they are placed in reservation stations (RSs) before they are executed [22]. Instructions are buffered in reservation stations even if their operands are ready. Reservation stations are useful because they allow instructions that do not have ready operands to progress deeper in the pipeline; this allows the frontend of the pipeline to process new instructions instead of stalling. The PowerPC 604 has distributed reservation stations, that is, each of the functional units has a dedicated two-entry buffer for two instructions of the proper type. If a reservation station is available and an instruction can be dispatched, an entry in the reservation station will be allocated for the instruction. The following conditions prevent an instruction from being dispatched to a reservation station: (i) The required reservation station or its write port is full; (ii) The reorder buffer (described next) or its write port is full; (iii) An instruction break occurs and changes the program flow; (iv) No rename register is available. When an entry in a reservation station is allocated, the value of each instruction operand is written into the reservation station entry. If the value is not yet available, the tag (i.e., the status bits that identify the result) of the pending operand will be used instead of the actual value. Once all of the operands (“DATA”) are available, the instruction is removed from the reservation station and execution begins. 56 Utamaphethai, Blanton and Shen (a) Fig. 8. FSM model for a reservation station entry containing an instruction with (a) two source operands, (b) one source operand. Figure 8(a) shows the FSM model for each RS entry of a two-operand PowerPC instruction (an add instruction for example). Every RS entry starts as free and available for allocation. Hence, the starting state for an RS entry is the Free state shown in the center of the diagram of Fig. 8(a). When an instruction is in the dispatch stage and all dispatch conditions are met, an entry in the RS is allocated. Since the instruction requires two source operands, there are four possible status states for an RS entry: • • • • Alloc Alloc Alloc Alloc 00: 01: 10: 11: No source operands are available Only the right source is available Only the left source is available Both source operands are available The FSM can transition from the Alloc 00, Alloc 01 and Alloc 10 states to another Alloc state when other operand(s) become available. An RS entry is deallocated if all operands are ready and the instruction is issued (issue transition); or if the entry is discarded due to an exception created by a prior instruction (discard transition). 4.4. (b) Reorder Buffer Out-of-order execution allows independent instructions to be executed in an order that is different from the original program order. However, out-of-order instructions must complete in program order to ensure precise exception handling. Pipelines of superscalar processors can be typically divided into an in-order frontend, an out-of-order execution core and an in-order backend. During the last stage of the in-order frontend, an entry is allocated for an instruction in a reorder buffer [23]. The execution of the instruction is then performed in the out-of-order core. When the instruction finally completes or transfers its speculative state into permanent machine state in the in-order backend, the associated reorder buffer entry is deallocated in program order. In essence, a reorder buffer entry is a place holder for results that preserves program order. The PowerPC 604 uses a 16-entry reorder buffer to implement in-order completion at the backend. A reorder buffer entry is allocated during instruction dispatch. When an instruction finishes execution, its status (completed with or without exception) is recorded in the status bits of the corresponding buffer entry. The completion unit retires up to four finished instructions per cycle from the reorder buffer and updates register files in the complete stage. The completion unit recognizes exceptions and, if necessary, discards any operations performed by subsequent instructions in program order. Figure 9 shows the FSM model for the operation of each reorder buffer entry. A reorder buffer entry is available for allocation if its FSM is in the Free state. The FSM transitions from the Free state to the Allocate state when the instruction is dispatched to a reservation station. The FSM will transition from the Allocate state to the Execute state and finally to the Finish state when the instruction executes and finishes, respectively. The reorder buffer entries can be discarded if the corresponding instructions follow an A Buffer-Oriented Methodology for Microarchitecture Validation Fig. 9. buffer. given the general assumption that the initial state is unknown. The second phase verifies the existence of all the specified states while the third ensures the correctness of each state transition. The checking sequence for an FSM guarantees that the FSM of a buffer entry is indeed the machine originally described and not one of the many other possible state machines of the same or fewer number of states. Thus, any buffer entry not completely satisfying the microarchitecture specification according to the FSM model will be discovered by the simulation of a checking sequence. We choose to use a complete transition tour for testing our FSM models since it provides a good tradeoff between sequence generation effort and the level of validation achieved. Small sequences of PowerPC assembly instructions are used to translate transition tours of the FSMs into a simulatable instruction sequence, i.e. an assembly program. These small instruction sequences are called atomic sequences. Execution of the atomic sequence instructions causes the associated FSM transition to be traversed. The atomic sequence structure is partitioned into two parts: an initialization subsequence and a single triggering instruction. The initialization subsequence places the processor into a machine state that makes it ready to traverse the transition associated with the atomic sequence. Execution of the triggering instruction then causes the traversal of the transition. Some atomic sequences do not require the initialization subsequence (the reorder buffer for example). Figure 10 shows an example of an atomic sequence that triggers the RNT transition from state (WT, T = 0) to (WNT, T = 0) of the conditional branch FSM for BTAC in Fig. 5. The instruction cmpi is a part of the initializing subsequence. The triggering instruction is the conditional branch instruction bc. It is important to note that there exist many atomic sequences that can trigger a particular transition in an FSM. Two atomic sequences are different if their triggering instructions have different parameters such as instruction opcode, instruction operands, etc. A more robust and systematic means to specify an atomic sequence is to use two templates. A sequence template is used to specify the structure (order of instructions) FSM model for an entry in the reorder instruction that causes an exception (discard transition). 5. Test Generation Based on the FSM models of the critical buffers, a test sequence can be derived using FSM testing techniques. The test sequence consists of a sequence of FSM transitions that exercise the FSM in some prescribed way (such as complete or partial state tour, complete or partial transition tour or a checking experiment [24]). An obvious tradeoff exists between sequence generation complexity (effort required and sequence length) and the level of validation achieved, depending on the FSM testing technique selected. Partial tours cover a given subset of states or transitions of a FSM while complete tours visit every transition in the FSM. A complete transition tour is widely used because it is an effective means for exercising a significant amount of the FSM’s functionality while requiring little effort to generate. Although a transition tour cannot guarantee total correctness, it is a proven technique for uncovering design errors [25]. A checking sequence [24] is the most powerful form of FSM validation. It consists of three phases. The first phase synchronizes the machine into a particular state BC ADDRO: cmpi bc 0, 12, 30, 2, 1 BC0 0 ori 0, 0, 0 BC 0: Fig. 10. 57 # Set CRO # Make transition # Target instruction # NOP Traversal of the RNT arc of Fig. 5 using an atomic sequence of assembly instructions. 58 Utamaphethai, Blanton and Shen cmpi 0, 30, 1 <conditional branch> # Initializing subsequence # Triggering instruction BC 0: # Target instruction ori 0, 0, 0 # NOP (a) Instruction parameters Opcode Operand 0 Operand 1 Operand 2 Operand 3 GPRO-GPR31 GPRO-GPR31 GPRO-GPR31 bc BDO-BO31 BIO-BI31 b imm (24 bits) — — — add .. . .. . imm (14 bits) — — Total # template instances 215 224 224 (b) Fig. 11. Example of (a) a sequence template of the atomic sequence in Fig. 10 and (b) PowerPC instruction set table for instruction templates. of an atomic sequence. The sequence template for the atomic sequence of Fig. 10 is shown in Fig. 11(a) where the conditional branch instruction bc is replaced by <conditional branch> which is an instruction template. An instruction template specifies the set of instructions that can serve as a triggering instruction in the sequence template. Given an ISA specification, all instruction templates can be listed systematically in the form of a table as shown in Fig. 11(b). Every instruction in the ISA is listed along with the possible range of values for each instruction parameter. For example, the instruction template of the bc instruction in Fig. 11(b) requires three operands: BO (a five-bit immediate value that specifies the condition for which the branch is taken), BI (a five-bit immediate specifying the bit in the condition register to be used as the condition of the branch) and a 14-bit immediate for the target address. Thus, a total of 224 instances of the bc instruction can be used in the sequence template of Fig. 11(a). There are some other conditional branches like bca, bcl and bcla that can also be used for the sequence template of Fig. 11(a) instead of bc. It can be clearly seen that enumerating through all possible values for every instruction parameter in an instruction template is an overkill. Therefore, there must be some criteria for selecting a subset of the possible template instances. For example, given that we would like to exercise the functionality of an entry in the BHT where the instruction parameter of interest is the instruction opcode, there is a total of four different triggering instructions: bc, bca, bcl and bcla. Consequently, there are four different atomic sequences or sequence templates that can cause a transition in the BHT FSM using the opcode as the only selection criteria for the instruction template. We leave it to the user to specify which instruction parameters are important and relevant to the microarchitecture being validated. At present, sequence templates and instruction parameters deemed important for validation are manually identified and derived from the microarchitecture and the ISA specifications. Atomic sequences are generated by filling the sequence templates with instructions that are enumerated through the selected instruction parameters. Figure 12 shows the selected instruction parameters that are enumerated for different microarchitecture features. Once the atomic sequences are obtained, assembly-level test programs are created using Perl scripts to concatenate various atomic sequences in the order of the FSM transitions specified in the transition tour. 6. Experimental Results The MW [26, 27] performance simulator is used for our simulation experiments for the PowerPC 604. Checking code has been added to the simulator to track and measure the coverage of FSM transitions by the program under simulation. In our earlier work [18] two metrics, transition coverage and mutant detection, were A Buffer-Oriented Methodology for Microarchitecture Validation Microarchitecture feature BTAC (unconditional) BTAC (conditional) BHT Rename buffer Reservation station (1 operand) Reservation station (2 operand) Reorder buffer Instruction parameter 1 Instruction opcode Instruction opcode Instruction opcode Destination register Source operand (register only) Source operand I (register only) Instruction type (based on functional unit) 59 Instruction parameter 2 — — — — — Source operand II (register only) — Fig. 12. Instruction parameters that are enumerated to generate different atomic sequences for different microarchitecture features. used to show the effectiveness of our sequences. Transition coverage measures the percentage of targeted transitions that are multiply traversed (explained below) during the simulation. For mutant detection, the software model describing the behavior of the PowerPC 604 is purposely modified and used during the simulation. If the transition coverage of the modified version is different from that of the original version, the instruction sequence used in simulation is deemed to detect the mutant. We only use the transition coverage metric in this paper to show the effectiveness of our method. In this case, coverage is simply the ratio of the number of traversed FSM transitions to the total number of targeted FSM transitions. As pointed out in the previous section, a transition in an FSM can be traversed in many possible ways and generating test stimuli for exhaustive simulation is not feasible. Therefore, the coverage calculation is limited to the number of targeted transitions defined by the selected instruction parameters in the instruction template. For example, if the instruction parameter selected in the instruction template for the BHT FSM is the instruction opcode, the targeted transitions include each arc traversed by each of the branch instructions bc, bcl, bca and bcla. Since the current version of MW is strictly tracedriven, it is not capable of simulating the functionality associated with mis-speculated paths. As a result, some transitions in the FSM models are untrackable. The untrackable transitions are shown as dashed arrows in Figs. 7–9. These transitions can only be traversed by executing instructions down speculative paths. Although our test programs include instructions that cause the traversal of those speculative transitions, MW does not allow us to verify the coverage. Real and pseudorandomly-generated programs are typically used in industry for validating hardware design [28–32]. Pseudorandom testing requires relatively little effort for test generation compared to other approaches. However, random-based test generators are intended to discover design errors by generating a huge number (millions to billions) of instructions. The test generation time is typically small but the simulation time is quite large. We evaluate the effectiveness of our method by comparing the transition coverage obtained from our test programs with the coverage of instructions randomlygenerated and real workloads (SPEC95 benchmarks). We have created a C program for randomly choosing PowerPC instructions and their operands. Since some PowerPC instructions are not handled by MW, these instructions are noted and not considered for the random program. Instructions affecting the control flow of a program are carefully handled. The generated random program has both forward (i.e. jumps to subroutine) and backward (i.e. loops) branches. However, no nested loops are permitted. Characteristics of the random program including the total number of instructions to be generated, the frequency of branch instructions, the maximum number of iterations per loop, the size of loops and subroutines can be specified by the user. Figure 13 shows the number of simulated instructions in the SPEC95 benchmarks (12 white bars), the randomly-generated program (the light grey bar) and our generated test program (the dark grey bar). Note that the y axis of the graph is log scale. Figure 14 shows the percentage of edges in the FSM models of the BTAC and the BHT covered by our generated sequences and the real and randomly-generated programs. Based on the instruction parameters in the table of Fig. 12, each FSM transition in the BTAC FSMs can be traversed by any type of branch. Hence, there are four different ways (b, ba, bl, bla) to traverse a transition by unconditional branches and two ways (bc, bcl, excluding bca, bcla due to the limitations of the PowerPC assembler and loader) for conditional 60 Utamaphethai, Blanton and Shen Fig. 13. Program sizes (i.e. the number of instructions) for the SPEC95 benchmarks (white bars), the random program (light grey bar), and our generated test programs (dark grey bars) used for validation through simulation. Fig. 14. FSM transition coverage comparison between the SPEC95 benchmarks (first 12 sets of bars), the random program (the next-to-last set of bars) and our generated test program (the last set of bars). A Buffer-Oriented Methodology for Microarchitecture Validation 61 Fig. 15. Rename buffer coverage comparison between the SPEC95 benchmarks (first 12 sets of bars), the random program (the next-to-last set of bars) and our generated test programs (the last set of bars). branches. Each transition in the BTAC unconditional branch FSM must be traversed four times using branch instructions b, ba, bl and bla. Similarly, branch instructions bc, bcl must be used to traverse each transition twice in the BTAC conditional branch FSM. For the BHT FSM, each transition can be traversed by any conditional branch (again excluding bca, bcla). Therefore, a transition in a BHT entry must be traversed twice using bc and bcl to achieve 100% coverage. As shown in Fig. 14, our test programs achieve 100% coverage of all three of the branch prediction FSMs. The SPEC benchmarks have an edge coverage of less than 50% for all the FSMs. The SPEC benchmarks do not have branch types ba, bla and bcl and a large portion of the dynamic branch instructions map to the same BHT entry, which results in the low coverage achieved. It can be clearly seen that the randomly-generated program is able to achieve 100% coverage for some of the branch prediction FSMs. Our simulation results report that the 100% coverage for the BTAC conditional branch FSM and the BHT FSM is achieved after executing 889K and 1.6M instructions, respectively. Even though the random program has a 100% coverage on these two FSMs, the number of instructions required is about 3× and 19× more than our sequences for the BTAC conditional branch FSM and the BHT FSM, respectively. Figure 15 shows overall coverage of the GRB, FRB and CRB rename buffers. Our test sequence is exhaustive in that each FSM model edge is traversed in every possible way. For an entry in the GRB, this means that each edge can be traversed using any one of the 32 different general purpose registers. Similarly, there are 32 possible ways to traverse of the FRB model since there are 32 different floating point registers. But there are only eight alternatives for a condition code rename entry since there are only eight condition registers. Figure 15 shows that none of the SPEC95 benchmarks achieve 100% coverage of the targeted functionality of the rename buffer. The benchmark programs from compress to vortex have low FRB coverage because these benchmarks are integer applications and therefore contain very few floating point instructions. The random program has 100% edge coverage for the FRB after executing 567K instructions (81× more than ours) and 316K instructions (45× more) for the CRB. However, our test programs easily achieve 100% coverage using significantly fewer instructions. Figure 16 shows overall coverage for the reorder buffer. The targeted functionality for the reorder buffer is quite simplistic. Here, we ensure that each buffer entry is exercised by each of the six functional units: two simple integer units (SFX0 and SFX1), one complex integer unit (CFX), one floating point unit (FPU), 62 Utamaphethai, Blanton and Shen Fig. 16. Reorder buffer coverage comparison between the SPEC95 benchmarks (first 12 bars), the random program (the next-to-last bar), and our generated test program (the last bar). one load/store unit (LSU), and one branch unit (BRU). All but three of the benchmark programs (go, li, and vortex) achieve 100% coverage. The random program requires 2200 instructions to achieve 100% coverage for the reorder buffer while our sequence requires only 300 instructions. Figures 17 and 18 compare coverages for the 1- and 2-operand reservation station FSMs, respectively, for the four functional units SFX0, SFX1, CFX and FPU. The targeted functionality for each reservation station includes traversing each transition with all register operand possibilities. For the two-operand reservation stations, this means there are 1024 different alternatives for exercising an edge since there are 32 register choices for each of the two sources. The one-operand reservation stations have only 32 possible alternatives since there is only one source. The benchmark coverage of the one-operand reservation station is quite low, with the highest coverage achieved being about 40% for the program fpppp. The benchmark coverage is even lower for the two-operand reservation stations. No benchmark program achieves more than 15% coverage. For the randomly-generated program, the coverage for one-operand reservation station is almost 100% for the floating point reservation stations. However, the random program has poor coverage for the one-operand reservation station in the complex integer unit (only 3%). The highest edge coverage achieved by the random program is for the simple integer two-operand reservation station FSM, which is a mere 33%. 7. Conclusions and Future Work We have shown that the microarchitecture features (branch prediction, register renaming, reorder buffers, etc.) of superscalar processors can be effectively modeled as a set of critical buffers. Each buffer consists of multiple, identical entries, each of which can be modeled by a simple FSM. Based on this model, traditional iterative array and FSM testing techniques were leveraged to create test programs that validate these complex microarchitecture mechanisms. Our test programs achieved 100% coverage with 1000× fewer instruction than benchmark programs and 3-81× fewer instructions than randomly-generated programs. Moreover, the benchmark and randomly-generated programs do not achieve 100% coverage. Our future work will focus on validating the interaction that exists between buffer FSMs. This work adopts a “single fault assumption” which means we assume that each buffer in the microarchitecture operates A Buffer-Oriented Methodology for Microarchitecture Validation Fig. 17. One-operand reservation station coverage comparison between the SPEC95 benchmarks (first 12 sets of bars), the random program (the next-to-last set of bars), and our generated test program (the last set of bars). Fig. 18. Two-operand reservation station coverage comparison between the SPEC95 benchmarks (first 12 sets of bars), the random program (the next-to-last set of bars) and our generated test program (the last set of bars). 63 64 Utamaphethai, Blanton and Shen correctly except for the one under consideration. Our future work relaxes this assumption by considering the communication among groups of FSMs. We are also increasing the level of automation of this validation method. We envision an ATPG-like algorithm which accepts FSM models of the microarchitecture, information about the instruction set architecture and coverage goals that are set by the program user. The output of this ATPG algorithm would be an assembly-level test program that “exercises” the functionality targeted by the FSMs. The type of exercise can range from tours and checking experiments applied to individual FSMs to the intercommunication of these FSMs. The PowerPC 604 will continue to serve as our case study but the algorithm would be general and therefore applicable to any superscalar processor that uses similar microarchitecture mechanisms. 13. 14. 15. 16. 17. 18. References 1. B. Black and J.P. Shen, “Calibration of Microprocessor Performance Models,” IEEE Computer, Vol. 31, No. 5, pp. 59–65, May 1998. 2. M. Abramovici, M.A. Breuer, and A.D. Friedman, Digital Systems Testing and Testable Design, IEEE Press, Piscataway, NJ, 1991. 3. R.E. Bryant, “Graph Based Algorithms for Boolean Function Manipulation,” IEEE Transactions on Computers, Vol. C-35, No. 8, pp. 677–691, Aug. 1986. 4. R.E. Bryant, D.L. Beatty, and C.J.H. Seger, “Formal Hardware Verification by Symbolic Ternary Trajectory Evaluation,” Proc. of Design Automation Conference (DAC), June 1991, pp. 397– 402. 5. M.C. McFarland, “Formal Verification of Sequential Hardware: A Tutorial,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 12, No. 5, pp. 633–654, May 1993. 6. K.L. McMillan, Symbolic Model Checking, Kluwer Academic Publishers, 1993. 7. M. Yoeli, Formal Verification of Hardware Design, IEEE Computer Society Press, Los Alamitos, CA, 1990. 8. J. Burch and D. Dill, “Automatic Verification of Pipelined Microprocessor Control,” International Conference on Computer Aided Verification, June 1994, pp. 68–80. 9. K.L. Nelson, A. Jain, and R.E. Bryant, “Formal Verification of a Superscalar Execution Unit,” Proc. of Design Automation Conference (DAC), June 1997, pp. 161–166. 10. C.L. Berman and L.H. Trevillyan, “Functional Comparison of Logic Designs for VLSI Circuits,” Proc. of International Conference on Computer Design (ICCD), Nov. 1989, pp. 456–459. 11. D.L. Beatty and R.E. Bryant, “Formally Verifying a Microprocessor Using a Simulation Methodology,” Proc. of Design Automation Conference, June 1994, pp. 596–602. 12. D. Geist, M. Farkas, A. Landver, Y. Lichtenstein, S. Ur, and Y. Wolfsthal, “Coverage-Directed Test Generation Using Symbolic 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. Techniques,” Proc. of the International Conference in Formal Methods in Computer-Aided Design, Nov. 1996, pp. 143–158. R.C. Ho, C.H. Yang, M.A. Horowitz, and D.L. Dill, “Architecture Validation for Processors,” Proc. of the International Symposium on Computer Architecture, June 1995, pp. 404–413. H. Iwashita, S. Kowatari, T. Nakata, and F. Hirose, “Automatic Test Program Generation for Pipelined Processors,” Proc. of International Conference on Computer-Aided Design, Nov. 1994, pp. 580–583. D. Moundanos, J.A. Abraham, and Y.V. Hoskote, “Abstraction Techniques for Validation Coverage Analysis and Test Generation,” IEEE Transactions on Computers, Vol. 47, No. 1, pp. 2–14, Jan. 1998. N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Superscalar Processor Validation at the Microarchitecture Level,” Digest of Papers of the International High Level Design Validation and Test Workshop (HLDVT), Nov. 1997, pp. 202–209. N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Validation of Speculative adn Out-of-Order Execution Microarchitecture,” First International Workshop on Microprocessor Test and Verification, Oct. 1998. N. Utamaphethai, R.D. Blanton, and J.P. Shen, “Superscalar Processor Validation at the Microarchitecture Level,” Proc. of International Conference on VLSI Design, Jan. 1999, pp. 300– 305. Jr. A.G. Liles and B.E. Willner, “Branch Prediction Mechanism,” IBM Technical Disclosure Bulletin, Vol. 22, No. 7, pp. 3013– 3016, Dec. 1979. J.K.F. Lee and A.J. Smith, “Branch Prediction Strategies and Branch Target Buffer Design,” IEEE Computer, pp. 6–22, Jan. 1984. R.M. Keller, “Look-Ahead Processors,” Computing Surveys, Vol. 7, No. 4, pp. 177–195, Dec. 1975. R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal, Vol. 11, No. 1, pp. 25–33, Jan. 1967. J.E. Smith and A.R. Pleszkun, “Implementation of Precise Interrupts in Pipelined Processors,” Proc. of the International Symposium on Computer Architecture, June 1985, pp. 36–44. Z. Kohavi, Switching and Finite Automata Theory, McGrawHill, New York, 1978. S. Naito and M. Tsunoyama, “Fault Detection for Sequential Machines by Transition-Tours,” Proc. of the Interanational Symposium on Fault-Tolerant Computing, June 1981, pp. 238–243. T.A. Diep, “VMW: A Visualization-based Microarchitecture Workbench,” Ph.D. Thesis, Carnegie Mellon University, Aug. 1995. A.S. Huang and T.A. Diep, “MW Developer’s Guide,” Technical Report, CMuART-95-1, Carnegie Mellon University, Aug. 1995. A. Aharon, A. Bar-David, B. Dorfman, E. Gofman, M. Leibowitz, and V. Schwartzburd, “Verification of the IBM RISC System/6000 by a Dynamic Biased Pseudo-Random Test Program Generator,” IBM System Journal, Vol. 30, No. 4, pp. 527– 538, 1991. P. Bose, “Architectural Timing Verification and Test for Superscalar Processors,” Proc. of the International Symposium on Fault-Tolerant Computing, June 1994, pp. 256–265. P. Bose, “Performance Test Case Generation for Microprocessors,” Proc. of VLSI Test Symposium, Apr. 1998, pp. 54–59. A Buffer-Oriented Methodology for Microarchitecture Validation 31. N. Dohm et al., “Zen and Art of Alpha Verification Microarchitecture Level,” Proc. of International Conference on Computer Design, Oct. 1998, pp. 111–117. 32. S.T. Mangelsdorf et al., “Functional Verification of the HP PA 8000 Processor,” Hewlett-Packard Journal, pp. 22–31, Aug. 1997. Noppanunt Utamaphethai is a PhD candidate in the ECE Department at Carnegie Mellon University. Currently, he is working on the ATPG-based validation of microprocessors and has applied his methodology to designs at both IBM and Intel. He received his BS from Brown University and his MS from CMU. Shawn Blanton is an assistant professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University where he is a member of the Center for Electronic Design Automation. He received the Bachelor’s degree in engineering from Calvin College 65 in 1987, a Master’s degree in Electrical Engineering in 1989 from the University of Arizona, and a Ph.D. degree in Computer Science and Engineering from the University of Michigan, Ann Arbor in 1995. His research interests include the computer-aided design of VLSI circuits and systems; verification and testing; and computer architecture. Dr. Blanton is the recipient of National Science Foundation Career Award and is a member of IEEE and ACM. John Paul Shen is a professor in CMU’s ECE Department and heads up the Carnegie Mellon Microarchitecture Research Team (CMuART). He received a BS from the University of Michigan and an MS and PhD from the University of Southern California all in Electrical Engineering. He spent several years at Hughes and TRW. His current research interests are in high performance microprocessor design and validation, speculative and dynamic microarchitectures, and software thread integration for embedded computing systems. He is an IEEE fellow.