A I F N

advertisement
AN INTEGRATED FUNCTIONAL
PERFORMANCE SIMULATOR
THE FMW POWERPC-BASED SIMULATION TOOL WILL HELP DESIGNERS
ACCURATELY EVALUATE THE EFFECTIVENESS AND VALIDATE THE
CORRECTNESS OF NEW MICROPROCESSOR MECHANISMS.
Candice Bechem,
Jonathan Combs,
Noppanunt
Utamaphethai,
Bryan Black,
R.D. Shawn Blanton,
and John Paul Shen
Carnegie Mellon
University
26
Microprocessor designers use multiple simulation tools with varying degrees of
modeling details ranging from the instruction
set of the microprocessor to the circuit implementation. Here, we focus on tool design for
the development of microarchitectures, which
implement the instruction set. Microarchitecture design involves both functional and
performance simulators.1-5 A functional simulator models a machine’s architecture, or
instruction set, with functional correctness. A
performance simulator models the machine
organization, or microarchitecture, and is concerned with machine performance. Sometimes these performance simulators are also
referred to as cycle-accurate simulators to
reflect their concern with timing issues.
Background
To reduce simulation time, designers have
traditionally implemented performance simulators as trace-driven tools; that is, their
inputs are traces of dynamic instructions,
without full-function simulation capability.
This type of simulator processes an execution
trace of a benchmark to produce measurements of the dynamic use of machine
resources, the throughput at various pipeline
stages, and ultimately the performance of the
machine, measured in IPC (average instructions per cycle). Figure 1 illustrates one such
tool called MW (microarchitecture workbench),6,7 which was developed at Carnegie
Mellon and which we have used extensively
in our microarchitecture research. An earlier
work validated the MW PowerPC 604 performance simulator used in this study against
an actual PowerPC 604 system.1
Since the 1980s, trace-driven performance
simulators have become popular for assessing
microprocessor performance. Avoiding fullfunction simulation, these performance simulators can process extremely long traces in a
reasonable amount of time. However, in
recent years four weaknesses of trace-driven
performance simulators have emerged.
First, the complexity of microarchitecture
has increased dramatically, causing trace-driven performance simulation to become quite
time consuming, thus reducing the simulation-time benefit of the trace-driven approach.
Since trace-driven simulation has become
much more time consuming than functional
simulation, it is possible to include functional capabilities in the traditional trace-driven
simulator without significantly impacting
simulation time.
Second, there are inherent limitations to the
capabilities of a trace-driven simulator. Typically, such a simulator processes only the trace
of instructions executed (I trace) and the trace
of memory addresses referenced (M trace).
0272-1732/99/$10.00  1999 IEEE
© 1999 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,
including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in other works.
Most contemporary microprocessors employ
some form of dynamic branch prediction.
During program execution, it is possible for
the branch predictor to mispredict and send
the machine temporarily on a mispredicted
path. If a misprediction is detected, during
branch resolution, the machine recovers by
flushing the mispredicted path instructions.
Since both the I and M traces contain only
nonspeculative instructions, it is impossible
to simulate the processing and the dynamic
effects of mispredicted path instructions using
only these traces. To correct this problem,
some trace-driven simulators insert a fixed
number of branch stall cycles or inject artificial instructions to mimic the mispredicted
path instructions. Both of these approaches
only approximate the actual machine behavior. The better solution is to simulate the
branch predictor, the associated speculative
processing of instructions, and the recovery
mechanism when misprediction is detected.
This involves the direct simulation of the mispredicted path instructions, including their
fetching, decoding, dispatching, execution,
and flushing, which requires a full-function
performance simulator.
The third weakness of a trace-driven performance simulator is the lack of instruction execution results. Both data-dependent instruction
execution and more recent value prediction
techniques8,9 require instruction results to be
accurately simulated. Unfortunately, tracedriven simulators cannot accurately provide
instruction results using data traces.
Finally, researchers have proposed systematic methods for generating instruction sequences
that thoroughly test the microarchitecture.10,11
These methods rely on an accurate performance model of the microarchitecture to confirm the effectiveness of the test sequences.
Since trace-driven simulators do not model the
execution of mispredicted instructions, it is
impossible to validate any speculative device or
recovery mechanism.
Trace
generation
PowerPC
architecture
specification
Microarchitecture
workbench
IPC
Peak usage
Utilization
Figure 1. The MW (microarchitecture workbench) is a typical trace-driven performance simulator.
should not significantly increase the total simulation time. Direct simulation of mispredicted path instructions and value prediction
mechanisms could then be supported. Such a
simulation tool can also be used to validate
speculation and recovery mechanisms.
The functional MW (fMW)
This article presents the design and implementation of a new performance simulator
with full-function capability called fMW
(functional microarchitecture workbench).
This tool replaces our original MW tool.6,7
Both MW and fMW are based on the PowerPC architecture and faithfully model all the
PowerPC instructions. The fMW builds on
MW by incorporating a customized version
of the PSIM12 functional simulator and by
extending the capabilities of the original MW.
(See Figure 2.) This coupling of PSIM and
MW guarantees accurate execution of instructions (including runtime data values) and
cycle-accurate timing simulation. FurtherFunctional
simulator
PSIM
Directs
execution
Instruction
results
Microarchitecture
workbench
Motivations
There is a serious need for a new performance simulation tool that can address these
shortcomings. The key requirements for the
new tool include full-function simulation and
cycle-accurate microarchitecture modeling.
The addition of full-function capability
PowerPC
architecture
specification
IPC
Peak usage
Utilization
Figure 2. PSIM (functional simulator) and MW (performance simulator) are
integrated and tightly coupled to provide the fMW framework.
MAY–JUNE 1999
27
PERFORMANCE SIMULATOR
A
Mispredicted path
Save
state
Correct
path
Revert
context
B
Branch resolution
C
Figure 3. Checkpointing of PSIM as implemented in the
fMW framework.
more, the development of this tool enables
future research involving multiple instruction
streams, such as simultaneous multithreading
and dual-path execution, as well as research
on value prediction8,9 that requires runtime
register and memory values.
To demonstrate the effectiveness of the
fMW tool, we present two recent research
studies. The first study investigates the effects
of mispredicted path instructions on the cache
hierarchy,13 while the second concerns the validation of speculation and recovery mechanisms.10,11 The capabilities of fMW are quite
similar to that of SimpleScalar;4 however,
fMW is based on the PowerPC architecture
and executes binaries compiled for netBSD.
PSIM’s ability to translate PowerPC system
calls to the local machine12 makes fMW highly portable and does not require native execution on PowerPC platforms.
Implementation details
MW models the pipeline resources and
cache hierarchy, and simulates at the level of
machine cycles. PSIM, on the other hand,
operates at the instruction level and interprets
or “executes” one instruction at a time. PSIM
maintains the architectural state by tracking
all register and memory updates. It has no
knowledge of the cache hierarchy or any other
features of the microarchitecture.
As PSIM executes each instruction, it bundles the instruction with its execution results
and passes it on to MW. When the branch
predictor mispredicts a branch instruction,
MW instructs PSIM to traverse the mispre-
28
IEEE MICRO
dicted path. PSIM checkpoints its current
state, for later recovery and begins execution
on the mispredicted path.
In Figure 3, the branch at the end of basic
block A is mispredicted. MW instructs PSIM
to checkpoint the architectural state and begin
execution on the mispredicted path (basic block
B). Later, the MW executes the branch instruction, detects the misprediction, and corrects it
by flushing the mispredicted path instructions.
As the MW simulation recovers, PSIM reverts
to the saved state and begins execution on the
correct path (basic block C). In this manner,
all mispredicted path instructions are accurately
accounted for and directly simulated in MW,
while the machine state and data values stored
in PSIM are correctly maintained.
Implementation difficulties
Simulating instructions on a mispredicted
path creates several interesting problems. System calls, interrupts, exceptions, and unnatural data values encountered on a mispredicted
path can cause irreparable state changes. For
example, if an exit() call is encountered, PSIM
will execute the call and terminate execution,
ending the simulation. Other problems include
accessing unmapped memory due to incorrect
address values, and exceptions caused by unnatural incorrect data values. To alleviate these
problems, PSIM suspends all system calls,
interrupts, and exceptions when executing mispredicted paths.
fMW performance
The interaction overhead between PSIM
and MW, and the execution of mispredicted
paths, slightly reduce the simulation speed.
The original trace-driven MW tool simulated
approximately 20,000 to 25,000 instructions
per second (KIPS). The fMW can simulate
approximately 15 to 20 KIPS. Both measurements are performed on 200-MHz Pentium
Pro machines running Linux. The fMW’s
speed is encumbered by years of software evolution. Extensive code optimization is currently underway, which will significantly
improve the speed of the fMW tool.
fMW applications
Two studies demonstrate the usefulness of
fMW. The first examines the effects that mispredicted path instructions have on the cache
hierarchy.13 The second study quantifies the
coverage achieved by microarchitecture validation test sequences.10,11 The fMW tool
enabled both of these studies, which were not
possible with the earlier MW tool.
Cache effects of mispredicted paths
To achieve high instruction fetch bandwidth, modern microprocessors employ
branch predictors to speculatively fetch
instructions beyond conditional branches. If
these speculative instructions are determined
to be on a mispredicted path, they must be
invalidated and removed from the machine.
Mispredicted path instructions can affect
many parts of the machine, particularly the
functional units, branch predictors, and
caches. This study examines the effect of mispredicted path execution on performance and
the cache hierarchy.
Previous work. A handful of studies 14-16 have
examined the effects of mispredicted paths;
however each of these efforts is hampered by
inadequate modeling techniques. One simulator15,16 is trace driven, leading to several inaccuracies. Since trace-driven simulators cannot
execute mispredicted path instructions, Pierce
and Mudge injected a fixed number of instructions to emulate the mispredicted path.15 However, the number of cycles a given machine
spends on each mispredicted path depends on
the aggressiveness of the branch predictor and
the branch resolution latency. Fixing the branch
resolution latency at a constant number of
instructions introduces significant error. Nevertheless, using this method, Pierce and Mudge
found that the mispredicted path instructions
tend to prefetch the data cache.
A continuation work16 using the same tool
focused on the instruction cache and found
that the prefetching effects of mispredicted
path instructions far outweigh the pollution
caused by them. Lee et al.14 studied instruction cache fetch policies for speculative execution using a cache simulator. They found
that mispredicted path instructions did not
cause degradation in performance over fetching only the correct path.
Previous studies generally show mispredicted path execution to be a beneficial
prefetching mechanism for the instruction
cache.14-16 Lee et al. also suggested similar ben-
Table 1. SPECint95 benchmarks.
Name
compress
gcc
go
ijpeg
li
m88ksim
perl
vortex
Input set
10,000 e 2231
−f<all optimizations> −O regclass.I -s regclass.s
59
tinyrose.ppm
queen6.lsp
dhry.big.100iter, cache off
trainscrabble.in
tiny.in
Instruction
count
39,719,131
257,670,349
79,544,303
92,054,217
56,572,774
106,900,787
50,039,056
153,084,257
efit for the data cache.14 However, the methods that measured these effects had serious
limitations and are inherently inaccurate. The
fMW removes such inaccuracies by directly
simulating the mispredicted path instructions
in the machine model. The following section
summarizes our experimental results; an earlier work provides more detailed results.13
Experimental results, We used the SPECint95
benchmark suite for our experiments. Table
1 summarizes the input sets and run lengths
of each benchmark. To focus the current study
on the effects of speculative execution and to
emphasize the effect of mispredicted instructions in the pipeline, we extended the PowerPC 604 microarchitecture to remove resource
constraints and widened it to allow a greater
number of in-flight instructions. The instruction window is limited to 512 instructions
with an unlimited number of functional units
and an unlimited number of rename registers.
Instruction fetch and dispatch widths are
increased to 16 instructions per cycle. A 64entry, fully associative branch target address
cache (BTAC) and a 512-entry branch history table (BHT) handle branch prediction.
The memory hierarchy includes a perfect
(100% hit rate) main memory; a 32-Kbyte,
four-way set-associative level-1 instruction
cache (IL1); a 32-Kbyte, eight-way set-associative level-1 data cache (DL1); and a 512Kbyte, eight-way set-associative, unified
level-2 cache (UL2). All caches use a writeback, write-allocate scheme. Access latencies
are 1, 3, and 100 cycles for the L1, L2, and
main memory respectively.
Due to space constraints, we discuss only
instruction cache results here. For more
MAY–JUNE 1999
29
PERFORMANCE SIMULATOR
Table 2. Instruction cache access discrepancies caused
by mispredicted path instructions.
Pollution
Benchmark accesses
compress
207
gcc
2,356,036
go
349,958
ijpeg
108,252
li
659
m88ksim
857
perl
96,656
vortex
278,049
Avg. cycle
loss/access
1.00
14.92
3.95
2.88
7.96
6.28
29.60
10.16
Prefetch
accesses
110
543,954
188,388
42,341
203
361
18,885
99,737
Avg. cycle
gain/access
24.44
34.35
50.51
38.56
34.36
34.25
45.63
34.84
detailed and extensive results, please refer to
Combs, Bechem, and Shen.13
To determine when the instruction cache
is polluted or prefetched, we used two copies
of the memory system during simulation.
One maintains the memory state for both correct path and mispredicted path instructions,
while the other maintains the memory state
for only correct path instructions. Any latency difference between the two memory systems is due to mispredicted path instructions.
If the access latency of the correct path-only
memory is greater than the latency of the
memory updated by the mispredicted path,
the mispredicted path has prefetched into the
instruction cache. If the opposite occurs, the
mispredicted path has polluted the instruction cache. Prefetching causes a performance
improvement and is considered a gain to be
measured in cycles with pollution being a loss.
Table 2 shows the instruction cache access
discrepancies observed when mispredicted
path instructions are simulated in fMW. The
table lists the number of prefetching and polluting accesses along with the average number of cycles gained (for each prefetch access)
or lost (for each polluting access). The “Net
Change” column records the overall cache
latency cycle changes caused by the mispredicted path accesses [(prefetching accesses ×
cycles gained per access) − (polluting accesses × cycles lost per access)]. A positive net
change indicates a reduction in cycle count
(performance gain), whereas a negative change
indicates an increase in cycle count (performance loss).
The results in Table 2 show that most
benchmarks have a positive net change due to
30
IEEE MICRO
mispredicted path cache
accesses, however the extent
varies greatly. Only perl and
Net change
gcc show negative net
(cycles)
changes.
2,481
Examination of the average
−16,479,381
number of cycles lost/gained
8,133,593
shows that most of the
1,320,485
prefetching accesses are
1,730
prefetched from main mem6,983
ory (100-cycle latency), since
−1,999,238
the average number of cycles
650,993
gained per prefetching access
is in the range of 25 to 50
cycles. On the other hand
most of the polluting accesses are causing L1
(1-cycle latency) misses and resulting in L2
(3-cycle latency) hits. For most benchmarks
the average number of cycles lost per polluting access is less than 10 cycles. Perl and gcc
are the exceptions. Both exhibit a significant
number of penalty cycles per polluting access,
indicating a significant number of misses to
main memory. In other words, the polluting
accesses caused by the mispredicted paths have
a tendency to remove valid data not only from
the L1 cache but from the L2 cache as well.
The number of cycles shown in the “Net
Change” column of Table 2 does not translate directly into IPC change. The dynamic
execution of the benchmark determines the
effect of each instruction fetch. When the
machine is stalled, performance is not affected by cache pollution because the next
instruction is not currently needed. Cache
prefetching also has a diminished impact
when the next instruction is not currently
needed.
Figure 4 shows the actual impact of mispredicted path execution on the IPC. The percent of change ranges from the greatest
increase in go of 12.0% to the perl decrease
of −7.93%. The average across all benchmarks
is approximately 1.0%. This IPC increase is
due to the positive effects of prefetching, while
the reduction of IPC is due to cache pollution
effects induced by the mispredicted path
instructions.
Although the gains are positive for most
benchmarks, the IPC changes vary significantly from benchmark to benchmark. Pierce
and Mudge15,16 observed that all benchmarks
had more prefetching than pollution and thus
IPC change (%)
concluded that the cache
15
effects of mispredicted path
instructions would always be
10
beneficial. This observation
conflicts with the data of Fig5
ure 4.
To accurately assess the
0
compress gcc
go
ijpeg
li
m88ksim
IPC performance impact,
direct simulation of the mis−5
predicted path instructions is
essential. Using fMW, we
−10
found that the magnitude of
the effect on IPC varies wideFigure 4. Percent IPC change.
ly from benchmark to benchmark, ranging from −8% to
+12%. These results clearly differ from those that is based on rigorous models of microarin previous studies that did not perform direct chitecture mechanisms and can yield quanticycle-accurate simulation of the mispredict- tative coverage figures.
ed path instructions. These results demonstrate the usefulness and effectiveness of the Validation method. Recently, we presented a
new fMW tool at yielding more accurate and systematic method for generating efficient test
complete simulation data.
sequences that would rigorously validate contemporary superscalar microarchitectures.10,11
Validation of speculation and recovery
These microarchitectures employ deep
Currently, the microprocessor industry pipelines, aggressive speculation, and out-ofrelies heavily on simulation for validating order execution. This method operates at the
microarchitecture mechanisms. Validation microarchitecture level and is intended to valinvolves exercising the simulation models and idate the behaviors of the key microarchitecexamining the outcome. To exercise these ture mechanisms:
models for validation, the industry uses
instruction sequences or test sequences as
• dynamic branch prediction,
input stimuli to the simulation models. Gen• register renaming, and
erally, researchers used three types of test
• out-of-order instruction issuing from
sequences.1,3 First, real application programs
reservation stations and maintaining precan be used as test sequences. While these procise exception via the reorder buffer.
grams may represent the actual user workload,
they may not fully exercise the machine. SecTo handle the complexity of a modern
ond, designers generate test programs to probe microarchitecture, we partitioned the
specific areas of the machine and to test the machine into a set of critical buffers, includ“corner conditions” of machine behavior. ing the branch target address cache, the
Third, randomly generated programs supple- branch history table, register rename buffers,
ment the previous two types of test sequences. reservation stations, and the reorder buffer.
Using real application programs and ran- Figure 5 (next page) illustrates a typical superdomly generated programs as test sequences scalar pipeline with these critical buffers. We
can be very inefficient. Explicitly generated view these buffers as critical because the bulk
test sequences are generated in a very ad hoc of the machine control logic manages the
fashion based on the intuitive knowledge of reading and writing of these buffers.
Each of these critical buffers has multiple
the designer. Regardless of the test sequences
used, there is no rigorous way to quantitatively symmetrical entries. In our validation method
assess their coverage at the microarchitecture the behavior (reading and writing) of each
level. There is a real need for a systematic buffer entry is modeled with a simple finitemethod to generate highly efficient test state machine (FSM). This FSM model is
sequences for microarchitecture validation used to automatically generate an efficient test
perl
vortex Average
MAY–JUNE 1999
31
PERFORMANCE SIMULATOR
sequence that fully exercises the buffer behavior. This approach resembles automatic test
pattern generation (ATPG) for logic testing
and borrows some ideas from functional testing of iterative structures.
Traditional logic testing tests an iterative
array-structured circuit by partitioning the
array into its symmetrical modules. Then each
module is separately and identically tested.
We borrowed this concept for our validation
method.11 Since the buffer entries are symmetrical, a buffer is validated by separately and
identically validating each of its entries. Each
buffer is validated by exercising all the FSM
state transitions for each buffer entry. A test
sequence of instructions is generated that will
force a buffer entry to traverse all of its state
transitions. The state transitions are verified
by monitoring the simulation process and
examining the simulation outcome. This
process is repeated for each of the buffer
entries, then for each of the buffers. The coverage of a test sequence is the percentage of all
possible FSM state transitions exercised by
that test sequence of instructions and verified
by the simulation tool.
In summary, our ATPG-based validation
method involves
1) partitioning a microarchitecture into its
critical buffers,
2) generating the FSM models for each
entry of all the key buffers,
3) constructing a transition tour for each
FSM model. and
4) synthesizing a test sequence of instructions to carry out each transition tour.
All the test sequences are then used to exercise the simulation model of the microarchitecture to verify the coverage achieved.
Execution
FSM models. This study applies the FSM
method to the register rename buffer and the
reorder buffer of the PowerPC 604 microarchitecture. Figure 6a illustrates the FSM diagram that models the behavior of each entry of
the register rename buffer. An entry is free until
the dispatch unit allocates it for an instruction
in the dispatch stage. The entry remains allocated until the instruction finishes.
There are two states for an allocated entry.
At the time of renaming, each newly allocated rename entry will always hold the most
recent (MR) value for the renamed register
denoted by the MR Allocation state. If a
rename entry is allocated to a register that is
later renamed by another instruction, the previously allocated entry will no
longer hold the most recent
Instruction
value and will therefore trancache
Branch
sition from the MR Allocaprediction
tion state to the NonMR
Allocation state. Once the
Fetch buffer
instruction finishes, the content of the rename entry
Decode
becomes valid, which causes
a transition from MR AllocaDecode buffer
tion (NonMR Allocation) to
MR Valid (NonMR Valid).
Dispatch
Rename buffers
The FSM stays in the valid
Reservation stations
state until the result is written
bru
sfx0
sfx1
cfx
fpu
ld/st
Write
to the register file (WB transition) or a prior instruction
Critical buffer
Entry
causes an exception that
requires all subsequent
instructions to be discarded
Read
Reorder buffer
(discard transition).
Figure 6b shows the FSM
diagram that models each entry
Figure 5. A microarchitecture viewed as a set of critical buffers.
of the reorder buffer. A reorder
buffer entry is available for allo-
32
IEEE MICRO
MR
allocation
Discard
Finish
Stale
Dispatch
MR
valid
Free
WB
NonMR
allocation
Free
Allocation
Allocate
Discard
Discard
Complete
Discard
Discard
Discard
Discard
WB
Stale
Finish
Finish
Execute
NonMR
valid
(a)
(b)
Figure 6. The FSM models of a register rename buffer entry (a) and a reorder buffer entry (b).
Experimental results. In earlier
works,10,11 we used the original version of MW
to simulate the microarchitecture. Given the
limitation of the original MW, the simulation
of certain FSM state transitions was not possible. The “discard” arcs in the FSM diagrams
of Figure 6a,b indicate these transitions. Consequently, the coverage of these transitions by
the test sequences cannot be verified, resulting
in relatively low coverage of the total number
of state transitions. With the availability of the
fMW tool, we can now simulate and verify all
of these state transitions. Here, we highlight
the results and benefits of using the fMW tool
for determining the coverage of the test
sequences in performing validation of the PowerPC 604 microarchitecture.
Using the ATPG-based validation method,
we generated a test sequence totaling 97,000
instructions so we could validate the rename
buffer and the reorder buffer. For comparison
we also used the SPECint benchmarks as a
second test sequence. Figure 7 shows the coverage results for the ATPG sequence and the
SPECint benchmarks for both the register
rename buffer and the reorder buffer. The figure also shows verifiable coverages using both
the original MW tool and the new fMW tool.
We can make two key observations. First, the
SPECint benchmarks, although almost four
orders of magnitude longer (totaling
685,000,000 instructions), achieve much lower
coverages than the ATPG test sequence. Second, using fMW, we can verify much higher
percentages of the state transitions in the FSM
Rename original MW
Rename fMW
ROB original MW
ROB fMW
100
FSM coverage (%)
cation if its FSM is in the Free
state. When an instruction is
dispatched to a reservation station, a reorder buffer entry is
allocated, and the entry transitions from Free to Allocation.
The FSM will transition from
the Allocation state to the Execute state and finally to the Finish state as the instruction
executes and finishes. Mispredicted path instructions are
removed from the reorder
buffer after branch execution.
The discard transition is traversed when an instruction is
flushed from the reorder buffer.
50
0
SPEC
ATPG
Sequence
Figure 7. Rename buffer and reorder buffer
(ROB) coverage comparison using the original MW and the new fMW.
models. Verifiable coverage increases for both
the ATPG and SPECint sequences using fMW.
Using our ATPG sequence, the original
MW can only achieve a verifiable coverage of
64% of the transitions for the rename buffer,
while the fMW tool achieves 100% coverage.
The register rename buffer includes 12 entries
for renaming general-purpose registers and
eight entries for renaming condition code registers. Each entry is validated for renaming
each possible architectural register. With the
original version of MW, only seven out of 11
transitions of the rename buffer FSM are
MAY–JUNE 1999
33
PERFORMANCE SIMULATOR
trackable during simulation. Therefore, the
maximum coverage that can be verified for
any sequence using the original MW is at best
64%. The four unverified transitions require
the rename entry to be first allocated and
updated but later discarded due to the misprediction on a preceding branch instruction.
The fMW tool can easily simulate the four
discard transitions, and reports 100% verifiable coverage for the ATPG sequence. For the
SPECint benchmarks, the verifiable coverage
of the rename buffer using the original version of MW is 38%. With the fMW tool the
verifiable coverage increases to 55%.
The reorder buffer has 16 entries. The
ATPG sequence using the original version of
MW achieves 57% verifiable coverage, while
using fMW achieves 100% verifiable coverage. For the SPECint benchmarks, the verifiable coverage of the reorder buffer using MW
is 54%. With the fMW tool, the verifiable coverage increases to 88%. The verifiable coverage
of the rename buffer and the reorder buffer
increases for both the ATPG sequence and the
SPECint benchmarks, when using fMW
instead of the original MW. Using fMW, many
more FSM transitions can be verified via simulation, which was not possible with the original MW tool. The results clearly demonstrate
the usefulness and effectiveness of the new
fMW tool for supporting simulation-based
microarchitecture validation.
O
ur new fMW tool will be the workhorse
for our future research on advanced
microarchitecture techniques. We plan to use
it to accurately evaluate the effectiveness and
validate the correctness of new microarchitecture mechanisms. The fMW will be an
effective tool for supporting value prediction
techniques, trace prediction techniques, multiple-path instruction execution, simultaneous multithreading, and simulation-based
microarchitecture validation. We believe that
a tool like fMW is absolutely essential for
future microarchitecture research.
MICRO
Acknowledgments
Our research was supported in part by
ONR (N00014-95-1-1112 and N00014-961-0347) and by Intel. This work has benefited from the generous donation of a large
number of Pentium II systems from Intel.
34
IEEE MICRO
References
1. B. Black and J.P. Shen, “Calibration of Microprocessor Performance Models,” Computer, May 1998, pp. 59-65.
2. P. Bose, “Performance Test Case Generation for Microprocessors,” Proc. 16th VLSI
Test Symp., IEEE Computer Soc., Los Alamitos, Calif., Apr. 1998, pp. 54-59.
3. P. Bose and S. Surya, “Architectural Timing
Verification of CMOS RISC Processors,”
IBM J. Research and Development,
Jan./Mar. 1995, pp. 113-129.
4. D. Burger and T. Austin, “The SimpleScalar
Tool Set, Version 2.0,” Tech. Report 1342,
Univ. of Wisconsin-Madison, 1997.
5. M. Reilly and J. Edmondson, “Performance
Simulation of an Alpha Microprocessor,”
Computer, May 1998, pp. 50-58.
6. T.A. Diep and J.P. Shen, “VMW: A Visualization Based Microarchitecture Workbench,” Computer, Dec. 1995, pp. 57-64.
7. A.S. Huang and T.A. Diep, “MW Developer’s Guide,” CMuART Tech. Report 95-1,
ECE Dept., Carnegie Mellon Univ., Pittsburgh, Aug. 1995.
8. M.H. Lipasti, C.B. Wilkerson, and J.P. Shen,
“Value Locality and Load Value Prediction,”
Proc. Seventh Int’l Conf. Architectural Support for Programming Languages and Operating Systems, Oct. 1996, pp. 138-147.
9. M.H. Lipasti and J.P. Shen, “Exceeding the
Dataflow Limit via Value Prediction,” Proc.
MICRO-29, Dec. 1996, pp. 226-237.
10. N. Utamaphethai, R.D. Blanton, and J.P.
Shen, “Validation of Speculative and Out-ofOrder Execution Microarchitectures,” Proc.
Microprocessor Test and Verification Workshop (MTV98), Oct. 1998.
11. N. Utamaphethai, R.D. Blanton, and J.P.
Shen, “A Buffer-Oriented Methodology for
Microarchitecture Validation,” J. Electronic
Testing: Theory and Application, Special
Issues on Microprocessor Test and Verification, to appear Fall 1999.
12. A. Cagney, “PSIM User’s Guide,” ftp://cambridge.cygnus.com/pub/psim/index.html,
Aug. 1996.
13. J. Combs, C. Bechem, and J.P. Shen, “Mispredicted Path Cache Effects,” CMuART
Tech. Report, ECE Dept., Carnegie Mellon
Univ., Jan. 1999.
14. D. Lee et al., “Instruction Cache Fetch Policies for Speculative Execution,” Proc. Int’l
Symp. Computer Arch., IEEE CS Press,
1995, pp. 357-367.
15. J. Pierce and T. Mudge, “The Effect of Speculative Execution of Cache Performance.”
Proc. Int’l Parallel Processing Symp., IEEE
CS Press, 1994, pp. 172-179.
16. J. Pierce and T. Mudge, “Wrong Path
Instruction Prefetching,” Tech. Report, Electrical Engineering and Computer Sci. Dept.,
Univ. of Michigan, Ann Arbor, 1994.
Candice Bechem is currently a member of
Motorola’s Engineering Rotation Program
and earlier worked on our project while she
was at Carnegie Mellon University. She
received her MS from the ECE Department
of Carnegie Mellon University. Bechem was
IEEE student branch president at the University of Illinois, Urbana-Champaign, where
she received her BS in computer engineering.
Jonathan Combs is currently a component
design engineer at Intel’s Texas Development
Center in Austin, Texas, working on a futuregeneration IA-32 processor implementation.
He also worked on this project while at
Carnegie Mellon. Combs received his MS
from the ECE Department of Carnegie Mellon and his BS in computer engineering from
the University of Illinois Urbana-Champaign.
Noppanunt Utamaphethai is a PhD candidate in the ECE Department at Carnegie Mellon. Currently, he is working on the
ATPG-based validation of microprocessors.
He received his BS from Brown University
and his MS from Carnegie Mellon.
Bryan Black is a PhD candidate in the ECE
Department of Carnegie Mellon. His research
interests span computing systems and tropical
fruits. He spent four years at Motorola as a
design engineer and was a member of the
PowerPC 604 design team before returning
to Carnegie Mellon for his PhD degree. Black
received his BS and MS from CMU.
R.D. Shawn Blanton is an assistant professor
in the ECE Department of Carnegie Mellon
and a member of the Center for Electronic
Design Automation. He has worked on the
design and test of complex digital systems
with General Motors Research Labs, AT&T
Bell Labs, and Intel. Blanton received a BS
from Calvin College, an MS from the University of Arizona, and a PhD in computer
science and engineering from the University
of Michigan, Ann Arbor. He is the recipient
of a NSF Career Award.
John Paul Shen, a professor in Carnegie Mellon’s ECE Department, heads the university’s Microarchitecture Research Team
(CMuART). He spent several years at Hughes and TRW. His current research interest is
high-performance microarchitectures. Shen
received a BS degree from the University of
Michigan and MS and PhD degrees from the
University of Southern California, all in electrical engineering. He is an IEEE fellow and
a member of the IEEE Computer Society.
Direct questions about this article to John
Paul Shen, Electrical and Computer Engineering Department, Carnegie Mellon Univ.,
Schenley Park, Pittsburgh, PA 15213; shen@
ece.cmu.edu.
The Swiss Federal Institute of Technology
Lausanne (EPFL) invites applications for a
position of
Professor of Electronic Systems
for the Department of Electrical Engineering
This position primarily involves information management
in complex industrial systems (hard and software co-design).
The post is conceived as integrating knowledge of electronic
components and the skill to implement them, whilst
respecting the constraints related to technology, reliability,
performance and cost. Aptitudes for research will be
demonstrated by the publication of scientific articles in
international journals and/or by patents. A taste and talent for
multidisciplinary collaborations with industry and within the
EPFL are essential, coupled with proven project management
ability. Industrial experience is an advantage. Education will
constitute an important responsibility; the position requires
teaching abilities and the capacity to guide students and
young researchers.
Deadline for registration: July 15, 1999. Starting date:
upon mutual agreement. Please ask for the application form
by writing or faxing to: Présidence de l’Ecole polytechnique
fédérale de Lausanne, CE-Ecublens, CH-1015 Lausanne,
Suisse, fax nr. +41 21 693 70 84. For further information,
please consult also URL: http://www.epfl.ch,
http://dewww.epfl.ch/,
http://admwww.epfl.ch/pres/profs.html or
http://research.epfl.ch/
35
Download