workload reduction and generation techniques

advertisement
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 57
..........................................................................................................................................................................................................................
WORKLOAD REDUCTION AND
GENERATION TECHNIQUES
..........................................................................................................................................................................................................................
BENCHMARKING IS A FUNDAMENTAL ASPECT OF COMPUTER SYSTEM DESIGN.
RECENTLY PROPOSED WORKLOAD REDUCTION AND GENERATION TECHNIQUES INCLUDE
INPUT REDUCTION, SAMPLING, CODE MUTATION, AND BENCHMARK SYNTHESIS. THE
AUTHORS DISCUSS AND COMPARE THESE TECHNIQUES ALONG SEVERAL CRITERIA:
WHETHER THEY YIELD REPRESENTATIVE AND SHORT-RUNNING BENCHMARKS, WHETHER
THEY CAN BE USED FOR BOTH ARCHITECTURE AND COMPILER EXPLORATIONS, AND
WHETHER THEY HIDE PROPRIETARY INFORMATION.
......
Benchmarking is an integral part
of contemporary research and development
in computer system design. Computer architects and designers use benchmarks to drive
the design of next-generation processors.
Compiler writers and system software developers evaluate their optimizations through
extensive benchmarking. Researchers in architecture and compilers and system software
use sets of benchmarks to evaluate novel
research ideas.
Because benchmarking is so fundamental,
it must be rigorous. Rigorous benchmarking
must include both experimental design and
data analysis. Experimental design involves
benchmark selection, simulator or hardware
platform selection, selection of a baseline design
point, and so on. Data analysis involves processing performance data after the experiment
is run, and includes computing confidence
intervals and average performance scores.
This article deals with benchmark selection to drive simulation experiments in systems research. We identify four requirements.
First, the benchmarks should be representative of their target domain. A benchmark
that ill-represents a target domain might
lead to a design that yields suboptimal performance when brought to market. Ideally, given
that the (high-performance) processor design
cycle is five to seven years, architects would
anticipate future workload characteristics.
Second, the benchmarks should be shortrunning so that researchers can obtain performance projections through simulation in
a reasonable amount of time. Processor simulators are extremely slow given the complexity of the contemporary processors they
model. Hence, simulating a large dynamic
instruction count quickly becomes prohibitive because of the large number of processor
design points that must be explored.
The benchmarks must also enable both
microarchitecture and compiler research
and development. Although existing benchmarks satisfy this requirement, this is typically not the case for workload reduction
techniques that reduce the dynamic instruction count to address the simulation challenge. Some workload reduction techniques
preclude the reduced workloads from being
used for compiler research.
Luk Van Ertvelde
Lieven Eeckhout
Ghent University
...................................................................
0272-1732/10/$26.00 c 2010 IEEE
Published by the IEEE Computer Society
57
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 58
...............................................................................................................................................................................................
BENCHMARKING
Finally, the benchmarks should not reveal
proprietary information. Industry clearly has
the workloads that users care about; however,
companies are reluctant to release their
codes. For example, a cell phone company
might not be willing to share its nextgeneration phone software with a processor
vendor for driving the processor architecture
design process. Similarly, a software vendor
might be reluctant to share the code base
with a compiler or virtual machine builder.
For this reason, researchers and developers
typically must rely on open source benchmarks, which might not be truly representative of real-life workloads.
Fulfilling all four criteria is nontrivial, and
to the best of our knowledge, no existing
workload reduction and generation technique addresses them all.
This article describes and compares
several recently proposed workload reduction
and generation techniques: input reduction,
sampling, code mutation, and benchmark
synthesis. We put special emphasis on code
mutation and benchmark synthesis because
these techniques are less well-known, they
fulfill most of the above requirements (unlike
input reduction and sampling), and we have
recently been working on them.
benchmark execution with a reference input
can take several weeks to run to completion
on today’s fastest simulators on today’s fastest
machines. A train input brings the total simulation time down to a couple hours.
The pitfall with using train inputs, and
smaller inputs in general, is that they might
not be representative of the reference inputs.
For example, a reduced input’s working set is
typically smaller, hence their cache and
memory behavior might stress the memory
hierarchy less than the reference input would.
KleinOsowski and Lilja propose MinneSPEC, which collects reduced input sets
for some CPU2000 benchmarks.1 These
reduced input sets are derived from the reference inputs using several techniques, such as
modifying inputs (for example, reducing the
number of iterations) and truncating inputs.
They propose three reduced inputs: smred
for short simulations, mdred for mediumlength simulations, and lgred for full
length, reportable simulations. They compare the representativeness of these reduced
inputs against the reference inputs using
function-level execution profiles, which
appear to be accurate for most benchmarks,
but not all.2
Input reduction
Sampling is a well-known workload reduction technique. Instead of simulating an entire
benchmark execution (with a reference input),
sampling simulates only a small fraction
(called sampling units) and then extrapolates
the performance numbers to the entire benchmark execution. Different approaches to sampling exist: you can pick sampling units
randomly across the entire program execution,3 periodically,4 or through program analysis.5 Current state-of-the-art in sampled
simulation achieves simulation speedups of
several orders of magnitude at very high accuracy. For example, SimPoint6 and TurboSmarts7 can simulate the SPEC CPU
benchmarks in the order of minutes on average with an error of only a few percent.
Although sampled simulation effectively
reduces the dynamic instruction count while
retaining representativeness and accuracy,
the simulator must be modified to quickly
navigate between sampling units and to establish architecture state (register and memory
Input reduction aims to reduce a reference input or devise a different input that
leads to a shorter running benchmark compared to a reference input while exhibiting
similar program behavior. Although the
idea of input reduction is simple, implementing the technique in a faithful way is
far from trivial.
Most benchmark suites come with several
inputs. SPEC CPU, for example, comes with
three inputs. The test input is used to verify
whether the benchmark runs properly, and
should not be used for performance analysis.
The train input is used to guide profile-based
optimizations—that is, it is used during
profiling, after which the system is optimized. The reference input is used for performance measurements. Researchers and
developers might also use train inputs to report performance numbers if simulating a
benchmark run with a reference input
takes too long. In particular, simulating a
....................................................................
58
IEEE MICRO
Sampling
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
state) and microarchitecture state (cache content, translation look-aside buffers [TLBs],
predictors, and so on) at the beginning of
the sampling units.
Ringenberg et al. present intrinsic checkpointing, which does not require modifying
the simulator.8 Instead, intrinsic checkpointing rewrites the benchmark’s binary and
stores the checkpoint (architecture state) in
the binary itself. Intrinsic checkpointing provides fix-up checkpointing code, consisting
of store instructions to put the correct data
values in memory and other instructions to
put the correct data values in registers.
The SimPoint group extended their
approach to use sampled simulation for instruction-set architecture (ISA) and compiler
research and development (next to microarchitecture explorations). The original SimPoint
approach focused on finding representative
sampling units based on the basic blocks
being executed.5 Follow-on work considered
alternative program characteristics, such as
loops and method calls, which let the group
identify cross-binary sampling units that
architects and compiler builders can use
when studying ISA extensions and evaluating
compiler and software optimizations.9
Page 59
Proprietary
input
Proprietary
application
Profiling through
binary
instrumentation
Execution
profile
Analysis and
binary rewriting
Mutant
Mutant distribution
Hardware and
simulation
Academia
Different
microarchitectures
Industry
vendors
Code mutation
Although input reduction and sampling
reduce a workload’s dynamic instruction
count, neither technique hides proprietary
information. Hence, these techniques cannot
be used to share proprietary workloads
among third parties. Code mutation aims
at hiding proprietary information from
benchmarks to facilitate workload sharing.10
Code mutation first profiles the execution
of a proprietary application to collect various
workload execution properties in an execution profile, which is then used for binary
rewriting the proprietary application into a
benchmark mutant, as Figure 1 illustrates.
The mutant has two key properties:
The functional semantics of the propri-
etary application cannot be revealed, or,
at least, are hard to reveal through
reverse engineering of the mutant.
The mutant’s performance characteristics resemble those of the proprietary
application well so that the mutant
Figure 1. Code mutation. A proprietary application is profiled and rewritten
in a mutant that can be distributed to third parties.
can serve as a proxy for the proprietary application during benchmarking experiments.
Code mutation aims to hide a proprietary
program’s functional meaning while preserving its behavioral execution characteristics in
the mutant. We started from the observation
that performance on contemporary superscalar processors is primarily determined by
miss events—that is, branch mispredictions
and cache and TLB misses—and to a lesser
extent by interoperation dependencies and
instruction types11 (interoperation dependencies and instruction execution latencies
are typically hidden by out-of-order instruction scheduling). This observation suggests
....................................................................
NOVEMBER/DECEMBER 2010
59
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 60
...............................................................................................................................................................................................
BENCHMARKING
that the mutant, to exhibit behavioral characteristics similar to the proprietary workload,
should mimic the branch and memory access
behavior without worrying too much about
interoperation dependencies and instruction
types. To allow the mutant to do so, we determine all operations that affect the program’s branch and/or memory access
behavior. We do this through dynamic
program slicing. We retain the operations
appearing in these slices unchanged in the
mutant, and can overwrite (mutate) all
other operations in the program to hide the
proprietary application’s functional meaning.
Code mutation consists of three major
steps.
Execution profiling collects an interoperation dependency profile that captures the
data dependencies between instructions, a
constant value profile that tracks which
instructions generate or consume constant
values, and a branch profile that records
whether a control flow operation exhibits
constant branching behavior. Execution
profiling is done through dynamic binary
instrumentation using Pin.12
The program analysis step involves program slicing to track down which instructions affect a memory access or a branch.13
To perform program slicing, we use the
interoperation dependency profile, and use
the constant value profile to trim the slices.
We compute program slices for all memory
accesses and/or control flow operations. All
the instructions that are not part of a slice
are marked (marked code is either never executed or produces unused data—that is, it
does not affect a program’s memory access
or branch behavior).
The third step, binary rewriting, mutates
the marked code. This involves overwriting
the marked instructions with randomly generated code sequences. We also introduce
opaque variables as branch condition flags
(an opaque variable has some property that
is known a priori to the code mutator, but
is difficult for a malicious person to deduce).
For example, conditional branches that jump
based on an opaque condition flag do not
alter the control flow, but complicate the
understanding of the mutant binary.
Our experimental results reveal that code
mutation’s efficacy is benchmark specific. In
....................................................................
60
IEEE MICRO
addition, our current code mutation framework can mutate 36 percent of the code
that is executed at least once on average,
and it can break 29 percent of the interoperation data dependencies on average. Further,
the mutated binary’s performance is within
1.4 percent on average (and at most 6 percent) on real hardware to the original workload. We made an interesting observation
when comparing the results for code mutation based on slices computed for both memory accesses and control flow operations
versus slices for control flow operations
only. We found the difference to be relatively
small, suggesting significant overlap between
the slices of memory accesses and the slices of
control flow operations. Not computing
memory access slices does not reveal that
many additional instructions are eligible for
code mutation. Put another way, by striving
to preserve a program’s control flow behavior, we also preserve most of the memory
access behavior.
Benchmark synthesis
Although code mutation is a promising
technique, it might not hide proprietary information to a satisfactory level. In some
cases, the mutated binaries might still reveal
some critical proprietary information, and
therefore a company or an institution
might be reluctant to distribute mutated
binaries. For this reason, we recently started
working on benchmark synthesis, which generates a synthetic benchmark in a high-level
programming language (HLL) from desired
program characteristics.14 Rather than
mutating an existing benchmark to hide proprietary information as much as possible,
benchmark synthesis generates a synthetic
benchmark starting from several program
characteristics (see the ‘‘History of Benchmark Synthesis’’ sidebar for some background on this technique). Because the
workload is synthetically generated, it hides
proprietary information much better than
code mutation—you could say by construction. In addition, because generating a new
benchmark provides more flexibility than
mutating an existing benchmark, synthetic
benchmark generation also allows for reducing the dynamic instruction count more
easily. The trade-off between code mutation
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 61
...............................................................................................................................................................................................
History of Benchmark Synthesis
1
2
Whetstone and Dhrystone are well-known synthetic benchmarks
that were crafted manually in 1972 and 1984, respectively. Manually
building benchmarks is both tedious and time-consuming, and because
benchmarks are quickly outdated, it is not a scalable approach.
Statistical simulation collects program characteristics from a program
execution and generates a synthetic trace, which is then simulated on a
statistical processor simulator.3-5 The important advantage of statistical
simulation is that the dynamic instruction count of a synthetic trace is
very short, typically a few million instructions at most. A synthetic
trace hides proprietary information very well, however, it cannot be
run on real hardware or an execution-driven simulator (which is current
practice as opposed to trace-driven simulation). Hence, statistical simulation is primarily useful for guiding early-stage design space
explorations.
More recent work focuses on automated synthetic benchmark generation, which builds on the statistical simulation approach but generates a synthetic benchmark rather than a synthetic trace.6-8 Our
benchmark synthesis approach shares some commonalities with this
prior work, but there are important differences as well. For one, our
work aims at generating synthetic benchmarks in high-level programming languages (HLLs) such as C so that compiler and architecture
developers as well as researchers can use them. Prior work in automated benchmark synthesis generates binaries, limiting their usage
to architects—that is, the synthetic benchmarks cannot be used for
compiler research and development. In addition, there are some technical differences. For example, whereas prior benchmark synthesis
approaches model control flow behavior in a coarse-grained matter,
our current work models fine-grained control flow behavior, including
(nested) loops and if-then-else structures. In addition, we use pattern
recognition rather than statistics and distributions for generating synthetic code sequences.
and benchmark synthesis is thus that synthetic benchmarks might be less accurate
and representative with respect to real workloads than mutated binaries; however, the
technique hides proprietary information
more adequately and yields shorter-running
benchmarks.
Figure 2 gives a high-level view of our
benchmark synthesis framework. We start
from a real proprietary application. We compile this workload at a low optimization level
(for example, —O0 in the GNU Compiler
Collection) to facilitate the pattern recognition and translation step from assembly
code to HLL code, as we discuss later. We
then run the resulting binary with its proprietary input and profile its execution—that is,
References
1. H.J. Curnow and B.A. Wichmann, ‘‘A Synthetic Benchmark,’’
Computer J., vol. 19, no. 1, 1976, pp. 43-49.
2. R.P. Weicker, ‘‘Dhrystone: A Synthetic Systems Programming Benchmark,’’ Comm. ACM, vol. 27, no. 10, Oct.
1984, pp. 1013-1030.
3. L. Eeckhout et al., ‘‘Control Flow Modeling in Statistical Simulation for Accurate and Efficient Processor Design Studies,’’
Proc. Ann. Int’l Symp. Computer Architecture (ISCA 04),
ACM Press, 2004, pp. 350-361.
4. S. Nussbaum and J.E. Smith, ‘‘Modeling Superscalar Processors via Statistical Simulation,’’ Proc. Int’l Conf. Parallel
Architectures and Compilation Techniques (PACT 01), IEEE
CS Press, 2001, pp. 15-24.
5. M. Oskin, F.T. Chong, and M. Farrens, ‘‘HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor
Design,’’ Proc. Ann. Int’l Symp. Computer Architecture
(ISCA 00), ACM Press, 2000, pp. 71-82.
6. R. Bell Jr. and L.K. John, ‘‘Improved Automatic Testcase
Synthesis for Performance Model Validation,’’ Proc. ACM
Int’l Conf. Supercomputing (ICS 05), ACM Press, 2005,
pp. 111-120.
7. C. Hughes and T. Li, ‘‘Accelerating Multicore Processor
Design Space Evaluation Using automatic Multi-threaded
Workload Synthesis,’’ Proc. Intl Symp. Workload Characterization (IISWC 08), IEEE Press, 2008, pp. 163-172.
8. A.M. Joshi et al., ‘‘Distilling the Essence of Proprietary Workloads into Miniature Benchmarks,’’ ACM Trans. Architecture
and Code Optimization (TACO 08), vol. 5, no. 2, Aug. 2008,
pp. 1-33.
we count how often each function is called,
how many times a loop is iterated, how
often a branch is taken, how often a basic
block is executed, and so on. In addition,
we record memory access patterns for loads
and stores, and we record branch taken and
transition rates. Finally, we use a pattern recognizer that scans the executed code to identify C code statements corresponding to
sequences of instructions observed at the
binary level. This pattern recognizer translates the binary code to C code. We perform
the translation in a semirandom fashion to
obfuscate proprietary information.
All of the characteristics that we collect
are comprised in a workload profile, which
captures the original workload’s execution
....................................................................
NOVEMBER/DECEMBER 2010
61
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 62
...............................................................................................................................................................................................
BENCHMARKING
Hardware and
simulation
Different
ISAs
Academia
Different
microarchitectures
Industry
vendors
Different
compilers and
optimizations
Benchmark
distribution
Synthetic
benchmark
in HLL
(for example, C)
Source code of
proprietary
workload
Compilation
at low optimization
level
Benchmark
synthesis
Binary
Profiling
Workload
profile
Proprietary input
Figure 2. Benchmark synthesis. Program execution characteristics are
extracted for a proprietary application from which a synthetic benchmark is
generated.
behavior and input. We then generate a synthetic benchmark from this workload profile
using an HLL, in our case C. We generate
sequences of C code statements (basic
blocks), as well as if-then-else statements,
loops, and function calls, and we add interstatement dependencies and data memory
access patterns. The C code structures are
generated pro rata their occurrence in the
original workload execution. However,
we force the synthetic benchmark to execute
....................................................................
62
IEEE MICRO
fewer instructions than the original workload, by construction. We do this by reducing the execution frequencies of basic blocks,
loops, and function calls by a given reduction
factor. The end result is a synthetic benchmark that executes fewer instructions than
the original workload while being representative for the original workload.
The synthetic benchmark does not expose
proprietary information (because of the
semirandom binary-to-source code translator
and the workload reduction). We verified
that this is the case using two software plagiarism detection tools. We can thus distribute
the synthetic benchmarks between third parties. Because the synthetic benchmarks are
generated in an HLL, they let us explore
the architecture and compiler space, and
compare systems with different compilers
and optimization levels, as well as different
ISAs, microarchitectures, and implementations. The synthetic benchmarks can run
on execution-driven simulators as well as
on real hardware. We report an average performance difference of 7.4 percent between
the synthetic clone and the original workload
across a set of compiler optimization levels
and hardware platforms.
Comparison
Table 1 compares the workload reduction
and generation techniques in terms of several
dimensions. It is immediately apparent from
this table that there is no clear winner. The
different techniques represent different
trade-offs, which makes discussing the differences in more detail interesting and naturally
leads to different use cases for each technique.
Simulation time reduction
All techniques except for code mutation
aim at reducing the dynamic instruction
count to reduce simulation time. As mentioned earlier, simulation time reductions of
several orders of magnitude have been
reported for sampled simulation and benchmark synthesis. This reduction is important
not only for architecture research and development, but also in the compiler space. For
example, iterative compilation evaluates numerous compiler optimizations to find the
optimum compiler optimizations for a
given program,15,16 A reduced workload
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 63
Table 1. Comparison of workload reduction and generation techniques.
Benchmark
Benchmark
Code
synthesis at
synthesis at
mutation
binary level
HLL level
Input
Feature
reduction
Sampling
Reduces simulation time
Yes
Yes
No
Yes
Yes
Requires simulator
No
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Partially
No
No
Yes
Hides proprietary information
No
No
Partially
Yes
Yes
Can model emerging
No
No
No
Yes
Yes
Medium to poor
High
Medium to high
Medium
Medium
modifications
Can be used for
microarchitecture exploration
Can be used for compiler
and ISA exploration
workloads
Accuracy with regard to
reference workload
that executes faster will also reduce the overall compiler space exploration time.
Architecture versus compiler exploration
All techniques can be used to drive microarchitecture research and development, but
only a few can be used for compiler and
ISA exploration. The reason is that techniques such as sampling, code mutation,
and benchmark synthesis at the binary level
operate on binaries and not on source
code, eliminating their utility for compiler
and ISA exploration. On the other hand,
sampling that identifies representative loops
and function calls can be used to drive compiler research, as can benchmark synthesis at
the HLL level.
Hide proprietary information
Only benchmark synthesis can hide proprietary information, although code mutation partially succeeds in hiding this
information. An important application for
these techniques is to generate synthetic
clones for real-life proprietary workloads.
Such an application would allow companies
to share code. It would also let them share
their workloads with their academic research
partners without revealing proprietary
information.
Model emerging workloads
Benchmark synthesis can also be used to
generate emerging and future workloads. In
particular, researchers and developers can generate a workload profile with anticipated future
performance characteristics. For example, they
could generate a synthetic workload with large
working sets, random memory access patterns,
or complex control flow behavior. They can
then use the synthetic benchmarks generated
from these profiles to explore design alternatives for future computer systems.
Accuracy
Last but not least, whether the reduced
workload is representative of the original reference workload is obviously of primary importance. Although it is hard to compare the
various workload reduction techniques without doing an apples-to-apples comparison
(which would require a rigorous comparison
using the same set of benchmarks and simulation infrastructure), we can make a qualitative statement based on published results and
our experience in this area. Sampling is likely
the most accurate approach, followed by
code mutation. Benchmark synthesis has
shown medium accuracy. Reduced inputs
have shown good accuracy for some benchmarks but poor accuracy for others.
W
e believe there is ample room for
future work in workload reduction
and generation, especially in terms of
extending the existing techniques from
single-core targets toward multicore processors. Contemporary computer systems
....................................................................
NOVEMBER/DECEMBER 2010
63
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Page 64
...............................................................................................................................................................................................
BENCHMARKING
feature multicore processors, which obviously has repercussions on benchmarking
for both hardware and software. Recent
work in workload reduction and generation
has focused almost exclusively on singlethreaded workloads, except for a few studies
in sampling17 and benchmark synthesis (see
the sidebar). However, given the trend
toward multicore processors, we urgently
need to develop workload reduction and
generation techniques for multithreaded
workloads. We hope this article will
MICRO
stimulate future work in this area.
7. T.F. Wenisch et al., ‘‘Simulation Sampling
with Live-points,’’ Proc. Ann. Int’l Symp.
Performance Analysis of Systems and
Software (ISPASS 06), IEEE Press, 2006,
pp. 2-12.
8. J. Ringenberg et al., ‘‘Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time through Binary Modification,’’
Proc. IEEE Int’l Symp. Performance Analysis
of Systems and Software (ISPASS 05), IEEE
Press, 2005, pp. 78-88.
9. E. Perelman et al., ‘‘Cross Binary Simulation
Points,’’ Proc. Ann. Int’l Symp. Performance Analysis of Systems and Software
Acknowledgments
We thank the anonymous reviewers for
their thoughtful comments and suggestions.
This work is supported in part by the
Research Foundation—Flanders (FWO)
projects G.0232.06, G.0255.08 and
G.0179.10, and the UGent-BOF projects
01J14407 and 01Z04109.
(ISPASS 07), IEEE Press, 2007, pp. 179-189.
10. L. Van Ertvelde and L. Eeckhout, ‘‘Dispersing Proprietary Applications as Benchmarks through Code Mutation,’’ Proc. Int’l
Conf. Architectural Support for Programming
Languages and Operating Systems (ASPLOS
08), ACM Press, 2008, pp. 201-210.
11. T. Karkhanis and J.E. Smith, ‘‘A First-order
Superscalar Processor Model,’’ Proc. Ann.
....................................................................
References
1. A.J. KleinOsowski and D.J. Lilja, ‘‘Min-
12. C.-K. Luk et al., ‘‘Pin: Building Customized
neSPEC: A New SPEC Benchmark Workload
Program Analysis Tools with Dynamic
for Simulation-based Computer Architecture
Instrumentation,’’ Proc. ACM SIGPLAN
Research,’’ Computer Architecture Letters,
Conf. Programming Languages Design and
vol. 1, no. 2, June 2002, pp. 10-13.
Implementation (PLDI 05), ACM Press,
2. L. Eeckhout, H. Vandierendonck, and K. De
Bosschere, ‘‘Designing Workloads for Computer Architecture Research,’’ Computer,
vol. 36, no. 2, Feb. 2003, pp. 65-71.
‘‘Reducing State Loss for Effective Trace
mark Synthesis for Architecture and Com-
Sampling of Superscalar Processors,’’ Proc.
piler Exploration,’’ Proc. IEEE Int’l Symp.
Int’l Conf. Computer Design (ICCD 96),
Workload Characterization (IISWC 10), IEEE
Press, 2010, to appear.
4. R.E. Wunderlich et al., ‘‘Smarts: Accelerating Microarchitecture Simulation via Rigor-
15. K.D. Cooper, P.J. Schielke, and D. Subramanian, ‘‘Optimizing for Reduced Code Space
ous Statistical Sampling,’’ Proc. Ann. Int’l
Using Genetic Algorithms,’’ Proc. SIGPLAN/
Symp. Computer Architecture (ISCA 03),
SIGBED Conf. Languages, Compilers, and
ACM Press, 2003, pp. 84-95.
Tools for Embedded Systems (LCTES 99),
ACM Press, 1999, pp. 1-9.
izing Large Scale Program Behavior,’’ Proc.
16. P. Kulkarni et al., ‘‘Fast Searches for Effec-
Int’l Conf. Architectural Support for Program-
tive Optimization Phase Sequences,’’ Proc.
ming Languages and Operating Systems
(ASPLOS 02), ACM Press, 2002, pp. 45-57.
ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI
6. M. Van Biesbrouck, B. Calder, and L. Eeckhout, ‘‘Efficient Sampling Startup for Sim-
IEEE MICRO
pp. 352-357.
14. L. Van Ertvelde and L. Eeckhout, ‘‘Bench-
5. T. Sherwood et al., ‘‘Automatically Character-
64
2005, pp. 190-200.
13. M. Weiser, ‘‘Program Slicing,’’ IEEE Trans.
Software Eng., vol. 10, no. 4, July 1984,
3. T.M. Conte, M.A. Hirsch, and K.N. Menezes,
IEEE CS Press, 1996, pp. 468-477.
....................................................................
Int’l Symp. Computer Architecture (ISCA
04), ACM Press, 2004, pp. 338-349.
04), ACM Press, 2004, pp. 171-182.
17. T.F. Wenisch et al., ‘‘SimFlex: Statistical Sam-
Point,’’ IEEE Micro, vol. 26, no. 4, July
pling of Computer System Simulation,’’ IEEE
2006, pp. 32-42.
Micro, vol. 26, no. 4, July 2006, pp. 18-31.
[3B2-14]
mmi2010060057.3d
8/12/010
16:47
Luk Van Ertvelde is a PhD student in the
Electronics and Information Systems Department at Ghent University, Belgium. His
research interests include computer architecture in general, and workload characterization in particular. Van Ertvelde has an MS in
computer science from Ghent University.
Lieven Eeckhout is an associate professor in
the Electronics and Information Systems
Department at Ghent University, Belgium.
His research interests include computer
architecture and the hardware/software interface in general, with a focus on performance
analysis, evaluation and modeling, and
Page 65
workload characterization. Eeckhout has a
PhD in computer science and engineering
from Ghent University. He is a member of
IEEE and the ACM.
Direct questions and comments about
this article to Lieven Eeckhout, ELIS—
Ghent University, Sint-Pietersnieuwstraat
41, B-9000 Gent, Belgium; leeckhou@
elis.ugent.be.
....................................................................
NOVEMBER/DECEMBER 2010
65
Download