[3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 57 .......................................................................................................................................................................................................................... WORKLOAD REDUCTION AND GENERATION TECHNIQUES .......................................................................................................................................................................................................................... BENCHMARKING IS A FUNDAMENTAL ASPECT OF COMPUTER SYSTEM DESIGN. RECENTLY PROPOSED WORKLOAD REDUCTION AND GENERATION TECHNIQUES INCLUDE INPUT REDUCTION, SAMPLING, CODE MUTATION, AND BENCHMARK SYNTHESIS. THE AUTHORS DISCUSS AND COMPARE THESE TECHNIQUES ALONG SEVERAL CRITERIA: WHETHER THEY YIELD REPRESENTATIVE AND SHORT-RUNNING BENCHMARKS, WHETHER THEY CAN BE USED FOR BOTH ARCHITECTURE AND COMPILER EXPLORATIONS, AND WHETHER THEY HIDE PROPRIETARY INFORMATION. ...... Benchmarking is an integral part of contemporary research and development in computer system design. Computer architects and designers use benchmarks to drive the design of next-generation processors. Compiler writers and system software developers evaluate their optimizations through extensive benchmarking. Researchers in architecture and compilers and system software use sets of benchmarks to evaluate novel research ideas. Because benchmarking is so fundamental, it must be rigorous. Rigorous benchmarking must include both experimental design and data analysis. Experimental design involves benchmark selection, simulator or hardware platform selection, selection of a baseline design point, and so on. Data analysis involves processing performance data after the experiment is run, and includes computing confidence intervals and average performance scores. This article deals with benchmark selection to drive simulation experiments in systems research. We identify four requirements. First, the benchmarks should be representative of their target domain. A benchmark that ill-represents a target domain might lead to a design that yields suboptimal performance when brought to market. Ideally, given that the (high-performance) processor design cycle is five to seven years, architects would anticipate future workload characteristics. Second, the benchmarks should be shortrunning so that researchers can obtain performance projections through simulation in a reasonable amount of time. Processor simulators are extremely slow given the complexity of the contemporary processors they model. Hence, simulating a large dynamic instruction count quickly becomes prohibitive because of the large number of processor design points that must be explored. The benchmarks must also enable both microarchitecture and compiler research and development. Although existing benchmarks satisfy this requirement, this is typically not the case for workload reduction techniques that reduce the dynamic instruction count to address the simulation challenge. Some workload reduction techniques preclude the reduced workloads from being used for compiler research. Luk Van Ertvelde Lieven Eeckhout Ghent University ................................................................... 0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 57 [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 58 ............................................................................................................................................................................................... BENCHMARKING Finally, the benchmarks should not reveal proprietary information. Industry clearly has the workloads that users care about; however, companies are reluctant to release their codes. For example, a cell phone company might not be willing to share its nextgeneration phone software with a processor vendor for driving the processor architecture design process. Similarly, a software vendor might be reluctant to share the code base with a compiler or virtual machine builder. For this reason, researchers and developers typically must rely on open source benchmarks, which might not be truly representative of real-life workloads. Fulfilling all four criteria is nontrivial, and to the best of our knowledge, no existing workload reduction and generation technique addresses them all. This article describes and compares several recently proposed workload reduction and generation techniques: input reduction, sampling, code mutation, and benchmark synthesis. We put special emphasis on code mutation and benchmark synthesis because these techniques are less well-known, they fulfill most of the above requirements (unlike input reduction and sampling), and we have recently been working on them. benchmark execution with a reference input can take several weeks to run to completion on today’s fastest simulators on today’s fastest machines. A train input brings the total simulation time down to a couple hours. The pitfall with using train inputs, and smaller inputs in general, is that they might not be representative of the reference inputs. For example, a reduced input’s working set is typically smaller, hence their cache and memory behavior might stress the memory hierarchy less than the reference input would. KleinOsowski and Lilja propose MinneSPEC, which collects reduced input sets for some CPU2000 benchmarks.1 These reduced input sets are derived from the reference inputs using several techniques, such as modifying inputs (for example, reducing the number of iterations) and truncating inputs. They propose three reduced inputs: smred for short simulations, mdred for mediumlength simulations, and lgred for full length, reportable simulations. They compare the representativeness of these reduced inputs against the reference inputs using function-level execution profiles, which appear to be accurate for most benchmarks, but not all.2 Input reduction Sampling is a well-known workload reduction technique. Instead of simulating an entire benchmark execution (with a reference input), sampling simulates only a small fraction (called sampling units) and then extrapolates the performance numbers to the entire benchmark execution. Different approaches to sampling exist: you can pick sampling units randomly across the entire program execution,3 periodically,4 or through program analysis.5 Current state-of-the-art in sampled simulation achieves simulation speedups of several orders of magnitude at very high accuracy. For example, SimPoint6 and TurboSmarts7 can simulate the SPEC CPU benchmarks in the order of minutes on average with an error of only a few percent. Although sampled simulation effectively reduces the dynamic instruction count while retaining representativeness and accuracy, the simulator must be modified to quickly navigate between sampling units and to establish architecture state (register and memory Input reduction aims to reduce a reference input or devise a different input that leads to a shorter running benchmark compared to a reference input while exhibiting similar program behavior. Although the idea of input reduction is simple, implementing the technique in a faithful way is far from trivial. Most benchmark suites come with several inputs. SPEC CPU, for example, comes with three inputs. The test input is used to verify whether the benchmark runs properly, and should not be used for performance analysis. The train input is used to guide profile-based optimizations—that is, it is used during profiling, after which the system is optimized. The reference input is used for performance measurements. Researchers and developers might also use train inputs to report performance numbers if simulating a benchmark run with a reference input takes too long. In particular, simulating a .................................................................... 58 IEEE MICRO Sampling [3B2-14] mmi2010060057.3d 8/12/010 16:47 state) and microarchitecture state (cache content, translation look-aside buffers [TLBs], predictors, and so on) at the beginning of the sampling units. Ringenberg et al. present intrinsic checkpointing, which does not require modifying the simulator.8 Instead, intrinsic checkpointing rewrites the benchmark’s binary and stores the checkpoint (architecture state) in the binary itself. Intrinsic checkpointing provides fix-up checkpointing code, consisting of store instructions to put the correct data values in memory and other instructions to put the correct data values in registers. The SimPoint group extended their approach to use sampled simulation for instruction-set architecture (ISA) and compiler research and development (next to microarchitecture explorations). The original SimPoint approach focused on finding representative sampling units based on the basic blocks being executed.5 Follow-on work considered alternative program characteristics, such as loops and method calls, which let the group identify cross-binary sampling units that architects and compiler builders can use when studying ISA extensions and evaluating compiler and software optimizations.9 Page 59 Proprietary input Proprietary application Profiling through binary instrumentation Execution profile Analysis and binary rewriting Mutant Mutant distribution Hardware and simulation Academia Different microarchitectures Industry vendors Code mutation Although input reduction and sampling reduce a workload’s dynamic instruction count, neither technique hides proprietary information. Hence, these techniques cannot be used to share proprietary workloads among third parties. Code mutation aims at hiding proprietary information from benchmarks to facilitate workload sharing.10 Code mutation first profiles the execution of a proprietary application to collect various workload execution properties in an execution profile, which is then used for binary rewriting the proprietary application into a benchmark mutant, as Figure 1 illustrates. The mutant has two key properties: The functional semantics of the propri- etary application cannot be revealed, or, at least, are hard to reveal through reverse engineering of the mutant. The mutant’s performance characteristics resemble those of the proprietary application well so that the mutant Figure 1. Code mutation. A proprietary application is profiled and rewritten in a mutant that can be distributed to third parties. can serve as a proxy for the proprietary application during benchmarking experiments. Code mutation aims to hide a proprietary program’s functional meaning while preserving its behavioral execution characteristics in the mutant. We started from the observation that performance on contemporary superscalar processors is primarily determined by miss events—that is, branch mispredictions and cache and TLB misses—and to a lesser extent by interoperation dependencies and instruction types11 (interoperation dependencies and instruction execution latencies are typically hidden by out-of-order instruction scheduling). This observation suggests .................................................................... NOVEMBER/DECEMBER 2010 59 [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 60 ............................................................................................................................................................................................... BENCHMARKING that the mutant, to exhibit behavioral characteristics similar to the proprietary workload, should mimic the branch and memory access behavior without worrying too much about interoperation dependencies and instruction types. To allow the mutant to do so, we determine all operations that affect the program’s branch and/or memory access behavior. We do this through dynamic program slicing. We retain the operations appearing in these slices unchanged in the mutant, and can overwrite (mutate) all other operations in the program to hide the proprietary application’s functional meaning. Code mutation consists of three major steps. Execution profiling collects an interoperation dependency profile that captures the data dependencies between instructions, a constant value profile that tracks which instructions generate or consume constant values, and a branch profile that records whether a control flow operation exhibits constant branching behavior. Execution profiling is done through dynamic binary instrumentation using Pin.12 The program analysis step involves program slicing to track down which instructions affect a memory access or a branch.13 To perform program slicing, we use the interoperation dependency profile, and use the constant value profile to trim the slices. We compute program slices for all memory accesses and/or control flow operations. All the instructions that are not part of a slice are marked (marked code is either never executed or produces unused data—that is, it does not affect a program’s memory access or branch behavior). The third step, binary rewriting, mutates the marked code. This involves overwriting the marked instructions with randomly generated code sequences. We also introduce opaque variables as branch condition flags (an opaque variable has some property that is known a priori to the code mutator, but is difficult for a malicious person to deduce). For example, conditional branches that jump based on an opaque condition flag do not alter the control flow, but complicate the understanding of the mutant binary. Our experimental results reveal that code mutation’s efficacy is benchmark specific. In .................................................................... 60 IEEE MICRO addition, our current code mutation framework can mutate 36 percent of the code that is executed at least once on average, and it can break 29 percent of the interoperation data dependencies on average. Further, the mutated binary’s performance is within 1.4 percent on average (and at most 6 percent) on real hardware to the original workload. We made an interesting observation when comparing the results for code mutation based on slices computed for both memory accesses and control flow operations versus slices for control flow operations only. We found the difference to be relatively small, suggesting significant overlap between the slices of memory accesses and the slices of control flow operations. Not computing memory access slices does not reveal that many additional instructions are eligible for code mutation. Put another way, by striving to preserve a program’s control flow behavior, we also preserve most of the memory access behavior. Benchmark synthesis Although code mutation is a promising technique, it might not hide proprietary information to a satisfactory level. In some cases, the mutated binaries might still reveal some critical proprietary information, and therefore a company or an institution might be reluctant to distribute mutated binaries. For this reason, we recently started working on benchmark synthesis, which generates a synthetic benchmark in a high-level programming language (HLL) from desired program characteristics.14 Rather than mutating an existing benchmark to hide proprietary information as much as possible, benchmark synthesis generates a synthetic benchmark starting from several program characteristics (see the ‘‘History of Benchmark Synthesis’’ sidebar for some background on this technique). Because the workload is synthetically generated, it hides proprietary information much better than code mutation—you could say by construction. In addition, because generating a new benchmark provides more flexibility than mutating an existing benchmark, synthetic benchmark generation also allows for reducing the dynamic instruction count more easily. The trade-off between code mutation [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 61 ............................................................................................................................................................................................... History of Benchmark Synthesis 1 2 Whetstone and Dhrystone are well-known synthetic benchmarks that were crafted manually in 1972 and 1984, respectively. Manually building benchmarks is both tedious and time-consuming, and because benchmarks are quickly outdated, it is not a scalable approach. Statistical simulation collects program characteristics from a program execution and generates a synthetic trace, which is then simulated on a statistical processor simulator.3-5 The important advantage of statistical simulation is that the dynamic instruction count of a synthetic trace is very short, typically a few million instructions at most. A synthetic trace hides proprietary information very well, however, it cannot be run on real hardware or an execution-driven simulator (which is current practice as opposed to trace-driven simulation). Hence, statistical simulation is primarily useful for guiding early-stage design space explorations. More recent work focuses on automated synthetic benchmark generation, which builds on the statistical simulation approach but generates a synthetic benchmark rather than a synthetic trace.6-8 Our benchmark synthesis approach shares some commonalities with this prior work, but there are important differences as well. For one, our work aims at generating synthetic benchmarks in high-level programming languages (HLLs) such as C so that compiler and architecture developers as well as researchers can use them. Prior work in automated benchmark synthesis generates binaries, limiting their usage to architects—that is, the synthetic benchmarks cannot be used for compiler research and development. In addition, there are some technical differences. For example, whereas prior benchmark synthesis approaches model control flow behavior in a coarse-grained matter, our current work models fine-grained control flow behavior, including (nested) loops and if-then-else structures. In addition, we use pattern recognition rather than statistics and distributions for generating synthetic code sequences. and benchmark synthesis is thus that synthetic benchmarks might be less accurate and representative with respect to real workloads than mutated binaries; however, the technique hides proprietary information more adequately and yields shorter-running benchmarks. Figure 2 gives a high-level view of our benchmark synthesis framework. We start from a real proprietary application. We compile this workload at a low optimization level (for example, O0 in the GNU Compiler Collection) to facilitate the pattern recognition and translation step from assembly code to HLL code, as we discuss later. We then run the resulting binary with its proprietary input and profile its execution—that is, References 1. H.J. Curnow and B.A. Wichmann, ‘‘A Synthetic Benchmark,’’ Computer J., vol. 19, no. 1, 1976, pp. 43-49. 2. R.P. Weicker, ‘‘Dhrystone: A Synthetic Systems Programming Benchmark,’’ Comm. ACM, vol. 27, no. 10, Oct. 1984, pp. 1013-1030. 3. L. Eeckhout et al., ‘‘Control Flow Modeling in Statistical Simulation for Accurate and Efficient Processor Design Studies,’’ Proc. Ann. Int’l Symp. Computer Architecture (ISCA 04), ACM Press, 2004, pp. 350-361. 4. S. Nussbaum and J.E. Smith, ‘‘Modeling Superscalar Processors via Statistical Simulation,’’ Proc. Int’l Conf. Parallel Architectures and Compilation Techniques (PACT 01), IEEE CS Press, 2001, pp. 15-24. 5. M. Oskin, F.T. Chong, and M. Farrens, ‘‘HLS: Combining Statistical and Symbolic Simulation to Guide Microprocessor Design,’’ Proc. Ann. Int’l Symp. Computer Architecture (ISCA 00), ACM Press, 2000, pp. 71-82. 6. R. Bell Jr. and L.K. John, ‘‘Improved Automatic Testcase Synthesis for Performance Model Validation,’’ Proc. ACM Int’l Conf. Supercomputing (ICS 05), ACM Press, 2005, pp. 111-120. 7. C. Hughes and T. Li, ‘‘Accelerating Multicore Processor Design Space Evaluation Using automatic Multi-threaded Workload Synthesis,’’ Proc. Intl Symp. Workload Characterization (IISWC 08), IEEE Press, 2008, pp. 163-172. 8. A.M. Joshi et al., ‘‘Distilling the Essence of Proprietary Workloads into Miniature Benchmarks,’’ ACM Trans. Architecture and Code Optimization (TACO 08), vol. 5, no. 2, Aug. 2008, pp. 1-33. we count how often each function is called, how many times a loop is iterated, how often a branch is taken, how often a basic block is executed, and so on. In addition, we record memory access patterns for loads and stores, and we record branch taken and transition rates. Finally, we use a pattern recognizer that scans the executed code to identify C code statements corresponding to sequences of instructions observed at the binary level. This pattern recognizer translates the binary code to C code. We perform the translation in a semirandom fashion to obfuscate proprietary information. All of the characteristics that we collect are comprised in a workload profile, which captures the original workload’s execution .................................................................... NOVEMBER/DECEMBER 2010 61 [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 62 ............................................................................................................................................................................................... BENCHMARKING Hardware and simulation Different ISAs Academia Different microarchitectures Industry vendors Different compilers and optimizations Benchmark distribution Synthetic benchmark in HLL (for example, C) Source code of proprietary workload Compilation at low optimization level Benchmark synthesis Binary Profiling Workload profile Proprietary input Figure 2. Benchmark synthesis. Program execution characteristics are extracted for a proprietary application from which a synthetic benchmark is generated. behavior and input. We then generate a synthetic benchmark from this workload profile using an HLL, in our case C. We generate sequences of C code statements (basic blocks), as well as if-then-else statements, loops, and function calls, and we add interstatement dependencies and data memory access patterns. The C code structures are generated pro rata their occurrence in the original workload execution. However, we force the synthetic benchmark to execute .................................................................... 62 IEEE MICRO fewer instructions than the original workload, by construction. We do this by reducing the execution frequencies of basic blocks, loops, and function calls by a given reduction factor. The end result is a synthetic benchmark that executes fewer instructions than the original workload while being representative for the original workload. The synthetic benchmark does not expose proprietary information (because of the semirandom binary-to-source code translator and the workload reduction). We verified that this is the case using two software plagiarism detection tools. We can thus distribute the synthetic benchmarks between third parties. Because the synthetic benchmarks are generated in an HLL, they let us explore the architecture and compiler space, and compare systems with different compilers and optimization levels, as well as different ISAs, microarchitectures, and implementations. The synthetic benchmarks can run on execution-driven simulators as well as on real hardware. We report an average performance difference of 7.4 percent between the synthetic clone and the original workload across a set of compiler optimization levels and hardware platforms. Comparison Table 1 compares the workload reduction and generation techniques in terms of several dimensions. It is immediately apparent from this table that there is no clear winner. The different techniques represent different trade-offs, which makes discussing the differences in more detail interesting and naturally leads to different use cases for each technique. Simulation time reduction All techniques except for code mutation aim at reducing the dynamic instruction count to reduce simulation time. As mentioned earlier, simulation time reductions of several orders of magnitude have been reported for sampled simulation and benchmark synthesis. This reduction is important not only for architecture research and development, but also in the compiler space. For example, iterative compilation evaluates numerous compiler optimizations to find the optimum compiler optimizations for a given program,15,16 A reduced workload [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 63 Table 1. Comparison of workload reduction and generation techniques. Benchmark Benchmark Code synthesis at synthesis at mutation binary level HLL level Input Feature reduction Sampling Reduces simulation time Yes Yes No Yes Yes Requires simulator No Yes No No No Yes Yes Yes Yes Yes Yes Partially No No Yes Hides proprietary information No No Partially Yes Yes Can model emerging No No No Yes Yes Medium to poor High Medium to high Medium Medium modifications Can be used for microarchitecture exploration Can be used for compiler and ISA exploration workloads Accuracy with regard to reference workload that executes faster will also reduce the overall compiler space exploration time. Architecture versus compiler exploration All techniques can be used to drive microarchitecture research and development, but only a few can be used for compiler and ISA exploration. The reason is that techniques such as sampling, code mutation, and benchmark synthesis at the binary level operate on binaries and not on source code, eliminating their utility for compiler and ISA exploration. On the other hand, sampling that identifies representative loops and function calls can be used to drive compiler research, as can benchmark synthesis at the HLL level. Hide proprietary information Only benchmark synthesis can hide proprietary information, although code mutation partially succeeds in hiding this information. An important application for these techniques is to generate synthetic clones for real-life proprietary workloads. Such an application would allow companies to share code. It would also let them share their workloads with their academic research partners without revealing proprietary information. Model emerging workloads Benchmark synthesis can also be used to generate emerging and future workloads. In particular, researchers and developers can generate a workload profile with anticipated future performance characteristics. For example, they could generate a synthetic workload with large working sets, random memory access patterns, or complex control flow behavior. They can then use the synthetic benchmarks generated from these profiles to explore design alternatives for future computer systems. Accuracy Last but not least, whether the reduced workload is representative of the original reference workload is obviously of primary importance. Although it is hard to compare the various workload reduction techniques without doing an apples-to-apples comparison (which would require a rigorous comparison using the same set of benchmarks and simulation infrastructure), we can make a qualitative statement based on published results and our experience in this area. Sampling is likely the most accurate approach, followed by code mutation. Benchmark synthesis has shown medium accuracy. Reduced inputs have shown good accuracy for some benchmarks but poor accuracy for others. W e believe there is ample room for future work in workload reduction and generation, especially in terms of extending the existing techniques from single-core targets toward multicore processors. Contemporary computer systems .................................................................... NOVEMBER/DECEMBER 2010 63 [3B2-14] mmi2010060057.3d 8/12/010 16:47 Page 64 ............................................................................................................................................................................................... BENCHMARKING feature multicore processors, which obviously has repercussions on benchmarking for both hardware and software. Recent work in workload reduction and generation has focused almost exclusively on singlethreaded workloads, except for a few studies in sampling17 and benchmark synthesis (see the sidebar). However, given the trend toward multicore processors, we urgently need to develop workload reduction and generation techniques for multithreaded workloads. We hope this article will MICRO stimulate future work in this area. 7. T.F. Wenisch et al., ‘‘Simulation Sampling with Live-points,’’ Proc. Ann. Int’l Symp. Performance Analysis of Systems and Software (ISPASS 06), IEEE Press, 2006, pp. 2-12. 8. J. Ringenberg et al., ‘‘Intrinsic Checkpointing: A Methodology for Decreasing Simulation Time through Binary Modification,’’ Proc. IEEE Int’l Symp. Performance Analysis of Systems and Software (ISPASS 05), IEEE Press, 2005, pp. 78-88. 9. E. Perelman et al., ‘‘Cross Binary Simulation Points,’’ Proc. Ann. Int’l Symp. Performance Analysis of Systems and Software Acknowledgments We thank the anonymous reviewers for their thoughtful comments and suggestions. This work is supported in part by the Research Foundation—Flanders (FWO) projects G.0232.06, G.0255.08 and G.0179.10, and the UGent-BOF projects 01J14407 and 01Z04109. (ISPASS 07), IEEE Press, 2007, pp. 179-189. 10. L. Van Ertvelde and L. Eeckhout, ‘‘Dispersing Proprietary Applications as Benchmarks through Code Mutation,’’ Proc. Int’l Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS 08), ACM Press, 2008, pp. 201-210. 11. T. Karkhanis and J.E. Smith, ‘‘A First-order Superscalar Processor Model,’’ Proc. Ann. .................................................................... References 1. A.J. KleinOsowski and D.J. Lilja, ‘‘Min- 12. C.-K. Luk et al., ‘‘Pin: Building Customized neSPEC: A New SPEC Benchmark Workload Program Analysis Tools with Dynamic for Simulation-based Computer Architecture Instrumentation,’’ Proc. ACM SIGPLAN Research,’’ Computer Architecture Letters, Conf. Programming Languages Design and vol. 1, no. 2, June 2002, pp. 10-13. Implementation (PLDI 05), ACM Press, 2. L. Eeckhout, H. Vandierendonck, and K. De Bosschere, ‘‘Designing Workloads for Computer Architecture Research,’’ Computer, vol. 36, no. 2, Feb. 2003, pp. 65-71. ‘‘Reducing State Loss for Effective Trace mark Synthesis for Architecture and Com- Sampling of Superscalar Processors,’’ Proc. piler Exploration,’’ Proc. IEEE Int’l Symp. Int’l Conf. Computer Design (ICCD 96), Workload Characterization (IISWC 10), IEEE Press, 2010, to appear. 4. R.E. Wunderlich et al., ‘‘Smarts: Accelerating Microarchitecture Simulation via Rigor- 15. K.D. Cooper, P.J. Schielke, and D. Subramanian, ‘‘Optimizing for Reduced Code Space ous Statistical Sampling,’’ Proc. Ann. Int’l Using Genetic Algorithms,’’ Proc. SIGPLAN/ Symp. Computer Architecture (ISCA 03), SIGBED Conf. Languages, Compilers, and ACM Press, 2003, pp. 84-95. Tools for Embedded Systems (LCTES 99), ACM Press, 1999, pp. 1-9. izing Large Scale Program Behavior,’’ Proc. 16. P. Kulkarni et al., ‘‘Fast Searches for Effec- Int’l Conf. Architectural Support for Program- tive Optimization Phase Sequences,’’ Proc. ming Languages and Operating Systems (ASPLOS 02), ACM Press, 2002, pp. 45-57. ACM SIGPLAN Conf. Programming Language Design and Implementation (PLDI 6. M. Van Biesbrouck, B. Calder, and L. Eeckhout, ‘‘Efficient Sampling Startup for Sim- IEEE MICRO pp. 352-357. 14. L. Van Ertvelde and L. Eeckhout, ‘‘Bench- 5. T. Sherwood et al., ‘‘Automatically Character- 64 2005, pp. 190-200. 13. M. Weiser, ‘‘Program Slicing,’’ IEEE Trans. Software Eng., vol. 10, no. 4, July 1984, 3. T.M. Conte, M.A. Hirsch, and K.N. Menezes, IEEE CS Press, 1996, pp. 468-477. .................................................................... Int’l Symp. Computer Architecture (ISCA 04), ACM Press, 2004, pp. 338-349. 04), ACM Press, 2004, pp. 171-182. 17. T.F. Wenisch et al., ‘‘SimFlex: Statistical Sam- Point,’’ IEEE Micro, vol. 26, no. 4, July pling of Computer System Simulation,’’ IEEE 2006, pp. 32-42. Micro, vol. 26, no. 4, July 2006, pp. 18-31. [3B2-14] mmi2010060057.3d 8/12/010 16:47 Luk Van Ertvelde is a PhD student in the Electronics and Information Systems Department at Ghent University, Belgium. His research interests include computer architecture in general, and workload characterization in particular. Van Ertvelde has an MS in computer science from Ghent University. Lieven Eeckhout is an associate professor in the Electronics and Information Systems Department at Ghent University, Belgium. His research interests include computer architecture and the hardware/software interface in general, with a focus on performance analysis, evaluation and modeling, and Page 65 workload characterization. Eeckhout has a PhD in computer science and engineering from Ghent University. He is a member of IEEE and the ACM. Direct questions and comments about this article to Lieven Eeckhout, ELIS— Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium; leeckhou@ elis.ugent.be. .................................................................... NOVEMBER/DECEMBER 2010 65