11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 Fast Cycle-Approximate Instruction Set Simulation ∗ Björn Franke Institute for Computing Systems Architecture School of Informatics University of Edinburgh James Clerk Maxwell Building Mayfield Road, Edinburgh, EH9 3JZ, United Kingdom bfranke@inf.ed.ac.uk Abstract ates on the instruction set level and abstracts away all hardware implementation details. Increasing complexity in both new ASIP designs and emerging applications is challenging current simulation technology as designers require fast and timing-accurate simulators at the earliest possible stage in the design cycle. Additionally, recent developments in compiler technology such as feedback-driven “compiler-in-the-loop” code optimisation [16] rely on profiling data to steer the optimisation process and can require many program executions. Our work is motivated by the huge differences in simulation speed and accuracy between cycle-accurate and functional instruction set simulators. For example, a performance gap of 1-2 orders of magnitude between Verilog RTL and interpretive simulation, and another factor of 10 improvement over this for compiled-code simulators, is reported in [30]. RTL-to-C translation [28, 2] improves the performance of cycle-accurate simulation, but it still does not match that of functional simulation. Cycle-accurate simulators typically operate at 20K − 100K instructions per second, compared with JIT-translating functional simulators which operate typically at 200 − 400 million instructions per second [29]. At this higher speed, however, ISS are not able to provide cycle counts, but only collect highlevel statistical information such as instruction counts and, possibly, the number of cache accesses etc. In this paper we present a novel approach to instruction set simulation, combining the benefits of cycle-accurate and functional simulators. We propose regression modelling to build a performance model during an initial training stage, which then – in the deployment stage – is to used to accurately predict the performance of a new program based on the information available from functional simulation. We have evaluated our methodology against an ARM v5 implementation and 183 embedded benchmark applications. We show that the achievable prediction error is less than 5.8% on average. Instruction set simulators are indispensable tools in both ASIP design space exploration and the software development and optimisation process for existing platforms. Despite the recent progress in improving the speed of functional instruction set simulators cycle-accurate simulation is still prohibitively slow for all but the most simple programs. This severely limits the applicability of cycleaccurate simulators in the performance evaluation of complex embedded applications. In this paper we present a novel approach, namely the prediction of cycle counts based on information gathered during fast functional simulation and prior training. We have evaluated our approach against a cycle-accurate ARM v5 architecture simulator and a large set of benchmarks. We demonstrate it is capability of providing highly accurate performance predictions with an average error of less than 5.8% at a fraction of the time for cycle-accurate simulation. 1. Introduction Instruction set simulators (ISS) are popular tools among both embedded hardware and software developers and have a variety of uses. They enable the early exploration of the architecture design space, provide a reference model for hardware verification, and equip software developers with an easy-to-use development platform. Given these widely different applications of ISS it comes at no surprise that specialised ISS schemes have emerged, differing in speed, accuracy and the level of micro-architectural detail modelled in each simulator. The spectrum ranges from very slow RTL simulation with precise timing to very fast functional simulation with no timing information. This latter method oper∗ c 2008 by EDAA. 69 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 Benchmark Programs ograms Cycle-Accurate Simulator Sim ulator Training Data New Program Functional Simulator Regression Solver Predictor Training Stage Cycle Count Deployment Stage Figure 1. Overview of the training and deployment stages in the cycle-approximate ISS. This paper is structured as follows. In section 2 we present background material on regression analysis. Details of the regression based cycle-approximate ISS are covered in section 3, before we present and analyse our empirical results in section 4. This is followed by a discussion of related work in section 5 before we summarise our work and conclude in section 6. y = β0 + N βi xi (1) i=1 Other popular regression models include polynomial and a wide range of non-linear functions. Frequently, the independent variables x are not the only factors affecting y, so the observed value yobserved will not always equal the calculated value ycalculated generated by f ∗ for a given x. This error = yobserved − ycalculated is due to lack of fit and is called the residual. The regression model is modified accordingly: N y = β0 + βi xi + (2) 2. Background In this section we introduce the statistical regression methods we have used in our work. i=1 2.1. Regression Analysis For m observations (y1 , x1,1 , . . . , x1,N ) to (ym , xm,1 , . . . , xm,N ) we can then build following equation system Regression analysis is a statistical method to examine the relationship between a dependent variable y and independent variables x = (x1 , . . . , xN ). This relationship is modelled as a function f with y = f (x). The regression function f may have any form and the particular choice of an appropriate functional form (or regression model) f ∗ for the purpose of approximating f is a design decision and depends on the knowledge of the mechanism that generated the data. For example, if a linear relationship between y and x is expected we would choose a linear regression model like this: y1 .. . ym = β0 + β1 x1,1 + β2 x1,2 + · · · + βN x1,N + 1 . = .. = β0 + β1 xm,1 + β2 xm,2 + · · · + βN xm,N + m These equations can be rewritten in matrix form as: y = Xβ + (3) where β is the vector of regression coefficients and X is the model matrix. Once the regression model has been chosen 70 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 3.1. Performance Prediction (e.g. linear function), the parameters of the model (β) have to be calculated in a way that will make f ∗ as close as possible to the true regression function f . The least-squares method minimises the sum of the squares of the prediction errors of the model on the observed data, i.e. 2 m N X X yi − β0 − S(β) = βj xi,j (4) i=1 Functional simulators frequently maintain run-time statistics which may include the total number of different types of instructions, cache reads and misses, and possibly other predictions (e. g. from branch predictors etc.). Other micro-architectural details relating e.g. to the state of the processor pipeline, however, are usually not available. We will use the available high-level event counters to construct a regression model relating these parameters to the observed cycle counts resulting from cycle-accurate simulation in the training stage. Our hypothesis is that each of the events recorded by the functional ISS contributes towards the total execution time of the program. Furthermore, we expect the average cost of events of the same type to be (more or less) constant both within a single program run and across programs. This may not necessarily the case for small programs, but for larger programs we expect “stable” average cycle counts for each type of instruction, each cache access etc. This motivates us to expect a linear relationship between the event counters x and the cycle count y. Once a regression line has been fitted to the set of training data, the fitted slope and intercept values can be used to predict the response for a specific value of x that was not included in the original set of observations. This is essentially an application of equation 1. j=1 where S(β) is minimised. The computed values represent estimates of the regression coefficients β. Note that linear regression does not test whether the data are linear. It assumes that the data are linear, and finds the slope and intercept that make a straight line best fit the data. Hence, it is necessary to check this linearity assumption before proceeding with regression based prediction. 3. Cycle-Approximate ISS In this section we present our novel cycle-approximate ISS methodology and explain how to exploit regression modelling to predict cycle counts of previously unseen programs. We assume that both a cycle-accurate and a functional simulator enhanced with various counters (e.g. counting different types of instructions and possibly also cache and memory accesses) are initially available. Cycle-approximate ISS is then based on a two-staged approach to constructing a performance predictor: A Training Stage in which a set of training/benchmarking programs are profiled and a later Deployment Stage, where the performance of previously unseen programs is predicted. An illustration of these two stages is given in figure 1. During the training stage a set of benchmark applications is executed and profiled on both the cycle-accurate and the functional ISS. For each program the exact cycle count and the values of the various counters maintained in the functional ISS are collected and together they form a single data point (y, x). As soon as this data is available for all programs in the training set the Regression Solver calculates the regression coefficients according to a predefined regression model and stores them for later use. On entering the deployment stage the cycle-accurate simulator is not used any more, but all simulations are performed by the functional simulator. As before, this simulator is used to generate a characteristic profile – the vector x – of the program under examination. The regression model with the previously calculated coefficients is then used as a predictor and evaluated at the point x, resulting in the predicted cycle count y for the new program. At present we use the same compiler to compile the executables in both the training and deployment stages. The investigation of the choice of the toolchain on the prediction accuracy is subject of our future work. 4. Empirical Evaluation In this section we discuss our evaluation approach and present empirical results for an ARM v5 instruction set simulator and a comprehensive set of embedded applications. 4.1. Tools Infrastructure We have evaluated our performance prediction methodology against an ARM v5 instruction set. For the simulation we have used the SimIt-ARM v2.1 functional and cycle-accurate1 instruction set simulators [23, 24], the latter of which is calibrated against an Intel StrongARM SA1100 implementation [13]. This 32-bit core features a 5-stage pipeline, 16KB instruction and 8KB data caches, respectively, and has MMU support. For integer benchmarks SimIt-ARM yields a timing accuracy of about 3%. Floating-point operations are not supported in hardware, but implemented in a software library. The benchmarks were compiled using a GCC 3.3.2 cross-compiler with optimisation level -O3 and targeting the ARM v5 ISA. Each benchmark was simulated in a stand-alone manner, without an 1 In fact, the SimIt-ARM simulator is only cycle-approximate as opposed to cycle-accurate as frequently stated. At the time of writing the more recent SimIt-ARM v3.0 release only contains a functional ISS, hence the use of the slightly outdated version 2.1. 71 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 Counters x1 . . . x30 x31 , x32 x33 x34 , x35 x36 , x37 x38 . . . x41 x42 , x43 x44 x45 , x47 y Description Instruction counters for mov, mvn, add, adc, sub, sbc, rsb, rsc, and, eor, orr, bic, cmp, cmn, tst, teq, mla, mul, smull, umull, ldr imm, ldr reg, str imm, str reg, ldm, stm, syscall, br, bl, fpe Total instructions, nullified instructions Total 4K memory pages allocated Total I-Cache reads, read misses Total I-TLB reads, read misses Total D-Cache writes, write misses, reads, read misses Total D-TLB reads, read misses Total BIU accesses Total allocated OSMs, retired OSMs Total cycles well the regression function can describe the observed data. Our hypothesis outlined in section 3 has been that the cycle count is a linear function in the various counters maintained by the ISS. In order to test this hypothesis we compare the calculated cycle counts ycalculated with observed cycle counts yobserved after performing regression over the entire data set (all programs and all counters except the observed cycle count). Figure 2(a) summarises this relationship for the complete set of benchmarks. Note that no prediction of yet unseen programs takes place at this point, we only analyse how well the constructed linear function is capable of describing all known data points. The graph in figure 2(a) clearly shows the close match between the observed and calculated cycle counts. The data points are concentrated near the ideal straight line with only a very few exceptional outliers, especially at the lower end of the scale. We have calculated the residuals and have found an average error of 4.5%. A more detailed breakdown of the distribution of errors is shown in figure 2(b), where the relative error frequency is plotted in a histogram against the percentage error interval. This diagram shows that the vast majority of programs can be described with a very small error. In fact, for about 50% of all programs the error is less than 1%. For 75% of all programs the error is less than 8% and only three programs have an error of ≥ 15%. The maximum error, however, is relatively large with a value of 26.1%. In summary, we have shown that a linear regression model is appropriate to describe the relationship between the observed counters and cycle counts. The introduction of a small error between the observed and calculated results is inevitable, however, we have shown that this error is typically very small (on average less than 5%). This gives us enough confidence in the chosen linear regression model for predicting cycle counts from the observed counters. Table 1. Overview of counters maintained in the functional and cycle-accurate SimIt-ARM instruction set simulators. underlying operating system. All experiments have been conducted on a quad-core 3GHz Intel Xeon based host platform running Linux. Details of the parameters extracted from the SimIt-ARM simulators are shown in table 1. In general, the parameters relate to instruction counts (in total and broken down across different instruction types), the memory system (number of memory pages, I-cache and D-cache accesses, I-TLB and D-TLB accesses, BIU accesses), and the internal operation state machine (OSM). Finally, we record the total cycle count and use this for training and accuracy evaluation. 4.2. Benchmarks Accuracy and Micro-Architectural Detail. Next, we evaluate the performance of our model to compute cycle counts of new, i.e. previously unseen, programs. For this, we use a standard statistical Leave-One-Out Cross Validation approach, where a single observation is eliminated from the data set and used as the validation data, and the remaining observations are used as the training data. This process is repeated for all programs in the data set such that each program is used once as the validation data. The results of this cross validation process for the entire set of programs are summarised in figures 3, 4 and table 3. Figures 3 and 4 show, similar to the diagram in figure 2(a), the predicted cycle count versus the observed cycle count. In general, better approximations of the cycle count result in data points closer to a straight line through the origin and unit slope. In four experiments we have performed leave-one-out cross validation based on different Table 2 summarises the 183 programs taken from six benchmark suites that we have used in our experiments. This large number of programs is necessary to provide sufficient training data, but also to demonstrate wide applicability of our cycle approximation methodology. Most of the benchmarks represent embedded DSP and multimedia codes, the only notable exceptions are the pointer-intensive general-purpose codes from [3, 4], which we have included to evaluate the impact of cross-domain training. 4.3. Results Model Selection. The selection of an appropriate regression model (linear, polynomial, various non-linear functions etc.) is a design parameter and has a critical impact on how 72 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 Benchmark Suite DSPstone [31] UTDSP [20] SWEET WCET [10] MediaBench [18, 19] Pointer-Intensive Benchmark Suite [3, 4] Other [1] No. of Programs 30 88 37 16 5 7 Lines of Code (per Program) ≈ 50 − 150 ≈ 50 − 3500 ≈ 50 − 4300 ≈ 0.3 − 50K ≈ 0.6 − 7.3K ≈ 0.2 − 1.5K Description Small DSP kernels Small DSP kernels and applications Worst case execution time benchmarks Multimedia applications Non-trivial pointer-intensive applications Cryptography, Software Radio, Audio Processing Table 2. Overview of the benchmarks suites used in our experiments. �������� ���������������������� �������� �������� �������� �������� �������� �������� �������� ��� �� � �� ��� �� � �� ��� �� �� � ��� �� � �� ��� �� � �� ��� �� � �� ��� �� �� � ��� �� � �� �������������������� (a) Calculated cycle counts vs. observed cycle counts. � � ������������������ ���� ���� ���� ���� ���� � �� �� �� �� �� � � �� �� � �� � �� �� �� � �� �� � �� �� � �� � �� � �� �� � � �� �� � ���� ���������� � (b) Relative frequency histogram of the percentage error. � Figure 2. Evaluation of the linear regression model. sets of parameters: instruction counters only; instruction and cache counters; instruction, cache and TLB counters; and all counters. straight line, however, some significant error prediction errors are noticeable especially towards the lower end of the scale. The average error and standard deviation are quite high with values of 38.9% and 57.7, respectively. The maximum error is much higher and reaches 518% for one of the smallest applications. The prediction accuracy recovers for In figure 3(a) results are shown for the case in which only instruction counts are used as parameters in the regression model. The data points are clustered around the ideal 73 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 �������� ��������������������� �������� �������� �������� �������� �������� �������� �������� �������� �������� ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � �������������������� � � (a) Instruction counters only. �������� ��������������������� �������� �������� �������� �������� �������� �������� �������� ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � �������������������� � � (b) Instruction & cache counters Figure 3. Leave-One-Out Cross Validation: Predicted cycle counts vs. observed cycle counts. longer running applications, though. This is not surprising as minor “disturbances” in the behaviour have a larger relative impact for the small applications, but average out over long program runs. Combined instruction and cache access counters produce the results in the diagram shown in figure 3(b). The prediction accuracy has significantly improved over the instructions only scheme and the mean absolute and maximum errors have been reduced to 10.5% (with a standard deviation of 13.0) and 66.4%, respectively. The inclusion of additional memory related parameters (TLB counters) further improve the prediction quality as can be seen in figure 4(a). The average error has been reduced to 5.72% and also the maximum prediction error observed for one of the smallest applications is significantly smaller (26.3%). In a control experiment we have included additional micro-architectural data relating to the operation state machines modelling the processor pipeline behaviour. This information is typically not available in functional simulators. The corresponding results are shown in figure 4(b). Interestingly, this additional information has only a very small impact on the mean absolute error (0.28% improvement over instructions, cache & TLB). On the other hand, both the standard deviation and, more importantly, the maximum error increase. This suggests that including a larger number of parameters in the regression model might not always be beneficial. Care must be taken in choosing parameters and balancing the number of training data points and coefficients to be determined. Scalability and Training Minimisation. The acceptance of a cycle-approximate simulation methodology that re74 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 �������� ��������������������� �������� �������� �������� �������� �������� �������� �������� ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � �� � �������������������� (a) Instruction, cache & TLB counters � � �������� ��������������������� �������� �������� �������� �������� �������� �������� �������� ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �� � ��� �� �������������������� � � (b) All counters Figure 4. Leave-One-Out Cross Validation: Predicted cycle counts vs. observed cycle counts. Counters/Parameters Instructions Instructions & Cache Instruction, Cache & TLB All (incl. OSMs) Mn. Abs. Error 38.9% 10.5% 5.72% 5.44% Std. Dev. 57.7 13.0 7.12 7.37 Max. Error 518% 66.4% 26.31% 44.66% plications whilst retaining its full accuracy. In this section we hence evaluate how much training is required and how well the performance of larger applications can be estimated if only small benchmarks are used for training. To answer the question as to how much training is required to accurately predict the performance of larger applications we have chosen the N = 90, . . . , 170 smallest applications to estimate the coefficients in our regression model. Based on this training data we have then predicted the execution time of the ten largest applications. Figure 5 summarises our findings and shows the median2 of the prediction error for the 10 largest programs for the various values of N. For the different numbers of training examples the error is relative stable, peaking at 4.1% for the largest training set size. This shows that with limited training data Table 3. Mean absolute error, standard deviation and maximum error for different parameter sets. quires offline training will largely depend on the amount of training before the deployment stage can be entered. Ideally, we would like to be able to train our model on a small number of small benchmarks and then apply it to larger ap- 2 The median is used as an average function because individual extreme outliers would have an over-proportional effect on the arithmetic mean. 75 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 ��� �������� � ��������������������� ������������ ��� � ��� � ��� � ��� � �� ��� ��� ��� ��� ��� ��� ��� ��� �������� �������� ������������������� �������� � � ��� Figure 5. Scalability: Median of the prediction error for the 10 largest application after training on the N smallest applications. �� � �� � ��� �� �� � ��� �� �� � ��� �� � �� �������������������� � Figure 6. Domain Specialisation: Prediction accuracy for general-purpose pointerintensive codes after training on embedded applications. good prediction accuracy is still achievable. For values of N < 90, however, we have observed the error to increase very quickly, making any meaningful prediction impossible. This suggest that there exists a critical threshold in the size of the training set below which regression based prediction should not be used. The results demonstrate that the amount of training can be safely reduced to a sufficiently small set of small programs. Still, the constructed predictor is capable of predicting the performance of programs several orders of magnitude larger than the largest program in the training data set. These results are encouraging and we are investigating ways of identifying a minimal, but still representative set of training programs with the goal of further minimising the training cost. thus, slightly smaller than the error on the codes from the same domain as the training data. This result suggests there might be little impact of the training domain on the prediction accuracy. However, we do not have sufficient evidence at this point to confirm this for other domains. Speed. The presented cycle-approximate ISS methodology shifts much of the work into the training stage, where training programs are executed on both a functional and a cycle-accurate simulator before the regression coefficients can be determined. Clearly, the cost of training is dominated by the cycle-accurate simulation, which can take several hours to run depending on the complexity of the simulated processor core and the specific choice of training programs. The calculation of the regression coefficients, however, is fast and typically only adds a few seconds to the overall training time. More important than the training time is the time required for approximating the cycle count of a new program in the deployment stage. As prediction only involves the calculation of a linear function in a single point (after the parameters of this point have been determined by functional simulation of the program) this operation is very fast3 . Domain Specialisation. We are interested in how the choice of training examples from one or more application domains affects the prediction accuracy on another domain. In order to test this “domain specialisation” a benchmark suite [3, 4] comprising pointer-intensive general-purpose applications has been included in our study. Training is based on all programs except these pointer-intensive codes and predictions are performed as before. In figure 6 the results for the five program in this experiment are plotted. The prediction errors range between 0.17% for the yacr2 program (symbolic channel router) and 7.1% for ft (minspan calculation). yacr2 is dominated by pointer and array dereferencing and arithmetic, but only performs limited dynamic storage allocation. ft exhibits a different behaviour with lots of dynamic storage allocation and deallocation throughout the life of the program, but very little pointer arithmetic. The dynamic memory allocation of ft is likely to introduce “irregularities” that make it more difficult to predict than the other, less dynamic codes. On average, the prediction error on the pointer-intensive codes is 2.73% and, 5. Related Work Instruction set simulation is an active research area and researchers have focused, in particular, on fast retargetable [21, 25] and compiled [26] or JIT-compiled [29] instruction set simulators. More recently, hybrid compiled 3 In fact, it is so short that using the standard system timers we have not been able to measure this time. 76 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 simulation combining interpretive and compiled simulation has been presented in [27, 29]. Most relevant to our work is [22], where an artificial neural network has been trained to estimate cycle counts of a new program. In contrast, our work is based on statistical regression and as such it is more open to mathematical analysis. Apart from this technical difference our methodology yields a significantly higher prediction accuracy in terms of both the average (5.72% vs 11.4%) and maximum error (26.3% vs 103%). Statistical simulation [9] is based on the simulation of small, synthetic program traces derived from statistical program profiles rather than the actual program itself. It can yield accurate performance estimates several orders of magnitude faster than full simulation. Our work shares some commonality with statistical simulation in that it also relies on statistical profiles (e.g. various counters maintained by a functional simulator). However, different to statistical simulation we do not construct a synthetic trace that is then fed into a cycle-accurate simulator, but compute the cycle count directly based on training data and a linear regression model. Program similarity is exploited in [11], where the goal is to transform a set of micro-architecture independent runtime characteristics of a program into relative performance differences based on the stored characteristics of a previously profiled benchmark suite. This approach shares the “training” concept with our work, but fails to provide the same prediction accuracy. This may be due to the chosen set of characteristics, the specific model, too few training examples or noise in the measurements4 . The construction and use of linear regression models for the performance analysis of a super-scalar processor is the subject of [14]. In this work an estimator of the expected CPI performance is constructed based on 26 microarchitectural parameters. Similarly, regression models are used in [17] to construct micro-architectural performance and power predictors. [5] uses compilation onto a virtual instruction set, which is subsequently translated into C with timing annotations and then executed. This early work is extended and formalised in [6], however, both publications lack of convincing empirical results. Micro-profiling [15] extends these ideas and is based on fine-grained instrumentation of a generic low-level and executable IR and is designed to support the early stages of an SoC design cycle. Whilst shown to work well for two small applications and a simple MIPS core with a flat memory hierarchy it is questionable if this approach will scale, e.g. when a more complex memory hierarchy is introduced. Compiler based approaches to performance prediction are covered in e.g. [7, 8, 12], where primary goal is to support the compiler in selecting “good” transformations and parameters. In this scenario the preservation of the performance trend (“better” or “worse”) is of higher importance than absolute accuracy. 6. Summary, Future Work and Conclusions Summary In this paper we have developed a cycleapproximate instruction set simulation methodology combining the benefits of functional and cycle-accurate simulation. Using prior training and regression based performance prediction our approach is able to exploit the statistical information typically provided by cache model enhanced functional simulators to estimate cycle counts with an average error of less than 5.8%. During the prediction stage our technique does not rely on a detailed model of the processor pipeline, but only utilises instruction and memory access counters. Thus, by reducing the level of microarchitectural detail in the simulation we are able to generate performance results at the speed of functional simulation whilst retaining much of the accuracy of cycle-accurate simulation. We have evaluated our technique against an extensive set of benchmarks and an ARM v5 implementation and demonstrated the effectiveness in predicting the execution time of larger programs even with limited training. No difference was found between predictions made with training data from the same versus different application domains, indicating that the specific selection of training examples is not critical for our technique to work. Future Work Future work will extend our presented work in two directions: (a) Reducing the maximum prediction error through advanced statistical methods that may, in addition to the cycle count, provide the user with a confidence value characterising the “uncertainty” of the generated result, and (b) broader evaluation. For the latter we plan to apply our methodology across a set of different embedded platforms, including more complex commercial high performance RISC, digital signal and multimedia processors. In addition, we are working on the integration of our technique in a publicly available simulator and its evaluation against an industrial benchmark suite (e.g. EEMBC). Conclusions Our work contributes towards faster turnaround cycles in ASIP design space exploration and in making automatic “compiler-in-the-loop” code optimisation a viable alternative to manual code tuning. In many of these situations cycle-accurate results are not strictly necessary and “good” approximations are sufficient. In addition, our 4 This work does not use cycle-accurate simulators for training, but real hardware and performance measurements may be affected by noise due to I/O and OS activity. 77 11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008 approach is particularly useful to support fast simulation of complex, long-running embedded applications such as state-of-the-art multimedia codecs where the slow speed of cycle-accurate simulation is often prohibitive. [16] T. Kisuki, P. Knijnenburg, and M. O’Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of PACT’00, 2000. [17] B. C. Lee and D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In Proceedings of ASPLOS’06, 2006. [18] C. Lee. MediaBench. http://euler.slu.edu/ ∼fritts/mediabench/mb1/, 2007. [19] C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual IEEE/ACM International Symposium on Microarchitecture, 1997. [20] C. G. Lee. UTDSP benchmark suite. http: //www.eecg.toronto.edu/∼corinna/DSP/ infrastructure/UTDSP.html, 1998. [21] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr, and A. Hoffmann. A universal technique for fast and flexible instruction-set architecture simulation. In DAC ’02: Proceedings of the 39th Conference on Design Automation, pages 22–27, New York, NY, USA, 2002. ACM. [22] M. S. Oyamada, F. Zschornack, and F. R. Wagner. Accurate software performance estimation using domain classification and neural networks. In Proceedings of SBCCI’04, 2004. [23] W. Qin. SimIt-ARM. http://simit-arm. sourceforge.net, 2007. [24] W. Qin and S. Malik. Flexible and formal modeling of microprocessors with application to retargetable simulation. In Proceedings of Design Automation & Test in Europe (DATE), 2003. [25] M. Reshadi, N. Bansal, P. Mishra, and N. Dutt. An efficient retargetable framework for instruction-set simulation. In Proceedings of CODES+ISSS’03, 2003. [26] M. Reshadi, P. Mishra, and N. Dutt. Instruction set compiled simulation: A technique for fast and flexible instruction set simulation. In Proceedings of the 2003 Design Automation Conference (DAC), 2003. [27] M. Reshadi, P. Mishra, and N. Dutt. Hybrid compiled simulation: An efficient technique for instruction-set architecture simulation. ACM Transactions on Embedded Computing Systems (TECS), 2007. To appear. [28] W. Snyder, P. Wasson, and D. Galbi. Verilator. http:\\ www.veripool.com\verilator.html, 2007. [29] N. Topham and D. Jones. High speed CPU simulation using JIT binary translation. In Proceedings of MoBS – Workshop on Modeling, Benchmarking and Simulation, 2007. [30] S. J. Weber, M. W. Moskewicz, M. Gries, C. Sauer, and K. Keutzer. Fast cycle-accurate simulation and instruction set generation for constraint-based descriptions of programmable architectures. In Proceedings of CODES+ISSS’04, 2004. [31] V. Zivojnovic, J. Martinez, C. Schläger, and H. Meyr. DSPstone: A DSP-Oriented Benchmarking Methodology. In Proceedings of ICSPAT’94, 1994. References [1] S. Amarasinghe. StreamIt - Benchmarks. http: //cag.csail.mit.edu/streamit/shtml/ benchmarks.shtml, 2007. [2] ARC. ARC VTOC Tool. http://www.arc.com/ software/simulation/vtoc.html, 2007. [3] T. M. Austin. Pointer-intensive benchmark suite. http: //www.cs.wisc.edu/∼austin/ptr-dist.html, 2007. [4] T. M. Austin, S. E. Breach, and G. S. Sohi. Efficient detectopn of all pointer and array access errors. Technical report, University of Wisconsin, 1993. [5] J. R. Bammi, E. Harcourt, W. Kruijtzer, L. Lavagno, and M. T. Lazarescu. Software performance estimation strategies in a system-level design tool. In Proceedings of CODES’00, 2000. [6] G. Bontempi and W. Kruijtzer. A data analysis method for software performance prediction. In Proceedings of DATE’02, 2002. [7] P. C. Diniz. A compiler approach to performance prediction using empirical-based modeling. Lecture Notes in Computer Science, 2659, 2003. [8] C. Dubach, J. Cavazos, B. Franke, M. O’Boyle, G. Fursin, and O. Temam. Fast compiler optimisation evaluation using code-feature based performance prediction. In Proceedings of Computing Frontiers’07, 2007. [9] L. Eeckhout, S. Nussbaum, J. E. Smith, and K. D. Bosschere. Statistical simulation: Adding efficiency to the computer designer’s toolbox. IEEE Micro, 2003. [10] J. Gustafsson. The WCET tool challenge 2006. In Proceedings of the 2nd International Symposium on Leveraging Applications of Formal Methods (ISOLA’06), 2007. [11] K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. D. Bosschere. Performance prediction based on inherent program similarity. In Proceedings of PACT’06, 2006. [12] C.-H. Hsu and U. Kremer. IPERF: A framework for automatic construction of performance prediction models. In Proceedings of the Workshop on Profile and FeedbackDirected Compilation (PFDC’98), 1998. [13] Intel. Intel StrongARM SA-1100 Microprocessor - Developer’s Manual. http://www.intel.com, 1999. [14] P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. Construction and use of linear regression models for processor performance analysis. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), 2006. [15] T. Kempf, K. Karuri, S. Wallentowitz, G. Ascheid, R. Leupers, and H. Meyr. A SW performance estimation framework for early system-level-design using fine-grained instrumentation. In Proceedings of DATE’06, 2006. 78