Fast Cycle-Approximate Instruction Set Simulation

advertisement
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
Fast Cycle-Approximate Instruction Set Simulation ∗
Björn Franke
Institute for Computing Systems Architecture
School of Informatics
University of Edinburgh
James Clerk Maxwell Building
Mayfield Road, Edinburgh, EH9 3JZ, United Kingdom
bfranke@inf.ed.ac.uk
Abstract
ates on the instruction set level and abstracts away all hardware implementation details.
Increasing complexity in both new ASIP designs and
emerging applications is challenging current simulation
technology as designers require fast and timing-accurate
simulators at the earliest possible stage in the design cycle.
Additionally, recent developments in compiler technology
such as feedback-driven “compiler-in-the-loop” code optimisation [16] rely on profiling data to steer the optimisation
process and can require many program executions.
Our work is motivated by the huge differences in simulation speed and accuracy between cycle-accurate and functional instruction set simulators. For example, a performance gap of 1-2 orders of magnitude between Verilog RTL
and interpretive simulation, and another factor of 10 improvement over this for compiled-code simulators, is reported in [30]. RTL-to-C translation [28, 2] improves the
performance of cycle-accurate simulation, but it still does
not match that of functional simulation. Cycle-accurate
simulators typically operate at 20K − 100K instructions
per second, compared with JIT-translating functional simulators which operate typically at 200 − 400 million instructions per second [29]. At this higher speed, however, ISS
are not able to provide cycle counts, but only collect highlevel statistical information such as instruction counts and,
possibly, the number of cache accesses etc.
In this paper we present a novel approach to instruction
set simulation, combining the benefits of cycle-accurate and
functional simulators. We propose regression modelling to
build a performance model during an initial training stage,
which then – in the deployment stage – is to used to accurately predict the performance of a new program based on
the information available from functional simulation. We
have evaluated our methodology against an ARM v5 implementation and 183 embedded benchmark applications. We
show that the achievable prediction error is less than 5.8%
on average.
Instruction set simulators are indispensable tools in both
ASIP design space exploration and the software development and optimisation process for existing platforms. Despite the recent progress in improving the speed of functional instruction set simulators cycle-accurate simulation
is still prohibitively slow for all but the most simple programs. This severely limits the applicability of cycleaccurate simulators in the performance evaluation of complex embedded applications. In this paper we present a
novel approach, namely the prediction of cycle counts based
on information gathered during fast functional simulation
and prior training. We have evaluated our approach against
a cycle-accurate ARM v5 architecture simulator and a large
set of benchmarks. We demonstrate it is capability of providing highly accurate performance predictions with an average error of less than 5.8% at a fraction of the time for
cycle-accurate simulation.
1. Introduction
Instruction set simulators (ISS) are popular tools among
both embedded hardware and software developers and have
a variety of uses. They enable the early exploration of the
architecture design space, provide a reference model for
hardware verification, and equip software developers with
an easy-to-use development platform. Given these widely
different applications of ISS it comes at no surprise that specialised ISS schemes have emerged, differing in speed, accuracy and the level of micro-architectural detail modelled
in each simulator. The spectrum ranges from very slow RTL
simulation with precise timing to very fast functional simulation with no timing information. This latter method oper∗
c
2008 by EDAA.
69
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
Benchmark
Programs
ograms
Cycle-Accurate
Simulator
Sim
ulator
Training
Data
New
Program
Functional
Simulator
Regression
Solver
Predictor
Training Stage
Cycle Count
Deployment Stage
Figure 1. Overview of the training and deployment stages in the cycle-approximate ISS.
This paper is structured as follows. In section 2 we
present background material on regression analysis. Details
of the regression based cycle-approximate ISS are covered
in section 3, before we present and analyse our empirical
results in section 4. This is followed by a discussion of related work in section 5 before we summarise our work and
conclude in section 6.
y = β0 +
N

βi xi
(1)
i=1
Other popular regression models include polynomial and
a wide range of non-linear functions. Frequently, the independent variables x are not the only factors affecting y,
so the observed value yobserved will not always equal the
calculated value ycalculated generated by f ∗ for a given x.
This error  = yobserved − ycalculated is due to lack of fit
and is called the residual. The regression model is modified
accordingly:
N

y = β0 +
βi xi + 
(2)
2. Background
In this section we introduce the statistical regression
methods we have used in our work.
i=1
2.1. Regression Analysis
For
m
observations
(y1 , x1,1 , . . . , x1,N )
to
(ym , xm,1 , . . . , xm,N ) we can then build following
equation system
Regression analysis is a statistical method to examine
the relationship between a dependent variable y and independent variables x = (x1 , . . . , xN ). This relationship is
modelled as a function f with y = f (x). The regression
function f may have any form and the particular choice of
an appropriate functional form (or regression model) f ∗ for
the purpose of approximating f is a design decision and
depends on the knowledge of the mechanism that generated
the data. For example, if a linear relationship between y and
x is expected we would choose a linear regression model
like this:
y1
..
.
ym
= β0 + β1 x1,1 + β2 x1,2 + · · · + βN x1,N + 1
.
= ..
= β0 + β1 xm,1 + β2 xm,2 + · · · + βN xm,N + m
These equations can be rewritten in matrix form as:
y = Xβ + 
(3)
where β is the vector of regression coefficients and X is the
model matrix. Once the regression model has been chosen
70
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
3.1. Performance Prediction
(e.g. linear function), the parameters of the model (β) have
to be calculated in a way that will make f ∗ as close as possible to the true regression function f . The least-squares
method minimises the sum of the squares of the prediction
errors of the model on the observed data, i.e.

2
m
N
X
X
yi − β0 −
S(β) =
βj xi,j 
(4)
i=1
Functional simulators frequently maintain run-time
statistics which may include the total number of different
types of instructions, cache reads and misses, and possibly
other predictions (e. g. from branch predictors etc.). Other
micro-architectural details relating e.g. to the state of the
processor pipeline, however, are usually not available. We
will use the available high-level event counters to construct
a regression model relating these parameters to the observed
cycle counts resulting from cycle-accurate simulation in the
training stage.
Our hypothesis is that each of the events recorded by the
functional ISS contributes towards the total execution time
of the program. Furthermore, we expect the average cost of
events of the same type to be (more or less) constant both
within a single program run and across programs. This may
not necessarily the case for small programs, but for larger
programs we expect “stable” average cycle counts for each
type of instruction, each cache access etc. This motivates us
to expect a linear relationship between the event counters x
and the cycle count y. Once a regression line has been fitted
to the set of training data, the fitted slope and intercept values can be used to predict the response for a specific value
of x that was not included in the original set of observations.
This is essentially an application of equation 1.
j=1
where S(β) is minimised. The computed values represent
estimates of the regression coefficients β.
Note that linear regression does not test whether the data
are linear. It assumes that the data are linear, and finds the
slope and intercept that make a straight line best fit the data.
Hence, it is necessary to check this linearity assumption before proceeding with regression based prediction.
3. Cycle-Approximate ISS
In this section we present our novel cycle-approximate
ISS methodology and explain how to exploit regression
modelling to predict cycle counts of previously unseen programs. We assume that both a cycle-accurate and a functional simulator enhanced with various counters (e.g. counting different types of instructions and possibly also cache
and memory accesses) are initially available.
Cycle-approximate ISS is then based on a two-staged approach to constructing a performance predictor: A Training
Stage in which a set of training/benchmarking programs are
profiled and a later Deployment Stage, where the performance of previously unseen programs is predicted. An illustration of these two stages is given in figure 1. During
the training stage a set of benchmark applications is executed and profiled on both the cycle-accurate and the functional ISS. For each program the exact cycle count and the
values of the various counters maintained in the functional
ISS are collected and together they form a single data point
(y, x). As soon as this data is available for all programs in
the training set the Regression Solver calculates the regression coefficients according to a predefined regression model
and stores them for later use. On entering the deployment
stage the cycle-accurate simulator is not used any more, but
all simulations are performed by the functional simulator.
As before, this simulator is used to generate a characteristic
profile – the vector x – of the program under examination.
The regression model with the previously calculated coefficients is then used as a predictor and evaluated at the point
x, resulting in the predicted cycle count y for the new program.
At present we use the same compiler to compile the executables in both the training and deployment stages. The
investigation of the choice of the toolchain on the prediction
accuracy is subject of our future work.
4. Empirical Evaluation
In this section we discuss our evaluation approach and
present empirical results for an ARM v5 instruction set simulator and a comprehensive set of embedded applications.
4.1. Tools Infrastructure
We have evaluated our performance prediction methodology against an ARM v5 instruction set. For the simulation we have used the SimIt-ARM v2.1 functional and
cycle-accurate1 instruction set simulators [23, 24], the latter of which is calibrated against an Intel StrongARM SA1100 implementation [13]. This 32-bit core features a
5-stage pipeline, 16KB instruction and 8KB data caches,
respectively, and has MMU support. For integer benchmarks SimIt-ARM yields a timing accuracy of about 3%.
Floating-point operations are not supported in hardware, but
implemented in a software library. The benchmarks were
compiled using a GCC 3.3.2 cross-compiler with optimisation level -O3 and targeting the ARM v5 ISA. Each benchmark was simulated in a stand-alone manner, without an
1 In
fact, the SimIt-ARM simulator is only cycle-approximate as opposed to cycle-accurate as frequently stated. At the time of writing the
more recent SimIt-ARM v3.0 release only contains a functional ISS, hence
the use of the slightly outdated version 2.1.
71
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
Counters
x1 . . . x30
x31 , x32
x33
x34 , x35
x36 , x37
x38 . . . x41
x42 , x43
x44
x45 , x47
y
Description
Instruction counters for mov, mvn, add,
adc, sub, sbc, rsb, rsc, and, eor,
orr, bic, cmp, cmn, tst, teq, mla,
mul, smull, umull, ldr imm,
ldr reg, str imm, str reg, ldm,
stm, syscall, br, bl, fpe
Total instructions, nullified instructions
Total 4K memory pages allocated
Total I-Cache reads, read misses
Total I-TLB reads, read misses
Total D-Cache writes, write misses,
reads, read misses
Total D-TLB reads, read misses
Total BIU accesses
Total allocated OSMs, retired OSMs
Total cycles
well the regression function can describe the observed data.
Our hypothesis outlined in section 3 has been that the cycle count is a linear function in the various counters maintained by the ISS. In order to test this hypothesis we compare the calculated cycle counts ycalculated with observed
cycle counts yobserved after performing regression over the
entire data set (all programs and all counters except the observed cycle count). Figure 2(a) summarises this relationship for the complete set of benchmarks. Note that no prediction of yet unseen programs takes place at this point, we
only analyse how well the constructed linear function is capable of describing all known data points.
The graph in figure 2(a) clearly shows the close match
between the observed and calculated cycle counts. The data
points are concentrated near the ideal straight line with only
a very few exceptional outliers, especially at the lower end
of the scale. We have calculated the residuals and have
found an average error of 4.5%. A more detailed breakdown of the distribution of errors is shown in figure 2(b),
where the relative error frequency is plotted in a histogram
against the percentage error interval. This diagram shows
that the vast majority of programs can be described with a
very small error. In fact, for about 50% of all programs the
error is less than 1%. For 75% of all programs the error
is less than 8% and only three programs have an error of
≥ 15%. The maximum error, however, is relatively large
with a value of 26.1%.
In summary, we have shown that a linear regression
model is appropriate to describe the relationship between
the observed counters and cycle counts. The introduction
of a small error between the observed and calculated results
is inevitable, however, we have shown that this error is typically very small (on average less than 5%). This gives us
enough confidence in the chosen linear regression model for
predicting cycle counts from the observed counters.
Table 1. Overview of counters maintained in
the functional and cycle-accurate SimIt-ARM
instruction set simulators.
underlying operating system. All experiments have been
conducted on a quad-core 3GHz Intel Xeon based host platform running Linux.
Details of the parameters extracted from the SimIt-ARM
simulators are shown in table 1. In general, the parameters
relate to instruction counts (in total and broken down across
different instruction types), the memory system (number of
memory pages, I-cache and D-cache accesses, I-TLB and
D-TLB accesses, BIU accesses), and the internal operation state machine (OSM). Finally, we record the total cycle
count and use this for training and accuracy evaluation.
4.2. Benchmarks
Accuracy and Micro-Architectural Detail. Next, we
evaluate the performance of our model to compute cycle
counts of new, i.e. previously unseen, programs. For this,
we use a standard statistical Leave-One-Out Cross Validation approach, where a single observation is eliminated
from the data set and used as the validation data, and the
remaining observations are used as the training data. This
process is repeated for all programs in the data set such that
each program is used once as the validation data. The results of this cross validation process for the entire set of
programs are summarised in figures 3, 4 and table 3.
Figures 3 and 4 show, similar to the diagram in figure
2(a), the predicted cycle count versus the observed cycle
count. In general, better approximations of the cycle count
result in data points closer to a straight line through the
origin and unit slope. In four experiments we have performed leave-one-out cross validation based on different
Table 2 summarises the 183 programs taken from six
benchmark suites that we have used in our experiments.
This large number of programs is necessary to provide sufficient training data, but also to demonstrate wide applicability of our cycle approximation methodology. Most of
the benchmarks represent embedded DSP and multimedia
codes, the only notable exceptions are the pointer-intensive
general-purpose codes from [3, 4], which we have included
to evaluate the impact of cross-domain training.
4.3. Results
Model Selection. The selection of an appropriate regression model (linear, polynomial, various non-linear functions
etc.) is a design parameter and has a critical impact on how
72
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
Benchmark Suite
DSPstone [31]
UTDSP [20]
SWEET WCET [10]
MediaBench [18, 19]
Pointer-Intensive Benchmark Suite [3, 4]
Other [1]
No. of
Programs
30
88
37
16
5
7
Lines of Code
(per Program)
≈ 50 − 150
≈ 50 − 3500
≈ 50 − 4300
≈ 0.3 − 50K
≈ 0.6 − 7.3K
≈ 0.2 − 1.5K
Description
Small DSP kernels
Small DSP kernels and applications
Worst case execution time benchmarks
Multimedia applications
Non-trivial pointer-intensive applications
Cryptography, Software Radio, Audio Processing
Table 2. Overview of the benchmarks suites used in our experiments.
��������
����������������������
��������
��������
��������
��������
��������
��������
��������
���
��
�
��
���
��
�
��
���
��
��
�
���
��
�
��
���
��
�
��
���
��
�
��
���
��
��
�
���
��
�
��
��������������������
(a) Calculated cycle counts vs. observed cycle counts.
�
�
������������������
����
����
����
����
����
�
��
��
��
��
��
�
�
��
��
�
��
�
��
��
��
�
��
��
�
��
��
�
��
�
��
�
��
��
�
�
��
��
�
����
����������
�
(b) Relative frequency histogram of the percentage error.
�
Figure 2. Evaluation of the linear regression model.
sets of parameters: instruction counters only; instruction
and cache counters; instruction, cache and TLB counters;
and all counters.
straight line, however, some significant error prediction errors are noticeable especially towards the lower end of the
scale. The average error and standard deviation are quite
high with values of 38.9% and 57.7, respectively. The maximum error is much higher and reaches 518% for one of the
smallest applications. The prediction accuracy recovers for
In figure 3(a) results are shown for the case in which
only instruction counts are used as parameters in the regression model. The data points are clustered around the ideal
73
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
��������
���������������������
��������
��������
��������
��������
��������
��������
��������
��������
��������
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
��������������������
�
�
(a) Instruction counters only.
��������
���������������������
��������
��������
��������
��������
��������
��������
��������
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
��������������������
�
�
(b) Instruction & cache counters
Figure 3. Leave-One-Out Cross Validation: Predicted cycle counts vs. observed cycle counts.
longer running applications, though. This is not surprising
as minor “disturbances” in the behaviour have a larger relative impact for the small applications, but average out over
long program runs.
Combined instruction and cache access counters produce
the results in the diagram shown in figure 3(b). The prediction accuracy has significantly improved over the instructions only scheme and the mean absolute and maximum errors have been reduced to 10.5% (with a standard deviation
of 13.0) and 66.4%, respectively. The inclusion of additional memory related parameters (TLB counters) further
improve the prediction quality as can be seen in figure 4(a).
The average error has been reduced to 5.72% and also the
maximum prediction error observed for one of the smallest
applications is significantly smaller (26.3%).
In a control experiment we have included additional
micro-architectural data relating to the operation state machines modelling the processor pipeline behaviour. This
information is typically not available in functional simulators. The corresponding results are shown in figure 4(b).
Interestingly, this additional information has only a very
small impact on the mean absolute error (0.28% improvement over instructions, cache & TLB). On the other hand,
both the standard deviation and, more importantly, the maximum error increase. This suggests that including a larger
number of parameters in the regression model might not always be beneficial. Care must be taken in choosing parameters and balancing the number of training data points and
coefficients to be determined.
Scalability and Training Minimisation. The acceptance
of a cycle-approximate simulation methodology that re74
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
��������
���������������������
��������
��������
��������
��������
��������
��������
��������
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
��
�
��������������������
(a) Instruction, cache & TLB counters
�
�
��������
���������������������
��������
��������
��������
��������
��������
��������
��������
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��
�
���
��
��������������������
�
�
(b) All counters
Figure 4. Leave-One-Out Cross Validation: Predicted cycle counts vs. observed cycle counts.
Counters/Parameters
Instructions
Instructions & Cache
Instruction, Cache & TLB
All (incl. OSMs)
Mn. Abs.
Error
38.9%
10.5%
5.72%
5.44%
Std.
Dev.
57.7
13.0
7.12
7.37
Max.
Error
518%
66.4%
26.31%
44.66%
plications whilst retaining its full accuracy. In this section
we hence evaluate how much training is required and how
well the performance of larger applications can be estimated
if only small benchmarks are used for training.
To answer the question as to how much training is required to accurately predict the performance of larger applications we have chosen the N = 90, . . . , 170 smallest
applications to estimate the coefficients in our regression
model. Based on this training data we have then predicted
the execution time of the ten largest applications. Figure
5 summarises our findings and shows the median2 of the
prediction error for the 10 largest programs for the various
values of N. For the different numbers of training examples
the error is relative stable, peaking at 4.1% for the largest
training set size. This shows that with limited training data
Table 3. Mean absolute error, standard deviation and maximum error for different parameter sets.
quires offline training will largely depend on the amount
of training before the deployment stage can be entered. Ideally, we would like to be able to train our model on a small
number of small benchmarks and then apply it to larger ap-
2 The median is used as an average function because individual extreme
outliers would have an over-proportional effect on the arithmetic mean.
75
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
���
��������
�
���������������������
������������
���
�
���
�
���
�
���
�
��
���
���
���
���
���
���
���
���
��������
��������
�������������������
��������
�
�
���
Figure 5. Scalability: Median of the prediction error for the 10 largest application after
training on the N smallest applications.
��
�
��
�
���
��
��
�
���
��
��
�
���
��
�
��
��������������������
�
Figure 6. Domain Specialisation: Prediction accuracy for general-purpose pointerintensive codes after training on embedded
applications.
good prediction accuracy is still achievable. For values of
N < 90, however, we have observed the error to increase
very quickly, making any meaningful prediction impossible. This suggest that there exists a critical threshold in the
size of the training set below which regression based prediction should not be used.
The results demonstrate that the amount of training can
be safely reduced to a sufficiently small set of small programs. Still, the constructed predictor is capable of predicting the performance of programs several orders of magnitude larger than the largest program in the training data
set. These results are encouraging and we are investigating
ways of identifying a minimal, but still representative set of
training programs with the goal of further minimising the
training cost.
thus, slightly smaller than the error on the codes from the
same domain as the training data. This result suggests there
might be little impact of the training domain on the prediction accuracy. However, we do not have sufficient evidence
at this point to confirm this for other domains.
Speed. The presented cycle-approximate ISS methodology shifts much of the work into the training stage, where
training programs are executed on both a functional and a
cycle-accurate simulator before the regression coefficients
can be determined. Clearly, the cost of training is dominated by the cycle-accurate simulation, which can take several hours to run depending on the complexity of the simulated processor core and the specific choice of training programs. The calculation of the regression coefficients, however, is fast and typically only adds a few seconds to the
overall training time.
More important than the training time is the time required for approximating the cycle count of a new program
in the deployment stage. As prediction only involves the
calculation of a linear function in a single point (after the
parameters of this point have been determined by functional
simulation of the program) this operation is very fast3 .
Domain Specialisation. We are interested in how the
choice of training examples from one or more application
domains affects the prediction accuracy on another domain.
In order to test this “domain specialisation” a benchmark
suite [3, 4] comprising pointer-intensive general-purpose
applications has been included in our study. Training is
based on all programs except these pointer-intensive codes
and predictions are performed as before. In figure 6 the
results for the five program in this experiment are plotted.
The prediction errors range between 0.17% for the yacr2
program (symbolic channel router) and 7.1% for ft (minspan calculation). yacr2 is dominated by pointer and array
dereferencing and arithmetic, but only performs limited dynamic storage allocation. ft exhibits a different behaviour
with lots of dynamic storage allocation and deallocation
throughout the life of the program, but very little pointer
arithmetic. The dynamic memory allocation of ft is likely
to introduce “irregularities” that make it more difficult to
predict than the other, less dynamic codes. On average, the
prediction error on the pointer-intensive codes is 2.73% and,
5. Related Work
Instruction set simulation is an active research area
and researchers have focused, in particular, on fast retargetable [21, 25] and compiled [26] or JIT-compiled [29] instruction set simulators. More recently, hybrid compiled
3 In fact, it is so short that using the standard system timers we have not
been able to measure this time.
76
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
simulation combining interpretive and compiled simulation
has been presented in [27, 29].
Most relevant to our work is [22], where an artificial neural network has been trained to estimate cycle counts of a
new program. In contrast, our work is based on statistical
regression and as such it is more open to mathematical analysis. Apart from this technical difference our methodology
yields a significantly higher prediction accuracy in terms
of both the average (5.72% vs 11.4%) and maximum error
(26.3% vs 103%).
Statistical simulation [9] is based on the simulation of
small, synthetic program traces derived from statistical program profiles rather than the actual program itself. It can
yield accurate performance estimates several orders of magnitude faster than full simulation. Our work shares some
commonality with statistical simulation in that it also relies on statistical profiles (e.g. various counters maintained
by a functional simulator). However, different to statistical
simulation we do not construct a synthetic trace that is then
fed into a cycle-accurate simulator, but compute the cycle
count directly based on training data and a linear regression
model.
Program similarity is exploited in [11], where the goal
is to transform a set of micro-architecture independent runtime characteristics of a program into relative performance
differences based on the stored characteristics of a previously profiled benchmark suite. This approach shares the
“training” concept with our work, but fails to provide the
same prediction accuracy. This may be due to the chosen
set of characteristics, the specific model, too few training
examples or noise in the measurements4 .
The construction and use of linear regression models
for the performance analysis of a super-scalar processor is
the subject of [14]. In this work an estimator of the expected CPI performance is constructed based on 26 microarchitectural parameters. Similarly, regression models are
used in [17] to construct micro-architectural performance
and power predictors.
[5] uses compilation onto a virtual instruction set, which
is subsequently translated into C with timing annotations
and then executed. This early work is extended and formalised in [6], however, both publications lack of convincing empirical results. Micro-profiling [15] extends these
ideas and is based on fine-grained instrumentation of a
generic low-level and executable IR and is designed to support the early stages of an SoC design cycle. Whilst shown
to work well for two small applications and a simple MIPS
core with a flat memory hierarchy it is questionable if this
approach will scale, e.g. when a more complex memory
hierarchy is introduced.
Compiler based approaches to performance prediction
are covered in e.g. [7, 8, 12], where primary goal is to support the compiler in selecting “good” transformations and
parameters. In this scenario the preservation of the performance trend (“better” or “worse”) is of higher importance
than absolute accuracy.
6. Summary, Future Work and Conclusions
Summary In this paper we have developed a cycleapproximate instruction set simulation methodology combining the benefits of functional and cycle-accurate simulation. Using prior training and regression based performance prediction our approach is able to exploit the statistical information typically provided by cache model enhanced functional simulators to estimate cycle counts with
an average error of less than 5.8%. During the prediction
stage our technique does not rely on a detailed model of the
processor pipeline, but only utilises instruction and memory access counters. Thus, by reducing the level of microarchitectural detail in the simulation we are able to generate
performance results at the speed of functional simulation
whilst retaining much of the accuracy of cycle-accurate simulation.
We have evaluated our technique against an extensive
set of benchmarks and an ARM v5 implementation and
demonstrated the effectiveness in predicting the execution
time of larger programs even with limited training. No difference was found between predictions made with training
data from the same versus different application domains,
indicating that the specific selection of training examples is
not critical for our technique to work.
Future Work Future work will extend our presented
work in two directions: (a) Reducing the maximum prediction error through advanced statistical methods that may,
in addition to the cycle count, provide the user with a confidence value characterising the “uncertainty” of the generated result, and (b) broader evaluation. For the latter
we plan to apply our methodology across a set of different
embedded platforms, including more complex commercial
high performance RISC, digital signal and multimedia processors. In addition, we are working on the integration of
our technique in a publicly available simulator and its evaluation against an industrial benchmark suite (e.g. EEMBC).
Conclusions Our work contributes towards faster turnaround cycles in ASIP design space exploration and in making automatic “compiler-in-the-loop” code optimisation a
viable alternative to manual code tuning. In many of these
situations cycle-accurate results are not strictly necessary
and “good” approximations are sufficient. In addition, our
4 This work does not use cycle-accurate simulators for training, but real
hardware and performance measurements may be affected by noise due to
I/O and OS activity.
77
11th International Workshop on Software & Compilers for Embedded Systems (SCOPES) 2008
approach is particularly useful to support fast simulation
of complex, long-running embedded applications such as
state-of-the-art multimedia codecs where the slow speed of
cycle-accurate simulation is often prohibitive.
[16] T. Kisuki, P. Knijnenburg, and M. O’Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In Proceedings of PACT’00, 2000.
[17] B. C. Lee and D. M. Brooks. Accurate and efficient regression modeling for microarchitectural performance and
power prediction. In Proceedings of ASPLOS’06, 2006.
[18] C. Lee. MediaBench. http://euler.slu.edu/
∼fritts/mediabench/mb1/, 2007.
[19] C. Lee, M. Potkonjak, and W. Mangione-Smith. MediaBench: A tool for evaluating and synthesizing multimedia
and communications systems. In Proceedings of the 30th
Annual IEEE/ACM International Symposium on Microarchitecture, 1997.
[20] C. G. Lee.
UTDSP benchmark suite.
http:
//www.eecg.toronto.edu/∼corinna/DSP/
infrastructure/UTDSP.html, 1998.
[21] A. Nohl, G. Braun, O. Schliebusch, R. Leupers, H. Meyr,
and A. Hoffmann. A universal technique for fast and flexible instruction-set architecture simulation. In DAC ’02:
Proceedings of the 39th Conference on Design Automation,
pages 22–27, New York, NY, USA, 2002. ACM.
[22] M. S. Oyamada, F. Zschornack, and F. R. Wagner. Accurate software performance estimation using domain classification and neural networks. In Proceedings of SBCCI’04,
2004.
[23] W. Qin.
SimIt-ARM.
http://simit-arm.
sourceforge.net, 2007.
[24] W. Qin and S. Malik. Flexible and formal modeling of
microprocessors with application to retargetable simulation.
In Proceedings of Design Automation & Test in Europe
(DATE), 2003.
[25] M. Reshadi, N. Bansal, P. Mishra, and N. Dutt. An efficient retargetable framework for instruction-set simulation.
In Proceedings of CODES+ISSS’03, 2003.
[26] M. Reshadi, P. Mishra, and N. Dutt. Instruction set compiled
simulation: A technique for fast and flexible instruction set
simulation. In Proceedings of the 2003 Design Automation
Conference (DAC), 2003.
[27] M. Reshadi, P. Mishra, and N. Dutt. Hybrid compiled simulation: An efficient technique for instruction-set architecture simulation. ACM Transactions on Embedded Computing Systems (TECS), 2007. To appear.
[28] W. Snyder, P. Wasson, and D. Galbi. Verilator. http:\\
www.veripool.com\verilator.html, 2007.
[29] N. Topham and D. Jones. High speed CPU simulation using
JIT binary translation. In Proceedings of MoBS – Workshop
on Modeling, Benchmarking and Simulation, 2007.
[30] S. J. Weber, M. W. Moskewicz, M. Gries, C. Sauer,
and K. Keutzer.
Fast cycle-accurate simulation and
instruction set generation for constraint-based descriptions of programmable architectures. In Proceedings of
CODES+ISSS’04, 2004.
[31] V. Zivojnovic, J. Martinez, C. Schläger, and H. Meyr. DSPstone: A DSP-Oriented Benchmarking Methodology. In
Proceedings of ICSPAT’94, 1994.
References
[1] S. Amarasinghe.
StreamIt - Benchmarks.
http:
//cag.csail.mit.edu/streamit/shtml/
benchmarks.shtml, 2007.
[2] ARC. ARC VTOC Tool. http://www.arc.com/
software/simulation/vtoc.html, 2007.
[3] T. M. Austin. Pointer-intensive benchmark suite. http:
//www.cs.wisc.edu/∼austin/ptr-dist.html,
2007.
[4] T. M. Austin, S. E. Breach, and G. S. Sohi. Efficient detectopn of all pointer and array access errors. Technical report,
University of Wisconsin, 1993.
[5] J. R. Bammi, E. Harcourt, W. Kruijtzer, L. Lavagno, and
M. T. Lazarescu. Software performance estimation strategies in a system-level design tool. In Proceedings of
CODES’00, 2000.
[6] G. Bontempi and W. Kruijtzer. A data analysis method
for software performance prediction. In Proceedings of
DATE’02, 2002.
[7] P. C. Diniz. A compiler approach to performance prediction
using empirical-based modeling. Lecture Notes in Computer
Science, 2659, 2003.
[8] C. Dubach, J. Cavazos, B. Franke, M. O’Boyle, G. Fursin,
and O. Temam. Fast compiler optimisation evaluation using
code-feature based performance prediction. In Proceedings
of Computing Frontiers’07, 2007.
[9] L. Eeckhout, S. Nussbaum, J. E. Smith, and K. D. Bosschere. Statistical simulation: Adding efficiency to the computer designer’s toolbox. IEEE Micro, 2003.
[10] J. Gustafsson. The WCET tool challenge 2006. In Proceedings of the 2nd International Symposium on Leveraging
Applications of Formal Methods (ISOLA’06), 2007.
[11] K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K.
John, and K. D. Bosschere. Performance prediction based
on inherent program similarity. In Proceedings of PACT’06,
2006.
[12] C.-H. Hsu and U. Kremer. IPERF: A framework for automatic construction of performance prediction models. In
Proceedings of the Workshop on Profile and FeedbackDirected Compilation (PFDC’98), 1998.
[13] Intel. Intel StrongARM SA-1100 Microprocessor - Developer’s Manual. http://www.intel.com, 1999.
[14] P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil. Construction and use of linear regression models for processor
performance analysis. In Proceedings of the 12th International Symposium on High-Performance Computer Architecture (HPCA), 2006.
[15] T. Kempf, K. Karuri, S. Wallentowitz, G. Ascheid, R. Leupers, and H. Meyr. A SW performance estimation framework for early system-level-design using fine-grained instrumentation. In Proceedings of DATE’06, 2006.
78
Download