Reliable Estimation of Execution Time of Embedded SW: A Statistical Approach Grant Martin, Paolo Giusto and Ed Harcourt Front End Products: System Level Design and Verification 26 October 2000 Cadence Design Systems, Inc. Agenda Motivation for SW estimation in HW-SW codesign Context – Various estimation techniques The Statistical approach Preliminary results Ideas for the future ‹#› Motivation 1 System Function System Architecture Processor Software Functional Simulation Improve Quality of Estimators Provide QualitySpeed Options Mapping 3 2 Needed to make HW-SW tradeoffs and to estimate ‘fit’ of SW to Processor(s) Performance Simulation Estimate of Execution Time Communication Refinement 4 Flow To Implementation ‹#› Function-Architecture Co-Design, as in Cadence VCC Context: SW Estimation Techniques Technique Virtual Processor Model Compiled Code ISS Interpretive ISS ‹#› Static vs. Dynamic Dynamic Accuracy (Error) +/- 20% ? Dynamic +/- 5% Dynamic +/- 2% Mathematical Static Kernel function Can be as low as a few % Backannotation If data dependency low, few % Static Instruction Abstraction ‘Virtual’ or ‘Fuzzy’ Instructions Actual instruction stream Modelling Effort Data Sheet Speed Very Fast ExtensiveFast many months. May optimise for speed by Cutting some corners Actual Less than Slow instruction Compiled stream Code ISS None. Must be A few hours Fastest no data to weeks dependencies None. Measure Few hours to Very Fast code on ISS or weeks; late HW Virtual Processor Model Estimation v__st_tmp = v__st; startup(proc); if (events[proc][0] & 1) goto L16; ANSI C Input Virtual Machine Instructions Analyse basic blocks compute delays Specify behavior and I/O ld ld op ld li op ts -br Generate new C Compile generated C and run natively ‹#› Performance Estimation with delay annotations v__st_tmp = v__st; __DELAY(LI+LI+LI+LI+LI+LI+OPc); startup(proc); if (events[proc][0] & 1) { __DELAY(OPi+LD+LI+OPc+LD+OPi+OPi+IF); goto L16; } Architecture Characterization Evolution of approaches used in POLIS Virtual Processor Model Example XXX Virtual Instruction Set Model (Basis file) LD,3 LI,1 ST,3 OP.c,3 OP.s,3 OP.i,4 OP.l,4 OP.f,4 OP.d,6 MUL.c,9 MUL.s,10 MUL.i,18 MUL.l,22 MUL.f,45 MUL.d,55 DIV.c,19 DIV.s,110 DIV.i,118 DIV.l,122 DIV.f,145 DIV.d,155 IF,5 GOTO,2 SUB,19 RET,21 Load Load Immdiate. Store Simple ALU Operation Complex ALU Operation Test and Branch Unconditional Branch Branch to Subroutine Return from Subroutine Basis file derived from DataBook ‹#› Register Allocation Technique – None, LocalScalarsOnly, LocalScalarsAndParameter, ParameterOnly Problems with VPM approach Cannot incorporate Target-dependent compiler optimisations – Special HW, register set (accurately) Does not model processor state – Pipelines, register occupancy Very Unclear how to handle Superscalar, VLIW, etc. Improvements to model target HW and compilers slow it down – Can add ‘statistical’ fudge factors - hard to characterise ‘Field of Use’ vs. ‘Field of Abuse’ hard to characterise – Most suitable for Control-dominated code, not computationdominated (compiler optimisations and special HW effects) ‹#› – Databook approach VERY conservative….worst-case assumptions – What is the confidence level for the estimation? Compiled Code ISS Target.c Host.c Target Options CC Host CC-ISS CC Target.s Host.o Libraries Linker Other.s ‹#› Host Other.o Host.exe One approach to a Compiled Code ISS translates Target assembler code into portable C code that when executed on the Host will reasonably accurately emulate the Timing, Functionality and Memory transaction behavior of the Target Architecture for the specific program --Target.c Problems with Compiled Code-ISS Modeling! – Generation of Compiled code ISS models and tooling often very timeconsuming – many person-months to person-years Slower than VPM approach – Modeling of internal processor state, HW resources etc. comes at a cost – May be 2-5X slower (various approaches available) Some modelling compromises made to achieve speed – Interpretive ISS approach can be more accurate ‹#› – BUT different levels of accuracy can be achieved at slower speed – Areas not proven well: modeling parallel resources and HW scheduling Interpretive ISS Execute Code stream in interpretive fashion on model of processor – usually C/C++ based model – models processor state and all internal resources – can interact with memory and bus models to achieve high accuracy Models very often produced by IP providers or 3rd parties – usually available either before IP or with IP introduction ‹#› Problems with Interpretive ISS Very slow - at least one order of magnitude slower than Compiled Code ISS and 2 orders (100X) slower than VPM approach – Simulates instruction Fetch, Decode and Dispatch dynamically at runtime. – Models internal processor resources and state Can be used to validate other approaches for accuracy and understand ‘field of use’ ‹#› – But unlikely to be useful for significant dynamic execution of embedded SW Mathematical Kernel Functions (e.g. Dataflow) 1 A FIR 1 1 1 B * 1 16 D FFT SPW/Cossap/Ptolemy Modeling 1 C TONE Schedule (16 A C B) D AT&T DSP16xx Intel Pentium MMX Motorola 56k FIR(t) = 14 + 4(t - 1) TONE() = 52 MULT() = 4 ‹#› FFT(p) = 122 + 32p VECTSUM(n) = 5 + 2 * n Cost 16 • FIR(128) + 16 • TONE + 16 • MULT + FFT(256) Problems with Mathematical Kernels Not at all suitable for data-dependent Control code Best for fixed processing of fixed sample sizes – Has been used for characterising DSP’s – e.g. BDTI - Berkeley Design Technology – FFT on a Motorola 56k of p points has a latency of 122 + 32p cycles – FIR filter on a Motorola 56k has a latency of 14 + 4(t - 1) cycles. Effective modelling depends on large families of precharacterised kernels – finding right balance of accuracy and granularity a problem ‹#› Needs in-depth application-specific knowledge to apply Cannot be easily automated Manual Back-annotation approach Measure generated code on target processor with target compiler – Compilation optimisations handled – Target dependent HW handled – Can use accurate Cycle-counting Instruction Set Simulator or real HW – could also use emulation systems Back-annotate source code for use in co-design estimation – Delay models applicable to whole task (black box approach) ‹#› – Annotate basic blocks of task with measured numbers Possible to combine measured and kernel function approaches Problems with Manual Back-annotation Extensive characterisation effort required Works best on computation-intensive code – Data-dependent control code hard to characterise or tedious to annotate every block Requires either real HW (very late in process) or accurate ISS – accurate ISS may depend on detailed RTL-level design – not suitable for early speculation on processor features Manual! ‹#› – Not easy to automate Can we improve VPM? The Data Book Approach to VPM estimators is very conservative – e.g. for processor XXX, LD = 3 cycles but compilers can optimise down to something a little over 1 on average – SUB and RET with 15 user registers: – conservative estimate of 19 and 21 cycles (4+Active Reg, 6+AR) – actual usage is more like 5-10 active registers (statistically) The Data book approach, or any kind of artificial ‘Calibration Suite’, is applied to all kinds of embedded SW – But embedded SW is not all the same – control vs. computation ‹#› – application specific features e.g. wireless, wired comms, automotive, multimedia image processing Abandon Causality, Embrace Correlations VPM approach based on Databook cycle counts assumes a causal relationship between the presence of a Virtual instruction in the ‘fuzzy’ compililation, and some (fixed) number of cycles in the task cycle count, if the Virtual instruction lies in an executed basic block. – Conservative assumptions on cycle counts – No application domain specific characteristics – Cannot account for target compiler optimisations A statistical approach abandons the idea of causal relationships ‹#› – Rather, it seeks to find correlations between characteristics of the source code and the overall cycle count, and use them to build better predictor equations Source Code characteristics Need dynamic characteristics, not static ones Obvious source: Virtual instructions from the VPM model! But rather than assume that each instruction ‘attracts’ a fixed number of cycles in the task cycle count, seek correlations between virtual instruction frequencies and overall cycle count. Statistical technique used: – Stepwise multiple linear regression ‹#› – Virtual instruction counts are not really all independent variables; use correlations between them to reduce the number of variables used in the predictor equation Note that this can subsume compiler effects to some extent Statistical Benchmarking approach ‹#› Identify a benchmark set of tasks: sample set – Application domain specific Run the tasks through the VPM instruction frequency counting Run the tasks on a cycle-accurate ISS or HW Reduce the VPM instructions to a set of independent variables – using correlation analysis Run a stepwise and normal multiple linear regression on the sample Produce a predictor equation with quality metrics Apply it to new code – drawn from the same application domain (should do reasonably) – Other domains (may do poorly) Determine a discriminating function that can be used to test code to see if it is from the same domain as the sample set Stepwise Multiple Linear Regression Regression: Y = B + Si Ci * Xi – where Y = Dependent variable (to be predicted) – B = intercept – i e I where I = the set of orthogonal (independent) Virtual Instructions (independent variables) – Ci = the coefficient for – Xi = the cycle count for Virtual Instruction I Stepwise - Independent variables are added one at a time – Must pass a test of statistical significance ‹#› – When no more variables can be added, regression stops Sample size should be larger than the number of independent variables, else regression is over-determined. Regression = one sample Y B X Simple linear regression: Y = B + mX ‹#› Minimise Sum of Squared Errors = S (Ysamplei-Ypredictedi)2 Measure ‘quality’ of Predictor Equation via R2 = the square of the correlation coefficient R2 is defined as “the proportion of the variation of Y that is explained (accounted for) by the variation in the X’s”. Note on R2 R2 does not imply causality Rather, it can be used to gauge the ‘quality’ of the resulting predictor equation R2 of 1.00 means a ‘perfect’ predictor equation is possible – Want R2 as high as possible, but respectable results can be obtained with low R2 Correlation between a set of independent variables and a dependent variable is a study in variability, not causality ‹#› Our hypothesis is that the resulting predictor can be used on new samples (code) drawn from the same population (application domain) First Study: Set of 35 ‘Automotive’ benchmarks Control dominated; not much computation Divided into Sample Set of 18, Control set of 17 randomly Partial excerpt of virtual instruction counts: LD ‹#› LI 20 20 20 20 20 20 20 20 20 20 20 20 22 20 20 ST 2 2 5 4 2 5 4 2 2 2 2 2 4 4 2 OPi 2 2 2 4 2 2 4 2 2 2 2 2 3 4 2 MULi 64 64 56 61 64 57 62 59 59 59 59 59 64 62 63 DIVi 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 IF 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 GOTO 25 25 21 25 26 22 25 23 23 23 23 23 26 25 25 SUB 0 0 0 1 2 1 1 1 1 1 1 1 2 2 1 1 1 3 1 1 3 1 1 1 1 1 1 2 1 1 First step: correlation of independent variables ‹#› Pearson Correlations Section (Row-Wise Deletion) LD LD 1.000000 LI 0.013095 ST -0.679006 OPi 0.937686 MULi 0.000000 DIVi 0.000000 IF 0.900243 GOTO 0.409408 SUB -0.212899 RET -0.212899 Cycles 0.707427 Cronbachs Alpha = 0.599374 LI ST OPi 0.013095 -0.679006 0.937686 1.000000 0.417704 -0.016410 0.417704 1.000000 -0.534283 -0.016410 -0.534283 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.004637 -0.439935 0.991981 0.074820 -0.021693 0.392022 0.688895 0.048507 -0.362069 0.688895 0.048507 -0.362069 -0.116149 -0.502512 0.688486 Standardized Cronbachs Alpha = 0.383204 Pearson Correlations Section (Row-Wise Deletion) IF LD 0.900243 LI 0.004637 ST -0.439935 OPi 0.991981 MULi 0.000000 DIVi 0.000000 IF 1.000000 GOTO 0.444580 SUB -0.400471 RET -0.400471 Cycles 0.662023 Cronbachs Alpha = 0.599374 GOTO SUB RET 0.409408 -0.212899 -0.212899 0.074820 0.688895 0.688895 -0.021693 0.048507 0.048507 0.392022 -0.362069 -0.362069 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.444580 -0.400471 -0.400471 1.000000 -0.223607 -0.223607 -0.223607 1.000000 1.000000 -0.223607 1.000000 1.000000 0.155163 -0.312829 -0.312829 Standardized Cronbachs Alpha = 0.383204 MULi 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 Cycles 0.707427 -0.116149 -0.502512 0.688486 0.000000 0.000000 0.662023 0.155163 -0.312829 -0.312829 1.000000 DIVi 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 Independent variable selection In this sample set, Muli and Divi were constant over the samples No variability implies nothing to contribute to Cycle variability, so drop SUB and RET were perfectly correlated (no surprise!) LD, OPi and IF highly correlated (over 90%) – Only need one of them. Therefore, regression run with: – LD, LI, ST, GOTO and SUB ‹#› – 5 independent variables, with sample size of 18 Regression Results and Interpretation R2 = 0.57 Cycles = 219 + 1.3*LD + 10.9*LI - 10.2*ST - 5.2*GOTO - 21.3*SUB Note: This is a pure “Predictor Equation” – Intercept does not have to make ‘sense’ – Coefficients have NO operative meaning – Positive and negative correlations can’t be ‘interpreted’ – Task cycle counts ranged from 151 to 246. Many of them less than the intercept of 219! ‹#› When predictor applied to Control group of 17, error -16% to +3% When predictor back-applied to sample set, error -8% to +13% Data sheet model applied to all 35, error -10% to +45% (conserv.) Stepwise results Running it only adds LD to equation with an R2 of .500 (in other words, LD variability explains far more of the cycle count variability than any of the other variables. Predictor equation: – Cycles = 146 + 3.6*LD Note - when applied to the control set of 17 we get error of -16% to +8% (24% range) – when back applied to sample set of 18, we get error of -8% to +14% (22% range) ‹#› Equation is very robust. Since LD’s are present in all programs, unusual instructions (e.g. Muld’s, etc.) do not skew the results. However, would do very poorly on mathematical code Residual Errors Histogramof Residuals of Cycles 5.0 Count 3.8 2.5 1.3 ‹#› 0.0 -30.0 -15.0 0.0 Residuals of Cycles 15.0 30.0 Residual Errors Normal ProbabilityPlot of Residuals of Cycles Residuals of Cycles 30.0 ‹#› 15.0 0.0 -15.0 -30.0 -2.0 -1.0 0.0 Expected Normals 1.0 2.0 Larger Sample Set We took the 35 ‘automotive’ control samples, added 6 ‘Esterel’ examples (believed control dominated), and 4 other ‘control’ oriented pieces of code, for a sample of 45 benchmarks. Multiple regression was applied, and then we manipulated the set of independent variables (through ‘Art’) to get the predictor: Cycles = 75 + (Opi + Opc) + 3.4*IF + 20*SUB On back-substitution, Error range was -30% to +20% ‹#› Some doubt that sample set was really drawn from similar enough domains to be meaningful, BUT….. An operative interpretation of the equation If Cycles = 75 + (Opi + Opc) + 3.4*IF + 20*SUB Then one might interpret (note- this is statistically invalid) that each task takes – 75 cycles to start and finish (prologue and epilogue) – each basic integer (Opi) and character (Opc) operator is 1 cycle – each branch (IF) ‘attracts’ 3.4 cycles on average – each Subroutine call and return takes 20 cycles – this last implies an average of 5 user registers in use ‹#› Although invalid, this predictor ‘makes some kind of sense’ (æsthetically and psychologically) Discriminating Code Application Domains This study only just started Basic idea: – Find a discriminating test to determine if a predictor equation derived from one sample set can be applied to another, new set of code samples – I.e. measure the legitimacy of using a predictor on a new set ‹#› One suggestion is to look at ‘Control’ vs. ‘Computation’ dominated code Idea is to classify virtual instructions as ‘control’ or ‘computation’ and compare their ratios Ratio of Control to Computation Virtual Instructions Equation _ 1 Control _ to _ Computation _ Ratio IF GOTO OP.n IF GOTO OP.n MUL.n DIV .n where OP.n represents OP.c, OP.s, OP.i, OP.l, Op.d, Op.f MUL.n represents MUL.c, MUL.s, MUL.i, MUL.l,MUL.d, MUL.f DIV.n represents DIV.c, DIV.s, DIV.i, DIV.l, DIV.d, DIV.f ‹#› If the Control to Computation Ratio is greater than 2/3, then the VPM should provide reasonable accuracy estimates. A simpler approach This approach seemed too variable when applied to our benchmark sets We looked at a simpler ratio – number of Virtual IF’s to total Cycle count – on the 35 ‘automotive control’ examples, this ratio ranged from 9 to 13% – on the 6 ‘Esterel’ examples, this ratio ranged from 1 to 5% – suggests that perhaps these 2 sample sets are NOT drawn from the same population, and therefore a predictor from one set will NOT work on the second ‹#› – True in practice: using the a predictor from the first set, on the second gave error ranges of 23 to 60% ! Is there a statistical test that can help? 2-sample t-test (and Aspin-Welch unequal-variance test) Used to test the hypothesis that 2 sample sets are drawn from the same underlying population – assumes that they are normally distributed – t-test assumes that they have equal variances and tests if the means are equal – Aspin-Welch assumes that they have unequal variances and tests if the means are equal ‹#› We applied these tests to our 18 and 17 sample ‘automotive’ SW sets Discriminator was the ratio of IF’s to total cycle count Tests passed - I.e. it is possible that the 2 samples were drawn from the same underlying population Normality of sample 1 under question ‹#› Results of analysis Histogramof C2 8.0 8.0 6.0 6.0 Count Count Histogramof C1 4.0 2.0 0.0 0.0 2.0 0.1 0.1 C1 ‹#› 4.0 0.1 0.1 0.0 0.1 0.1 0.1 C2 0.1 0.1 Results of analysis Normal ProbabilityPlot of C2 0.1 0.1 0.1 0.1 C2 C1 Normal ProbabilityPlot of C1 0.1 0.1 0.0 -2.0 0.1 -1.0 0.0 Expected Normals ‹#› 0.1 1.0 2.0 0.1 -2.0 -1.0 0.0 Expected Normals 1.0 2.0 Results of analysis BoxPlot 0.14 Am ount 0.11 0.08 0.05 ‹#› 0.02 C1 C2 Variables T-test applied to ‘automotive’ vs. ‘Esterel’ Histogramof C1 20.0 BoxPlot 15.0 Count 0.14 10.0 5.0 Am ount 0.11 0.0 0.0 0.1 0.1 0.1 0.1 C1 0.07 Histogramof C2 4.0 0.04 Count 3.0 2.0 0.00 C1 C2 1.0 Variables Hypotheses soundly rejected Normality of 6-sample Esterel set very doubtful! (Sample size too small) 0.0 0.0 ‹#› 0.0 0.0 C2 0.0 0.1 Conclusions The statistical analysis method seems to provide: – improved and more robust estimators than data sheets – better centred and less likely to produce wild outliers – some ability to discriminate among domain sets of SW tasks Hampered by lack of benchmark examples – very hard to obtain especially for multiple domains – since statistical, need lots of samples drawn from a wide variety of implementations Needs to be rerun on other processors (just one used so far) – Variant HW features may mean applicable only to very simple embedded RISCs ‹#› Users find it very hard to interpret the predictor equations – they expect the coefficients to be all positive and make ‘sense’ Future improvements under study Improved Compiler front-end – Target-Independent Optimisations incorporated – May yield different Virtual instruction counts and thus better predictors of cycle count Other processors – Need cycle-accurate ISS’s from vendors – looking at DSP’s to study limits of technique Better benchmark sets ‹#› – need non-proprietary benchmarks which work on a variety of embedded processors – EEMBC Benchmark Suite (www.eembc.org) - EDN Embedded Microprocessor Benchmark Consortium EEMBC benchmarks Cadence joined EEMBC as a 3rd party tool provider to get access to benchmark code Automotive/Industrial – e.g. table lookup, angle to time conversion, pulse width modulation, IIR filters, CAN remote data request, …. Consumer - e.g. JPEG compression and decompression Networking - e.g. path routing ‹#› Office Automation - e.g. printer control Telecommunications – e.g.autocorrelation, Viterbi, convolutional encoding, ...