Methods for Evaluation of Embedded Systems Simon Künzli, Alex Maxiaguine Institute TIK, ETH Zurich System-Level Analysis Web browsing IP Telephony Multimedia streaming Secure FTP LookUp RISC Cipher DSP Bus Load ? Memory ? Packet Delays ? Resource Utilization ? Clock Rate ? Problems for Performance Estimation • Distributed processing of applications on different resources SDRAM RISC Arbiter DSP • Interaction of different applications on different resources • Heterogeneity, HW-SW A “nice-to-have” performance model • measuring what we want • high accuracy • high speed • full coverage • based on unified formal specification model • composability & parameterization • reusable across different abstraction levels at least easy to refine Overview over Existing Approaches Ernst speed Thiele Givargis Lahiri SPADE Jerraya accuracy Benini RTL Discrete-event Simulation Event Scheduler • Event queue actions to be executed Accuracy vs. Speed: future events (e.g. signal changes) How many events are simulated? System Model • Architecture and Behavior • Components/Actors/Processes • Communication channels/Signals © The MathWorks Discrete-event Simulation “The design space”: Time resolution Modeling communication Modeling timing of data-dependent execution … cont. time x(t) x(t) a a c a c a a t1 t2 t3 t4 t5 t6 t7 a a c a c a a t1 t2 t3 t4 t5 t6 t7 t discrete time • Continuous time e.g. Gate-level simulation • Discrete time or “cycle-accurate” e.g. Register Transfer Level (RTL) simulation system-level performance analysis t accuracy Time Resolution Modeling communication • • Pin-level model all signals are modeled explicitly often combined with RTL C1 d2 d1 d0 C2 Transaction-level Model protocol details are abstracted e.g. burst mode transfers C1 C2 ready true/false <write> transaction • TLM simulator of AMBA bus x100 faster then pin-level model Caldari et al. Transaction-Level Models for AMBA Bus Architecture Using SystemC 2.0. DATE 2003 Modeling timing of data-dependent execution Problem: • How to model timing of datadependent functionality inside a component? in out a=read(in) Possible solution: Estimate and annotate delays in the functional/behavioral model: a>b d1 task1() task2() a=read(in); if(a>b) { task1(); delay(d1); else { task2(); delay(d2);} write(out,c); d2 write(out,c) • this approach works well for HW but may be too coarse for modeling SW HW/SW Cosimulation Options Application SW... • … is delay-annotated & natively executes on workstation as a part of HW simulator • … is compiled for target processor and its code is used as a stimuli to processor model that is a part of HW simulator • … is not a part of the HW simulator -- a complete separation of Application and Architecture models Processor Models: Simulation Environment RTL C/C++ Application SW Compiler .exe prog. code Microarch. Sim. ISS Processor Model wrapper HW Sim. (rest of the system) Processor Models • RTL model cycle-accurate or continuous time all the details are modeled (e.g. synthesizable) • Microarchitecture Simulator cycle-accurate model models pipeline effects, etc can be generated automatically (e.g. Liberty, LISA…) • Instruction Set Simulator provides instruction count functional models of instructions e.g. SimpleScalar Multiprocessor System Simulator Cycle-accurate ISS SystemC Wrapper SystemC model L Benini, U Bologna Comparison of HW/SW Co-simulation techniques simulator continuous time (nano-second accurate) cycle-accurate instruction level speed (instructions/sec) 1 - 100 50 – 1000 2000 – 20,000 J. Rowson, Hardware/Software Co-Simulation, Proceedings of the 31st DAC, USA,1994 HW/SW Co-simulation Options Application SW... • … is delay-annotated & natively executes on workstation as a part of HW simulator • … is compiled for target processor and its code is used as a stimuli to processor model that is a part of HW simulator • … is not a part of the HW simulator -- a complete separation of Application and Architecture models Independent Application and Architecture Models (“Separation of Concerns”) Application WORKLOAD Mapping DSP RISC RESOURCES SRAM Architecture Co-simulation of Application and Architecture Models Basic principle: Application (or functional) simulator drives architecture (or hardware) simulator The models interact via traces of actions The traces are produced on-line or off-line Advantages: system-level view flexible choice of abstraction level the models and the mapping can be easily altered Trace-driven Simulation SPADE: System level Performance Analysis and Design space Exploration Architecture model Application model P. Lieverse et al., U Delft & Philips Trace-driven Simulation (SPADE) Lieverse et al., U Delft & Philips Going away from discrete-event simulation… Analysis for Communication Systems Lahiri et al., UC San Diego A two-step approach: 1. simulation without communication (e.g. using ISS) 2. analysis for different communication architectures K. Lahiri, UCSD Overview K. Lahiri, UCSD Analytical Methods for Power Estimation • Givargis et al. UC Riverside • Analytical models for power consumption of: Caches Buses • two-step approach for fast power evaluation collect intermediate data using simulation use equations to rapidly predict power couple with a fast bus estimation approach Approach Overview Givargis, UC Riverside • Bus equation: • m items/second (denotes the traffic N on the bus) • n bits/item • k bit wide bus • bus-invert encoding • random data assumption k 1 k 1 k 1 k 2 n k P Cbus m 1k 1 2k 2 k 2 2 2 k 2 Experiment Setup Givargis, UC Riverside Performance C Program Trace Generator ISS Cache Simulator • Dinero [Edler, Hill] • CPU power [Tiwari96] CPU Power Memory Power Bus Simulator I/D Cache Power + Analytical Method Workload ? e1 e3 ? e2 e4 ? scheduling discipline 1 scheduling discipline 2 CPU1 CPU2 Event Model Interface Classification Ernst, TU Braunschweig burst (b) = 1 T=T,length t=T, b=1 periodic with burst b b T t periodic T T t=T jitter 0 T=T,= J=0 periodic with jitter T T J J J t t=t lossless EMIF sporadic xt xt xt t=T-J EMIF to less expressive model Example: EMIFs & EAFs EAF ? Use standard scheduling analysis for e2 e4 EMIF ? single components. e1 EMIF scheduling discipline 1 CPU1 Event model interface needed e3 scheduling discipline Event2 adaptation CPU2 function needed General Framework Functional Task Model T1 load scenarios Abstract Task Model T3 functional units abstract load scenarios Abstract Components (Run-Time Environment) T2 mapping relations event streams abstract event streams abstract functional units abstract resource units Abstract Architecture resource units Architecture Model ARM9 DSP Event & Resource Models • use arrival curves to capture event streams • use service curves to capture processing capacity # of packets max: 1 2 packet 3 packets min: 0 1 packets packet au al 3 2 1 DDD time t 0 1 2 D Analysis for a Single Component l,u a l ,a u αl ,αu l , u Analysis – Bounds on Delay & Memory service curve l u,l delay d au,l b arrival curve au backlog b Comparison between diff. Approaches Analytical Methods Simulation-Based • possibilities to answer questions limited by method • restricted by underlying models • good coverage (worst case) • fast • coarse • can answer virtually any questions about performance • can model arbitrary complex systems • average case (single instance) • time-consuming • accurate Example: IBM Network Processor Comparison RTC vs. Simulation 90 80 Simulation Analytical Method 60 40 30 20 PLBread write 50 OP B PLB Utilization [%] 70 10 0 100Mbps 150Mbps 200Mbps 250Mbps Linespeed 300Mbps 350Mbps 400Mbps Experiment Results Givargis, UC Riverside Execution Time (sec) •Diesel application’s performance •Blue is obtained using full simulation •Red is obtained using our equations 0.3 4% error 320x faster 0.25 0.2 0 . 15 0.1 0.05 0 Con f 0 Con f 1 Con f 2 Con f 3 Con f 4 Con f 5 Con f 6 Con f 7 Con f 8 Con f 9 Concluding Remarks Backup Metropolis Framework Cadence Berkeley Lab & UC Berkeley