TIME-PREDICTABLE EXECUTION OF EMBEDDED SOFTWARE ON MULTI-CORE PLATFORMS 1 Sudipta Chattopadhyay under the guidance of A/P Abhik Roychoudhury EMBEDDED SYSTEMS 2 REAL-TIME CONSTRAINTS Hard real-time Embedded system Soft real-time 3 TIMING ANALYSIS Hard real time systems require absolute timing guarantees System level analysis Single task analysis Worst case execution time (WCET) analysis An upper bound on execution time for all possible inputs Sound over-approximation is obtained by static analysis 4 WCET ANALYSIS WCET of basic blocks Infeasible path constraints Program Control flow graph Micro-architectural modeling WCET boun d Loop bound constraints Path analysis 5 ARCHITECTURE Core 1 Core n L1 cache L1 cache Shared bus Resource sharing Shared L2 cache Memory 6 OVERVIEW Instr. accesses Core 1 Shared cache L1 instruction Unified cache cache Data Shared cache accesses A multi-core Core n + WCET tool L1 databus cache shared L1 cache L1 cache L2 unified Dissertation Sharedcache bus work (Time-predictable execution in multi-core) Shared L2 cache Conflicts with different instruction and data Shared scratchpad memory blocks allocation Memory Main Memory Coherence miss modeling Processor Resource sharing Bus Cache related preemption delay analysis 7 MICRO-ARCHITECTURAL MODELING branch predictor cache shared cache pipeline Single Core shared bus Multi Core 8 COMPARISON Work Micro-arch. level technique Program level technique Precision Scalability Classical abstract interpretation (AI) AI AI × √ Classical model checking (MC) MC MC √ × RTS’00 (aiT, Chronos) AI Integer linear programming Can be improved √ RTSS’10 AI MC Can be improved _ Our approach (AI+MC) Integer linear programming > RTS’00 = RTS’00 (AI+MC) MC > RTSS’10 = RTSS’10 9 IMPRECISION IN ABSTRACT INTERPRETATION p1 young p2 a b b Cache state = Abstract C1 cache set Path p1 or path p2? young x Cache state = C2 Abstract Joined Cache state = C3 cache set b Joined cache state Joined cache state loses information about path p1 and p2 10 MODEL CHECKING ALONE ? A path sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states p1 Cache state = C1 p2 Cache state = C2 11 MODEL CHECKING ALONE ? A path-sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states Abstract LRU cache set p1 young p2 a b b x young a b Abstract LRU State Explosion cache set b young young x Abstract LRU cache set 12 CACHE ANALYSIS Program Cache analysis by abstract interpretatio n WCET of basic blocks All checked Pipeline Analysis analysis outcome Infeasible path constraints IPET Refine by Branch predictor model checker modeling Loop bound Timeout Micro architectural constraints modeling Refinement by model checker can be terminated at any point Model checker refinement steps are inherently parallel Path analysis Each model checker refinement step checks light assertion property 13 REFINEMENT (INTER-CORE) m start Conflictin g task Task m1 m2 x<y m1 x == y Infeasible m2 young ≠m ≠m m cache m Cache hit Cache miss exit Spurious 14 REFINEMENT (INTER-CORE) start m Conflictin g task Task m1 x<y C_m++ Increment conflict m1 Verified m2 young m m cache x == y C_m++ Increment conflict m A Cache Hit exit Infeasible assert (C_m <= 1) m2 15 REFINEMENT (WHY IT WORKS?) m C_m++ m’ Increment conflict Conflict to m Path 2 x<y m’ x == y m Does not affect the value of C_m assert (C_m <= 0) m Cache miss Property 16 EXPERIMENTAL SETUP (CHRONOS TOOLKIT) GCC simplescalar C source Micro architectural modeling cache pipeline Binary code CFG Flow constraints Branch prediction ILP WCET CBMC C bounded model checking Micro-architectural constraints 17 EXPERIMENTAL RESULT 18 EXPERIMENTAL RESULT Tasks cnt jfdctint edn fir fdct ndes Direct-mapped, 256 bytes L1 cache 4-way associative, 8 KB L1 cache Shared L2 cache WCET Average time = 70 secs 19 EXTENSION USING SYMBOLIC EXECUTION x<y Conflictin g task m1 m2 x<y C_m++ Increment conflict m1 unknown x≥y x<y x=y x=y NO x == y constraint solver C_m++ Increment conflict m2 x<y˄x=y satisfied assert (C_m <= 1) assert (C_m <= 1) abort 20 EXTENSION USING KLEE C source GCC simplescalar Micro architectural modeling cache pipeline Binary code CFG Flow constraints Branch prediction ILP WCET CBMC/KLEE Micro-architectural constraints 21 A GENERIC FRAMEWORK Three different architectural/application settings Cache conflict High priority cache Intra task (WCET in single core) Low Task in Core 1 Cache priority conflict cache Inter task (Cache Related Preemption Delay analysis) L1 cache Cache conflict Task in Core 2 L1 cache Shared L2 cache Inter core (WCET in multi-core) 22 MICRO-ARCHITECTURAL MODELING branch predictor cache shared cache pipeline Single Core shared bus Multi Core 23 TASK-LEVEL INTERFERENCE T1 Tasks T3 T2 Core 1 Core n L1 cache L1 cache Shared bus Shared L2 cache T2 T1 Timeline T2 T3 T1 T3 Task interference graph 24 SHARED CACHE + TDMA SHARED BUS Task graphs T1 Time Division Multiple Access (TDMA) T3 T2 T4 Core 1 Core 2 L1 cache Core 1 slot T1 Core 2 slot L1 cache Shared bus T2 Shared L2 cache T1 T2 T3 Bus L2 access miss due T4to T2 Disjoint lifetime Core 1 slot Bus access Core 2 slot T4 25 T3 T4 OVERVIEW OF THE FRAMEWORK L1 cache analysis L1 cache analysis Filter Filter L2 cache analysis Initial interference Task interference monotonically decreases L2 cache analysis Bus aware analysis L2 conflict analysis WCRT computation Yes Interference changes ? Estimated WCRT No 26 EVALUATION (2-CORE) One core runs statemate another core runs the program under evaluation 27 EVALUATION (4-CORE) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores 28 MICRO-ARCHITECTURAL MODELING branch predictor shared cache Interactions cache pipeline Single Core shared bus Multi Core 29 TIMING ANOMALY (SHARED CACHE) hit hit hit hit miss miss miss hit miss miss hit hit miss hit miss miss May not be the worst case path 30 BASELINE ABSTRACTION – TIMING INTERVAL Representing each pipeline stage as a timing interval [1,3] End = Startstart + cache miss latency finish [3,7] [4,10] interval latency IF ID EX WB CM IF ID EX WB CM R1 := R2 + 5 Structural dependency IF ID EX WB CM R5 := R1 * R7 IF ID EX WB CM IF ID EX WB CM Contention R3 := R5 * 5 A fixed-point analysis derives the timing of each stage as an interval 31 TDMA SHARED BUS ANALYSIS Time Division Multiple Access (TDMA) Offset abstraction Core 0 Core 0 offset delay round T (core 1) Core 1 Core 1 Core 0 Core 0 delay = 0 offset round T’ (core 0) Core 1 Core 1 32 LOOP CONSTRUCT previous iteration current iteration IF ID EX WB CM IF ID EX WB CM IF ID EX WB CM IF ID EX WB CM How do we define bus context? Property: If the bus offsets of the cross-iteration edges do not change, WCET of the loop iteration cannot change 33 LOOP CONSTRUCT Ci = bus context of the loop body at i-th iteration C1 C2 Bus context flow graph C3 C4 C5 C5 C3 Property: If Ci Cj, then Ci+k Cj+k for any k > 0 34 LOOP CONSTRUCT C1 WCET Bus context flow graph of basic blocks C2 C3 loop bound Program Control flow graph C4 Micro-architectural modeling Infeasible path constraints ILP solve r Compute WCET for each bus contextLoop bound E(C1) = number of times context C1 is executed constraints Generate linear constraints: E(C1) + E(C2) + E(C3) + E(C4) ≤ loop bound ILP = Integer LinearE(C Programming Path analysis 1) ≥ E(C2) 35 BRANCH PREDICTION + CACHE Cache content JOIN Cache conflict m Branch location m Maximum number of speculated instructions m’ Unclear cache access Cache content 36 EXPERIMENTAL SETUP (CHRONOS TOOLKIT) GCC simplescalar C source Micro architectural modeling Private cache Shared cache pipeline Binary code CFG Flow constraints Branch prediction ILP WCET Shared bus Micro-architectural constraints 37 EVALUATION (CACHE + PIPELINE) Imprecision of shared cache analysis Core 1 Core 1 Core 2 Core 2 Vertically partition jfdctint Horizontally partition statemate 38 EVALUATION (CACHE + PIPELINE + SPECULATION) Imprecision of modeling speculation 39 EVALUATION (BUS + PIPELINE) Imprecision of path analysis Imprecision of shared bus analysis 40 RECAP High priority PE-0 Core task Cache1 Low priority PE-1 task PE-N Core Task n conflict Shared cache A multi-core SPM-0 Shared cache SPM-1 SPM-N + WCET tool L1 data shared bus L1 data Unified cache Coherence Core 1 Core n Fast on-chip cache cache miss traffic communication media c External Stale data Memory bus Shared Interface items L1 cache L1 cache Dissertation work Shared L2 cache Shared off-chip data bus (Time-predictable execution in multi-core) …… Shared L2 cache Off-chip memory Memory Shared scratchpad allocation Coherence miss modeling Cache related preemption delay analysis 41 PERSPECTIVE Time-predictable execution in single-core Resource sharing (cache and bus) Data sharing (cache coherence) Time-predictable execution in multi-core Testing Shared cache ARM Cortex A9 MPCore Samsung Exynos Nvidia Tegra II (smart phones) Static analysis Shared bus Time Division Multiple Access Customized hardware Cache coherence Shared scratchpad Sony PSP IBM Cell Aethreal Network-on-chip 42 PERSPECTIVE Functionality Verification Concrete domain Quantitative Verification Concrete domain Abstract domain in abstract Interpretation (AI) Abstraction Property Verifier May be spurious AI Spurious counter example Generate Quantitative property Refinement Verified Abstraction refinement Path-sensitive Verification FUTURE WORK Static performance analysis + testing Symbolic Execution x<y x<y Performance testing Mobile devices x<y x=y x≥y x=y m1 x == y m2 Energy analysis of software Battery life Energy-aware software testing x<y˄x≠y Input assert (C_m <= 1) (Quantitative property e.g. cache conflict) abort 44 My sincere thanks to all the Examiners and especially the anonymous Examiner 1 for his THANK Y OU comment on symbolic execution 45