Timing Analysis and Timing Predictability

advertisement
Timing Analysis and Timing Predictability
Reinhard Wilhelm
Saarland University
http://www.predator-project.eu/
Deriving Run-Time Guarantees for
Hard Real-Time Systems
Given:
1. a software to produce a reaction,
2. a hardware platform, on which to
execute the software,
3. a required reaction time.
Derive: a guarantee for timeliness.
-2-
-3-
Timing Analysis
•
•
–
–
Sound methods that determine upper bounds
for all execution times,
can be seen as the search for a longest path,
through different types of graphs,
through a huge space of paths.
I will show
1. how this huge state space originates,
2. how and how far we can cope with this huge
state space,
3. what synchronous languages contribute.
-4-
Timing Analysis – the Search Space
• all control-flow paths (through the
binary executable) – depending on the
possible inputs.
• Feasible as search for a longest path if
– Iteration and recursion are bounded,
– Execution time of instructions are
(positive) constants.
• Elegant method: Timing Schemata
(Shaw 89) – inductive calculation of
upper bounds.
Input
Software
Architecture
(constant
execution
times)
ub (if b then S1 else S2) := ub (b) + max (ub (S1), ub (S2))
-5-
High-Performance Microprosessors
• increase (average-case) performance by using:
Caches, Pipelines, Branch Prediction, Speculation
• These features make timing analysis difficult:
Execution times of instructions vary widely
– Best case - everything goes smoothly: no cache miss,
operands ready, resources free, branch correctly
predicted
– Worst case - everything goes wrong: all loads miss the
cache, resources are occupied, operands not ready
– Span may be several hundred cycles
-6-
Variability of Execution Times
x = a + b;
LOAD
r2, _a
LOAD
r1, _b
ADD
r3,r2,r1
PPC 755
Execution Time (Clock Cycles)
In most cases, execution
will be fast.
So, assuming the worst case
is safe, but very pessimistic!
350
300
250
200
Clock Cycles
150
100
50
0
Best Case
Worst Case
-7-
AbsInt‘s WCET Analyzer aiT
IST Project DAEDALUS final
review report:
"The AbsInt tool is probably the
best of its kind in the world and it
is justified to consider this result
as a breakthrough.”
Several time-critical subsystems of the Airbus A380
have been certified using aiT;
aiT is the only validated tool for these applications.
cache-miss penalty
over-estimation
Tremendous Progress
during the past 14 Years
-8-
200
The explosion of penalties has been compensated
by the improvement of the analyses!
60
25
20-30%
30-50%
25%
15%
10%
4
1995
Lim et al.
2002
Thesing et al.
2005
Souyris et al.
State-dependent Execution Times
• Execution time depend on the
execution state.
• Execution state results from
the execution history.
-9-
state
semantics state:
values of variables
execution state:
occupancy of
resources
Timing Analysis – the Search Space
with State-dependent Execution Times
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
execution states for
paths reaching this
program point
instruction
in I-cache
instruction
not in I-cache
mul rD, rA, rB
- 10 -
Input
Software
initial
state
Architecture
1
small operands 1
bus occupied
bus not occupied
≥ 40
large operands
4
Timing Analysis – the Search Space
with out-of-order execution
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
• including different schedules for
instruction sequences
- 11 -
Input
Software
initial
state
Architecture
Timing Analysis – the Search Space
with multi-threading
• all control-flow paths – depending on
the possible inputs
• all paths through the architecture for
potential initial states
• including different schedules for
instruction sequences
• including different interleavings of
accesses to shared resources
- 12 -
Input
Software
initial
state
Architecture
Why Exhaustive Exploration?
- 13 -
• Naive attempt: follow local worst-case transitions
only
• Unsound in the presence of Timing Anomalies:
A path starting with a local worst case may have a
lower overall execution time,
Ex.: a cache miss preventing a branch misprediction
• Caused by the interference between processor
components:
Ex.: cache hit/miss influences branch prediction;
branch prediction causes prefetching; prefetching
pollutes the I-cache.
- 14 -
State Space Explosion in Timing Analysis
concurrency +
shared resources
preemptive
scheduling
out-of-order
execution
state-dependent
execution times
constant
execution
times
years +
~1995
~2000
methods
Timing schemata Static analysis
2010
???
Coping with the Huge Search Space
For caches:
O(2 cache capacity)
abstraction
+ compact representation
+ efficient update
-loss in precision 
-over-estimation
For caches:
O(cache capacity)
Timing
anomalies
statespace
no abstraction
-exhaustive exploration
of full program 
- intolerable effort
Splitting into
•Local analysis on basic blocks
•Global bounds analysis
-exhaustive exploration
of basic blocks 
+ tolerable effort
- 15 -
Abstraction-Induced Imprecision
- 16 -
Notions
- 17 -
Determinism: allows the prediction of the future behavior given the actual state and knowing
the future inputs
State is split into
• the semantics state, i.e. a point in the program + a mapping of variables to values, and
• the execution state, i.e. the allocation of variables to resources and the occupancy of
resources
Timing Repeatability:
• same execution time for all inputs to and initial execution states of a program
• allows the prediction of the future timing behavior given the PC and the control-flow
context, without knowing the future inputs
Predictability of a derived property of program behavior:
• expresses how this property of all behaviors of a program independent of inputs and
initial execution state can be bounded,
• bounds are typically derived from invariants over the set of all execution traces
• P. expresses how this property of all future behaviors can be bounded given an invariant
about a set of potential states at a program point without knowing future inputs
- 18 -
Some Observations/Thoughts
Repeatability
• doesn’t make sense for behaviors,
• it may make sense for derived properties, e.g. time, space
and energy consumption
Predictability
• concerns properties of program behaviors,
• the set of all behaviors (collecting semantics) is typically
not computable
• abstraction is applied
– to soundly approximate program behavior and
– to make their determination (efficiently) computable
Approaches
- 19 -
• Excluding arch.-state-dependent variability PRET architecture
w/o PRET
by design ->
programming
• Excluding state- and input-dependent
PRET with PRET
variability by design -> repeatability
programming
• Bounding the variability, but ignoring
invariants about the set of possible states,
MERASA
i.e. assuming arch.-state-independent worst
cases for each “transition”
• Using invariants about sets of possible
Predator
arch. states to bound future behaviors
• Designing architecture as to support
analyzability
Predator
• independent of the applications or
with
• for a given set of applications
PROMPT
Time Predictability
- 20 -
•There are analysis-independent notions,
• predictability as an inherent system property,
e.g. Reineke et al. for caches, Grund et al. 2009
i.e., predictable by an optimal analysis method
•There are analysis-method dependent notions,
• predictable by a certain analysis method.
• In case of static analysis by abstract
interpretation: designing a timing analysis means
essentially designing abstract domains.
•To achieve predictability is not difficult,
not to loose much performance at the same time is!
Tool Architecture
determines loop
determines
bounds
enclosing intervals
for the values in
registers and local
variables
- 21 -
determines
infeasible
paths
Abstract Interpretations
derives invariants about
architectural execution states,
computes bounds on execution combined cache
times of basic blocks
and pipeline
analysis
Abstract Interpretation
determines a worstcase path and an
upper bound
Integer Linear
Programming
Timing Accidents and Penalties
Timing Accident – cause for an increase
of the execution time of an instruction
Timing Penalty – the associated increase
• Types of timing accidents
–
–
–
–
–
–
Cache misses
Pipeline stalls
Branch mispredictions
Bus collisions
Memory refresh of DRAM
TLB miss
- 22 -
- 23 -
Our Approach
• Static Analysis of Programs for
their behavior on the execution
platform
• computes invariants about the
set of all potential execution
states at all program points,
• the execution states result from
the execution history,
• static analysis explores all
execution histories
state
semantics state:
values of variables
execution state:
occupancy of
resources
Deriving Run-Time Guarantees
- 24 -
• Our method and tool derives Safety
Properties from these invariants :
Certain timing accidents will never happen.
Example: At program point p, instruction
fetch will never cause a cache miss.
• The more accidents excluded, the lower
the upper bound.
Murphy’s
invariant
Fastest
Variance of execution times
Slowest
Architectural Complexity implies
Analysis Complexity
Every hardware component whose state has an
influence on the timing behavior
• must be conservatively modeled,
• may contribute a multiplicative factor to the
size of the search space
• Exception: Caches
– some have good abstractions providing for highly
precise analyses (LRU), cf. Diss. of J. Reineke
– some have abstractions with compact
representations, but not so precise analyses
- 25 -
Abstraction and Decomposition
- 26 -
Components with domains of states C1, C2, … , Ck
Analysis has to track domain C1  C2 …  Ck
Start with the powerset domain 2 C1  C2 …
C
k
Find an abstract domain C1#
Find abstractions C11# and C12#
transform into C1#  2 C2 …  Ck factor out C11# and transform
rest into 2 C12# …  Ck
This has worked for caches and
cache-like devices.
program
This has worked for the arithmetic
of the pipeline.
C11#
value analysis
program with
annotations
2 C12# …
C
k
microarchitectural
analysis
Complexity Issues for Predictability by
Abstract Interpretation
Independent-attribute
analysis
• Feasible for domains with
no dependences or
tolerable loss in precision
• Examples: value analysis,
cache analysis
• Efficient!
- 28 -
Relational analysis
• Necessary for mutually
dependent domains
• Examples: pipeline analysis
• Highly complex
Other parameters:
Structure of the underlying domain, e.g. height of lattice;
Determines speed of convergence of fixed-point iteration.
My Intuition
- 29 -
• Current high-performance processors have cyclic
dependences between components
• Statically analyzing components in isolation (independentattribute method) looses too much precision.
• Goals:
– cutting the cycle without loosing too much performance,
– designing architectural components with compact abstract domains,
– avoiding interference on shared resources in multi-core
architectures (as far as possible).
• R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses
for Future Architectures in Time-critical Embedded
Systems, IEEE TCAD, July 2009
- 30 -
Caches: Small & Fast Memory on Chip
• Bridge speed gap between CPU and RAM
• Caches work well in the average case:
– Programs access data locally (many hits)
– Programs reuse items (instructions, data)
– Access patterns are distributed evenly across the cache
• Cache performance has a strong influence on
system performance!
• The precision of cache analysis has a strong
influence on the degree of over-estimation!
- 31 -
Caches: How they work
CPU: read/write at memory address a,
– sends a request for a to bus
m
Cases:
• Hit:
a
– Block m containing a in the cache:
request served in the next cycle
• Miss:
– Block m not in the cache:
m is transferred from main memory to the cache,
m may replace some block in the cache,
request for a is served asap while transfer still continues
• Replacement strategy: LRU, PLRU, FIFO,...determine which
line to replace in a full cache (set)
Cache Analysis
- 32 -
How to statically precompute cache contents:
• Must Analysis:
For each program point (and context), find out which blocks
are in the cache  prediction of cache hits
• May Analysis:
For each program point (and context), find out which blocks
may be in the cache
Complement says what is not in the cache  prediction of
cache misses
• In the following, we consider must analysis until otherwise
stated.
(Must) Cache Analysis
- 33 -
• Consider one instruction in
the program.
• There may be many paths
leading to this instruction.
• How can we compute
whether a will always be in
cache independently of
which path execution
takes?
load a
Question:
Is the access to a
always a cache hit?
- 34 -
Determine LRU-Cache-Information
(abstract cache states) at each Program Point
youngest age - 0
{x}
{a, b}
oldest age - 3
Interpretation of this cache information:
describes the set of all concrete cache states
in which x, a, and b occur
• x with an age not older than 1
• a and b with an age not older than 2,
Cache information contains
1. only memory blocks guaranteed to be in cache.
2. they are associated with their maximal age.
- 35 -
Cache- Information
Cache analysis
determines safe
information about
Computed
cache information
Cache Hits.
Each predicted Cache
Hit reduces the upper
{x}
load a
{a, b}
bound by the cachemiss penalty.
Access to a is a cache hit;
assume 1 cycle access time.
- 36 -
Cache Analysis – how does it work?
• How to compute for each program point an
abstract cache state representing a set of
memory blocks guaranteed to be in cache each
time execution reaches this program point?
• Can we expect to compute the largest set?
• Trade-off between precision and efficiency –
quite typical for abstract interpretation
(Must) Cache analysis of a memory access with LRU
replacement
- 37 -
x
a
b
y
concrete
transfer
function
(cache)
access to a
a
x
b
y
abstract
transfer
function
(analysis)
{x}
{a, b}
After the access to a,
a is the youngest memory
block in cache,
and we must assume that
x has aged.
access to a
{a}
{b, x}
Combining Cache Information
- 38 -
• Consider two control-flow paths to a program point with sets S1 and
S2 of memory blocks in cache,.
– Cache analysis should not predict more than S1  S2 after the merge of
paths.
– elements in the intersection with their maximal age from S1 and S2.
• Suggests the following method: Compute cache information along all
paths to a program point and calculate their intersection – but too
many paths!
• More efficient method:
– combine cache information on the fly,
– iterate until least fixpoint is reached.
• There is a risk of losing precision, but not in case of distributive
transfer functions.
What happens when control-paths merge?
We can
guarantee
this content
on this path.
{c}
{e}
{a}
{d}
{a}
{ }
{ c, f }
{d}
Which content
{can
} we
guarantee
{ }
on{ a,
this
c } path?
{d}
- 39 -
We can
guarantee
this content
on this path.
“intersection
+ maximal age”
combine cache information at each control-flow merge point
- 40 -
Predictability of Caches
- Speed of Recovery from Uncertainty write
z
read
y
read
x
mul
x, y
1. Initial cache contents?
2. Need to combine information
3. Cannot resolve address of x...
4. Imprecise analysis domain/
update functions
Need to recover information:
Predictability = Speed of Recovery
J. Reineke et al.: Predictability of Cache Replacement
Policies, Real-Time Systems, Springer, 2007
Metrics of Predictability:
... ... ...
evict
- 41 -
evict & fill
fill
Two Variants:
M = Misses Only
HM
[f,e,d]
[f,e,c]
[h,g,f]
[f,d,c]
Seq: a b c d e f g h
- 42 -
Results: tight bounds
Generic examples prove tightness.
- 43 -
The Influence of the Replacement Strategy
LRU:
Information gain
through access to m
FIFO:
m
m
m
m
+ aging of
prefix of
unknown
length of
the cache
contents
m
m
m m  cache
at least k-1
youngest
still in
m cache
- 44 -
- 45 -
Pipelines
Inst 1
Inst 2
Inst 3
Inst 4
Fetch
Fetch
Decode
Decode
Fetch
Execute
Execute
Decode
Fetch
WB
WB
Execute
Decode
Fetch
WB
Execute
Decode
WB
Execute
WB
Ideal Case: 1 Instruction per Cycle
- 46 -
CPU as a (Concrete) State Machine
• Processor (pipeline, cache, memory, inputs)
viewed as a big state machine,
performing transitions every clock cycle
• Starting in an initial state for an
instruction,
transitions are performed,
until a final state is reached:
– End state: instruction has left the pipeline
– # transitions: execution time of instruction
Pipeline Analysis
- 47 -
• simulates the concrete pipeline on
abstract states
• counts the number of steps until an
instruction retires
• non-determinism resulting from
abstraction and timing anomalies require
exhaustive exploration of paths
- 48 -
Integrated Analysis: Overall Picture
s1
s3
s2
Fixed point iteration over Basic Blocks (in
context) {s1, s2, s3} abstract state
s1
Cyclewise evolution of processor model
for instruction
s1
Basic Block
move.1 (A0,D0),D1
s10
s11
s12
s13
s2
s3
Implementation
•
•
•
•
- 49 -
Abstract model is implemented as a DFA
Instructions are the nodes in the CFG
Domain is powerset of set of abstract states
Transfer functions at the edges in the CFG
iterate cycle-wise updating each state in the
current abstract value
• max{# iterations for all states} gives bound
• From this, we can obtain bounds for basic
blocks
- 50 -
Classification of Pipelined Architectures
• Fully timing compositional architectures:
– no timing anomalies.
– analysis can safely follow local worst-case paths only,
– example: ARM7.
• Compositional architectures with constant-bounded
effects:
– exhibit timing anomalies, but no domino effects,
– example: Infineon TriCore
• Non-compositional architectures:
– exhibit domino effects and timing anomalies.
– timing analysis always has to follow all paths,
– example: PowerPC 755
Extended the Predictability Notion
- 51 -
• The cache-predictability concept applies
to all cache-like architecture components:
• TLBs, BTBs, other history mechanisms
• It does not cover the whole architectural
domain.
The Predictability Notion
- 52 -
Unpredictability
• is an inherent system property
• limits the obtainable precision of static predictions about
dynamic system behavior
Digital hardware behaves deterministically (ignoring
defects, thermal effects etc.)
• Transition is fully determined by current state and input
• We model hardware as a (hierarchically structured,
sequentially and concurrently composed) finite state
machine
• Software and inputs induce possible (hardware)
component inputs
- 53 -
Uncertainties About State and Input
• If initial system state and input were known only
one execution (time) were possible.
• To be safe, static analysis must take into account
all possible initial states and inputs.
• Uncertainty about state implies a set of starting
states and different transition paths in the
architecture.
• Uncertainty about program input implies possibly
different program control flow.
• Overall result: possibly different execution times
Source and Manifestation of
Unpredictability
- 54 -
• “Outer view” of the problem: Unpredictability
manifests itself in the variance of execution
time
• Shortest and longest paths through the
automaton are the BCET and WCET
• “Inner view” of the problem: Where does the
variance come from?
• For this, one has to look into the structure of
the finite automata
- 55 -
Connection Between Automata and Uncertainty
• Uncertainty about state and input are
qualitatively different:
• State uncertainty shows up at the “beginning” 
number of possible initial starting states the
automaton may be in.
• States of automaton with high in-degree lose
this initial uncertainty.
• Input uncertainty shows up while “running the
automaton”.
• Nodes of automaton with high out-degree
introduce uncertainty.
- 56 -
State Predictability – the Outer View
Let T(i;s) be the execution time with component input i
starting in hardware component state s.
The range is in [0::1], 1 means perfectly timing-predictable
The smaller the set of states, the smaller the variance and the
larger the predictability.
The smaller the set of component inputs to consider, the larger the
predictability.
The Main Culprit:
Interference on Shared Resources
- 58 -
• They come in many flavors:
– instructions interfere on the caches,
– bus masters interfere on the bus,
– several threads interfere on shared caches, shared
memory.
• some directly cause variability of execution times,
e.g. different bus access times in case of collision,
• some allow for different interleavings of control or
architectural flow resulting in different execution
states and subsequently different timings.
Dealing with Shared Resources
Alternatives:
• Avoiding them,
• Bounding their effects on timing
variability
- 59 -
Embedded Systems go Multicore
- 60 -
• Multicore systems are expected to solve all
problems: performance, energy consumption, etc.
• Current designs share resources.
• Recent experiment at Thales:
– running the same application on 2 cores, code and data
duplicated  50% loss compared to execution on one core
– running two different applications on 2 cores  30% loss
– Reason: interference on shared resources
– Worst-case performance loss will be larger due to
uncertainty about the interferences!
Principles for the PROMPT
Architecture and Design Process
- 61 -
• No shared resources where not needed
for performance,
• Harmonious integration of applications:
not introducing interferences on shared
resources not existing in the applications.
Steps of the Design Process
1.
–
–
–
–
2.
3.
•
•
Hierarchical privatization
- 62 -
decomposition of the set of applications according to the
sharing relation on the global state
allocation of private resources for non-shared code and state
allocation of the shared global state to non-cached memory,
e.g. scratchpad,
sound (and precise) determination of delays for accesses to
the shared global state
Sharing of lonely resources – seldom accessed
resources, e.g. I/O devices
Controlled socialization
introduction of sharing to reduce costs
controlling loss of predictability
- 63 -
Sharing of Lonely Resources
• Costly lonely resources will be shared.
• Accesses rate is low compared to CPU and
memory bandwidth.
• The access delay contributes little to the
overall execution time because accesses
happen infrequently.
PROMPT Design Principles
for Predictable Systems
- 65 -
• reduce interference on shared resources in
architecture design
• avoid introduction of interferences in mapping
application to target architecture
Applied to Predictable Multi-Core Systems
• Private resources for non-shared components of
applications
• Deterministic regime for the access to shared
resources
Comments on the Project Proposal
- 66 -
Precision-Timed Synchronous Reactive Processing
“Traditional” WCET analysis of MBD systems
•analyzes binaries, which are generated from C code, which is
generated from models
•good precision due to disciplined code
•precision is increased by synergetic integration of MDB code
generator, compiler, and WCET tool
Separation of WCET analysis and WCRT analysis
•multiplies the bounds on the number of synchronous cycles
with the bound on the tick durations,
•these two upper bounds do not necessarily happen together
 danger of loosing precision
Comments on the Project Proposal
- 67 -
Precision-Timed Synchronous Reactive Processing
• Taking the compiler into the boat is a good idea
• increase of precision and efficiency of WCET
analysis by WCET-aware code generation
• Operating modes should be supported to gain
precision
Conclusion
- 68 -
• Timing analysis for single tasks running on singleprocessor systems solved.
• Determination of context-switch costs for
preemptive execution almost solved.
• TA for concurrent threads on multi-core
platforms with shared resources not solved.
• PROMPT is a rather radical approach,
• requires new design and fabrication process.
• Reconciling Predictability with Performance still
an interesting research problem.
Some Relevant Publications from my Group
•
•
•
•
•
•
•
•
•
•
•
•
•
- 69 -
C. Ferdinand et al.: Cache Behavior Prediction by Abstract Interpretation. Science of
Computer Programming 35(2): 163-189 (1999)
C. Ferdinand et al.: Reliable and Precise WCET Determination of a Real-Life Processor,
EMSOFT 2001
R. Heckmann et al.: The Influence of Processor Architecture on the Design and the
Results of WCET Tools, IEEE Proc. on Real-Time Systems, July 2003
St. Thesing et al.: An Abstract Interpretation-based Timing Validation of Hard Real-Time
Avionics Software, IPDS 2003
L. Thiele, R. Wilhelm: Design for Timing Predictability, Real-Time Systems, Dec. 2004
R. Wilhelm: Determination of Execution Time Bounds, Embedded Systems Handbook, CRC
Press, 2005
St. Thesing: Modeling a System Controller for Timing Analysis, EMSOFT 2006
J. Reineke et al.: Predictability of Cache Replacement Policies, Real-Time Systems,
Springer, 2007
R. Wilhelm et al.:The Determination of Worst-Case Execution Times - Overview of the
Methods and Survey of Tools. ACM Transactions on Embedded Computing Systems (TECS)
7(3), 2008.
R.Wilhelm et al.: Memory Hierarchies, Pipelines, and Buses for Future Architectures in
Time-critical Embedded Systems, IEEE TCAD, July 2009
R. Wilhelm et al.: Designing Predictable Multicore Architectures for Avionics and
Automotive Systems, RePP Workshop, Grenoble, Oct. 2009
D. Grund, J. Reineke, and R. Wilhelm: A Template for Predictability Definitions with
Supporting Evidence, PPES Workshop, Grenoble 2011
Some other Publications dealing with
Predictability
- 70 -
• R. Pellizzoni, M. Caccamo: Toward the Predictable Integration of Real-Time
COTS Based Systems. RTSS 2007: 73-82
• J. Rosen, A. Andrei, P. Eles, and Z. Peng:
Bus access optimization for predictable implementation of real-time applications
on multiprocessor systems-on-chip, RTSS 2007
• B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards and E. A. Lee: Predictable
Programming on a Precision Timed Architecture, CASES 2008
• M. Schoeberl, A Java processor architecture for embedded real-time systems,
Journal of Systems Architecture, 54/1--2:265--286, 2008
• M. Paolieri et al.: Hardware Support for WCET Analysis of Hard Real-Time
Multicore Systems, ISCA 2009
• A. Hansson et al.: CompSoC: A Template for Composable and Predictable MultiProcessor System on Chips, ACM Trans. Des. Autom. Electr. Systems, 2009
Download