Thomas Lundqvist, Per Stenstrom (RTSS ‘99)
Presented by: Kaustubh S. Patil
• Execution time of an instruction is not fixed
- Due to pipeline stalls or cache misses
- Input data dependency eg. mulhw, mulhwu, mullw in PowerPC architecture
• In such cases, current methods assume longest instruction latency for every instruction
- eg. if the outcome of a cache access is unknown, a cache miss is assumed.
- Intuition-based
Claim: Making such assumptions for dynamically scheduled processors is wrong !
• Dynamically scheduled processors
- out-of-program-order instruction execution
• For such processors, counter-intuitive increase or decrease in execution time is possible
- eg. a cache miss can actually reduce the overall execution time.
- These are termed as timing anomalies.
• Description of architectural features that may cause anomalies
• Examples of timing anomalies
• Handling of such anomalies in previous methods
• Proposed methods to eliminate such anomalies
• Case study of a previous method in the context of proposed solutions
Terms and definitions
• Formal definition of timing anomaly
- Instruction latency same as instruction execution time
- case 1: latency of first instruction increased by i cycles
- case 2: it is decreased by d cycles
- C be resulting future change in execution time
Definition:
A situation where, in the first case, C>i or C<0, or in the second case, C<-d or
C>0.
• In-order and out-of-order resources
• If a processor only contains in-order resources, no timing anomalies can occur
Architecture used for illustrating
• A cache-hit results in WCET
• B is dependent on A
• In cache-hit case, B gets priority over C
• In cache-miss case, D & E execute 1 cycle earlier
• The reason for this anomaly
- IU is an out-of-order resource
• Overall miss penalty can be higher than a single cache miss penalty
• A,B,C have dependencies
• C always results in a miss
• C finishes 11 cycles later instead of one miss penalty of 8 cycles
• MCIU allows B and D to execute out-of-order
• Unbounded impact on WCET
• A and B make a loop body
• Fast case
‘ A’ executes as soon as dispatched
• Slow case
‘A’ is delayed by one cycle
- Old B gets priority over new A
‘A’ gets delayed in each iteration
- Total penalty k cycles if k iterations
• Such methods make locally safe decisions, at basic block or instruction level.
• Timing anomalies due to variable latency instructions and different pipeline states do not allow this.
• Consider an instruction sequence with n variable latency instructions.
• Each such instruction can have k different latencies.
• Need to examine k n possibly different schedules
• The pessimistic serial-execution method
- All instructions are executed in-order.
- All memory references are considered misses.
- Which instruction sequence is considered ?
- Very pessimistic approach
• The program modification method
- All unknown events and variable latency instructions must result in a predictable pipeline state
- If a path is selected as a WCET path among a set of paths, then the end cache & pipeline state must be the same.
Making pipeline-state predictable
• Forced in-order resource use is one solution
- little processor support
• Use of sync instruction in PowerPC architecture
- to take care of variable latency instructions
- also when cache hits are unpredictable
• sync works for both the previous conditions
Making cache state predictable
• After each path invalidate all cache blocks
- poor performance
• Invalidate only differing cache blocks
- poor performance again
• Preload cache blocks
- special instruction support eg. icbt,dcbt in PowerPC
• Instruction level simulation
• Extended instruction semantics to take care of ‘unknown’ operands eg. Add A,B,C
A B + C ,if both B and C are known
A unknown , either B or C is unknown
• Elimination of infeasible paths
• Merging of paths to avoid exponential number of paths
• First pass identifies all places where local decisions need to be made
- eg. merging of paths and variable latency instructions
• Addition of sync and preload instructions at such sites
• T serial = sum of all latencies and misses
• T = T serial
/ 2 in the ideal case
• PSIM, existing instruction-level simulator was extended for symbolic execution and modification of program approach
• The benchmarks used were:
- matmult : Multiplies 2 50*50 matrices
- bsort : Bubblesort of 100 integers
- isort : Insertsort of 10 integers
- fib : Calculates nth element of Fibonacci sequence for n<30
- DES : Encrypts 64 bit data
- jfdctint : Discrete cosine transform of an 8*8 pixel image
- compress : Compresses 50 bytes of data
Program Actual WCET Unsafe
WCET matmult 5283287 5283287
Ratio Serail WCET
2
Ratio Modified
WCET
Ratio
6323287 1.20
1 10566574
Modified slowdown
1.20
bsort 230490 230490 1 460981 2 1.11
256854 1.11
isort fib
2085
797
DES jfdctint
186166
9409 compress 16846
2085
797
186358
9409
54583
1
1
4170
1594
1.001
372716
1 18819
3.31
109167
2
2
2325
797
1.12
1.12
1 1
2.002
2
186358
9921
1.001
1.05
1
1.05
6.62
69291 4.20
1.27
• Timing anomalies in dynamically scheduled processors may cause wrong WCET estimation using previous methods.
• Using architecture support to control state of the cache and pipeline, it is possible to eliminate anomalies and the previous methods can be used on such modified programs.