Circuit-Level Timing Speculation: The Razor Latch Developed by Trevor Mudge’s group at the University of Michigan, 2003 We’ve Already Encountered Speculation in ECE 568 • Branch prediction – When a branch is encountered, guess whether it is taken or not • If the guess is correct, we have gained time • If the guess is incorrect, we must undo any incorrectly executed instructions and move on • Multi-word cache lines – When a cache miss is encountered, we bring in the entire cache line, not just the word we’re looking for • If the access pattern shows spatial locality, we are prefetching other words that the program will soon ask for, thereby saving time. • If the speculation is too aggressive (i.e., the cache lines are too long), we’ll fetch many words uselessly. Speculation (contd.) • Value Prediction – (Not covered in this course) – Idea is to predict what the value of a variable will be and use the predicted value. • If the predicted value was right, we gain some time; if it was wrong, we did some useless execution. If this execution changed processor state, these changes will have to be undone. • Not used in practice (to my knowledge): mainly an academic exercise so far. Speculating on Time • The pipeline clock cycle is the time by which each stage is guaranteed to complete its assigned operation • This time is a function of – Actual hardware parameters: Gate and wire delays vary within the same die, from one die to another, and from one wafer to another. – Data involved in the computation: • Example: Ripple-carry adder. Worst-case execution time is the time it takes to ripple the carry through from carry-in of the least significant, to the carry-out of the most significant, stage. Actual execution times may vary considerably. • Requiring the worst-case delays to be accounted for often forces designers to be overly conservative in setting the clock rates Timing Speculation: Basic Idea • Suppose F is the frequency at which the pipeline is guaranteed to function correctly • Run the pipeline at a somewhat higher rate, f. – Much of the time, this clock period, t_p=1/f, will be sufficient for all pipeline stages, and we’ll gain in execution speed – Some of the time, we may need more time: • Need to discover when this is the case • When this is the case, provide additional time by allowing the pipeline stages additional cycles to complete their operation Implementation • Recall that pipeline stages are separated by latches • Duplicate each pipeline latch by introducing a shadow latch • Consider any stage of the pipeline. Suppose it starts some activity at time 0. – – – – At time t_p=1/f, latch the output of that stage into the regular pipeline latch. At time T_p=1/F, latch the output of the stage into the shadow latch. Compare the results of the regular and shadow latches If they agree, • do nothing: running at a higher speed has paid off – If they don’t agree, • Use the result of the shadow latch as the correct one • Squelch the computation that the following stage began on the basis of the incorrect shadow latch results • Restart the computation in the following stage using the correct results, as stored in the shadow latch Unless otherwise stated, all figures are from Ernst, et al., MICRO-36, 2003. Issues to Consider • How aggressive should we be? – If f is too high, a large fraction of the results will require correction with the shadow latch and we’ll actually lose time – If f is too low, the clock will be unnecessarily too slow and we won’t gain much Issues to Consider (contd.) • What about F? – Lower bound of F is given by the worst-case path (for the worst-case inputs) – What happens if F is too small? [This is one of the few instances in design when being too conservative at one level affects correctness of functioning!] • F may be so small that the results of the next computation propagate through the stage and arrive at the shadow latch – We’d then be comparing the results of two different operations! Metastability • If the input data is not stable when the clock transition happens, the output of the latch may float at a voltage that is in neither the 0 nor in the 1 logic ranges – Duration of metastable stage is not bounded – Different gates may interpret such indeterminate voltages differently (in terms of logic values) • Cannot reduce the probability of metastability to zero: all we can do is to keep it sufficiently low for all practical purposes Recovery Technique 1: Global Clock Gating • If any stage detects a timing problem – Stall the entire pipeline for one clock cycle. – Use this additional clock cycle to recompute using the correct shadow-latch values Recovery Technique 2: Counterflow Pipelining • When a mismatch (between regular and shadow latch contents) is detected: – Assert a bubble signal, to specify that the erring pipeline slot is now to be considered a bubble. – In the subsequent cycle, inject the shadow latch value into the next stage, allowing the errant operation to continue with the correct values – Trigger a flush train, traveling backwards from the errant stage, flushing operations at each stage it visits (Question: Is this flush operation necessary?? Can we do something else to avoid it?) Power Consumption Using a Processor to Fry an Egg From: www.phys.ncku.edu.tw/~htsu/humor/fry_egg.html Power Density From: Hsu and Feng, “A Power-Aware Real-Time System…”, 2005 Power Implications: Dynamic Power From Krishna & Lee: IEEE Trans. Computers, 2003. Static Power • Even when there is no switching, transistors leak current • Leakage power is a strongly increasing function of temperature and supply voltage; it is inversely proportional to the threshold voltage. Subthreshold leakage vs temperature From: Do, et al: Tech Report 2007-06, Dept of CSE, Chalmers Instt of Tech Leakage Current vs Vdd From Do et al., op cit. Voltage Control for Razor Latch System