6, Renaming and Scheduling

advertisement
Out-of-Order Execution Structures
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
MIPS R10000-Like Design
• Based on:
– Complexity-Effective Superscalar Processors
– S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Fetch Phase
• Fetch:
– Read instructions from I-Cache
– Predict Branches
– Pass on to Decode phase
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Decode Phase
• Decode:
– Parse instruction
– Shuffle opcode parts to appropriate ports for rename
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming Phase
• Rename:
– Map Architectural registers to Physical
– Eliminate False Dependences
– Passes renamed instructions to scheduler
• Called Dispatch
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scheduling Phase
• Wakeup:
– Instructions check whether they become ready
– From Writeback: physical register names
• Select:
– Amongst the ready select those to execute
– Structural hazards
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register File Read Phase
• Read source operands
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Bypass and Execute Phase
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Data Cache Access Phase
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Writeback Phase
• Write result to register file
• Broadcast tag in order to wakeup waiting instructions
– Notice that the tag broadcast should happen TWO cycles
in advance of the result production
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Reservation Station Model
• Used by Pentium Pro, PowerPC 604
• Re-order buffer holds values
• Renaming points to re-order buffer entries
– Tomasulo-like
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Physical Register File vs. Reservation Station
• Physical Register File
– Values reside in the register file
– At writeback instructions broadcast the register
name
• Reservation Stations:
– Values reside:
– In the register file upon commit
• Non-speculative
– In reservation stations prior to commit
• Speculative
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Quantifying Complexity
• Critical Path Delay as a function of architectural
parameters
– Instruction Window size (WinSize)
– Issue Width (IW)
• Full-custom Implementations
– Study the critical path
– Delay model
– Extrapolate how it will scale with “future”
technologies
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming
• Inputs:
– IW instructions
– Up to 2 x Input register names
– Up to 1 x Output register name
• Outputs:
– 2 x input physical registers
– 1 x new output physical register
– 1 x previous physical register name for
checkpointing
– Updated rename table
• Superscalar Issue complicates things a bit
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming One Instruction
new reg from free list
d
2
Read port
s2
1
RAT
1
Read port
A. Moshovos ©
s1
s1
1
Read port
s2
p0
old d
Write port
p31
ECE1773 - Fall ‘07 ECE Toronto
For mispeculation
recovery
Renaming Two Instructions
Cross Bundle
Dependency Check Logic
?
s1
s2
d
new d
s1
s2
d
new d
?
?
RAT
ps1 ps2
A. Moshovos ©
Old d new d
ECE1773 - Fall ‘07 ECE Toronto
ps1 ps2
Old d new d
Renaming More Instructions
• Dependency Checking logic for instruction i must
match against all preceding destinations
• If there are multiple matches it must enforce
priority:
– Pick the one closest to this instruction
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
RAT: SRAM Implementation
bitlines
SRAM cell
decoder
Arch reg
#ARCH REGS
lg(#PHYS REGS)
Sense amp
Phys reg
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
SRAM RAT cell
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
RAT: CAM Implementation
Arch reg
encoder
CAM cell
Active bit
Phys reg
#PHYS REGS
lg(#ARCH REGS)
• One CAM per physical register
• Active bit indicates the current map
• New version by setting active bit
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wordline
Bitline_B
Bitline
CAM Cell
Match
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
SRAM vs. CAM
• SRAM:
– Arch reg rows
– Lg(phy reg) cols
– SRAM read/write
• CAM:
– Phy reg rows
– Lg(arch reg) cols
– CAM match
– Update:
• Reset previous valid bit
• Set current valid bit
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #1 - Wakeup
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scheduler: Part #2 - Select
For a Single FU
Tree of Arbiters
Location based select policy
REQ Signals
GRANT
Signals
Anyreq raised if any req is
active, Grant Issued if
arbiter enabled
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Root enabled
if FU
available
Select for more than one FUs
• Handling Multiple FUs of Same Type:
– Stack Select logic blocks
in series - hierarchy
– Mask the Request granted
to previous unit
• NOT Feasible for More than 2 FUs
• Alternative:
– statically partition issue window among FUs –
MIPS R10000, HP PA 8000
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Datapath and Bypass
Commonly Used Layout:
Turn on TriState A to
pass result of
FU1 to left
operand of
FU0
1 Bit-Slice
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Complexity Analysis
• Critical path delay as a function of:
– Issue Width
– Window Size
• Register Renaming Table
• Wakeup and Select
• Bypass paths
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Methodology
• A representative CMOS design is selected from
published alternatives
• Implemented the circuits for 3 technologies:
– 0.8micron, 0.35micron and 0.18 micron
• Optimize for speed
• Wire parasitics in delay model
– Rmetal, Cmetal
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Methodology
• Feature size scaling: 1 / S
• Voltage scaling: 1 / U
•
•
•
•
Logic Delay = (CLx V) / I
Capac. Load: CL= 1  1 / S
Supply Voltage: V = 1  1 / U
Average charge/discharge current: I = 1  1 / U
• So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wire Delay
• L: wire length
• Intrinsic RC delay 
• Rmetal: resistance per unit length
• Cmetal: capacitance per unit length
• 0.5: 1st order approximation of distributed RC
model – uniformly distributed R & C
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wire Delay Scaling
• Metal Thickness doesn’t scale much
– Width ~ 1/S
– Rmetal ~ S
• Fringe Capacitance dominates in smaller feature
sizes – edges to parallel wires and the substrate
• Parallel plate – scales with 1 / S
– Cmetal ~ S
• Length scales with 1/S
• Overall Scale factor: S x S x (1/S)2 = 1
• Wire delay remains constant
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Table
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Dependency Checking Logic
• Accessed in Parallel with Map Table
• Every Logical Reg compared against logical dest
regs of current rename group
• For IW=2,4,8, delay less than map table
r1
r4
r4
r4
r4
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming Delay
• SRAM scheme
• Delay Components:
– Time to decode the arch reg index
– Time to drive wordline
– Time to pull down bit line
– Time for SenseAmp to detect pull-down
– MUX time ignored as control from dep. Check
logic comes in advance
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming Circuit
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Predecoding for speed
• Length of predecode lines:
– Cellheight: Height of single cell excluding
wordlines
– Wordline spacing
• NVREG: # of virtual reg-s
• x3: 3-operand instr-s
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Tnand fall delay of NAND
• Tnor rise delay of NOR
• Rnandpd NAND pull-down channel resistance +
Predecode line metal resistance
• Ceq diff-n Cap. of NAND + gate Cap. of NOR +
interconnect Cap.
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Decoder Delay
• Substitute
• Predecode line length, Req and Ceq we get:
• c2: intrinsic RC delay of predecode line
• c2 very small
• Decoder delay ~linearly dependent on IW
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Rename Delay
• Wordline
• c2: intrinsic RC delay of wordline
• c2 very small
• Wordline delay ~linearly dependent on IW
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Rename Delay
• Bitline:
• c2 very small
• Bitline delay ~linearly dependent on IW
• SenseAmp delay ~linearly dependent on IW
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Rename Logic Delay Scaling
•
•
•
•
•
•
•
•
Total delay increases
linearly with IW
Each Component shows
linear increase with IW
Bitline Delay > Wordline
Delay
Bitline length ~ # of
Logical reg-s
Wordline length ~ width
of physical reg
designator
Feature size -  [increase in
bitline&wordline delay with increasing IW] IW impact on delay worsens
0.8um: IW 2 8  Bitline delay + 37%
with decreasing feature
0.18um: IW 28  Bitline delay + 53%
size
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay
• Critical Path: Mismatch  Pull ready signal low
• Delay Components:
– Tag drivers  drive tag lines - vertical
– Mismatched bit: pull down stack  pull matchline low –
horizontal
– Final OR gate  or all the matchlines of an operand tag
• Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load
C
• Quadratic component significant for IW>2 & 0.18um
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay
• Quadratic component Small for both cases
• Both delays ~linearly dependent on IW
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: IW and Window Size
• 0.18um Process
• Quadratic
dependence
• Issue width has
greater effect 
increase all 3
delay components
• As IW & WinSize
+ together 
delay actually
changes like:
THIS
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Window Size
• 8 way & 0.18 Process
• Tag drive delay increases rapidly with WinSize +
• Match OR delay constant
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Wakeup Delay: Feature size
• 8 way & 64 entry window
• Tag drive and Tag match delays do not scale as well as MatchOR
delay
• Match OR  logic delay
• Others  also have wire delays
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Selection Logic and Bypass Delay
• Selection
– Logarithmically dependent on WinSize
• Bypass: Delay dependent on (IW)2
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Download