lect05

advertisement
Lecture 5: Interrupts, Superscalar
Professor Alvin R. Lebeck
Computer Science 220 / ECE 252
Fall 2008
Admin
• Homework #1 Due Today
• Homework #2 Assigned
• Reading
– H&P Chapter 2 & 3 (suggested)
– Research papers (not yet ready to read, but will be soon!):
» Hinton et al: “The Microarchitecture of the Pentium 4
Processor”
» Palacharla, Jouppi, and Smith: “Complexity-Effective
Superscalar Processors”
» Akkary, Rajwar, and Srinivasan: “Checkpoint Processing and
Recovery”
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
2
Review: Hazards
Data Hazards
• RAW
– only one that can occur in simple 5-stage pipeline
• WAR, WAW
• Data Forwarding (Register Bypassing)
– send data from one stage to another bypassing the register file
• Still have load use delay
Structural Hazards
• Replicate Hardware, scheduling
Control Hazards
• Compute condition and target early (delayed branch)
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
3
Review: Dynamic Branch Prediction
• Solution: 2-bit counter where prediction changes
only if mispredict twice:
• Increment for taken, decrement for not-taken
– 00,01,10,11
• Helps when target is known before condition
T
NT
Predict Taken
Predict Taken
T
T
Predict Not
Taken
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
NT
T
Computer Science 220
NT
Predict Not
Taken
NT
4
Review: Correlating Branches
• Idea: taken/not taken of
recently executed
branches is related to
behavior of next branch
(as well as the history of
that branch behavior)
• Tournament
Branch address
2-bits per branch predictor
Prediction
Choose between alternative
predictors
• How do you choose?
2-bit global branch history
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
5
Review: Need Address @ Same Time as
Prediction
• Branch Target Buffer (BTB): Address of branch index
to get prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong
branch address
PC of Inst to fetch
Predicted PC
Branch Prediction:
Taken or not Taken
0
…
n-1
=
Yes, use predicted PC
No, not branch
Procedure Return Addresses Predicted with a Stack
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
6
Review: Multicycle Ops in Pipeline
EX
M1
IF
M2
M EM
M3
M4
M5
M6
M7
WB
ID/RF
A1
A2
A3
A4
FP/INT Divide Unit
Not Pipelined
25 Clocks
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
7
Interrupts and Exceptions
• Unnatural change in control flow
• warning: varying terminology
– “exception” sometimes refers to all cases
– “Trap” software trap, hardware trap
• Exception is potential problem with program
–
–
–
–
–
–
condition occurs within the processor
segmentation fault
bus error
divide by 0
Don’t want my bug to crash the entire machine
page fault (virtual memory…)
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
8
Interrupts and Exceptions
• Interrupt is external event
– devices: disk, network, keyboard, etc.
– clock for timeslicing
– These are useful events, must do something when they occur.
• Trap is user-requested exception
– operating system call (syscall)
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
9
Handling an Exception/Interrupt
User Program
ld
add
st
div
beq
ld
sub
bne
Interrupt Handler
• Invoke specific kernel routine
based on type of interrupt
– interrupt/exception handler
• Must determine what caused
interrupt
– could use software to examine
each device
– PC = interrupt_handler
RETT
• Vectored Interrupts
– PC = interrupt_table[i]
• Similar mechanism is used to
handle interrupts,
exceptions, traps
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
• Kernel initializes table at boot
time
• Clear the interrupt
• May return from interrupt
(RETT) to different process
(e.g, context switch)
Computer Science 220
10
Execution Mode
• What if interrupt occurs while in interrupt handler?
– Problem: Could lose information for one interrupt
clear of interrupt #1, clears both #1 and #2
– Solution: disable interrupts
• Disabling interrupts is a protected operation
– Only the kernel can execute it
– user v.s. kernel mode
– mode bit in CPU status register
• Other protected operations
– installing interrupt handlers
– manipulating CPU state (saving/restoring status registers)
• Changing modes
– interrupts
– system calls (syscall instruction)
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
11
A System Call (syscall)
User Program
ld
add
st
TA 6
beq
ld
sub
bne
• Special Instruction to change
modes and invoke service
Kernel
– read/write I/O device
– create new process
Trap
Handler
RETT
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Service
Routines
• Invokes specific kernel routine
based on argument
• kernel defined interface
• May return from trap to different
process (e.g, context switch)
• RETT, instruction to return to
user process
Computer Science 220
12
Interrupts/exceptions
• classifying interrupts
–
–
–
–
–
terminal (fatal) vs. restartable (control returned to program)
synchronous (internal) vs. asynchronous (external)
user vs. coerced
maskable (ignorable) vs. non-maskable
between instructions vs. within instruction
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
13
Precise Exceptions
“unobserved system can exist in any intermediate
state, upon observation system collapses to welldefined state”
– 2nd postulate of quantum mechanics
• system  processor, observation  interrupt
• what is the “well-defined” state?
– von Neumann: “sequential, instruction atomic execution”
– precise state at interrupt
» all instructions older than interrupt are complete
» all instructions younger than interrupt haven’t started
• implies interrupts are taken in program order
• necessary for VM (why?), “highly recommended” by
IEEE
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
14
Pipelining Complications
• Interrupts (Exceptions)
–
–
–
–
5 instructions executing in 5 stage pipeline
How to stop the pipeline?
How to restart the pipeline?
Who caused the interrupt?
Stage
IF
ID
EX
MEM
Problem interrupts occurring
Page fault on instruction fetch; misaligned memory
access; memory-protection violation
Undefined or illegal opcode
Arithmetic interrupt
Page fault on data fetch; misaligned memory
access; memory-protection violation
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
15
Pipelining Complications
• Simultaneous exceptions in > 1 pipeline stage
– Load with data page fault in MEM stage
– Add with instruction page fault in IF stage
• Solution #1
– Interrupt status vector per instruction
– Defer check til last stage, kill state update if exception
• Solution #2
– Interrupt ASAP
– Restart everything that is incomplete
• Another advantage for state update late in
pipeline!
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
16
Interrupts/Exceptions are Nasty
• odd bits of state must be precise (e.g., condition
codes)
• delayed branches
– what if instruction in delay slot takes an interrupt?
• Out of order Writes (e.g., autoinc, multicycle ops)
– must undo write (e.g., future-file, history-file)
• some machines had precise interrupts only in integer
pipe
– sufficient for implementing VM (e.g., VAX/Alpha)
• Lucky for us, there’s a nice, clean way to handle
precise state
– We’ll see how this is done in a couple of lectures ...
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
17
Pipelining x86
• The x86 ISA has some really nasty instructions - how
did Intel ever figure out how to build a pipelined x86
microprocessor?
• Solution: at runtime, “crack” x86 instructions (macroops) into RISC-like micro-ops
– First used in P6 (Pentium Pro)
– Used in all subsequent x86 processors, including those from AMD
• What are the potential challenges for implementing
this solution?
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
18
Where are We
• principles of pipelining
– pipeline depth: clock rate vs. number of stalls (CPI)
• hazards
– structural
– data (RAW, WAR, WAW)
– control
• Branch prediction
• multi-cycle operations
– structural hazards, WAW hazards
• interrupts
– precise state
• Next up: CPI < 1
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
19
Getting CPI < 1: Issuing Multiple
Instructions/Cycle
• “Flynn bottleneck”
– single issue performance limit is CPI = IPC = 1
– hazards + overhead  CPI >= 1 (IPC <= 1)
• diminishing returns from deep pipelines
• solution: issue multiple instructions per cycle
• Superscalar: varying no. instructions/cycle (1 to
8), scheduled by compiler (statically scheduled)
or by HW (Tomasulo; dynamically scheduled)
– First superscalar IBM America → RS6000 → Power1
– Pentium4, IBM PowerPC, Sun SuperSparc, DEC Alpha, HP PA8000
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
20
Base Implementation
• statically scheduled (in-order) superscalar
–
–
–
–
executes unmodified sequential programs
Figures out on its own what can be done in parallel
e.g., Sun UltraSPARC, Alpha 21164
we’ll start with this one
– What has to change from single issue to multiple issue?
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
21
CPI < 1: Issuing Multiple Instructions/Cycle
• Ex 2-way superscalar: 1 FP & 1 anything else
– Fetch 64-bits/clock cycle; Int on left, FP on right
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type
Int. instruction
FP instruction
Int. instruction
FP instruction
Int. instruction
FP instruction
PipeStages
IF
ID
IF
ID
IF
IF
EX MEM WB
EX MEM WB
ID
EX MEM WB
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
• 1 cycle load delay expands to 3 instructions in SS
– instruction in right half can’t use it, nor instructions in next slot
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
22
Implications of Superscalar
regfile
PC
F/D
D/X
X/M
M/W
BP
F
I$
D$
D
X
M
W
• what is involved in
–
–
–
–
–
fetching two instructions per cycle?
decoding two instructions per cycle?
executing two ALU operations per cycle?
accessing the data cache twice per cycle?
writing back two results per cycle?
• what about 4 or 8 instructions per cycle?
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
23
Wide Fetch
• Fetch N instructions per cycle
• if instructions are sequential...
– and on same cache line  nothing really
– and on different cache lines  banked I$ + combining network
• if instructions are not sequential...
– more difficult
– two serial I$ accesses (access1predict targetaccess2)? no
• note: embedded branches OK as long as predicted
NT
– serial access + prediction in parallel
– if prediction is T, discard serial part after branch
• Trace Cache…
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
24
Wide Decode
• Decode N instructions per cycle
• actually decoding instructions?
– easy if fixed length instructions (multiple decoders)
– harder (but possible) if variable length
• reading input register values?
– 2N register read ports (register file read latency ~2N)
– actually less than 2N, since most values come from bypasses
• what about the stall logic to enforce RAW
dependences?
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
25
N2 Dependence Check Logic
• remember stall logic for single issue pipeline
– rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W)
– same for rs2(D)
– full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD
• doubling issue width (N) quadruples stall logic!
–
–
–
–
–
not only 2 instructions in D, but two instructions in every stage
(rs1(D1) == rd(D/X1) && op(D/X1) == LOAD)
(rs1(D1) == rd(D/X2) && op(D/X2) == LOAD)
repeat for rs1(D2), rs2(D1), rs2(D2)
also check dependence of 2nd instruction on 1st: rs1(D2) == rd(D1)
• “N2 dependence cross-check”
– for N-wide pipeline, stall (and bypass) circuits grow as N2
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
26
Superscalar Stalls
• invariant: stalls propagate upstream to younger
instructions
• what if older instruction in issue “pair” (inst0) stalls?
– younger instruction (inst1) stalls too, cannot pass it
• what if younger instruction (inst1) stalls?
– can older instruction from next group (inst2) move up?
• Rigid pipeline: No
• Fluid pipeline: Yes
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
27
Wide Execute
• What does it take to execute N instructions per cycle?
• multiple execution units...N of every kind?
– N ALUs? OK, ALUs are small
– N FP dividers? no, FP dividers are huge (and fdiv is uncommon)
• typically have some mix (proportional to instruction
mix)
• RS/6000: 1 ALU/memory/branch + 1 FP
– Pentium: 1 any + 1 ALU (Pentium)
– Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branch
– Alpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/store
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
28
N2 Bypass
• N2 bypass logic... OK
– only 5-bit quantities
– compare to generate 1-bit outcomes
– similar to stall logic
• N2 bypass buses... not even close to OK
–
–
–
–
32-bit or 64-bit quantities
broadcast, route, and multiplex (mux)
difficult to lay out and route all the wires
wide (SLOW) muxes
• big design problem today
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
29
One Solution to N2 Bypass: Clustering
D/X
X/M
• group functional units into clusters
–
–
–
–
full bypass within cluster
no bypass between clusters
~(N/k) inputs at each mux
~(N/k)2 routed buses in each cluster
• steer instructions to different
clusters
– dependent instructions to same cluster
– exploit intra-cluster bypass
– static or dynamic steering is possible
• e.g., Alpha 21264
– 4-wide, 300MHz
– full bypass didn’t fit into 1 clock cycle
– 2 clusters with full intra-cluster bypass
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
30
Wide Memory Access
• what is involved in accessing memory for multiple
instructions per cycle?
• multi-banked D$
– requires bank assignment and conflict-detection logic
• (rough) instruction mix: 20% loads, 15% stores
– for width N, we need about 0.2*N load ports, 0.15*N store ports
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
31
Wide Writeback
• what is involved in writing back multiple instructions
per cycle?
• nothing too special, just another port on the register
file
– everything else is taken care of earlier in pipeline
• adding ports isn’t free, though
– increases area
– increases access latency
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
32
Multiple Issue Summary
•
•
•
•
superscalar problem spots
fetch, branch prediction  trace cache?
decode (N2 dependence cross-check)
execute (N2 bypass)  clustering?
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
33
Can we do better?
• Problem: Stall in ID stage if any data hazard.
• Your task: Teams of two, propose a design to
eliminate these stalls.
MULD
ADDD
ADDD
ADDD
F2, F3, F4
F1, F2, F3
F3, F4, F5
F1, F4, F5
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Long latency…
Computer Science 220
34
Next Time
• Dynamic Scheduling
• Read papers
• HW #2 Assigned
© 2008 Lebeck, Sorin, Roth, Hill, Wood,
Sohi, Smith, Vijaykumar, Lipasti, Katz
Computer Science 220
35
Download