Superscalar Processors Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University

advertisement
Superscalar Processors
Kasi L.K. Anbumony
Department of Electrical and Computer Engineering
Auburn University
Auburn, AL 36849
10/24/05 ELEC6200
1
Outline
• Pipelining: Motivation
• Pipeline Hazards
• Advanced Pipelining
•
– Instruction Level Parallelism (ILP)
– Multiple Issue (MIPS Superscalar)


Static Multiple Issue (SW centric)
Dynamic Multiple Issue (HW centric)
• Superscalar Processor
• Conclusion
10/24/05 ELEC6200
2
Pipelining: Motivation
• Multiple instructions are overlapped in execution. To exploit the
Instruction level parallelism(ILP)
•
One of technique to make the processors fast
• Some terms:

Stages

Task Order

Throughput
•
In pipeline the stages occur concurrently (or) parallely
• Possible as long as we have separate resources for each stage
10/24/05 ELEC6200
3
Sequential Laundry: Non-pipelined
6 PM
7
8
9
10
11 Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 20 30 40 20 30 40 20 30 40 20
A
B
C
D
• Sequential laundry takes 6 hours for 4 loads
• If they learned pipelining, how long would laundry take?
10/24/05 ELEC6200
4
Pipelined Laundry:Start work ASAP
6 PM
7
8
9
10
11 Midnight
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
• Pipelined laundry takes 3.5 hours for 4 loads
10/24/05 ELEC6200
5
Pipelining: Lessons
6 PM 7
8
•
Improvement in throughput of
entire workload without
improving any time to
complete a single load
•
Pipeline rate limited by
slowest pipeline stage
•
Multiple tasks operating
simultaneously
•
Potential speedup = Number
pipe stages
•
Unbalanced lengths of pipe
stages reduces speedup
•
Time to “fill” pipeline and time
to “drain” it reduces speedup
9
Time
T
a
s
k
O
r
d
e
r
30 40 40 40 40 20
A
B
C
D
10/24/05 ELEC6200
6
Comparison: Example
Consider a non-pipelined machine with 5 execution steps of
lengths 200 ps, 100 ps, 200 ps, 200 ps, and 100 ps. Due to
clock skew and setup, pipelining adds 5 ps of overhead to
each instruction stage. Ignoring latency impact, how much
speedup in the instruction execution rate will we gain from a
pipeline?
10/24/05 ELEC6200
7
Sequential vs. Pipelined Execution
800
Sequential Execution
800
800
200 100 200 200 100 200 100 200 200 100 200 100 200 200 100
Pipelined Execution
200
100 200
200
100 200
200
10/24/05 ELEC6200
200 100
200 100
100 200
200 100
8
Speed Up Equation for Pipelining
Speedup from pipelining =
=
Avg. Instr. Time Unpipelined
Avg. Instr. Time Pipelined
CPI unpipelined  Clock Cycle unpipelined
CPI pipelined  Clock Cycle pipelined
= CPI unpipelined
CPI pipelined

Clock Cycle unpipelined
Clock Cycle pipelined
Ideal CPIpipelined = CPIunpipelined /Pipeline depth
Clock Cycle unpipelined
Ideal CPI  Pipeline depth
Speedup =

CPI pipelined
Clock Cycle pipelined
10/24/05 ELEC6200
9
Speed Up Equation for Pipelining
CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr
Clock Cycle unpipelined
Ideal CPI x Pipeline depth
Speedup =

Ideal CPI + Pipeline stall CPI
Clock Cycle pipelined
Clock Cycle unpipelined
Pipeline depth
Speedup =

1 + Pipeline stall CPI
Clock Cycle pipelined
10/24/05 ELEC6200
10
It’s Not That Easy for Computers: Limitation
• Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle
– Structural hazards: Hardware cannot support this
combination of instructions that has to be executed in the
same clock cycle (washer+dryer)
– Data hazards: Instruction depends on result of prior
instruction still in pipeline (one sock missing)
– Control hazards: Pipelining of branches & other
instructions. Common solution is to stall the pipeline until
the hazard “bubbles” through the pipeline
10/24/05 ELEC6200
11
Instruction Level Parallelism
• Longer pipeline
• Laundry analogy: Divide our washer into three machines that
perform the wash, rinse and spin steps of a traditional
machine
• To get the full speedup,we need to rebalance the remaining
steps so that they are of the same length
• Amount of parallelism exploited is higher, since there are
more operations being overlapped
10/24/05 ELEC6200
12
Advanced Pipelining: Techniques
• Motivation:
To further exploit the Instruction Level Parallelism (ILP)
• Multiple Issue
Replicate the internal components of the computer so that
it can launch multiple instructions in every pipeline stage
• Dynamic Pipeline scheduling (or) Dynamic Pipelining (or)
Dynamic Multiple issue by hardware to avoid pipeline
hazards
10/24/05 ELEC6200
13
Multiple Issue: Superscalar
• Launch multiple instructions in parallel
• A Superscalar laundry would replace our household washer
and dryer with say , three washers and three dryers. Also
followed by 3 assistants to fold and put away thee times as
much laundry in the same amount of time.
• Downside extra work needed to keep all the machines busy
and transferring load to next pipeline stage.
• Superscalar is defined as executing more than one instruction
per clock cycle
10/24/05 ELEC6200
14
Performance Metrics: CPI & IPC
• Instruction execution rate exceed the clock rate
• Example: 6GHz, 4-way multiple-issue microprocessor can
execute at a peak rate of 24 billion instructions per second
and have a best case of CPI of 0.25
• Instructions per clock cycle (IPC) (for the above case: 4)
• Assume a 5 stage pipeline such a processor would have 20
instructions in execution at any given time.
10/24/05 ELEC6200
15
Multiple issue processor: Decision Strategy
• Static Multiple Issue

Decisions are made at compile time before execution

Software based

Compiler scheduling

VLIW(Very Long Instruction Word)
• Dynamic Multiple Issue

Decisions are made at run/execution time by the
processor

Dynamic scheduling

Hardware based
10/24/05 ELEC6200
16
Static Multiple Issue Processor
• Issue Packet: Set of instructions which can be paired to form
one large instruction with multiple operations (VLIW)
• Relies on Compiler to take on responsibilities for handling
data and control hazards
• Some of the compiler’s responsibilities may be static branch
prediction and code scheduling
10/24/05 ELEC6200
17
Getting CPI < 1:Static 2 Issue pipeline
• Superscalar MIPS: 2 instructions, 1 ALU & 1 LOAD instruction
– Fetch 64-bits/clock cycle; ALU on left, Load on right
– Can only issue 2nd instruction if 1st instruction issues
Type
ALU instruction
Load instruction
ALU instruction
Load instruction
ALU instruction
Load instruction
10/24/05 ELEC6200
Pipe Stages
IF
ID
IF
ID
IF
IF
EX MEM WB
EX MEM WB
ID
EX MEM WB
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX MEM WB
18
Static Multiple Issue: Datapath
ALU/bx xion
IM
Reg.
file
ALU
lw/sw xion
ALU
10/24/05 ELEC6200
19
Example: Multiple Issue code scheduling
•
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0 ($s1)
addi $s1, $s1, -4
bne $s1,$zero, Loop
• After reordering the instructions based on dependencies, we get a
CPI=0.8 (or) IPC=1.25
ALU/BX
Loop:
lw/sw
Clock cycle
lw $t0, 0($s1)
1
addi $s1, $s1, -4
2
addu $t0, $t0, $s2
3
bne $s1,$zero, Loop
10/24/05 ELEC6200
sw $t0, 0 ($s1)
4
20
Loop Unrolling: 4 Iterations
• Multiple copies of the loop body are made , thus more ILP by
overlapping instructions from different iterations
• CPI=8/14=0.57
Loop:
ALU/BX
lw/sw
Clock cycle
addi $s1, $s1, -16
lw $t0, 0($s1)
1
lw $t1, 12($s1)
2
addu $t0, $t0, $s2
lw $t2, 8($s1)
3
addu $t1, $t1, $s2
lw $t3, 4($s1)
4
addu $t2, $t2, $s2
sw $t0, 16 ($s1)
5
addu $t3, $t3, $s2
sw $t0, 12 ($s1)
6
sw $t0, 8 ($s1)
7
sw $t0, 4 ($s1)
8
bne $s1,$zero, Loop
10/24/05 ELEC6200
21
Dynamic Multiple-Issue Processors
• Instructions are issue in order and the processor decides
whether zero,one (or) more instructions can issue in a given
clock cycle
• Again achieving good performance requires the compiler to
schedule instructions to move dependencies apart and
thereby improving the instruction issue rate
10/24/05 ELEC6200
22
Dynamic Scheduling: Definition
• Dynamic pipeline scheduling goes past stalls to find later
instructions to execute while waiting for the stall to be
resolved
• Chooses which instruction to execute next by reordering the
instructions to avoid stalls (dynamic issue decisions)
•
lw $t0, 20($s2)
addu $t1, $t0, $s2
sub $s4, $s4, $t3
slti $t5, $s4, 20
bne $s1,$zero, Loop
10/24/05 ELEC6200
23
HW Schemes: Why?
• Why in HW at run time?
– Works when can’t know real dependence at compile time
– Compiler simpler
– Code for one machine runs well on another
10/24/05 ELEC6200
24
Dynamic Pipeline Scheduling: Model
In order
Inst. Fetch & decode unit
Res. station
Res. station
………..
Res. station
Out order
FP
Integer
………..
lw/sw
Reorder buffer
In order
Commit unit
10/24/05 ELEC6200
25
HW Units: Working
• Inst fetch/decode unit fetches instructions,decodes them and
sends each instruction to a corresponding functional unit of
the execute stage
• 5-10 functional units with buffers called reservation stations
that holds the operands and operation
• As soon as buffer contains all the operands , functional unit
executes, the result is calculated
• It is for the commit unit to decide when it is safe to put the
result into the register file (or) for store into memory
10/24/05 ELEC6200
26
Dynamic scheduling: in-order completion
• To make programs behave as if they run on a non-pipelined
computer, the instruction fetch and decode unit is required to
issue instructions in order, and the commit unit is required to
write results to registers and memory in program execution
order (in-order completion)
• Hence an exception occurs, the computer can point to the last
instruction executed and the only registers updated will be all
those written by the instructions before exception
10/24/05 ELEC6200
27
Dynamic scheduling: Speculation
• Speculative execution: Dynamic scheduling can be
combined with branch prediction, so after a mispredicted
branch , commit unit be able to discard all the results in the
execution unit
• Dynamic scheduling can also be combined with Superscalar
execution, so each unit may be committing 4 to 6 instructions
per cycle
10/24/05 ELEC6200
28
Superscalar Processor
10/24/05 ELEC6200
29
Conclusion: Several Steps ILP Exploitation
10/24/05 ELEC6200
30
References
• Computer Organization & Design, Patterson & Hennessy,
2 & 3 Edition
• http://www.cs.berkeley.edu/~pattrsn/152F97/index_lectures.ht
ml
• http://www.cse.lehigh.edu/~mschulte/ece401-01/
• http://paul.rutgers.edu/courses/cs505/S03/
• http://engineering.dartmouth.edu/~engs116/lectures/engs%20
116%20lecture%204-05f.ppt (Pipelining)
10/24/05 ELEC6200
31
Download