ppt

advertisement
CS-447– Computer Architecture
M,W 10-11:20am
Lecture 16
Superscalar Processors
October 29th, 2007
Majd F. Sakr
msakr@qatar.cmu.edu
www.qatar.cmu.edu/~msakr/15447-f07/
15-447 Computer Architecture
Fall 2007 ©
During This Lecture
° Introduction to Superscalar Pipelined
Processors
• Limitations of Scalar Pipelines
• From Scalar to Superscalar Pipelines
• Superscalar Pipeline Overview
15-447 Computer Architecture
Fall 2007 ©
General Performance Metrics
° Performance is determined by (1) Instruction
Count, (2) Clock Rate, & (3) CPI –Clocks per
Instructions.
° Exe Time = Instruction Count * CPI * Cycle Time
° The Instruction Count is governed by
Compiler’s Techniques & Architectural
decisions
° The CPI is primarily governed by Architectural
decisions.
° The Cycle Time is governed by technology
improvements & Architectural decisions.
15-447 Computer Architecture
Fall 2007 ©
Scalar Unpipelined Processor
° Only ONE instruction can be
resident at the processor at
any given time.
° The whole processor is
considered as ONE stage,
k=1.
CPI = 1 / IPC
15-447 Computer Architecture
One Instruction
resident in Processor
The number of stages
,
K=1
Fall 2007 ©
Pipelined Processor
1st Inst.
nd
2 Inst.
3rd Inst.
th
4 Inst.
5th Inst.
IF
ID
EX Mem WB
IF
ID
IF
EX
Mem WB
ID
EX
Mem WB
IF
ID
EX
Mem WB
ID
EX
IF
The number of
stages K=5
Ideally,
CPI = IPC = 1
//’ims = k =5
Mem WB
° K –number of pipe stages,
instructions are resident
at the processor at any
given time.
° In our example, K=5
stages, number of
parallelism (concurrent
instruction in the
processor) is also equal
to 5.
° One instruction will be
accomplished each clock
cycle,
CPI = IPC = 1
15-447 Computer Architecture
Fall 2007 ©
Example of a Six-Stage Pipelined Processor
15-447 Computer Architecture
Fall 2007 ©
Pipelining & Performance
Cycle Timeunpipelined
Ideal CPI  Pipeline depth
Speedup 

Ideal CPI  Pipeline stall CPI
Cycle Timepipelined
° The Pipeline Depth is the number of stages implemented
in the processor, it is an architectural decision, also it is
directly related to the technology. In the previous
example K=5.
° The Stall’s CPI are directly related to the code’s
instructions and the density of existing dependences
and branches.
° Ideally the CPI is ONE.
15-447 Computer Architecture
Fall 2007 ©
Super-Pipelining Processor
° IF (Instruction Fetch) : requires a cache access in
order to get the next instruction to be delivered to
the pipeline.
° ID (Instruction Decode) : requires that the control
unit decodes the instruction to know what it does.
° Obviously the fetch stage cache access, requires
much more time than the decode stage –
combinational logic.
° Why not subdivide each pipe-stage into m other
stages of smaller amount of time required for each
stage; minor cycle time.
15-447 Computer Architecture
Fall 2007 ©
Super-Pipelining Processor
° The machine can issue a
new instruction every
minor cycle.
° Parallelism = K x m.
° For this figure,
//’ism = K x m
= 4 x 3 = 12.
Super-pipelined machine of
Degree m=3
15-447 Computer Architecture
Fall 2007 ©
Stage Quantazation
° (a) four-stage
instruction pipeline
° (b) eleven-stage
instruction pipeline
15-447 Computer Architecture
Fall 2007 ©
Superscalar machine of Degree n=3
° The superscalar
degree is determined
by the issue
parallelism n, the
maximum number of
instructions that can
be issued in every
machine cycle.
° Parallelism = K x n.
° For this figure,
//’ism = K x n = 4 x 3
=12
°Is there any reason why the superscalar machine cannot also be superpipelined?
15-447 Computer Architecture
Fall 2007 ©
Superpipelined Superscalar Machine
° The superscalar
degree is determined
by the issue
parallelism n, and the
sub-stages m, & the
number of stages k.
° Parallelism = K x m x
n
° For this figure,
//’ism = K x m x n
= 5 x 3 x 3 = 45
IF
ID
IF
ALU Mem WB
ID
IF
ALU Mem WB
ID
IF
ALU Mem WB
ID
IF
ALU Mem WB
ID
IF
ID
IF
IF
15-447 Computer Architecture
ALU Mem WB
ALU Mem WB
ID
ID
ALU Mem WB
ALU Mem WB
Fall 2007 ©
Characteristics of Superscalar Machines
° Simultaneously advance multiple instructions through
the pipeline stages.
° Multiple functional units => higher instruction execution
throughput.
° Able to execute instructions in an order different from
that specified by the original program.
° Out of program order execution allows more parallel
processing of instructions.
15-447 Computer Architecture
Fall 2007 ©
Limitations of Scalar Pipelines
° Instructions, regardless of type,
traverse the same set of pipeline
stages.
° Only one instruction can be resident in
each pipeline stage at any time.
° Instructions advance through the
pipeline stages in a lockstep fashion.
IF
ID
EXE
MEM
WB
° Upper bound on pipeline throughput.
15-447 Computer Architecture
Fall 2007 ©
Deeper Pipeline, a Solution?
° Performance is proportional to (1) Instruction
Count, (2) Clock Rate, & (3) IPC –Instructions Per
Clock.
° Deeper pipeline has fewer logic gate levels in
each stage, this leads to a shorter cycle time &
higher clock rate.
° But deeper pipeline can potentially incur higher
penalties for dealing with inter-instruction
dependences.
15-447 Computer Architecture
Fall 2007 ©
Bounded Pipelines Performance
° Scalar pipeline can only initiate at most one
instruction every cycle, hence IPC is
fundamentally bounded by ONE.
° To get more instruction throughput, we must
initiate more than one instruction every machine
cycle.
° Hence, having more than one instruction
resident in each pipeline stage is necessary;
parallel pipeline.
15-447 Computer Architecture
Fall 2007 ©
Inefficient Unification into a Single Pipeline
° Different instruction types
require different sets of
computations.
° In IF, ID there is significant
uniformity of different
instruction types.
15-447 Computer Architecture
IF
ID
Perform their job
regardless of the
instruction they
are woking on.
Fall 2007 ©
Inefficient Unification into a Single Pipeline
THIRTY
Clock
Cycles
° But in execution stages ALU
& MEM there is substantial
diversity.
° Instructions that require long
& possibly variable latencies
(F.P. Multiply & Divide) are
difficult to unify with simple
instructions that require only
a single cycle latency.
15-447 Computer Architecture
F.P.
Divide
TEN
Clock
Cycles
ONE
Clock
Cycle
Mult
Add
Fall 2007 ©
Diversified Pipelines
° Specialized execution units customized for
specific instruction types will contribute to the
need for greater diversity in the execution
stages.
° For parallel pipelines, there is a strong
motivation to implement multiple different
execution units –subpipelines, in the execution
portion of parallel pipeline. We call such
pipelines diversified pipelines.
15-447 Computer Architecture
Fall 2007 ©
Performance Lost Due to Rigid Pipelines
° Instructions advance
through the pipeline stages
in a lockstep fashion; inorder & synchronously.
° If a dependent instruction is
stalled in pipeline stage i,
then all earlier stages,
regardless of their
dependency or NOT, are
also stalled.
Time
Mult R21, R8
Sub R2, R15
NO
Dependence
at all!
Add R1, R22
Stall for 30 Cycles (since FP div)
15-447 Computer Architecture
DivFP R31, R0
Fall 2007 ©
Rigid Pipeline Penalty
° Only after the inter-instruction dependence
is satisfied, that all i stalled instructions can
again advance synchronously down the
pipeline.
° If an independent instruction is allowed to
bypass the stalled instruction & continue
down the pipeline stages, an idling cycle of
the pipeline can be eliminated.
15-447 Computer Architecture
Fall 2007 ©
Out-Of-Order Execution
Time
Add R2, R4
Div R31, R0
15-447 Computer Architecture
Sub R24, R16
Add R12, R23
Mult R21, R8
Sub R2, R15
Add R1, R22
Fall 2007 ©
Independent
Instructions
° It is the act of allowing the
bypassing of a stalled
leading instruction by
trailing instructions.
Parallel pipelines that
support out-of-order
execution are called
dynamic pipelines.
From Scalar to Superscalar Pipelines
° Superscalar pipelines are parallel pipelines: able to
initiate the processing of multiple instructions in every
machine cycle.
° Superscalar are diversified; they employ multiple &
heterogeneous functional units in their execution stages.
° They are implemented as dynamic pipelines in order to
achieve the best possible performance without requiring
reordering of instructions by the compiler.
15-447 Computer Architecture
Fall 2007 ©
Parallel Pipelines
° A k-stage pipeline can have k
instructions concurrently resident in
the machine & can potentially achieve
a factor of k speedup over
nonpipelined machines.
IF
K=5
ID
EXE
MEM
WB
° Alternatively, the same speedup can
be achieved by employing k copies of K=5 P1 P1 P1 P1 P1
the nonpipelined machine to process
k instructions in parallel.
15-447 Computer Architecture
Fall 2007 ©
Temporal & Spatial Machine Parallelism
° Spatial parallelism requires replication of the
entire processing unit hence needs more
hardware than temporal parallelism.
° Parallel pipelines employs both temporal &
spatial machine parallelism, that would produce
higher instruction processing throughput.
15-447 Computer Architecture
Fall 2007 ©
Parallel Pipelines
° For parallel pipelines, the speedup is primarily
determined by the width of the parallel pipeline.
° A parallel pipeline with width s can concurrently
process up to s instructions in each of its pipeline
stages; and produce a potential speedup of s.
15-447 Computer Architecture
Fall 2007 ©
Parallel Pipeline’s Interconnections
° To connect all s instruction buffers from one stage to all s
instructions buffers of the next stage, the circuitry for
inter-stage interconnection can increase by a factor of s2.
° In order to support concurrent register file access by s
instructions, the number of read & write ports of the
register file must be increased by a factor of s. Similarly
additional I-cache & D-cache access ports must be
provided.
15-447 Computer Architecture
Fall 2007 ©
The Sequel of the i486: Pentium Microprocessor
° Superscalar
machine
implementing a
parallel pipeline of
width s=2.
° Essentially
implements two i486
pipelines, a 5-stage
scalar pipeline.
IF
IF
IF
Main decoding
stage
Secondary
D2
decoding stage
D1
D1
D1
EX
D2
D2
EXE
EXE
WB
WB
U pipe
V pipe
WB
15-447 Computer Architecture
Fall 2007 ©
Pentium Microprocessor
° Multiple instructions can be fetched & decoded by the
first 2 stages of the parallel pipeline in every machine
cycle.
° In each cycle, potentially two instructions can be
issued into the two execution pipelines.
° The goal is to achieve a peak execution rate of two
instructions per machine cycle.
° The execute stage can perform an ALU operation or
access the D-cache hence additional ports to the RF
must be provided to support 2 ALU operations.
15-447 Computer Architecture
Fall 2007 ©
Diversified Pipelines
° In a unified pipeline, though each instruction type
only requires a subset of the execution stages, it
must traverse all the execution stages –even if
idling.
° The execution latency for all instruction types is
equal to the total number of execution stages;
resulting in unnecessary stalling of trailing
instructions.
15-447 Computer Architecture
Fall 2007 ©
Diversified Pipelines (3)
° Instead of implementing s
identical pipes in an s-wide
parallel pipeline,
IF
IF
IF
ID
ID
ID
RD
RD
RD
diversified execution pipes
ALU
can be implemented using
MEM1
FP1
MEM2
FP2
multiple Functional Units.
Execute
Stages
FP3
WB
15-447 Computer Architecture
BR
WB
WB
Fall 2007 ©
Advantages of Diversified Pipelines
° Efficient hardware design due to customized pipes for
particular instruction type.
° Each instruction type incurs only the necessary latency &
make use of all the stages.
° Possible distributed & independent control of each
execution pipe if all inter-instruction dependences are
resolved prior to dispatching.
15-447 Computer Architecture
Fall 2007 ©
Diversified Pipelines Design
° The number of functional units should match the
available I.L.P. of the program. The mix of F.U.s
should match the dynamic mix of instruction
types of the program.
° Most 1st generation superscalars simply
integrated a second execution pipe for
processing F.P. instructions with the existing
scalar pipeline.
15-447 Computer Architecture
Fall 2007 ©
Diversified Pipelines Design (2)
° Later, in 4-issue machines, 4 F.U.s are
implemented for executing integer, F.P.,
Load/Store, & branch instructions.
° Recently designs incorporate multiple integer
units, some dedicated for long latency integer
operations: multiply & divide, and operations for
image, graphics, signal processing applications.
15-447 Computer Architecture
Fall 2007 ©
Dynamic Pipelines
° Superscalar pipelines differ from (rigid) scalar
pipelines in one key aspect; the use of complex
multi-entry buffers.
° In order to minimize unnecessary staling of
instructions in a parallel pipeline, trailing
instructions must be allowed to bypass a stalled
leading instruction.
° Such bypassing can change the order of
execution of instructions from the original
sequential order of the static code.
15-447 Computer Architecture
Fall 2007 ©
Dynamic Pipelines (2)
° With out-of-order
execution of instructions,
they are executed as soon
as their operands are
available; this would
approach the data-flow
limit of execution.
IF
IF
IF
ID
ID
ID
RD
RD
RD
In Order
Dispatch Buffer
Out-of-order
ALU
MEM1
FP1
MEM2
FP2
BR
Execute
Stages
FP3
Out-of-order
Re-Order Buffer
In Order
WB
15-447 Computer Architecture
WB
WB
Fall 2007 ©
Dynamic Pipelines (3)
° A dynamic pipeline achieves out-of-order
execution via the use of complex multientry buffers that allow instructions to
enter & leave the buffers in different
orders.
15-447 Computer Architecture
Fall 2007 ©
15-447 Computer Architecture
Fall 2007 ©
Pentium 4
°80486 - CISC
°Pentium – some superscalar
components
• Two separate integer execution units
°Pentium Pro – Full blown superscalar
°Subsequent models refine & enhance
superscalar design
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Block Diagram
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Operation
° Fetch instructions form memory in order of static
program
° Translate instruction into one or more fixed length RISC
instructions (micro-operations)
° Execute micro-ops on superscalar pipeline
• micro-ops may be executed out of order
° Commit results of micro-ops to register set in original
program flow order
° Outer CISC shell with inner RISC core
° Inner RISC core pipeline at least 20 stages
• Some micro-ops require multiple execution stages
-
Longer pipeline
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (1)
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (2)
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (3)
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (4)
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (5)
15-447 Computer Architecture
Fall 2007 ©
Pentium 4 Pipeline Operation (6)
15-447 Computer Architecture
Fall 2007 ©
Introduction to PowerPC 620
°The first 64-bit superscalar processor to
employ:
• Aggressive branch prediction,
• Real out-of-order execution,
• A Six pipelined execution units,
• A Dynamic renaming for all register files,
• Distributed multientry reservation stations,
• A completion buffer to ensure program correctness.
°Most of these features had not been
previously implemented in a single-chip
microprocessor.
15-447 Computer Architecture
Fall 2007 ©
Download