Chapter 6

advertisement
ECE200 – Computer Organization
Chapter 6 – Enhancing
Performance with
Pipelining
Homework 6

6.2, 6.3, 6.5, 6.9, 6.11, 6.19, 6.27, 6.30
Outline for Chapter 6 lectures

Pipeline motivation: increasing instruction
throughput

MIPS 5-stage pipeline

Hazards

Handling exceptions

Superscalar execution

Dynamic scheduling (out-of-order execution)

Real pipeline designs
Pipeline motivation

Need both low CPI and high frequency for best
performance
 Want
a multicycle for high frequency, but need better CPI

Idea behind pipelining is to have a multicycle
implementation that operates like a factory
assembly line

Each “worker” in the pipeline performs a
particular task, hands off to the next “worker”,
while getting new work
Pipeline motivation

Tasks should take about the same time – if one
“worker” is much slower than the rest, then
other “workers” will stand idle

Once the assembly line is full, a new “product”
(instruction) comes out of the back-end of the
line each time period

In a computer assembly line (pipeline), each
task is called a stage and the time period is one
clock cycle
MIPS 5-stage pipeline

Like single cycle datapath but with registers
separating each stage
MIPS 5-stage pipeline

5 stages for each instruction
 IF:
instruction fetch
 ID: instruction decode and register file read
 EX: instruction execution or effective address calculation
 MEM: memory access for load and store
 WB: write back results to register file

Delays of all 5 stages are relatively the same

Staging registers are used to hold data and

All instructions pass through all 5 stages

As an instruction leaves a stage in a particular
clock period, the next instruction enters it
control as instructions pass between stages
Pipeline operation for lw

Stage 1: Instruction fetch
Pipeline operation for lw

Stage 2: Instruction decode and register file
read

What happens to the instruction info in IF/ID?
Pipeline operation for lw

Stage 3: Effective address calculation
Pipeline operation for lw

Stage 4: Memory access
Pipeline operation for lw

Stage 5: Write back

Instruction info in IF/ID is gone – won’t work
Modified pipeline with write back fix

Write register bits from the instruction must be
carried through the pipeline with the instruction
Pipeline operation for lw

Pipeline usage in each stage for lw
Pipeline operation for sw

Stage 3: Effective address calculation
Pipeline operation for sw

Stage 4: Memory access
Pipeline operation for sw

Stage 5: Write back (nothing)
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Pipeline operation for lw, sub sequence
Graphical pipeline representation

Represent overlap of pipelined instructions as
multiple pipelines skewed by a cycle
Another useful shorthand form
Pipeline control

Basic pipeline control is similar to the single
cycle implementation
Pipeline control

Control for an instruction is generated in ID and
travels with the instruction and data through the
pipeline

When an instruction enters a stage, it’s control
signals set the operation of that stage
Pipeline control
Multiple instruction example

For the following code fragment
lw
sub
and
or
add
$10, 20($1)
$11, $2, $3
$12, $4, $5
$13, $6, $7
$14, $8, $9
show the datapath and control usage as the
instruction sequence travels down the pipeline
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
Multiple instruction example
How the MIPS ISA simplifies pipelining

Fixed length instruction simplifies
 Fetch
– just get the next 32 bits
 Decode – single step; don’t have to decode opcode before
figuring out where to get the rest of the fields

Source register fields always in same location
 Can

read source registers during decode
Load/store architecture
 ALU
can be used for both arithmetic and EA calculation
 Memory instruction require about same amount of work as
arithmetic ones, easing pipelining of the two together

Memory data must be aligned
 Read
or write accesses can be done in one cycle
Pipeline hazards

A hazard is a conflict, regarding data, control, or
hardware resources

Data hazards are conflicts for register values

Control hazards occur due to the delay to

Structural hazards are conflicts for hardware
execute branch and jump instruction
resources, such as
A
single memory for instructions and data
 A multi-cycle, non-pipelined functional unit (such as a
divider)
Data dependences


A read after write (RAW) dependence occurs
when the register written by an instruction is a
source register of a subsequent instruction
lw
$10, 20($1)
sub
$11, $10, $3
and
$12, $4, $11
or
$13, $11, $4
add
$14, $13, $9
Also have write after read (WAR) and write after
write (WAW) data dependences (later)
Pipelining and RAW dependences

RAW dependences that are close by may cause
data hazards in the pipeline

Consider the following code sequence:
sub
and
or
add
sw

$2, $1, $3
$12, $2, $6
$13, $6, $2
$14, $2, $2
$15, 100($2)
What are the RAW dependences?
Pipelining and RAW dependences

Data hazards with first three instructions
hazard
hazard
ok
ok
Forwarding

Most RAW hazards can be eliminated by
forwarding results between pipe stages
at this point, result of sub is available
Forwarding datapaths

Bypass paths feed data from MEM and WB back

Do we still have to write the register file in WB?
to MUXes at the EX ALU inputs
Detecting forwarding

Rd of the instruction in MEM or WB must match
Rs and/or Rt of the instruction in EX

The instruction in MEM or WB must have
RegWrite=1 (why?)

Rd must not be $0 (why?)
Detecting forwarding from MEM to EX

To the upper ALU input (ALUupper)
 EX/MEM.RegWrite
=1
 EX/MEM.RegisterRd not equal 0
 EX/MEM.RegisterRd = ID/EX.RegisterRs

To the lower ALU input (ALUlower)
 EX/MEM.RegWrite
=1
 EX/MEM.RegisterRd not equal 0
 EX/MEM.RegisterRd = ID/EX.RegisterRt
Detecting forwarding from WB to EX

To the upper ALU input
 MEM/WB.RegWrite
=1
 MEM/WB.RegisterRd not equal 0
 MEM/WB.RegisterRd = ID/EX.RegisterRs
 The value is not being forwarded from MEM (why?)

To the lower ALU input
 MEM/WB.RegWrite
=1
 MEM/WB.RegisterRd not equal 0
 MEM/WB.RegisterRd = ID/EX.RegisterRt
 The value is not being forwarded from MEM
Forwarding control

Control is handled by the forwarding unit
Forwarding example

Show forwarding for the code sequence:
sub
$2,
$1, $3
and
$4,
$2, $5
or
$4,
$4, $2
add
$9,
$4, $2
Forwarding example

sub produces result in EX
Forwarding example

sub forwards result from MEM to ALUupper
Forwarding example

sub forwards result from WB to ALUlower

and forwards result from MEM to ALUupper
Forwarding example

or forwards result from MEM to ALUupper
RAW hazards involving loads


Loads produce results in MEM – can’t forward to
an immediately following R-type instruction
Called a load-use hazard
RAW hazards involving loads

Solution: stall the stages behind the load for one
cycle, after which the result can be forwarded
Detecting load-use hazards

Instruction in EX is a load
 ID/EX.MemRead

=1
Instruction in ID has a source register that
matches the load destination register
 ID/EX.RegisterRt
= IF/ID.RegisterRs OR
ID/EX.RegisterRt = IF/ID.RegisterRt
Stalling the stages behind the load

Force nop (“no operation”) instruction into EX
stage on next clock cycle
 Force
ID/EX.MemWrite input to zero
 Force ID/EX.RegWrite input to zero

Hold instructions in ID and IF stages for one
clock cycle
 Hold
the contents of PC
 Hold the contents of IF/ID
Control for load-use hazards

Control is handled by the hazard detection unit
Load-use stall example

Code sequence:
lw
$2,
20($1)
and
$4,
$2, $5
or
$4,
$4, $2
add
$9,
$4, $2
Load-use stall example

lw enters ID
Load-use stall example

Load-use hazard detected
Load-use stall example

Force nop into EX and hold ID and IF stages
Load-use stall example

lw result in WB forwarded to and in EX

or reads operand $2 from register file
Load-use stall example

Pipeline advances normally
Control hazards

Taken branches and jumps change the PC to the
target address from which the next instruction is
to be fetched

In our pipeline, the PC is changed when the
taken beq instruction is in the MEM stage

This creates a control hazard in which
sequential instructions in earlier stages must be
discarded
beq instruction that is taken
instri+3

instri+2
instri+1
beq $2,$3,7
instri+1, instri+2, instri+3 must be discarded
beq instruction that is taken

In this example, the branch delay is three

Why is the branch immediate field a 7?
Reducing the branch delay

Reducing the branch delay reduces the number
of instructions that have to be discarded on a
taken branch

We can reduce the branch delay to one for beq
by moving both the equality test and the branch
target address calculation into ID

We need to insert a nop between the beq and
the correctly fetched instruction
Reducing the branch delay
beq with one branch delay

Register equality test done in ID by a exclusive
ORing the register values and NORing the result

Instruction in ID forced to nop by zeroing the
IF/ID register

Next fetched instruction will be from PC+4 or
branch target depending on the beq outcome
beq with one branch delay

beq in ID; next sequential instruction (and) in IF
beq with one branch delay

bubble in ID; lw (from taken address) in IF
Forwarding and stalling changes

Results in MEM and WB must be forwarded to ID
for use as possible beq source operand values

beq may have to stall in ID to wait for source
operand values to be produced

Examples
addi
$2,
$2, -1
lw
$8, 20($1)
beq
$2,
$0, 20
beq
$4, $8, 6
Stall beq one cycle;
forward $2 from MEM
to upper equality input
in ID
Stall beq two cycles;
forward $8 from WB to
lower equality input in
ID
Forwarding from MEM to ID
beq $2,$0,20
 How
bubble
could we eliminate the bubble?
addi $2,$2,-1
Forwarding from WB to ID
beq $4,$8,6
bubble
bubble
lw $8,20($1)
Further reducing the branch delay

Insert a bubble only if the branch is taken
 Allow
the next sequential instruction to proceed if the branch
is not taken
 AND the IF.Flush signal with the result of the equality test
 Still have bubble for taken branches (~2/3 of all branches)

Delayed branching
Delayed branching

The ISA states that the instruction following the
branch is always executed irregardless of the
branch outcome
 Hardware

must adhere to this rule!
The compiler finds an appropriate instruction to
place after the branch (in the branch delay slot)
beq $4, $8, 6
sub $1, $2, $3
branch delay slot (always
executed after the branch)
Delayed branching

Three places compiler may find a delay slot
instruction
Prior example without delayed branch

beq in ID; next sequential instruction (and) in IF
 What
do you notice about the sub instruction?
Prior example without delayed branch

bubble in ID; lw (from taken address) in IF
Prior example with delayed branch

beq in ID; delay slot instruction (sub) in IF
sub $10 $4,$8
Prior example with delayed branch

sub in ID; lw (from taken address) in IF
sub $10 $4,$8
 What
would happen if the branch was not taken?
Limitations of delayed branching

50% of the time the compiler can’t fill delay slot
with useful instructions while maintaining
correctness (has to insert nops instead)

High performance pipelines may have >10 delay
slots
 Many
cycles for instruction fetch and decode
 Multiple instructions in each pipeline stage
 Example





Pipeline: IF1-IF2-ID1-ID2
Branch calculation performed in ID2
Four instructions in each stage
12 delay slots
Solution: branch prediction (later)
Precise exceptions

Exceptions require a change of control to a
special exception handler routine

The PC of the user program is saved in EPC and
restored after the handler completes so that the
user program can resume at that instruction

For the user program to work correctly after
resuming,
 All
instructions before the excepting one must have written
their result
 All subsequent instructions must not have written their result

Exceptions handled this way are called precise
Pipelining and precise exceptions

There may be instructions from before the
excepting one and from after it in the pipeline
when the exception occurs

Exceptions may be detected out of program order
exception
exception

Which should be handled first?
Supporting precise exceptions

Each instruction in the pipeline has an exception
field that travels with it

When an exception is detected, the type of
exception is encoded in the exception field

The RegWrite and MemWrite control signals for
the instruction are set to 0

At the end of MEM, the exception field is checked
to see if an exception occurred

If so, the instructions in IF, ID, and EX are made
into nops, and the address of the exception
handler is loaded into the PC
Supporting precise exceptions
exception
exception
Superscalar pipelines

In a superscalar pipeline, each pipeline stage
holds multiple instructions
 4-6
instructions in modern high performance microprocessors

Performance is increased because every clock
period more than one instruction completes
(increased parallelism)

Superscalar pipelines have a CPI less than 1
Simple 2-way superscalar MIPS
Simple 2-way superscalar MIPS

Two instructions fetched and decoded each cycle

Conditions for executing a pair of instructions
 First
instruction an integer or branch, second a load or store
 No RAW dependence from first to second

Otherwise, second instruction is executed the
cycle after the first
Compiler code scheduling

The compiler can improve performance by
changing the order of the instructions in the
program (code scheduling)

Examples
 Fill
branch delay slots
 Move instructions between two dependent instructions to
eliminate the stall cycles
 Reorder instructions to increase the number executed in
parallel
Scheduling example – before
Loop: lw
addu
sw
addi
bne
 Load-use
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$s1, $s1, -4
$s1, $zero, Loop
stall
 Stall after addi
 First three instructions must execute serially due to
dependences
 Last two must also execute serially for same reason
 Have branch delay slot to fill
Scheduling example – after
Loop: lw
addi
addu
bne
sw
 All
$t0, 0($s1)
$s1, $s1, -4 # moved into load delay slot
$t0, $t0, $s2
$s1, $zero, Loop
$t0, 4($s1)
# moved into branch delay slot
stall cycles are eliminated
 Last two instructions can now execute in parallel on the 2-way
superscalar MIPS
 First two can also, but we would introduce a stall cycle before
the addu (loop is too short – not enough instructions to
schedule)
Loop unrolling

Idea is to take multiple iterations of a loop
(“unroll” it) and combine them into one bigger
loop

Gives the compiler many instructions to move
between dependent instructions and to increase
parallel execution

Reduces the overhead of branching
Loop unrolling

Example of prior loop unrolled 4 times:
Loop: lw
addu
sw
lw
addu
sw
lw
addu
sw
lw
addu
sw
addi
bne
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$t0, -4($s1)
$t0, $t0, $s2
$t0, -4($s1)
$t0, -8($s1)
$t0, $t0, $s2
$t0, -8($s1)
$t0, -12($s1)
$t0, $t0, $s2
$t0, -12($s1)
$s1, $s1, -16
$s1, $zero, Loop
Original code:
Loop:
lw
addu
sw
addi
bne
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$s1, $s1, -4
$s1, $zero, Loop
Loop unrolling

Problem: reuse of $t0 constrains instruction order
Loop: lw
addu
sw
lw
addu
sw
lw
addu
sw
lw
addu
sw
addi
bne

$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$t0, -4($s1)
$t0, $t0, $s2
$t0, -4($s1)
$t0, -8($s1)
$t0, $t0, $s2
$t0, -8($s1)
$t0, -12($s1)
$t0, $t0, $s2
$t0, -12($s1)
$s1, $s1, -16
$s1, $zero, Loop
Write after read (WAR) and write after write
(WAW) hazards
Loop unrolling

Solution: different registers for each computation
Loop: lw
addu
sw
lw
addu
sw
lw
addu
sw
lw
addu
sw
addi
bne
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$t1, -4($s1)
$t1, $t1, $s2
$t1, -4($s1)
$t2, -8($s1)
$t2, $t2, $s2
$t2, -8($s1)
$t3, -12($s1)
$t3, $t3, $s2
$t3, -12($s1)
$s1, $s1, -16
$s1, $zero, Loop
Loop unrolling

Unrolled loop after scheduling:
Loop: lw
lw
lw
lw
addu
addu
addu
addu
addi
sw
sw
sw
bne
sw

$t0, 0($s1)
$t1, -4($s1)
$t2, -8($s1)
$t3, -12($s1)
$t0, $t0, $s2
$t1, $t1, $s2
$t2, $t2, $s2
$t3, $t3, $s2
$s1, $s1, -16
$t0, 16($s1)
$t1, 12($s1)
$t2, 8($s1)
$s1, $zero, Loop
$t3, 4($s1)
New sw offsets due to moving the addi
Modern superscalar processors

Today’s superscalar processors attempt to issue
(initiate the execution of) 4-6 instructions each
clock cycle

Such processors have multiple integer ALUs,
integer multipliers, and floating point units that
operate in parallel on different instructions

Because most of these units are pipelined, there
is the potential to have 10’s of instructions
simultaneously executing

We must remove several barriers to achieve this
Modern processor challenges

Handling branches in a way that prevents
instruction fetch from becoming a bottleneck

Preventing long latency operations, especially
loads in which the data is in main memory, from
holding up instruction execution

Removing register hazards due to the reuse of
registers so that instruction can execute in
parallel
Instruction fetch challenges

Branches comprise about 20% of the executed
instructions in SPEC integer programs

The branch delay may be >10 instructions in a
highly pipelined, superscalar processor

Delayed branches are useless with so may delay
slots

Solution: dynamic branch prediction with
speculative execution
Dynamic branch prediction

When fetching the branch, predict what the
branch outcome and target will be

Fetch instructions from the predicted direction

After executing the branch, verify whether the
prediction was correct

If so, continue without any performance penalty

If not, undo and fetch from the other direction
Bimodal branch predictor

Predicts the branch outcome

Works under the assumption that most branches
are either taken most of the time or not taken
most of the time

Prediction accuracy is ~85-95% with 2048
entries
Bimodal branch predictor

Consists of a small memory and a state machine

Each memory location has 2 bits

The address of the memory is the low-order
log2n PC bits of a fetched branch instruction
PC of fetched branch
instruction
branch predictor
memory
address
n entries
.
.
.
2 bits/entry
Bimodal branch predictor

When a branch is fetched, the 2-bit memory
entry is retrieved

The prediction is based on the high-order bit
 1=predict
taken
 0=predict not taken
Bimodal branch predictor

Once the branch is executed, the state bits are
updated and written back into the memory
actual branch outcome
11
01

10
00
In the 00 or 11 state, have to be wrong twice
in a row to change the prediction
Branch target buffer

Predicts the branch target address
 Is
this as critical as predicting the branch outcome?

Small memory (typically 256-512 entries)
addressed by the low-order branch PC bits

Each entry holds the last target address of the
branch

When a branch is fetched, the BTB is accessed
and the target address is used if the bimodal
predictor predicts “taken”
Speculative execution

The execution of the branch, and verification of
the prediction, may take many cycles due to
RAW dependences with long-latency instructions
lw
beq
$2,100($1)
$2,$0,Label
# can take >100 cycles

We cannot write the register file or data memory
until we know the prediction is correct

Execution will eventually stall
Speculative execution

In speculative execution, results are first written
to temporary buffers (NOT the register file or
data memory)

The results are copied from the buffers to the
register file or data memory if the branch
prediction has been verified and is correct

If the prediction is incorrect, we discard the
results
Speculative execution

Writeback now consists of two stages:
instruction completion and instruction commit

Completion: execution is complete, write results
to buffer

Commit: branch prediction is verified and
correct, copy results from buffers to register file
or data memory
branch prediction
verified as correct
execute
completion
buffers

commit
register
file
Modern processors can speculate through 4-8
branches
Modern processor challenges

Handling branches in a way that prevents
instruction fetch from becoming a bottleneck

Preventing long latency operations, especially
loads in which the data is in main memory, from
holding up instruction execution

Removing register hazards due to the reuse of
registers so that instruction can execute in
parallel
Long latency operations

Long latency operations, especially loads that
have to access main memory, may stall
subsequent instructions
completed
waiting for lw
can’t execute even
though its operands
are available!

or
and
lw
add
sub
$5,$6,$7
$8,$6,$7
$2,100($1)
$9,$2,$2
$10,$5,$8
data not found in onchip memory, have to get
from main memory
Solution: allow instructions to issue (start
executing) out of their original program order
but update registers/memory in program order
Out-of-order issue

Fetched and decoded instructions are placed in a
special hardware queue called the issue queue
IF
ID
issue queue
…
reg file

completion
commit
EX
buffers
An instruction waits in the IQ until
 Its
source operands are available
 A suitable functional unit is available

The instruction can then issue
Out-of-order issue

Every cycle, the destination register numbers (rd
or rt) of issuing instructions are broadcast to all
instructions in the IQ
IF
ID
issue queue
…
completion
EX
commit
dest reg numbers

A match with a source register number (rs or rt)
of an instruction in the IQ indicates the operand
will be available
issued
instructions
or
and
lw
add
sub
$5,$6,$7
$8,$6,$7
$2,100($1)
$9,$2,$2
$10,$5,$8
both operands become
available
# can take >100 cycles!
Out-of-order issue

Instructions with available source operands can
issue ahead of earlier instructions (out of
original program order)
from ID
issue queue
.
.
.
sub $10,$5,$8
waiting for lw
add $9,$2,$2
or and and instructions were
just issued => issue sub
Out-of-order issue, in-order commit

Once instructions complete, they write results
into the buffers used for speculative execution

However, instructions are written to the register
file and data memory in original program order
execute
completion
commit
buffers
may be outof-order
completes first

or
and
lw
add
sub
register
file
must be inorder
$5,$6,$7
$8,$6,$7
$2,100($1)
$9,$2,$2
$10,$5,$8
Why do we need to do this?
commits first
Modern processor challenges

Handling branches in a way that prevents
instruction fetch from becoming a bottleneck

Preventing long latency operations, especially
loads in which the data is in main memory, from
holding up instruction execution

Removing register hazards due to the reuse of
registers so that instruction can execute in
parallel
Register hazards

The reuse of registers creates WAW and WAR
hazards that limit out-of-order issue and parallel
execution

Example
Loop:
lw
addu
sw
addi
bne
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$s1, $s1 , -4
$s1, $zero, Loop
 Potential
for multiple iterations to be executed in parallel
 The branch could be predicted as taken with high accuracy
 Problem: WAW and WAR hazards involving $t0 and $s1

Solution: register renaming
Register renaming

Idea is for the hardware to reassign registers
like the compiler does in loop unrolling
Loop:
lw
addu
sw
lw
addu
sw
$t0, 0($s1)
$t0, $t0, $s2
$t0, 0($s1)
$t1, -4($s1)
$t1, $t1, $s2
$t1, -4($s1)

Requires implementing more registers than
specified in the ISA (e.g., 128 integer registers
rather than 32)

Allows every instruction in the pipeline to be
given a unique destination register number to
eliminate all WAR and WAW register conflicts
Register renaming



A register renaming stage is added between
decode and the register file access
The original architectural destination register
number is replaced by a unique physical register
number that is not used by any other instruction
A lookup is done for each source register to find
the corresponding physical register number
decode
architectural
register numbers
used up to here
rename
physical register
numbers used
after this point
reg
file
Register renaming

Example: two iterations of the loop with branch
predicted taken
BEFORE:
lw
$t0, 0($s1)
addu
$t0, $t0, $s2
sw
$t0, 0($s1)
addi
$s1, $s1 , -4
<bne predicted taken>
lw
$t0, 0($s1)
addu
$t0, $t0, $s2
sw
$t0, 0($s1)
addi
$s1, $s1 , -4
<bne predicted taken>
 WAR
AFTER:
lw
$p1, 0($p3)
addu
$p2, $p1, $p10
sw
$p2, 0($p3)
addi
$p4, $p3 , -4
<bne predicted taken>
lw
$p7, 0($p4)
addu
$p23, $p7, $p10
sw
$p23, 0($p4)
addi
$p11, $p4 , -4
<bne predicted taken>
hazard involving $s1 is removed, allowing the addi to
complete before the first iteration is completed
 The WAW and WAR hazards involving $t0 are removed
 Removing both of these restrictions allows the second
iteration to proceed in parallel with the first
The MIPS R12000 microprocessor

4-way superscalar

Five execution units
2
integer
 2 floating point
 1 load/store for effective address calculation and data memory
access

Dynamic branch prediction and speculative
execution

ooo issue, in-order commit

Register renaming
R12000 pipeline (ALU operations)

Fetch stages 1 and 2
 Fetch
4 instructions each cycle
 Predict branches
 Split into two stages to enable higher clock rates (R10K had 1)

Decode stage
 Decode
and rename 4 instructions each cycle
 Put into issue queues

Issue stage
 Check
source operand availability
 Read source operands from register file (or bypass paths) for
issued instructions

Execute stage
 Execute

and complete
Write stage
 Write
results to physical registers
R12000 branch prediction

2048-entry bimodal predictor

32 entry branch target address cache

Speculation through four branches
R12000 ooo completion, in-order commit

Separate 16-entry issue queues for integer,
floating point, and memory (load and store)
instructions

Hardware tracks the program order and status
(completed, caused exception, etc) of up to 48
instructions
R12000 register renaming

64 integer and 64 floating point physical
registers

Hardware lookup table to correlate architectural
registers with physical registers

Hardware maintains list of currently unused
registers that can be assigned as destination
registers
R10000 die photo
R12000 summary

R10000 was one of the 1st microprocessors to
implement the “issue queue” approach to ooo
superscalar execution
 PowerPC
processors use the “reservation station” approach
discussed in the book

Clock rate was slow
 R12000
provided a slight improvement with some redesign

Pentium and Alpha processors are ooo but with
much faster clock rates

Very hard to get significant improvement beyond
4-6 way issue
 Branch
prediction needs to be extremely high
 Finding parallel operations in many program is difficult
 Long latency of loads creates an operand supply problem
 Keeping the clock rate high is tough
Questions?
Download