Structure of Computer Systems

advertisement
Structure of Computer
Systems
Course 5
The Central Processing Unit CPU
Solutions for hazard cases
 Scoreboard
method
 Tomasulo’s method
 Branch prediction
Scoreboard method

General considerations (wiki):





used first in the CDC 6600 computer (1966),
used for dynamically scheduling a pipeline so that the instructions can
execute out-of-order when there are no conflicts and the hardware is
available (no structural hazard is present)
the data dependencies of every instruction are logged.
instructions are released only when the scoreboard determines that
there are no conflicts with previously issued and incomplete instructions.
if an instruction is stalled because it is unsafe to continue, the
scoreboard monitors the flow of executing instructions until all
dependencies have been resolved before the stalled instruction is
issued.
Scoreboard method

Implementation of the scoreboard method:
Every instruction goes through 4 stages:
 Issue(ID1)
• decode instructions
• check for structural and WAW hazards
• stall until structural and WAW hazards are resolved

Read operands (ID2)
• wait until no RAW hazards
• then read operands

Execution (EX)
• operate on operands
• may be multiple cycles - notify scoreboard when done

Write result (WB)
• finish execution
• stall if WAR hazard
Scoreboard method

Scoreboard structure:

Instruction status
• Indicates which of 4 steps the instruction is in: ID1, ID2, EX, or WB.

Functional unit status: Indicates the state of the functional unit
(FU)
•
•
•
•
•
•

Busy Indicates whether the unit is busy or not
Op Operation to perform in the unit (e.g., + or –)
Fi Destination register
Fj, Fk Source-register numbers
Qj, Qk Functional units producing source registers Fj, Fk
Rj, Rk Flags indicating when Fj, Fk are ready
Register result status
• Indicates which functional unit will write each register, if one exists.
Blank when no pending instructions will write that register
Scoreboard method

Speedup from scoreboard



1.7 for FORTRAN programs
2.5 for hand-coded assembly language programs
Hardware



Scoreboard hardware approximately same as one FPU
Main cost - buses (4 times normal amount)
Could be more severe for modern processors
Scoreboard and Tomasulo’s
algorithm

Issues with Scoreboard method:




it does not solve structural hazard
No forwarding logic
introduces stall phases when a required functional unit is busy; the stall
affects the next instructions too
Tomasulo’s algorithm



avoid the structural hazard and also resolve WAR and WAW
dependencies with Register renaming and Common data bus (CDB)
Used first in IBM 360/91 computer (1969)
Register renaming – keep multiple copies of the same physical register
• Avoids data dependencies when the dependency is caused by the limited
number of registers and not by a real data dependency

Common data bus – a data is put on a common bus as soon as it’s
available avoiding unnecessary stall until the data is written in the
destination register
Tomasulo’s alorithm

Instruction stages:

Issue – an instruction is issued if the required functional unit and all
operands are available, else it is stalled and the next instruction is
tested and if possible issued; if a real data is not yet available a virtual
value is considered, until the real value becomes available
• Registers are renamed to avoid WAR and WAW hazards

Execute – the instruction is carried out as long as the necessary
operands are available or present on the CDB; special care must be
given to Load and Store instructions that require access to the memory

Write result – the result of the executed instruction is written back into

the destination register and Store operations are made with the memory
(see later commit stage)
Tomasulo’s alorithm

Reservation stations




buffers that fetch and store instruction operands as they are
available
A reservation station holds the data and the result of an
instruction
It points to registers (if data is available) or other reservation
stations that will contain the necessary data as soon as it
becomes available (before it is written back in the register)
The reservation station stores the result of an instruction
execution and releases the functional unit as soon the
instruction is executed; the result becomes available for other
reservation stations ; in this way we avoid WAR and RAW stalls
Tomasulo’s algorithm




To avoid structural hazard, redundant functional units are
used, such as multiple integer ALUs, floating point ALUs
or address computing ALUs
Example: the P6 architecture (Pentium II and III)
contains 7 ALUs –> 2IEU, 1FEU, 1MMX, 3AGU
In front of every functional unit a buffer or a list may store
the request(s) (instructions) destined for that unit; e.g.
Netburst architecture (Pentium IV) has a list of requests
for every reservation station;
In this way every functional unit is scheduled in advance
and it can work almost without stalling
Tomasulo’s algorithm

Commit – an extra stage in the instruction execution
sequence, besides issue, execute and write result



Used to further improve the Tomasulo’s solution
In the Write result stage the result is written in the re-order buffer
(ROB) and not directly in the destination register or memory; all
data in ROB may be used by other instructions; in this way some
stall periods may be avoided
Re-order buffer (ROB) – it is used to commit instructions
executed out-of-order
• Contains data regarding instructions in original order; some entries
may be filled-in in advance as result of out-of-order execution
• The instructions are committed in their original order
• ROB is useful for role-back procedures in case of branch prediction
mismatch or exceptions

In the commit stage data from the re-order buffer is copied into
the real registers or into the memory in the order specified
through the program and not in the order of execution
Branch prediction



A method for solving control hazard
Problem: a brunch in the program disturbs pipeline
execution; if the branch “is taken” the pipeline must be
flushed and reinitialized with instructions from the target
address
Principle: try to guess the direction of a branch
instruction (mainly conditional branch) and load the
pipeline with instructions from the correct branch
 Methods:


Static prediction – based on the nature of the branch
instruction
Dynamic prediction – take into consideration the
history of the branch instructions (if there were taken
or not in the past may predict their future behavior)
Branch prediction

Static prediction – based on the nature of the
branch instruction

Cases:
•
•
•
•

Procedure calls - are taken
Unconditional jumps - are taken
Backward branches - are taken (considered as loops in the program)
Forward branches - are not taken (considered exceptions from a normal
execution)
Advantage:
• it is simple and fast
• works well for programs having many loops

drawback:
• does not work well if there are a lot of conditional jumps
Branch prediction

Dynamic prediction - take into consideration the history of the
branch instructions


Principle: use previous executions of a conditional jump in order to
better predict the next executions
Methods:
• Next line predictor – stores the pointer to the next instruction (or group of
instructions if multiple instructions are fetched in the same time); the method
stores the decision as well as the target (pointer) of the branch
• Saturating counters – store in 1 or two bits (saturating counters) the
decisions made before; in case of 2 bit counter – 4 states:
 Strongly not taken (00) – “not taken” is predicted
 Weakly not taken (01) – “not taken” is predicted
 Weakly taken (10) – “taken” is predicted
 Strongly taken (11) - “taken” is predicted
Taken
every occurrence of the branch updates
the state of the counter

00
11
Not
taken
01
10
Branch prediction

Dynamic prediction – methods (cont.)



store the decision and the target address for every executed conditional
jump in a BHT (Branch History Table) and BTB (Branch Target Buffer);
this information will help predict next executions of the same instructions
with aprox. 90% probability.
BHT and BTB are indexed with less significant bits of the addresses (of
PC); the number of bits used determines the dimension of the tables
Two-level adaptive predictor
• necessary for alternating and imbricated conditional jumps
• idea: to memorize jump sequence patterns; prediction based on a pattern of
taken (1) and not taken (0) branches
• a two-level adaptive predictor
with an n-bit history can
Prediction
2
bit
counter
predict any repetitive
0100
sequence with any period if
n bits
all n-bit sub-sequences are
....
different
Pattern history table
Branch prediction

Dynamic prediction – methods (cont.)

Local branch prediction
• a separate history buffer for each conditional jump instruction
• it may use a 2 level branch predictor with common or individual
pattern history table
• Pentium II and III have local branch predictors with a local 4-bit
history and a local pattern history table with 16 entries for each
conditional jump

Global branch predictor
•
•
•
•
•
keeps a shared (global) history of all conditional jumps
any correlation between two branches is used for prediction;
poor results if branches are not correlated;
usually not as good as local predictors
variants:
 “gshare" predictor
 “gselect” predictor
Branch prediction

Dynamic prediction – methods (cont.)


Global branch predictor – possible implementation: two-level adaptive predictor
with globally shared history buffer and pattern history table
• “gshare" predictor - index in the prediction history table is a XOR between
the global history buffer and the jump address
• “gselect” predictor – index is obtain by concatenating the history buffer and
the jump’s address
• Pentium M, Core 2 and AMD processors use global branch prediction
combinations of local and global predictors:
• Alloyed branch prediction - concatenates local and global branch history
buffer, sometimes also with the address of the jump
• Agree predictor – makes a XOR between the local and global predictor
(used in Pentium 4)
• Hybrid predictor – a combination of predictors; the result is selected through
voting or from the predictor with the best hit rates
• Loop predictor – detects if a conditional jump is a loop; it is taken N-1 times
and not taken 1 time; it may use a counter for the loop; it may be part of a
hybrid predictor
• Prediction of indirect jumps – when the jump target of a conditional branch
has multiple choices – store the previous targets and more bits on the
prediction history buffer for such a jump
• Prediction of function returns – stores a copy of the stack that contains the
return addresses of the executed functions
Branch prediction

Correlated prediction


example of a combination
between local and global
prediction
how it works:
Branch address (4 bits)
2-bits per branch
local predictors
• every entry in the history table has
4 predictors (e.g. 2 bit counters)
• the 2 bit global history buffer
select between the 4 predictors
• the state of the selected predictor
is updated according with the
decision made
• the global branch history gives the
context and the local predictors
store behavior of different jump
2-bit recent global
instructions
branch history
• (2,2) predictor – 2 bit counters
(01 = not taken then taken)
and 2 bit history buffer
Prediction
20%
18%
Misprediction statistics
for specs tests 18%
16%
14%
12%
Frequency of
Mispredictions
12%
11%
10%
8%
6%
6%
6%
6%
5%
5%
4%
4%
1%
0%
4,096 entries: 2-bits per entry
Unlimited entries: 2-bits/entry
eqnto
tt
0%
0%
1%
gc
c
2%
1,024 entries (2,2)
1. 4096 Entries 2-bit BHT
2. Unlimited Entries 2-bit BHT
3. 1024 Entries - local and global prediction (2,2) BHT
- 1 and 3 require the same amount of memory – 8kbits
Branch prediction
 Tournament



predictor
2-bit local predictor fail on important branches; by adding global
information, performance may improved
Tournament predictors: use two predictors, 1 based on global
information and 1 based on local information, and combine with
a selector
Hopes to select right predictor for right branch (or right context of
branch)
Conditional branch misprediction rate
Misprediction statistics
10%
9%
8%
Local - 2 bit counters
7%
6%
5%
Correlating - (2,2) scheme
4%
3%
Tournament
2%
1%
0%
0
8
16
24
32
40
48
56
64
72
80
88
96
Total predictor size (Kbits)
104 112 120 128
Branch prediction

Branch Target Buffer (BTB): contains target of taken branches


an associative access memory
contains:
• jump instr. address
• target address
• prediction state
Jmp addr
Send PC to memory
and branch-target
buffer
IF
No
Target pred
No
ID
Is instruction
a taken
branch?
Normal
instruction
execution
PC
EX
New address
Enter branch
instruction address
and next PC into
branch-target buffer
Entry found
in branchtarget
buffer?
Yes
Send out predicted
PC
Yes
No
Taken
Yes
Branch?
Mispredicted branch,
kill fetched
instruction; restart
fetch at other target;
delete entry from
target buffer
Branch correctly
predicted; continue
execution with no
stalls
Download