Chapter 3 ILP and Its Dynamic Exploitation – Branch and Beyond

advertisement
EEF011 Computer Architecture
計算機結構
Chapter 3
ILP and Its Dynamic Exploitation –
Branch and Beyond
吳俊興
高雄大學資訊工程學系
October 2004
Chapter Overview
3.1 Instruction Level Parallelism: Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling: Examples & the Algorithm
3.4 Reducing Branch Costs with Dynamic Hardware
Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple Issue
3.7 Hardware-based Speculation
3.8 Studies of the Limitations of ILP
3.9 Limitations on ILP for Realizable Processors
3.10 The P6 Microarchitecture
2
3.4 Reducing Branch Costs with Dynamic
Hardware Prediction
Significances of branch
1. When issue N instructions per clock cycle, branches will
arrive up to n times faster in an n-issue processor
2. Amdahl’s Law => relative impact of the control stalls will be
larger with the lower potential CPI in an n-issue processor
Review: basic schemes – static or software prediction
• Predict taken
• Predict not taken
• Delayed branch
3
Dynamic Hardware
Prediction
Dynamic Branch Prediction
Dynamic Branch Prediction is the ability of the hardware to
make an educated guess about which way a branch will
go - will the branch be taken or not.
The hardware can look for clues based on the instructions, or
it can use past history - we will discuss both of these
directions.
Key Concept: A Branch History Table contains information
about what a branch did the last time it was executed.
Performance = ƒ(accuracy, cost of misprediction)
4
Branch Prediction Buffers (Branch History Table)
1-bit branch-prediction buffer: a small memory
– indexed by the lower portion of the address of the branch instruction
– a bit indicating whether the branch was recently taken or not.
– Fetching begins in the predicted direction. If mis-predicted, the bit is inverted
Bits 2 – 13 define 1024 different
possibilities. Based on the address
of the branch, its prediction is put
into the Branch History table.
Address
31
0
0
Bits 13 - 2
1023
Problem: in a loop, 1-bit BHT will cause two mis-predictions:
End of loop case, when it exits instead of looping as before
First time through loop on next time through code, when it predicts exit
instead of looping
P
r
e
d
i
c
t
i
o
n
5
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
2-bit Dynamic Branch Prediction
2-bit scheme: change prediction only if get misprediction twice
Figure 3.7 (p. 198)
6
Branch
History Table
Accuracy
Mis-prediction:
• Wrong guess for that
branch
• Got branch history of
wrong branch when
index the table
(Figure 3.9)
• 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv)
to 18% (eqntott), with spice at 9% and gcc at 12%
• 4096 about as good as infinite table, but 4096 is a lot of HW
7
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Correlating Branches
Generated MIPS Code:
DSUBUI
R3, R1, #2
BNEZ
R3, L1
DADD
R1, R0, R0
L1:
DSUBUI
R3, R2, #2
Then the third “if” can be
somewhat predicted based on
BNEZ
R3, L2
the 1st two “ifs”
DADD
R2, R0, R0
L2:
DSUBU
R3, R1, R2
This branch is based on the
Outcome of the previous 2 branches.
BEQZ
R3, L3
What if we have the code:
If ( aa == 2) aa = 0;
If ( bb == 2 ) bb = 0;
If ( aa != bb ) { …
8
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Correlating Branches
(2-level Predictors)
Idea: taken/not taken of recently
executed branches is related to
behavior of next branch (as well as
the history of that branch behavior)
– Then behavior of recent branches
selects between, say, 4 predictions of
next branch, updating just that
prediction
• (2,2) predictor: 2-bit global, 2-bit local
• (m,n) predictor: uses the behavior of
the last m branches to choose from
2m branch predictors, each of which is
an n-bit predictor for a single branch
# of bits = 2m * n * # of pred. entries
Branch address (4 bits)
2-bits per branch
local predictors
Prediction
Prediction
2-bit global branch history
(01 = not taken then taken)
9
Example: Multiple Consequent Branches
if(d == 0) not taken
d=1;
else
taken;
if(d==1) not taken
else
taken
If b1 is not taken, then b2 will be not taken
1-bit predictor: Consider d alternates between 2 and 0. All branches are mispredicted
10
if(d == 0) not taken
d=1;
else
taken;
if(d==1) not taken
else
taken
two prediction bits: prediction if last branch not taken, and prediction if last branch taken
(1,1) predictor - 1-bit predictor with 1 bit of correlation: last branch (either taken or not
taken) decides which prediction bit will be considered or updated
11
Tournament Predictors: Adaptively
Combining Local and Global Predictors
Use several levels of branch-prediction tables together with an algorithm
for choosing among the multiple predictors
Advantage: ability for per-branch basis to select the right predictor for the
right branch dynamically
2+:0
0:2+
P1/P2 =
0 incorrect
1 correct
1:0
0:1
Figure 3.16 The state transition diagram for a tournament predictor has
four states corresponding to which predictor to use.
12
Figure 3.17
The fraction of predictions
coming from the local
predictor for a tournament
predictor (=local 2-bit
predictor + 2-bit local/global
predictor) using the
SPEC89 benchmarks.
Figure 3.18
The misprediction rate for
three different predictors on
SPEC89 as the total number
of bits is increased
13
Dynamic Hardware
Prediction
Basic Branch Prediction:
Branch Prediction Buffers
Accuracy of
Different Schemes
(Figure 3.15, p. 206)
(2,2) predictor with 1K entries often
outperforms a 2-bit predictor with
an unlimited number of entries
14
3.5 High Performance Instruction Delivery
Predicting well is not enough for multiple-issue pipeline
• Expect to deliver a high-bandwidth instruction stream
• Ideal no-penalty branch: we need to know the next address to
fetch by the end of IF stage
• For the classic five-stage pipeline, a branch-prediction buffer
is accessed during the ID cycle
The goal here is to be able to fetch an instruction from
the destination of a branch
• You need the next address at the same time as you’ve made
the prediction.
• This can be tough if you don’t know where that instruction is
until the branch has been figured out.
• The solution is a table that remembers the resulting
destination addresses of previous branches.
Three concepts: branch-target buffer, integrated instruction fetch
unit, and indirect branches by predicting return addresses
15
Dynamic Hardware
Prediction
Branch Target Buffer
Basic Branch Prediction:
Branch Target Buffers
•Branch Target Buffer (BTB): a branch-prediction cache that stores the
predicted address for the next instruction after a branch
– Use address of branch as index to get prediction AND branch address (if taken)
– Must check for branch match now, only store predicted-taken branches
•Branch-target address
– Branch-prediction buffer: accessed during the ID cycle
– Branch-target buffer: accessed during the IF stage
Branch PC
Predicted PC
PC of instruction
FETCH
(Figure 3.19, p. 210)
Extra
Yes: instruction is branch and
prediction
=? use predicted PC as next PC
state bits
No: branch not predicted,
proceed normally (NextPC = PC+4)
16
Figure 3.20 Steps for handling BTB
Incorrect prediction:
• 1-clock-cycle delay fetching the
wrong instruction
• restart the fetch 1 clock cycle later
Total penalty of 2 clock cycles
17
Dynamic Hardware
Prediction Example
Case
1.
2.
3.
4.
Instruction
in Buffer
Yes
Yes
No
No
Prediction
Taken
Taken
Basic Branch Prediction:
Branch Target Buffers
Actual
Branch
Taken
Not taken
Taken
Not Taken
Penalty
Cycles
0
2
2
0
Example on page 211 (Figure 3.21).
Determine the total branch penalty for a BTB using the above
penalties. Assume also the following:
• Prediction accuracy of 90%
• Hit rate in the buffer of 90%
• Assume that 60% of the branches are taken.
Case 2
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2
+ ( 1 - percent buffer hit rate) X Taken branches X 2
Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2)
Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles
Case 3
18
Dynamic Hardware
Prediction
Integrated Instruction Fetch Units
As a separate autonomous unit that feeds
instructions to the rest of the pipeline
– Integrated branch prediction – The branch predictor
constantly predicts branches
– Instruction prefetch – The unit autonomously manages the
prefetching of instructions, integrating with branch prediction
– Instruction memory access and buffering – Using prefetch to
try to hide the cost of crossing cache blocks
Prefetching and trace caches is discussed in Chapter 5
19
Dynamic Hardware
Prediction
Return Address Predictors
Predicting indirect jumps – destination
address varies at run time
– indirect procedure calls, procedure returns, select or case
statements
Approaches
– predicting with a branch-target buffer
– stack for return address predictor: pushing a return address
on the stack at a call and popping one off at a return
– multi-path fetching to reduce misprediction penalty
• Caching addresses or instructions from multiple paths in the target buffer
20
3.6 More ILP with Multiple Issue
Previous techniques - achieving an ideal CPI of one
• Eliminate data and control stalls
Multiple-issue processors – achieving CPI < 1!!
• Start more than one instruction in a given cycle
• Issue multiple instructions in a clock cycle
• Vector Processing: explicit coding of independent loops
as operations on large vectors of numbers
• Multimedia instructions being added to many processors
Two basic flavors
• superscalar processors and
• VLIW (very long instruction word) processors
21
Issuing Multiple Instructions/Cycle
Flavor I:
Superscalar processors issue varying number of
instructions per clock, can be either
– statically scheduled (by the compiler, in-order executing) or
– dynamically scheduled (by the hardware, out-of-order execution)
Superscalar has a varying number of instructions/cycle (1 to 8),
scheduled by compiler or by HW (Tomasulo)
Example: a 4-issue static superscalar processor
– issue packet: group of instructions received from the fetch unit
that could potentially issue in one clock cycle
– The IF unit in-order examines each instruction in the issue packet
– The instruction is not issued if it would cause a structural hazard
or a data hazard (hardware hazard detection)
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000
22
Flavor II: VLIW (Very Long Instruction Word)
VLIW issues a fixed number of instructions formatted either
– as one large instruction or
– as a fixed instruction packet with the parallelism among
instructions indicated by the VLIW
also known as Explicitly Parallel Instruction Computer (EPIC)
Inherently statically scheduled by compilers (see chapter 4)
– Fixed number of instructions (4-16) scheduled by the compiler;
put operators into wide templates
– Intel Architecture-64 (IA-64) 64-bit address
23
Summary - Issuing Multiple Instructions/Cycle
24
Statically Scheduled Superscale MIPS processor
– Fetch 64-bits/clock cycle; INT on left, FP on right
z
z
Fetch/prefetch multiple instructions
but may issue/deliver 0-n instructions
z
hardware hazard detection
In our MIPS example,
we can handle 2
instructions/cycle:
• Floating Point
• Anything Else
– Can only issue 2nd instruction if 1st instruction issues
– More ports for FP registers to do FP load & FP op in a pair
Type
Pipe Stages
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
Int. instruction
IF
ID
EX MEM WB
FP instruction
IF
ID
EX MEM WB
• 1 cycle load delay causes delay to 3 instructions in Superscalar
– FP instruction in right half can’t use it, nor instructions in next slot
25
Dynamic Scheduling with Tomasulo’s
algorithm in Superscalar
•
How to issue two instructions and keep in-order
instruction issue for Tomasulo?
–
–
•
Assume 1 integer + 1 floating point
Two approaches to removing the constraint of issuing one
integer and one FP instruction in a clock
1) Issuing in half a clock cycle (or 2X clock rate)
2) 1 Tomasulo control for integer, 1 for floating point
Only loads/stores might cause dependency
between integer and FP issue:
–
–
–
Replace load reservation station with a load queue;
operands must be read in the order they are fetched
Load checks addresses in Store Queue to avoid RAW violation
Store checks addresses in Load Queue to avoid WAR,WAW
26
Example:
Integer ALU 1 cycle
load/store
2 cycles
FP add
3 cycles
assume 2 CDBs, 1 integer ALU, 1 FP unit, and perfect branch prediction
27
Integrated ALU
One integer functional unit for
both ALU operations and
effective address calculations
Figure 3.25
Figure 3.26
Executes Stage
•L.D and S.D - effective
address calculation
•Branches – when the branch
condition can be evaluated
•A new loop iteration is fetched and
issued every 3 clock cycles
•Issue rate: 5/3 = 1.67
•The loop executes in 16 clock
cycles
•One CDB is enough
28
Separate ALU
Separate functional units for
effective address calculations
and ALU operations
Figure 3.27 / Figure 3.28
•The loop executes in 5 clock
cycles less (11 versus 16)
•Two CDBs are needed
29
Three factors limit the performance of the
example pipeline
1. Imbalance between the functional unit structure
of the pipeline and the example loop
– impossible to fully use the FP units
– we need fewer dependent integer operations per loop
2. Amount of overhead per loop iteration is very
high
– DADDIU and BNE: 2 out of 5 instructions
3. The control hazard causes 1-cycle penalty on
every loop iteration
– We assume any instructions following a branch cannot
start execution until after the branch condition has been
evaluated
– Accurate branch prediction is not sufficient
30
3.7 Hardware-based Speculation
• Motivation
– Prediction is not sufficient to have high amount of ILP
– Overcome control dependence by speculating on the
outcome of branches
⇒ Execute the program as if our guesses were correct
• Dynamic scheduling vs. speculation
– dynamic scheduling: only fetch and issue instructions
as if our branch predictions were always correct
– speculation: fetch, issue, and execute such instructions
• Incorrect speculation ⇒ Undo
Intel Pentium II/III/4, AMD K5/K6/Athlon, PowerPC
603/604/G3/G4, MIPS R10000/R12000, Alpha
21264
31
Key Ideas
• Design
– dynamic branch prediction to choose which
instructions to execute
– speculation to allow the execution of instructions
before the control dependences are resolved
• with the ability to undo the effects of an incorrectly speculated
sequence
– dynamic scheduling to deal with the scheduling of
different combinations of basic blocks
• Implementation
– allow instructions to execute out of order,
– but to force them to commit in order and to prevent any
irrevocable action until an instruction commits
32
Reorder Buffer (ROB)
• Reorder buffer – an additional set of hardware buffers that
hold the results of instructions that have finished
execution but have not committed
– a source of operands for instructions as the reservation stations
– in the interval between completion of instruction execution and
instruction commit
• Similar to the store buffer in Tomasulo’s algorithm
– Speculation – the register file is not updated until the instruction commits
– Tomasulo – once an instruction writes its result, any subsequently issued
instructions will find the result in the register file
ROB completely replaces store buffers in Tomasulo’s algorithm
33
ROB Components
•Instruction type field
– Indicate whether the instruction is
• a branch (and has no destination result),
• a store (has a memory address destination), or
• a register operation (ALU operation or load, which has register destinations)
•Destination field
– Supply the register # (for loads and ALU operations) or the memory
address (for stores) where the instruction result should be written
•Value field
– Hold the value of the instruction result until the instruction commits
•Ready field
– Indicate that the instruction has completed execution, and the value is
ready
34
Speculative
Tomasulo’s algorithm
1.Issue (=dispatch)
•Get an instruction from queue
•Issue the instruction if there is
–an empty RS
–an empty slot in the ROB
then indicate they are in use, and
ROB # for result also sent to RS
2.Execute
•Wait for all operands available
3.Write Result
•Write result to CDB and ROB #
(value field of the ROB for Store)
•Mark RS as available
4.Commit – Three cases
•Normal commit
–Occur when
• an inst reaches the head of the ROB
• its result is present in the buffer
–Update the register with the
result and free the ROB entry
•Store
–Like Normal Commit, but update
memory instead
•Mispredicted branch
–ROB is flushed and restart
execution at correct successor
of the branch
35
Instruction status:
Example of
Speculative Tomasulo’s
When MUL.D is ready to commit
Instruction
LD
LD
MULTD
SUBD
DIVD
ADDD
F6
F2
F0
F8
F10
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Issue
1
2
3
4
5
6
Exec
Write
Comp Result
3
4
15
7
4
5
10
11
8
Cycle 15 of original Alg.
Add: 2 cycles
Multiply: 10 cycles
Divide: 40 cycles
With speculation, SUB.D and
ADD.D will not commit until
MUL.D commits, although the
results are available and can
be used
36
Hazards Through Memory - Load/Store RAW Hazard
Question: Given a load that follows a store in program
order, are the two related?
– (Alternatively: is there a RAW hazard between the store and the load)?
E.g.:
st
0(R2),R5
ld
R6,0(R3)
Can we go ahead and start the load early (RAW)?
– Store address could be delayed for a long time by some calculation that
leads to R2 (divide?).
– We might want to issue/begin execution of both operations in same cycle
Answer is that we are not allowed to start load until we know
that address 0(R2) ≠ 0(R3)
– Not allowing a load to initiate the 2nd step if any active ROB entry
occupied by a store has a Destination field that matches the value of the A
field of the load
– maintaining the program order for the computation of an effective address
of a load with respect to all earlier stores
How about WAR/WAW hazards through memory?
– Stores commit in order, so no worry
37
Multiple Issue
Separate functional units for
address calculation, ALU
operations, and branch condition
No speculation
L.D must wait until the branch
outcome is determined
With speculation
L.D following the BNE can start
execution early with speculation
38
Extended Physical Registers
Speculative Tomasulo’s algorithm with ROB
– architecturally visible registers (R0, …, R31 and F0, …, F31)
– values reside in the visible register set and RS, and temporarily in the ROB
Alternative to ROB: a larger physical set of
registers and register renaming
– An extended set of physical registers is used to hold both architecturally
visible registers and temporary values
• The extended registers replace functions of ROB and RS
• A physical register does not become the architectural register until the instr
commits
– During instruction issue, renaming process maps names of architectural
registers to physical registers with renaming table, allocating a new unused
register for the destination
• WAW and WAR hazards are avoided by renaming the destination register
39
Register Renaming versus Reorder Buffers
•Advantage: simplifies instruction commit
mark register as no longer speculative, free register with old value. Require
only two simple actions:
1.Record that the mapping between an architectural register # and physical
register # is no longer speculative
2.free up any physical register being used to hold the value of the
architectural register
•Advantage: simplifies instruction issue
All results are in the extended registers, so need not examine both the ROB
and the register file
•Disadvantage: deallocating registers is complex
Before freeing up a physical register, we must know that
– It no longer corresponds to an architectural register, and
•Rewriting an architectural register causes the renaming table to point elsewhere
– no further uses of the physical register are outstanding (not as a source)
•Examining source register specifiers of all instructions in functional unit queues
20-80 extra registers: Alpha, PowerPC, MIPS, Pentium,…
– Size limits no. instructions in execution (used until commit)
40
3.8 Studies of the Limitations of ILP
•Conflicting studies of amount of improvement available
– Benchmarks (vectorized FP Fortran vs. integer C programs)
– Hardware sophistication
– Compiler sophistication
•Studies of ILP limitations
– How much ILP is available using existing mechanisms with
increasing HW budgets?
– Do we need to invent new HW/SW mechanisms to keep on
processor performance curve?
41
Studies of ILP
Ideal Hardware Model
Initial HW Model here; MIPS compilers.
Assumptions for ideal/perfect machine to start:
1. Register renaming–infinite virtual/physical registers
– all WAW & WAR hazards are avoided
– unbounded number of instructions can begin execution simultaneously
2. Branch prediction–perfect; no misprediction
3. Jump prediction–all jumps perfectly predicted
– machine with perfect speculation
– an unbounded buffer of instructions available
4. Memory-address alias analysis–addresses are known & a store can
be moved before a load provided addresses not equal
5. Perfect cache –all loads and stores always complete in one cycle
A2+A3: eliminate control dependencies
A1+A4: eliminate all but true data dependencies (RAW)
1 cycle latency for all instructions; unlimited number of
instructions issued per clock cycle
42
Upper Limit to ILP: Ideal Machine
(Figure 3.35, page 242)
This is the amount of parallelism when there are no branch mis-predictions
and we’re limited only by data dependencies.
Integer: 18-63
FP: 75-150
43
Limitations on Window Size and Maximum Issue Count
•Operand dependence comparisons required
to determine whether n issuing instructions
have any register dependences among them:
2Σi = 2*(n-1)n/2 = n2 – n
•Window: the set of instructions that is
kept in the processor and examined for
simultaneous execution
# of result comparisons per cycle =
maximum completion rate
* window size
* # of operands per instruction
•Assume 2K window and 64 issues later
Figure 3.36 Effects of reducing the window size
FP (59-61)
> Int (15-41)
Figure 3.37 Effect of window size and
average issue rate
44
Effects of Realistic Branch and Jump Prediction
What parallelism do we get when we don’t allow perfect
branch prediction, but assume some realistic model?
Possibilities include:
1. Perfect - all branches are perfectly predicted (previous
slide)
2. Tournament-based branch predictor – use a correlating
2-bit predictor and a non-correlating 2-bit predictor
together with a selector. 8K entries for branch and 2K
entries for jump
3. Standard 2-bit predictor with 512 2-bit entries
4. Static – A static predictor uses the profile history of the
program and predicts that the branch is always taken or
not taken
5. None - Parallelism is limited to basic block.
Assume 2K window, 64 issues, and tournament-based
predictor in later studies
45
Effects of BranchPrediction Schemes
Figure 3.40 Branch-prediction accuracy
Figure 3.38 Effect of branch-prediction schemes
FP (15-48)
> Int (9-12)
Figure 3.39 sorted by applications
46
Effects of Finite Registers for Renaming
FP (16-45)
> Int (10-15)
• Window size=2K, Max issue=64
instructions, tournament-based branch
predictor
• The impact on the integer programs is
small primarily because the limitations in
window size and branch prediction have
limited the ILP substantially
• Assume 256 integer and 256 FP registers
available for renaming in later studies
47
Effects of Imperfect Memory Alias Analysis
Different models for memory alias analysis (memory
disambiguation):
1. Perfect (no memory disambiguation)
2. Global/stack perfect (by best compiler-based
analysis schemes) and heap references conflict
3. Inspection – Examine the accesses to see if they
can be determined not to interfere at compile time
–eg. Mem[R10 + 20] and Mem[R10+100] never
conflict (same base register with different offsets)
4. None – All memory references are assumed to
conflicts
48
Effects of Imperfect Memory Alias Analysis
• Since there is no heap references in
Fortran, there is no difference
between perfect and global/stack
perfect analysis for Fortran programs
• 2K window, 64 issues
Figure 3.43 Effect of alias analysis
3.44 sorted by applications
49
3.10 The P6 Microarchitecture
– The basis for Pentium Pro, Pentium II and Pentium III
– A dynamically scheduled processor that translates each IA-32
instruction to a series of micro-operations (uops) executed
by the pipeline
• uops are similar to typical RISC instructions
• Up to 3 IA-32 instructions are fetched, decoded, and translated into uops every
clock cycle
• If an IA-32 instruction requires more than 4 uops, implemented by a microcoded sequence that generates the necessary uops in multiple clock cycles =>
Max=6
Differ in clock rate, cache architecture, and memory interface. Pentium II added
MMX (multimedia extension). Pentium III added SSE (Streaming SIMD Extensions) 50
P6 Microarchitecture Pipeline
• uops executed by an out-of-order speculative pipeline using
register renaming and a ROB (similar to Section 3.7)
– Up to 3 uops per clock renamed and dispatched to RS, or committed
• 14 super-pipelined stages
– 8 stage: in-order instruction fetch, decode, and dispatch
• 512-entry, two-level (correlating) branch predictor
• decode and issue stages include 40 extended registers for register renaming
and dispatch to one of 20 RS and to one of 40 entries in the ROB
– 3 stages: out-or-order execution in one of 5 separate FU (ALU, FP,
branch, memory address, memory access, 1-32 cycles)
– 3 stages: instruction commit
Repeat rate of 2 means
that operations can
start every other cycle
51
Stalls in Decode Cycle
Figure 3.50 # of instructions decoded per
clock (average = 0.87 instructions per cycle)
Figure 3.51 Stall cycles per instruction at
decode time (I-cache miss + lack of RS/ROB)
52
Figure 3.52 Number of micro-operations per
IA-32 Instruction
• Most instruction will take
only one uop
• On average, 1.37 uops
per IA-32 instruction
• Other than fpppp, the
integer programs
typically require more
uops
53
Figure 3.53 Number of misses per thousand
instructions for L1 and L2 caches
•L1=8KB I+8KB D
(hide by speculative)
•L2=256KB (cost 5 times)
(dominate performance)
54
Figure 3.54 BTB miss frequency (dominate)
vs.. mispredict frequency
If BTB misses, a static prediction is used
•backward branches are predicted taken (1-cycle penalty if correctly predicted)
•forward branches are predicted not taken (no penalty if correctly predicted)
Branch mispredict:
•direct penalty: 10-15 cycles
•indirect: hard to measure
incorrectly speculated
instructions
On average about 20% use
the simple static predictor rule
55
Instruction Commit
Figure 3.55 the fraction of issued instructions
that do not commit. On average, each
mispredicted branch issues 20 uops canceled
Figure 3.56 Breakdown in how often 0-3
uops commit in a cycle (average: 55%,
13%, 8%, 23%)
56
Figure 3.57 Actual CPI and Individual CPIs
uop cycles assume that 3 uops are completed every cycle and include the # of uops per instruction
Average CPI is 1.15 for SPECint programs and 2.0 for SPECFP programs
57
AMD Althon
• Similar to P6 microarchitecture
(Pentium III), but more resources
• Transistors: PIII 24M vs. Althon 37M
• Die Size: 106 mm2 vs. 117 mm2
• Power: 30W vs. 76W
• Cache: 16K/16K/256K vs. 64K/64K/256K
• Window size: 40 vs. 72 uops
• Rename registers: 40 vs. 36 int +36 Fl. Pt.
• BTB: 512 x 2 vs. 4096 x 2
• Pipeline: 10-12 stages vs. 9-11 stages
• Clock rate: 1.0 GHz vs. 1.2 GHz
• Memory bandwidth: 1.06 GB/s vs. 2.12 GB/s
58
Pentium 4 – NetBurst Microarchitecture
• Still translate from 80x86 to micro-ops
• A much deeper pipeline: 24 (vs. 14)
• Use register renaming (potentially up to 128) rather than ROB (vs. 40)
– Window: 40 v. 126
• 7 execution units (vs. 5, one more ALU and address computation unit)
• P4 has better branch predictor, more FUs
• aggressive ALU and data cache (operating half a clock cycle)
• 8 times larger BTB (4096 vs.. 512)
• New SSE2 instructions allow 2 floating operations per instruction
• Instruction Cache holds micro-operations vs. 80x86 instructions
– no decode stages of 80x86 on cache hit
– called “trace cache” (TC)
• Faster memory bus: 400 MHz v. 133 MHz
• Caches
– Pentium III: L1I 16KB, L1D 16KB, L2 256 KB
– Pentium 4: L1I 12K uops, L1D 8 KB, L2 256 KB
– Block size: PIII 32B v. P4 128B; 128 v. 256 bits/clock
• Clock rates:
– Pentium III 1 GHz v. Pentium IV 1.5 GHz
59
The Pentium 4
•
Pentium, Pentium Pro,
Pentium 4 Pipeline
Pentium (P5) = 5 stages
Pentium Pro, II, III (P6) = 10 stages (1 cycle ex)
Pentium 4 (NetBurst) = 20 stages (no decode)
60
The Pentium 4
•
•
•
•
Block Diagram of Pentium 4
Microarchitecture
BTB = Branch Target Buffer (branch predictor)
I-TLB = Instruction TLB, Trace Cache = Instruction cache
RF = Register File; AGU = Address Generation Unit
"Double pumped ALU" means ALU clock rate 2X => 2X ALU F.U.s
61
The Pentium 4
Pentium 4 Die
Photo
•
•
•
•
•
42M Xtors
– PIII: 26M
217 mm2
– PIII: 106 mm2
L1 Execution
Cache
– Buffer 12,000
Micro-Ops
8KB data cache
256KB L2$
62
The Pentium 4
•
•
•
•
•
•
Benchmarks: Pentium 4 v.
PIII v. Althon
SPEC base2000
– Int, P4@1.5 GHz: 524, PIII@1GHz: 454, AMD Althon@1.2Ghz:?
– FP, P4@1.5 GHz: 549, PIII@1GHz: 329, AMD Althon@1.2Ghz:304
WorldBench 2000 benchmark (business) PC World magazine, Nov.
20, 2000 (bigger is better)
– P4 : 164, PIII : 167, AMD Althon: 180
Quake 3 Arena: P4 172, Althon 151
SYSmark 2000 composite: P4 209, Althon 221
Office productivity: P4 197, Althon 209
S.F. Chronicle 11/20/00: "… the challenge for AMD now will be to
argue that frequency is not the most important thing-- precisely the
position Intel has argued while its Pentium III lagged behind the
Althon in clock speed."
63
The Pentium 4
•
•
•
•
•
•
Why is the Pentium
4 Slower than the
Pentium 3?
Instruction count is the same for x86
Clock rates: P4 > Althon > PIII
How can P4 be slower?
Time = Instruction count x CPI x 1/Clock rate
Average Clocks Per Instruction (CPI) of P4 must be worse
than Althon, PIII
Will CPI ever get < 1.0 for real programs?
64
The Pentium 4
•
•
•
Another Approach:
Multithreaded Execution for
Servers
Thread: process with own instructions and data
– thread may be a process part of a parallel program of multiple
processes, or it may be an independent program
– Each thread has all the state (instructions, data, PC, register
state, and so on) necessary to allow it to execute
Multithreading: multiple threads to share the functional units of
1 processor via overlapping
– processor must duplicate independent state of each thread e.g., a
separate copy of register file and a separate PC
– memory shared through the virtual memory mechanisms
Threads execute overlapped, often interleaved
– When a thread is stalled, perhaps for a cache miss, another
thread can be executed, improving throughput
65
Summary
3.1 Instruction Level Parallelism: Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling: Examples & the Algorithm
3.4 Reducing Branch Costs with Dynamic Hardware
Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple Issue
3.7 Hardware-based Speculation
3.8 Studies of the Limitations of ILP
3.9 Limitations on ILP for Realizable Processors
3.10 The P6 Microarchitecture
66
Download