CS116-Computer Architecture

advertisement
ELEC 669
Low Power Design Techniques
Lecture 2
Amirali Baniasadi
amirali@ece.uvic.ca
How to write a review?
 Think Critically.
 What if?
 Next Step?
 Any other applications?
2
Branches
Instructions which can alter the flow of
instruction execution in a program
Direct
Conditional
Unconditional
if - then- else
for loops
(bez, bnez, etc)
procedure calls (jal)
goto (j)
3
Motivation
A branch is fetched
But takes N cycles
to execute
F D A M W
F D A M W
F D A M W
 Pipelined execution
 A new intruction enters the
pipeline every cycle...
 …but still takes several cycles to
execute
 Control flow changes
 Two possible paths after a branch
is fetched
 Introduces pipeline "bubbles"
Branch delay slots
 Prediction offers a chance to
avoid this bubbles
F D A M W
Pipeline bubble
4
Techniques for handling branches
Stalling
Branch delay slots
Relies on programmer/compiler to fill
Depends on being able to find suitable instructions
Ties resolution delay to a particular pipeline
5
Why aren’t these techniques acceptable?
 Branches are frequent - 15-25%
 Today’s pipelines are deeper and wider
 Higher performance penalty for stalling
 Misprediction Penalty = issue width * resolution delay cycles
 A lot of cycles can be wasted!!!
6
Branch Prediction
 Predicting the outcome of a branch
 Direction:
Taken / Not Taken
Direction predictors
 Target Address
PC+offset (Taken)/ PC+4 (Not Taken)
Target address predictors
• Branch Target Buffer (BTB)
7
Why do we need branch prediction?
Branch prediction
Increases the number of instructions available for the scheduler
to issue. Increases instruction level parallelism (ILP)
Allows useful work to be completed while waiting for the branch
to resolve
8
Branch Prediction Strategies
 Static
 Decided before runtime
 Examples:
Always-Not Taken
Always-Taken
Backwards Taken, Forward Not Taken (BTFNT)
Profile-driven prediction
 Dynamic
 Prediction decisions may change during the execution of the program
9
What happens when a branch is predicted?
On misprediction:
No speculative state may commit
Squash instructions in the pipeline
Must not allow stores in the pipeline to occur
• Cannot allow stores which would not have happened to
commit
• Even for good branch predictors more
than half of the fetched instructions are
squashed
10
Instruction traffic due to misprediction
100%
80%
60%
40%
20%
0%
amm bzp cmp equ gcc mcf mes prs avg
fetch
decode
issue
complete
Half of fetched instructions wasted.
More Waste in Front-End.
11
Energy Loss due to Miss-Predictions
45%
40%
better
35%
30%
25%
20%
15%
10%
5%
0%
amm
bzp
cmp
equ
gcc
mcf
mes
prs
AVG
21% average energy loss.
More energy waste in integer benchmarks.
12
Simple Static Predictors
 Simple heuristics
 Always taken
 Always not taken
 Backwards taken / Forward not taken
Relies on the compiler to arrange the code following this assertion
 Certain opcodes taken
 Programmer provided hints
 Profiling
13
Simple Static Predictors
80
70
60
taken
not taken
BTFNT
50
40
30
postgres
vortex
perl
ijpeg
li
gcc
m88ksim
20
14
Dynamic Hardware Predictors
• Dynamic Branch Prediction is the ability of the hardware
to make an educated guess about which way a branch
will go - will the branch be taken or not.
• The hardware can look for clues based on the
instructions, or it can use past history - we will discuss
both of these directions.
15
A Generic Branch Predictor
Predicted Stream
PC, T or NT
Execution Order
Fetch
f(PC, x)
Resolve
Actual Stream
Actual Stream
f(PC, x) = T or NT
Predicted Stream
- What’s f (PC, x)?
- x can be any relevant info
thus far x was empty
16
Bimodal Branch Predictors
Dynamically store information about the branch
behaviour
Branches tend to behave in a fixed way
Branches tend to behave in the same way across program
execution
Index a Pattern History Table using the
branch address
1 bit: branch behaves as it did last time
Saturating 2 bit counter: branch behaves as it usually does
17
Saturating-Counter Predictors
 Consider strongly biased branch with infrequent
outcome
 TTTTTTTTNTTTTTTTTNTTTT
 Last-outcome will misspredict twice per infrequent
outcome encounter:
 TTTTTTTTNTTTTTTTTNTTTT
 Idea: Remember most frequent case
 Saturating-Counter: Hysteresis
 often called bi-modal predictor
18
Bimodal Prediction
 Table of 2-bit saturating counters
Taken
 Predict the most common direction
11
T
PC
Not
Taken
10
T
PHT
Ta k e n
Ta k e n
00
Ta k e n
01
N ot
Ta k e n
Tak en Tak en
00
11
N ot
Ta k e n
Tak en
01
N ot
Tak en
Not
Taken
Ta k e n
10
N ot
Ta k e n
N ot
Tak en
T/NT
11
N ot
N ot
Tak en Tak en
Taken
01
N ot
Ta k e n
Tak en
10
Taken
Not
Taken
...
NT
Taken
00
NT
Not
Taken
 Advantages: simple, cheap, “good” accuracy
 Bimodal will misspredict once per infrequent outcome
encounter:
TTTTTTTTNTTTTTTTTNTTTT
19
Bimodal Branch Predictors
100
95
90
bimodal
BTFNT
85
80
75
postgres
vortex
perl
ijpeg
li
gcc
m88ksim
70
20
Correlating Predictors
 From program perspective:
 Different Branches may be correlated
 if (aa == 2) aa = 0;
 if (bb == 2) bb = 0;
 if (aa != bb) then …
 Can be viewed as a pattern detector
 Instead of keeping aggregate history information
I.e., most frequent outcome
 Keep exact history information
Pattern of n most recent outcomes
 Example:
 BHR: n most recent branch outcomes
 Use PC and BHR (xor?) to access prediction table
21
Pattern-based Prediction
 Nested loops:
for i = 0 to N
for j = 0 to 3
…
 Branch Outcome Stream for j-for branch
• 11101110111011101110
 Patterns:
•
•
•
•
111
110
101
011
->
->
->
->
0
1
1
1
 100% accuracy
 Learning time 4 instances
 Table Index (PC, 3-bit history)
22
Two-level Branch Predictors
 A branch outcome depends on the outcomes of previous
branches
 First level: Branch History Registers (BHR)
 Global history / Branch correlation: past executions of all branches
 Self history / Private history: past executions of the same branch
 Second level: Pattern History Table (PHT)
 Use first level information to index a table
Possibly XOR with the branch address
 PHT: Usually saturating 2 bit counters
 Also private, shared or global
23
Gshare Predictor (McFarling)
Branch History Table
Global BHR
PC
f
Prediction
 PC and BHR can be
 concatenated
 completely overlapped
 partially overlapped
 xored, etc.
 How deep BHR should be?
 Really depends on program
 But, deeper increases learning time
 May increase quality of information
24
Two-level Branch Predictors (II)
99
97
95
Gshare
bimodal
93
91
89
87
postgres
vortex
perl
ijpeg
li
gcc
m88ksim
85
25
Hybrid Prediction
 Combining branch predictors
 Use two different branch predictors
Access both in parallel
 A third table determines which prediction to use Two or more
predictor components combined
PC
GSHARE
Bimodal
...
T/NT
 Different
branches benefit
from different types
of history
T/NT
Selector
26
T/NT
Hybrid Branch Predictors (II)
100
98
96
Gshare+Bimod (12K)
Gshare (16K)
Gshare (4K)
94
92
postgres
vortex
perl
ijpeg
li
gcc
m88ksim
90
27
Issues Affecting Accurate Branch Prediction
 Aliasing
 More than one branch may use the same BHT/PHT entry
Constructive
• Prediction that would have been incorrect, predicted
correctly
Destructive
• Prediction that would have been correct, predicted
incorrectly
Neutral
• No change in the accuracy
28
More Issues
 Training time
 Need to see enough branches to uncover pattern
 Need enough time to reach steady state
 “Wrong” history
 Incorrect type of history for the branch
 Stale state
 Predictor is updated after information is needed
 Operating system context switches
 More aliasing caused by branches in different programs
29
Performance Metrics
 Misprediction rate
 Mispredicted branches per executed branch
Unfortunately the most usually found
 Instructions per mispredicted branch
 Gives a better idea of the program behaviour
Branches are not evenly spaced
30
Impact of Realistic Branch Prediction
Limiting the type of branch
60
prediction.
61
60
58
FP: 15 - 45
48
50
46 45
46 45 45
IPC
Instruction issues per cycle
41
40
35
Integer: 6 - 12
30
29
19
20
16
15
12
10
13 14
10
9
6
7
6
6
6
7
4
2
2
2
0
gcc
espresso
li
fpppp
doducd
tomcatv
Program
Perfect
Selective predictor
Standard 2-bit
Static
None
31
BPP:Power-Aware Branch Predictor
Combined Predictors
Branch Instruction Behavior
BPP (Branch Predictor Prediction)
Results
32
Combined Predictors




Different Behaviors, Different Sub-Predictors
Selector Picks Sub-Predictor.
Improved Performance over processors using only one sub-predictor
Consequence: Extra Power (~%50)
Bimodal
Selector
Gshare
33
Branch Predictors & Power
 Direct Effect
Up to 10%.
 In-direct Effect: Wrong Path Instructions:
 Smaller/Less Complex Predictors, More Wasted Energy.
 Power-Aware Predictors MUST be Highly Accurate.
34
Branch Instruction Behavior
 Branches use the same sub-predictor:
100%
80%
60%
40%
20%
0%
amm
bzp
cmp
equ
gcc
mcf
mes
prs
vor
vpr
wlf
AVG
35
Branch Predictor Prediction
BPP BUFFER
HINTS
Branch PC
HINT
Hints on next two branches.
HOW?
11: Miss-Predicted Branch
00:Branch used Bimod last time
01:Branch used Gshare last time
36
BPP : example
Code Sequence
:First Appearance
A
HINTS
B
BPP BUFFER
Branch PC
C
A
B
C
D
D
E
HINT
0
0
1
0
1
1
1
0
0
1
0
0
1
1
0
0
F
BMD
NON-BRANCH GSH
MISS-PREDICTED
37
BPP : example
Code Sequence
:second appearance
A
B
BPP BUFFER
C
Branch PC
D
A
B
C
D
E
HINT
0
0
1
0
1
1
1
0
0
1
0
0
1
1
0
0
NEXT CYCLE :
Gate Selector and Bimod
DO
NOTHING
NEXT CYCLE:
Gate Selector and Gshare
F
BRANCH
NON-BRANCH
38
Results
 Power (Total & Branch Predictor’s) and Performance.
 Compared to three base cases:
 A) Non-Gated Combined (CMB)
 B) Bimodal (BMD)
 C) Gshare (GSH)
 Reported for 32k entry Banked Predictors.
39
Performance
130%
120%
110%
100%
90%
80%
70%
CMB
BMD
G
AV
lf
w
vp
r
vo
r
pr
s
m
es
m
cf
gc
c
cm
p
eq
u
bz
p
am
m
60%
GSH
Within 0.4% of CMB, better than BMD(7%) and GSH(3%)
40
Branch Predictor’s Energy
190%
170%
150%
130%
110%
90%
70%
CMB
BMD
G
A
V
lf
w
vp
r
vo
r
pr
s
m
es
m
cf
gc
c
eq
u
cm
p
bz
p
am
m
50%
GSH
13% less than CMB, more than BMD(35%) and GSH(22%)
41
Total Energy
105%
100%
95%
90%
85%
CMB
BMD
G
A
V
lf
w
vp
r
vo
r
pr
s
m
es
m
cf
gc
c
cm
p
eq
u
bz
p
am
m
80%
GSH
0.3%, 4.5% and 1.8% less than CMB, BMD and GSH
42
ILP, benefits and costs?
 How can we extract more ILP?
 What are the costs?
43
Upper Limit to ILP: Ideal Machine
Amount of parallelism when there are no branch mispredictions and we’re limited only by data dependencies.
160
150.1
FP: 75 - 150
Instruction Issues per cycle
IPC
140
120
118.7
Integer: 18 - 60
100
75.2
80
62.6
60
54.8
40
17.9
20
0
gcc
Instructions that could
theoretically be issued per
cycle.
espresso
li
fpppp
doducd
tomcatv
Programs
44
Complexity-Effective Designs
 History: “Brainiacs” and “Speed demons”
 Brainiacs – maximizing the # of instructions issued per clock cycle
 Speed demons – simpler implementation with a very fast clock
 Complexity-Effective
 Complexity-Effective architecture means that the architecture takes both of
the benefits of complex issue schemes and the benefits of simpler
implementation with a fast clock cycle
 Complexity measurement : delay of the critical path
 Proposed Architecture
 High performance(high IPC) with a very high clock frequency
45
Extracting More Parallelism
8
4
128
Today
256
Future?
Higher IPC
Clock, Power?
Want:
High IPC+ Fast Clock+ Low Power
46
Generic pipeline description
 Baseline superscalar model
 Criteria for sources of complexity(delay)
o structures whose delay is a function of issue window size and issue width
o structures which tends to rely on broadcast operations over long wires
47
Sources of complexity
 Register renaming logic
o translates logical register designators to physical register designator
 Wakeup logic
o Responsible for waking up instructions waiting for their source operands to
become available
 Selection logic
o Responsible for selection instructions for execution from the pool of ready
instructions
 Bypass logic
o Bypassing the operand values from instructions that have completed
execution
 Other structures not to be considered here
 Access time of the register file varies with the # of registers and the # of ports.
 Access time of a cache is a function of the size of the cache and the associativity of
the cache
48
Register rename logic complexity
49
Delay analysis for rename logic
 Delay analysis for RAM scheme
 RAM scheme operates like a standard RAM
 Issue width affect delay through its impact wire lengths
- Increasing issue width increases the # of bit/word lines
- Delay of rename logic depends on the linear function of the issue width.
 Spice result
o Total delay & each component delay
increase linearly with IW
o Bit line & word line delay worsens
as the feature size is reduced.
(Logic delay is reduced linearly as
the feature size is reduced. But wire
delay fall at a slow rate.)
50
Wakeup logic
 Wakeup logic
 Responsible for updating source dependences for instructions in the issue window
waiting for their source operands to become available.
 Basic Structure
o 2 OR gates and 2*IW comparators per one entry of issue window
 Delay analysis
Almost linear func. (>=0.35um)
Quadratic func. Under 0.35um
Almost linear function.
51
Delay analysis for wakeup logic
 SPICE result

(figure 5 : under 0.18um)
Issue width has a greater impact on
the delay than window size.
WINSIZE  Tdrive
IW
 Tdrive, Ttagmatch, TmatchOR

(figure 6 : under 8-way,64-entry window)
The tag drive and tag match delays are
less scalable than the match OR delay.
Tdrive + Ttagmatch 52% under 0.8um
62% under 0.18um
52
Selection Logic
 Selection Logic
 Responsible for choosing instructions for execution from the pool of ready
instructions in the issue window
 Basic structure
 REQ(input) & GRANT(output) signals
 Operation : 2 phases
• REQ propagates up to the root.
• GRANT with high priority on the
arbiter cell propagates down to
the leaf arbiter.
 Selection policy(oldest first) :
< implementation >
• left-most entries have the highest
priority.
• IW compacts the IW to the left every
time instr.s are issued and inserts
new instr.s at the right end.
53
Delay analysis for selection logic
 Delay analysis
 The optimal number of arbiter inputs to be four here.
 SPICE result
 Assuming a single functional unit
 Various components of the total
delay scale well as the feature size
is reduced.
 All the delays are logic delay.
(don’t consider the wire)
 It is possible to minimize the effect of
the effect of wire delays if the ready
signals are stored in a smaller,more
compact array.
54
Data bypass logic
 Bypass logic
 Responsible for forwarding result values from completing instructions to dependent
instructions to dependent instructions,bypassing RF
 Basic structure
 In fully bypass design,
Bypass paths=2*(IW)2*S
where S= # of pipeline stages after first
output-producing stage
 Current trend : deeper-pipelining
wider issue
 produce critical importance
55
Delay analysis for data bypass logic
 Delay analysis
 The length of the wires is a function of the result wires
 Increasing IW increases the length of the result wires
 SPICE result
 Based on the basic structure(layout)
 The delays are the same for the three
technologies(feature sizes)
56
Summary of Delays and Pipeline Issues
 Pipeline delay results
 For the 4-way machine,
the window logic(WL)  critical path delay
 For the 8-way machine,
the bypass logic(BL)  critical path delay
 Future machine(ILP)
 WL & BL will pose the largest problems.
 Both make us difficult to divide these
into more pipeline segments.(atomic operation)
 In WL(wake-up/select)
 In BL(bypass logic)
• In order for dependent operations to execute in consecutive cycles, the bypass value
must be made available to the dependent instruction within a cycle.
• Solution : stall(trade-off between the cycle time and bottle-neck from bypass in wider
issues)
57
A complexity-Effective Micro-Arch.
 Dependence-based microarchitecture
 Replaces the issue window with a simpler structure that facilitates a faster clock while
exploiting similar levels of parallelism.
 Naturally lends itself to clustering and helps the bypass problem to a large extent.
 Simple description
 “Dependent instructions can’t execute
in parallel but consecutively.”
 The issue window is replaced by
a small # of FIFO buffers
 The FIFO buffers are constrained to
issue in-order, and dependent instr.s
are steered to the same FIFO.
 The register availability only needs to be fanned out to the heads of the FIFO buffers.
(In typical issue window, result tags have to be broadcast to all the entries.)
 The instruction at the FIFO heads monitor reservation bits to check for operand availability.
(one per physical register)
 SRC_FIFO table for steering instructions to appropriate buffers
• Indexed using logical register designators.
• SRC_FIFO(Ra) = the identity of the FIFO buffer
58
Instruction Steering Heuristics
 Applied heuristics
 Case 1: All operands of I are available
 I into new(free) FIFO
 Case 2: A single outstanding operands of I & Isource in FIFO fa
if no instructions behind Isource in FIFO fa  I into FIFO fa
else
 I into new FIFO
 Case 3: 2 outstanding operands of I  apply one of 2 operands to case 2
59
Performance results
 Performance results
 Proposed arch. : 8 FIFOs, 8 entries in 1 FIFO,
baseline arch. : 64-entry issue window
 The dependence-based microarchitecture is nearly as effective(extracts similar parallelism) as
the typical window-based microarchitecture.
Max. 8%
60
Complexity analysis
 Reservation table
 If the instruction Ia at the head of FIFO Fa is dependent on an instruction Ib
waiting in FIFO, Ia cannot issue until Ib completes.
 The delay of the wakeup logic is determined by the delay of accessing the
reservation table.
 The selection logic is simple
because only the instructions
at the FIFO heads need to be
considered for selecton.
 Effect
 The suggested arch. can improve clock period(faster clock)
 as much as 39% in 0.18 um technology
61
Clustering
 Clustering the dependence-based microarchitecture
 Advantage
 Wakeup and selection logic are simplified.
 Because of assigning dependent instructions to FIFOs,local bypasses are
more frequently than inter-cluster bypasses.(overall delay is reduced.)
 Multiple copies of register file make the # of ports reduced(faster RF access)
62
Performance of Clustering
 Performance comparison
 Comparison between 2*4-way dependence-based and conventional 8-way
64-entry window-based architecture
 Assuming 1-cycle Local bypass delay and 2-cycle inter-cluster bypass delay
 Overall performance
considering clock speed
 average 16% improvement
Max 12%
63
Conclusion
 Some important results
 The logic associated with the issue window and the data bypass logic are going to
become increasingly critical as future designs employ wider issue widths,bigger
windows, and smaller feature size.
 Wire delays will increasingly dominate total delay in future technology.
(window logic and bypass logic are atomic operations.)
 Complexity-effective architecture
 Architecture that facilitate a fast clock while exploiting similar levels of ILP
 Dependence-based architecture as a complexity-effective architecture
 simplifies window logic
 naturally lends itself to clustering by grouping dependent instructions
64
The Motivation for Caches
Memory System
Processor
Cache
DRAM
Motivation:
Large memories (DRAM) are slow
Small memories (SRAM) are fast
Make the average access time small by:
Servicing most accesses from a small, fast memory.
Reduce the bandwidth required of the large
memory
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<10s ns
Cache
K Bytes
10-100 ns
$.01-.001/bit
Main Memory
M Bytes
100ns-1us
$.01-.001
Disk
G Bytes ms
-3 -4
10 - 10 cents
Tape
infinite
sec-min
10 -6cents
Staging
Xfer Unit
faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache
Blocks
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Disk
Files
Tape
user/operator
Mbytes
Larger
Lower Level
The Principle of Locality
Probability
of reference
0
Address Space
2
The Principle of Locality:
Program access a relatively small portion of the address space at any
instant of time.
Example: 90% of time in 10% of the code
Two Different Types of Locality:
Temporal Locality (Locality in Time): If an item is referenced, it will tend
to be referenced again soon.
Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
Memory Hierarchy: Principles of Operation
At any given time, data is copied between only 2
adjacent levels:
Upper Level (Cache) : the one closer to the processor
 Smaller, faster, and uses more expensive technology
Lower Level (Memory): the one further away from the processor
 Bigger, slower, and uses less expensive technology
Block:
The minimum unit of information that can either be present or not
present in the two level hierarchy
To Processor
Upper Level
(Cache)
Lower Level
(Memory)
Blk X
From Processor
Blk Y
Download