CS 7810

advertisement
CS 7810
• Paper critiques and class participation: 25%
• Final exam: 25%
• Project: Simplescalar (?) modeling, simulation,
and analysis: 50%
• Read and think about the papers before class!
Superscalar Pipeline
BPred
I-Cache
PC
B
T
B
Rename
Table
IFQ
checkpoints
in2 in1 out op
D-Cache
Regfile
LSQ
FU
FU
FU
R
O
B
FU
Issue queue
Rename
A lr1  lr2 + lr3
B lr2  lr4 + lr5
C lr6  lr1 + lr3
D lr6  lr1 + lr2
RAR lr3
RAW lr1
WAR lr2
WAW lr6
A ; BC ; D
pr7  pr2 + pr3
pr8  pr4 + pr5
pr9  pr7 + pr3
pr10  pr7 + pr8
RAR pr3
RAW pr7
WAR x
WAW x
AB ; CD
Resolving Branches
A: lr1  lr2 + lr3
B: lr2  lr1 + lr4
C: lr1  lr4 + lr5
E: lr1  lr2 + lr3
D: lr2  lr1 + lr5
A: pr6  pr2 + pr3
B: pr7  pr6 + pr4
C: pr8  pr4 + pr5
E: pr10  pr7 + pr3
D: pr9  pr8 + pr5
Resolving Exceptions
lr1  lr2 + lr3
lr2  lr1 + lr4
br
C lr1  lr2 + lr3
D lr2  lr1 + lr5
A
B
pr6  pr2 + pr3
pr7  pr6 + pr4
br
pr8  pr7 + pr3
pr9  pr8 + pr5
Resolving Exceptions
lr1  lr2 + lr3
lr2  lr1 + lr4
br
C lr1  lr2 + lr3
D lr2  lr1 + lr5
pr6  pr2 + pr3
pr7  pr6 + pr4
br
pr8  pr7 + pr3
pr9  pr8 + pr5
A
B
ROB
A
pr6
pr1
B
pr7
pr2
br
C
pr8
pr6
D
pr9
pr7
LSQ
Ld/St
Address
Data
Completed
Store
Unknown
1000
--
Load
x40000000
--
--
Store
x50000000
--
--
Load
x50000000
--
--
Load
x30000000
--
--
LSQ
Ld/St
Address
Data
Completed
Store
Unknown
1000
--
Load
x40000000
--
--
Store
x50000000
100
Yes
Load
x50000000
--
--
Load
x30000000
--
--
LSQ
Ld/St
Address
Data
Completed
Store
Unknown
1000
--
Load
x40000000
--
--
Store
x50000000
100
Yes
Load
x50000000
100
Yes
Load
x30000000
--
--
LSQ
Ld/St
Address
Data
Completed
Store
x45000000
1000
Yes
Load
x40000000
--
--
Store
x50000000
100
Yes
Load
x50000000
100
Yes
Load
x30000000
--
--
can commit
LSQ
Ld/St
Address
Data
Completed
Store
x45000000
1000
Yes
Load
x40000000
10
Yes
Store
x50000000
100
Yes
Load
x50000000
100
Yes
Load
x30000000
1
Yes
Instruction Wake-Up
p2
p1 p2 p6
add
p2 p6 p7
add
p1 p2 p8
sub
p7 p8 p9
mul
p1 p7 p10
add
Paper I
Limits of Instruction-Level Parallelism
David W. Wall
WRL Research Report 93/6
Also appears in ASPLOS’91
Goals of the Study
• Under optimistic assumptions, you can find a
very high degree of parallelism (1000!)
 What about parallelism under realistic
assumptions?
 What are the bottlenecks? What contributes
to parallelism?
Dependencies
For registers and memory:
True data dependency
RAW
Anti dependency
WAR
Output dependency
WAW
Control dependency
Structural dependency
Perfect Scheduling
For a long loop:
Read a[i] and b[i] from memory and store in registers
Add the register values
Store the result in memory c[i]
The whole program should finish in 3 cycles!!
Anti and output dependences : the assembly code keeps
using lr1
Control dependences : decision-making after each iteration
Structural dependences : how many registers and cache
ports do I have?
Impediments to Perfect Scheduling
• Register renaming
• Alias analysis
• Branch prediction
• Branch fanout
• Indirect jump prediction
• Window size and cycle width
• Latency
Register Renaming
lr1  …
…  lr1
lr1  …
pr22  …
…  pr22
pr24  …
• If the compiler had infinite registers, you would
not have WAR and WAW dependences
• The hardware can renumber every instruction and
extract more parallelism
• Implemented models:
 None
 Finite registers
 Perfect (infinite registers – only RAW)
Alias Analysis
• You have to respect RAW dependences for
memory as well –
store value to addrA
load from addrA
• Problem is: you do not know the address at
compile-time or even during instruction dispatch
Alias Analysis
• Policies:
 Perfect: You magically know all addresses and
only delay loads that conflict with earlier stores
 None: Until a store address is known, you stall
every subsequent load
 Analysis by compiler: (addr) does not conflict
with (addr+4) – global and stack data are
allocated by the compiler, hence conflicts can
be detected – accesses to the heap can conflict
with each other
Global, Stack, and Heap
main()
int a, b;
call func();
func()
int c, d;
int *e;
int *f;
 global data
 stack data
 e,f are stack data
This is a
conflict if you
had previously
done e=e+8
e = (int *)malloc(8);  e, f point to heap data
f = (int *)malloc(8);
…
*e = c;
d = *f;
 store c into addr stored in e
 read value in addr stored in f
Branch Prediction
• If you go the wrong way, you are not extracting
useful parallelism
• You can predict the branch direction statically or
dynamically
• You can execute along both directions and throw
away some of the work (need more resources)
Dynamic Branch Prediction
• Tables of 2-bit counters that get biased towards
being taken or not-taken
• Can use history (for each branch or global)
• Can have multiple predictors and dynamically pick
the more promising one
• Much more in a few weeks…
Static Branch Prediction
• Profile the application and provide hints to the
hardware
• Dynamic predictors are much better
Branch Fanout
• Execute both directions of the branch – an
exponential growth in resource requirements
• Hence, do this until you encounter four branches,
after which, you employ dynamic branch prediction
• Better still, execute both directions only if the
prediction confidence is low
Not commonly used in today’s processors.
Indirect Jumps
• Indirect jumps do not encode the target in the
instruction – the target has to be computed
• The address can be predicted by
 using a table to store the last target
 using a stack to keep track of subroutine call
and returns (the most common indirect jump)
• The combination achieves 95% prediction rates
Latency
• In their study, every instruction has unit latency
-- highly questionable assumption today!
• They also model other “realistic” latencies
• Parallelism is being defined as
cycles for sequential exec / cycles for superscalar,
not as instructions / cycles
• Hence, increasing instruction latency can increase
parallelism – not true for IPC
Window Size & Cycle Width
8 available slots
in each cycle
Window of 2048 instructions
Window Size & Cycle Width
• Discrete windows: grab 2048 instructions, schedule
them, retire all cycles, grab the next window
• Continuous windows: grab 2048 instructions,
schedule them, retire the oldest cycle, grab a few
more instructions
• Window size and register renaming are not related
Simulated Models
• Seven models: control, register, and memory
dependences
• Today’s processors: ?
• However, note optimistic scheduling, 2048 instr
window, cycle width of 64, and 1-cycle latencies
• SPEC’92 benchmarks, utility programs (grep, sed,
yacc), CAD tools
Simulated Models
Aggressive Models
• Parallelism steadily increases as we move to
aggressive models (Fig 12, Pg. 16)
• Branch fanout does not buy much
• IPC of Great model: 10
Reality: 1.5
• Numeric programs can do much better
Aggressive Models
Cycle Width and Window Size
• Unlimited cycle width buys very little (much less
than 10%) (Figure 15)
• Decreasing the window size seems to have little
effect as well (you need only 256?! – are registers
the bottleneck?) (Figure 16)
• Unlimited window size and cycle widths don’t help
(Figure 18)
Would these results hold true today?
Memory Latencies
• The ability to prefetch has a huge impact on IPC –
to hide a 300 cycle latency, you have to spot the
instruction very early
• Hence, registers and window size are extremely
important today!
Branch Prediction
• Obviously, better prediction helps (Fig. 22)
• Fanout does not help much (Fig. 24-b) – not
selecting the right branches?
• Luckily, small tables are good enough for good
indirect jump prediction
• Mispredict penalty has a major impact on ILP
(Fig. 30)
Alias Analysis
• Has a big impact on performance – compiler
analysis results in a two-fold speed-up
• Later, we’ll read a paper that attempts this in
hardware (Chrysos ’98)
Instruction Latency
• Parallelism almost unaffected by increased
latency (increases marginally in some cases!)
• Note: “unconventional” definition of parallelism
• Today, latency strongly influences IPC
Conclusions of Wall’s Study
• Branch prediction, alias analysis, mispredict
penalty are huge bottlenecks
• Instr latency, registers, window size, cycle width
are not huge bottlenecks
• Today, they are all huge bottlenecks because they
all influence effective memory latency…which is
the biggest bottleneck
Questions
• Weaknesses: caches, register model, value pred
• Will most of the available IPC (IPC = 10 for superb
model) go away with realistic latencies?
• What stops us from building the following: 400Kb
bpred, cache hierarchies, 512-entry window/regfile,
16 ALUs, memory dependence predictor?
• The following may need a re-evaluation: effect of
window size, branch fan-out
Next Week’s Paper
• “Complexity-Effective Superscalar Processors”,
Palacharla, Jouppi, Smith, ISCA ’97
• The impact of increased issue width and window
size on clock speed
Title
• Bullet
Download