CS 7810 Lecture 3 Conventional Microarchitectures

advertisement
CS 7810
Lecture 3
Clock Rate vs. IPC: The End of the Road for
Conventional Microarchitectures
V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger
UT-Austin
ISCA ’00
Previous Papers
• Limits of ILP – it is probably worth doing o-o-o
superscalar
• Complexity-Effective – wire delays make the
implementations harder and increase latencies
• Today’s paper – these latencies severely impact
IPCs and slow the growth in processor performance
1995-2000
1995-2000
• Clock speed has improved by 50% every year
 Reduction in logic delays
 Deeper pipelines  This will soon end
• IPC has gone up dramatically (the increased
complexity was worth it)  Will this end too?
Wire Scaling
• Multiple wire layers – the SIA roadmap predicts
dimensions (somewhat aggressive)
• As transistor widths shrink, wires become thinner,
and their resistivity goes up (quadratically – Table 1)
• Parallel-plate capacitance reduces, but coupling
capacitance increases (slight overall increase)
• The equations are different, but the end result is
similar to Palacharla’s (without repeaters)
Wire Scaling
Wire Scaling
• With repeaters, delay of a fixed-length wire does
not go up quadratically as we shrink gate-width
• In going from 250nm  35nm,
 5mm wire delay 170ps  390ps
 delay to cross X gates 170ps  55ps
 SIA clock speed 0.75GHz  13.5GHz
 delay to cross X gates 0.13 cyc  0.75 cycles
• We could increase wire width, but that compromises
bandwidth
Clock Scaling
• Logic delay (the FO4 delay) scales linearly with
gate length
• Likewise, work per pipeline stage has also been
shrinking
• The SIA predicts that today’s 16 FO4/stage delay
will shrink to 5.6 FO4/stage
• A 64-bit add takes 5.5 FO4 – hence, they examine
SIA (super-aggressive), 8-FO4 (aggressive), and
16-FO4 (conservative) scaling strategies
Clock Scaling
Clock Scaling
• While the 15-20% improvement in technology
scaling will continue, the 15-20% improvement
in pipeline depth will cease
On-Chip Wire Delays
• The number of bits reachable in a cycle are
shrinking (by more than a factor of two across
three generations)
 Structures that fit in a cycle today, will have
to be shrunk (smaller regfiles, issue queues)
• Chip area is steadily increasing
 Less than 1% of the chip reachable in a
cycle, 30 cycles to go across the chip!
Processors are becoming communication-bound
Processor Structure Delays
• To model the microarchitecture, they estimate
the delays of all wire-limited structures
Structure
fSIA
f8
f16
64K-2-port L1
7
5
3
64-entry 10-port regfile
3
2
1
20-entry 8-port issueq
3
2
1
64-entry 8-port ROB
3
2
1
• Weakness: bypass delays are not considered
Microarchitecture Scaling
• Capacity Scaling: constant access latencies in
cycles (simpler designs), scale capacities down
to make it fit
• Pipeline Scaling: constant capacities, latencies
go up, hence, deeper pipelines
• Any other approaches?
Microarchitecture Scaling
• Capacity Scaling: constant access latencies in
cycles (simpler designs), scale capacities down
to make it fit
• Pipeline Scaling: constant capacities, latencies
go up, hence, deeper pipelines
• Replicated Capacity Scaling: fast core with few
resources, but lots of them – high IPC if you can
localize communication
IPC Comparisons
20-IQ
40
Regs
Pipeline Scaling
F
F
F
F
20-IQ
40
Regs
F
F
F
F
2-cycle wakeup
2-cycle regread
2-cycle bypass
Capacity Scaling
15-IQ
Replicated Capacity
Scaling
15-IQ
30
Regs
F
F
F
15-IQ
30
Regs
30
Regs
F
F
F
F
F
F
Methodology
Results
Results
• Every instruction experiences longer latencies
• IPCs are much lower for aggressive clocks
• Overall performance is still comparable for all
approaches
Results
• In 17 years, we are seeing only a 7-fold speedup
(historically, it should have been 1720) – annual
increase of 12.5%
• Slow growth because pipeline depth and IPC
increase will stagnate
Questionable Assumptions
• Additional transistors are not being used to
improve IPC
• All instructions pay wire-delay penalties
Conclusions
• Large monolithic cores will perform poorly –
microarchitectures will have to be partitioned
• On-chip caches will be the biggest bottlenecks –
3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s
• Future proposals should be wire-delay-sensitive
Next Class’ Paper
• “Dynamic Code Partitioning for Clustered
Architectures”, UPC-Barcelona, 2001
• Instruction steering heuristics to balance load
and minimize communication
Title
• Bullet
Download