Why Intel is designing multi-core processors

advertisement
Why Intel is designing
multi-core processors
Geoff Lowney
Intel Fellow, Microprocessor
Architecture and Planning
July, 2006
Moore’s Law at Intel 1970-2005
10
Feature Size (um)
1.E+09
1
1.E+07
0.79x
0.1
0.70x
0.01
1970 1980 1990 2000 2010 2020
100000
2.0x
1.E+05
1.E+03
1970 1980 1990 2000 2010 2020
1000
Frequency (MHz)
CPU Power (W)
10000
2.0x
1000
100
100
1.5x
10
2
Transistor Count
1
1970 1980 1990 2000 2010 2020
July, 2006
1.4x
Power trend not
sustainable
10
1990 1995 2000 2005 2010 2015
Pentium® 4 Processor
386 Processor
May 1986
@16 MHz core
275,000 1.5µ transistors
~1.2 SPECint2000
3
July, 2006
17 Years
200x
200x/11x
1000x
August 27, 2003
@3.2 GHz core
55 Million 0.13µ transistors
1249 SPECint2000
Frequency: 200x
10000
Clock Frequency
1000
100
10
Performance: 1000x
1
Jan-85
Jan-87
Jan-89
10000
SPECint2000
100
10
4
Jan-93
Jan-95
Jan-97
Introduction Date
1000
1
Jan-85 Jan-87 Jan-89
July, 2006
Jan-91
Jan-91
Jan-93
Jan-95
Jan-97
Introduction Date
Jan-99
Jan-01
Jan-03
Jan-05
Jan-99
Jan-01
Jan-03
Jan-05
SPECint2000/MHz (normalized)
7
Normalized SPECint2000/MHz
PIII
6
PPro/II
5
4
P5
P55-MMX
P4
3
486
2
486DX2
386
1
0
Jan-85 Jan-87 Jan-89 Jan-91 Jan-93 Jan-95 Jan-97 Jan-99 Jan-01 Jan-03 Jan-05
Introduction Date
5
July, 2006
5
3
2
1
0
Pipelined
S-Scalar
OOOSpec
Deep Pipe
3
Pow er X
Mips/W (%)
2
Power efficiency
has dropped
Increase (X)
Increase (X)
Performance scales
with area**.5
Area X
Perf X
4
1
0
Pipelined
-1
6
July, 2006
S-Scalar
OOOSpec
Deep
Pipe
Pushing Frequency
10
8
6
4
2
0
Performance
Diminishing
Return
1 2 3 4 5 6 7 8 9 10
Relative Frequency (Pipelining)
Maximized frequency by
• Deeper pipelines
• Pushed process to the limit
Pipeline & Performance
10
8
•Dynamic power increases
with frequency
•Leakage power increases
with reducing Vt
6
4
Sub-threshold
Leakage increases
exponentially
2
0
1 2 3 4 5 6 7 8 9 10
Relative Frequency
Process Technology
Diminishing return on performance. Increase in power
7
July, 2006
Moore’s Law at Intel 1970-2005
10
Feature Size (um)
1.E+09
1
1.E+07
0.79x
0.1
0.70x
0.01
1970 1980 1990 2000 2010 2020
Transistor Count
2.0x
1.E+05
1.E+03
1970 1980 1990 2000 2010 2020
1000
100000
Frequency (MHz)
CPU Power (W)
10000
2.0x
1000
100
100
1.5x
10
8
1
1970 1980 1990 2000 2010 2020
July, 2006
1.4x
Power trend not
sustainable
10
1990 1995 2000 2005 2010 2015
Reducing power with voltage scaling
Power = Capacitance * Voltage**2 * Frequency
Frequency ~ Voltage in region of interest
Power ~ Voltage ** 3
10% reduction of voltage yields
• 10% reduction in frequency
• 30% reduction in power
• Less than 10% reduction in performance
Rule of Thumb
9
Voltage
Frequency
Power
Performance
1%
1%
3%
0.66%
July, 2006
Dual core with voltage scaling
RULE OF THUMB
A 15%
Frequency
Power
Performance
Reduction
Reduction
Reduction
Reduction
In Voltage
15%
45%
10%
Yields
SINGLE CORE
Area
=1
Voltage = 1
Freq
=1
Power = 1
Perf
=1
10
July, 2006
DUAL CORE
Area
=
Voltage =
Freq
=
Power =
Perf
=
2
0.85
0.85
1
~1.8
Multiple cores deliver more performance per watt
Power
Cache
Power = ¼
4
3
Big core
C1
C2
Cache
C3
11
July, 2006
C4
Performance = 1/2
Performance
2
2
1
1
Small
core
1
1
Many core is more
power efficient
4
4
3
3
2
2
1
1
Power ~ area
Single thread
performance ~ area**.5
Memory Gap
Peak Instructions
Per DRAM Access
Growing Performance Gap
700
600
500
400
LOGIC
GAP
300
200
100
MEMORY
1992
1994
1996
1998
2000
2002
0
Pentium
66MHz
Pentium-Pro
200MHz
PentiumIII
1100MHz
Pentium4 2
GHz
Reduce DRAM access with large caches
Extra benefit: power savings. Cache is lower power than logic
Tolerate memory latency with multiple threads
12
July, 2006
Multiple cores
Hyper-threading
Multi-threading tolerates memory latency
Serial Execution
Ai
Idle
Ai+1
Bi
Idle
Bi+1
Multi-threaded Execution
Ai
Idle Ai+1
Bi
Bi+1
Execute thread B while thread A waits for memory
13
July, 2006
Multi-core has a similar effect
Multi-core tolerates memory latency
Serial Execution
Ai
Idle
Ai+1
Bi
Idle
Multi-core Execution
Ai
Bi
Idle
Idle
Ai+1
Bi+1
Execute thread A and B simultaneously
14
July, 2006
Bi+1
Core Area (with L1 caches) Trend
Die area/core shrinking after peaking with PPro
350.0
PPro
300.0
Core Area - m m ^2
Pentium Processor
250.0
200.0
Pentium II processor
486
150.0
McKinley
Pentium 4 processor
100.0
Prescott
386
50.0
Banias
Yonah
4004
0.0
1970
1975
1980
1985
1990
1995
2000
Merom
2005
Introduction Date
Why shrinking? Diminishing returns on performance.
15
July, 2006
Interconnect. Caches.
Complexity.
Reliability in the long term
100
150
Relative
Relative
~8% degradation/bit/generation
100
50
50
Wider
0
0
16
22
32
45
65
90
0
13
18
0
100
Soft Error FIT/Chip (Logic & Mem)
120
140
160
180
200
Vt(mV)
Extreme device variations
Ion Relative
1
0
1
2
3
4
5
6
7
8
9 10
Time
Time dependent device degradation
16
July, 2006
In future process
generations, soft and
hard errors will be
more common.
Redundant multi-threading: an
architecture for fault detection and
recovery
Sphere of Replication
Leading
Thread
Trailing
Thread
Input
Replication
Output
Comparison
Memory System
Two copies of each architecturally visible thread
Compare results: signal fault if different
Multi-core enables many possible designs for redundant
threading.
17
July, 2006
Moore’s Law at Intel 1970-2005
10
Feature Size (um)
1.E+09
1
1.E+07
0.79x
0.1
0.70x
0.01
1970 1980 1990 2000 2010 2020
Transistor Count
2.0x
1.E+05
1.E+03
1970 1980 1990 2000 2010 2020
1000
100000
Frequency (MHz)
CPU Power (W)
10000
2.0x
1000
100
100
1.4x
1.5x
10
1
1970 1980 1990 2000 2010 2020
July, 2006
10
performance,
1990 1995 2000 2005memory,
2010 2015
Multi-core addresses power,
complexity, reliability
18
Moore’s Law will provide transistors
Intel process technology capabilities
High Volume Manufacturing
2004
2006
2008
2010
2012
2014
2016
2018
Feature Size
90nm
65nm
45nm
32nm
22nm
16nm
11nm
8nm
2
4
8
16
32
64
128
256
Integration Capacity
(Billions of Transistors)
50nm
19
Transistor for
90nm Process
Influenza Virus
Source: CDC
Source: Intel
July, 2006
Use transistors for multiple cores, caches, and new features.
Multi-Core Processors
CLOVERTOWN
WOODCREST
PENTIUM® M
Single Core
20
July, 2006
Dual Core
Quad Core
Future Architecture: More Cores
Open issues
Cores
•
•
•
•
How many?
What size?
Homogeneous?
Heterogeneous?
On-die Interconnect
•
•
Topology
Bandwidth
Power delivery and management
High bandwidth memory
Reconfigurable cache
Scalable fabric
Core Core
Core Core
Core
Core Core
Core Core
Core
Core Core
Cache hierarchy
•
•
•
Number of levels
Sharing
Inclusion
Scalability
21
July, 2006
Core
Core Core
Fixed-function units
The Importance of Threading
Do Nothing: Benefits Still Visible
•
•
•
Operating systems ready for multi-processing
Background tasks benefit from more compute resources
Virtual machines
Parallelize: Unlock the Potential
•
•
•
22
Native threads
Threaded libraries
Compiler generated threads
July, 2006
Performance Scaling
Performance
Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)
10
9
8
7
6
5
4
3
2
1
0
Serial% = 6.7%
N = 16, N1/2 = 8
16 Cores, Perf = 8
Serial% = 20%
N = 6, N1/2 = 3
6 Cores, Perf = 3
0
10
20
30
Number of Cores
23
Parallel software key to Multi
-core success
Multi-core
July, 2006
How does Multicore Change Parallel
Programming?
SMP
P1
P2
P3
P4
cache
cache
cache
cache
Memory
No change in fundamental
programming model
Synchronization and communication
costs greatly reduced
• Makes it practical to parallelize more
programs
CMP
Resources now shared
C1
C2
C3
C4
cache
cache
cache
cache
• Caches
• Memory interface
• Optimization choices may be
different
Memory
24
July, 2006
Threading for Multi-Core
Architectural
Analysis
Intel® VTune™ Analyzers
Introducing
Threads
Intel® C++ Compiler
Debugging
Intel® Thread Checker
Performance
Tuning
Intel® Thread Profiler
Intel has a full set of tools for parallel programming
25
July, 2006
The Era Of Tera
Terabytes of data. Teraflops of performance.
Improved
Medical care
Text
Mining
Scientific Simulation
Courtesy Tsinghua
University HPC Center
Machine
vision
X
When personal
computing finally
becomes
personal
Y
Immersive 3D
entertainment
Interactive learning
Source: Steven K. Feiner, Columbia
University.
Telepresent
meetings
Virtual
realities
Computationally intensive applications of the future will
Tele-present meetings
be highly parallel
26
July, 2006
Courtesy of the Electronic Visualization
Laboratory, Univ. of Illinois at Chicago.
21st Century Computer Architecture
Slide from
Patterson 2006
Old CW: Since cannot know future programs, find set of old programs
to evaluate designs of computers for the future
•
E.g., SPEC2006
What about parallel codes?
•
Few available, tied to old models, languages, architectures, …
New approach: Design computers of future for numerical methods
important in future
Claim: key methods for next decade are 7 dwarves (+ a few), so
design for them!
•
27
Representative codes may vary over time, but these numerical
methods will be important for > 10 years
Patterson, UC Berkeley, also predicts importance of
parallel applications.
July, 2006
Phillip Colella’s “Seven dwarfs”
Slide from
Patterson 2006
High-end simulation in the physical
sciences = 7 numerical methods:
Structured Grids (including locally structured
grids, e.g. Adaptive Mesh Refinement)
Unstructured Grids
Fast Fourier Transform
Dense Linear Algebra
Sparse Linear Algebra
Particles
If add 4 for embedded, covers all 41
EEMBC benchmarks
8. Search/Sort
9. Filter
10. Combinational logic
11. Finite State Machine
Note: Data sizes (8 bit to 32 bit)
and types (integer, character) differ,
but algorithms the same
Monte Carlo
Slide from “Defining Software
Requirements for Scientific
Computing”, Phillip Colella, 2004
Well-defined targets from algorithmic,
software, and architecture standpoint
Note scientific computing, media processing, machine
learning, statistical computing share many algorithms
28
July, 2006
Workload Acceleration
RMS Workload Speedup (Simulated)
Ray Tracing (Avg.)
64
48x - 57x
56
Recognition
Body Tracker
Speedup
48
Mining
Mod. Levelset
FB_Est
Gauss-Seidel
Sparse Matrix (Avg.)
40
19x - 33x
32
Dense Matrix (Avg.)
Kmeans
SVD
24
SVM_classification
16
Cholesky (watson)
Cholesky (mod2)
8
Forward Solver (pds-10)
Synthesis
0
Forward Solver (world)
0
4
8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
# of Cores
Backward Solver (pds10)
Backward Solver
(watson)
Group I – Scale well with increasing core count
Examples: Ray tracing, body tracking, physical simulation
Group II – Worst-case scaling examples, yet still scale
29
July, 2006
Examples: Forward & backward solvers
Source: Intel Corporation
Tera-Leap to Parallelism:
ENERGY-EFFICIENT PERFORMANCE
Energy Efficient Performance
Tera-Scale
Computing
Quad-Core
Dual Core
Hyper-Threading
Instruction level parallelism
TIME
30
July, 2006
Science
fiction
becomes
A REALITY
More performance
Using less energy
Single-core chips
Summary
Technology is driving Intel to build multi-core processors.
–
–
–
–
–
Power
Performance
Memory latency
Complexity
Reliability
Parallel programming is a central issue.
Parallel applications will become mainstream
31
July, 2006
Download