Future for Parallel Computer Architectures are Between a Rock

advertisement
Future Computer Advances are
Between a Rock (Slow Memory)
and a Hard Place (Multithreading)
Mark D. Hill
Computer Sciences Dept.
and Electrical & Computer Engineer Dept.
University of Wisconsin—Madison
Multifacet Project (www.cs.wisc.edu/multifacet)
October 2004
Full Disclosure: Consult for Sun & US NSF
© 2004 Mark D. Hill
Wisconsin Multifacet Project
Executive Summary: Problem
• Expect computer performance doubling every 2 years
• Derives from Technology & Architecture
• Technology will advance for ten or more years
talk
• But Architecture faces a Rock: Slow Memory
– a.k.a. Wall [Wulf & McKee 1995]
• Prediction: Popular Moore’s Law (doubling
performance) will end soon, regardless of
the real Moore’s Law (doubling transistors)
© 2004 Mark D. Hill
2
Wisconsin Multifacet Project
Executive Summary: Recommendation
• Chip Multiprocessing (CMP) Can Help
– Implement multiple processors per chip
– >>10x cost-performance for multithreaded workloads
– What about software with one apparent thread?
• Go to Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing
– Computer architects can help, but long run
– Requires moving multithreading from CS fringe to center
(algorithms, programming languages, …, hardware)
• Necessary For Restoring Popular Moore’s Law
© 2004 Mark D. Hill
3
Wisconsin Multifacet Project
Outline
• Executive Summary
• Background
–
–
–
–
Moore’s Law
Architecture
Instruction Level Parallelism
Caches
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill
4
Wisconsin Multifacet Project
Society Expects A Popular Moore’s Law
Computing critical: commerce, education, engineering,
entertainment, government, medicine, science, …
– Servers (> PCs)
– Clients (= PCs)
– Embedded (< PCs) talk
• Come to expect a misnamed “Moore’s Law”
– Computer performance doubles every two years (same cost)
–  Progress in next two years = All past progress
• Important Corollary
– Computer cost halves every two years (same performance)
–  In ten years, same performance for 3% (sales tax – Jim Gray)
• Derives from Technology & Architecture
© 2004 Mark D. Hill
5
Wisconsin Multifacet Project
(Technologist’s) Moore’s Law Provides Transistors
Number of transistors
per chip doubles every
two years (18 months)
Merely a “Law” of
Business Psychology
© 2004 Mark D. Hill
6
Wisconsin Multifacet Project
Performance from Technology & Architecture
Reprinted from Hennessy and Patterson,"Computer Architecture:
A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
© 2004 Mark D. Hill
7
Wisconsin Multifacet Project
Architects Use Transistors To Compute Faster
• Bit Level Parallelism (BLP) within Instructions
 Instrns
Time 
• Instruction Level Parallelism (ILP) among Instructions
 Instrns
Time 
• Scores of speculative instructions look sequential!
© 2004 Mark D. Hill
8
Wisconsin Multifacet Project
Architects Use Transistors Tolerate Slow Memory
• Cache
– Small, Fast Memory
– Holds information (expected)
to be used soon
– Mostly Successful
• Apply Recursively
– Level-one cache(s)
– Level-two cache
• Most of microprocessor
die area is cache!
© 2004 Mark D. Hill
9
Wisconsin Multifacet Project
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
– Technology Continues
– Slow Memory
– Implications
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill
10
Wisconsin Multifacet Project
Future Technology Implications
• For (at least) ten years, Moore’s Law continues
– More repeated doublings of number of transistors per chip
– Faster transistors
• But hard for processor architects to use
– More transistors due global wire delays
– Faster transistors due too much dynamic power
• Moreover, hitting a Rock: Slow Memory
– Memory access = 100s floating-point multiplies!
– a.k.a. Wall [Wulf & McKee 1995]
© 2004 Mark D. Hill
11
Wisconsin Multifacet Project
Rock: Memory Gets (Relatively) Slower
Reprinted from Hennessy and Patterson,"Computer Architecture:
A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
© 2004 Mark D. Hill
12
Wisconsin Multifacet Project
Impact of Slow Memory (Rock)
• Off-Chip Misses are now hundreds of cycles
Time 
 Instrns
Compute Phases
Good Case!
Memory Phases
• More Realistic Case
Time 
 Instrns
I1
I3
I4
© 2004 Mark D. Hill
13
I2
window = 4 (64)
Wisconsin Multifacet Project
Implications of Slow Memory (Rock)
• Increasing Memory Latency hides Compute Phase
• Near Term Implications
– Reduce memory latency
– Fewer memory accesses
– More Memory Level Parallelism (MLP)
• Longer Term Implications
– What can single-threaded software do while waiting 100
instruction opportunities, 200, 400, … 1000?
– What can amazing speculative hardware do?
© 2004 Mark D. Hill
14
Wisconsin Multifacet Project
Assessment So Far
• Appears
– Popular Moore’s Law (doubling performance)
will end soon, regardless of the
real Moore’s Law (doubling transistors)
• Processor performance hitting Rock (Slow Memory)
• No known way to overcome this, unless
• Redefine performance in Popular Moore’s Law
– From Processor Performance
– To Chip Performance
© 2004 Mark D. Hill
15
Wisconsin Multifacet Project
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
– Small & Large CMPs
– CMP Systems
– CMP Workload
• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill
16
Wisconsin Multifacet Project
Performance for Chip, not Processor or Thread
• Chip Multiprocessing (CMP)
• Replicate Processor
• Private L1 Caches
– Low latency
– High bandwidth
• Shared L2 Cache
– Larger than if private
© 2004 Mark D. Hill
17
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
1-issue, in-order,
500MHz
CPU
Next few slides from
Luiz Barosso’s ISCA 2000 presentation of
Piranha: A Scalable Architecture
Based on Single-Chip Multiprocessing
© 2004 Mark D. Hill
18
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
1-issue, in-order,
500MHz
CPU
L1 caches:
I&D, 64KB, 2-way
I$ D$
© 2004 Mark D. Hill
19
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
CPU
CPU
CPU
1-issue, in-order,
500MHz
CPU
L1 caches:
I&D, 64KB, 2-way
I$ D$
I$ D$
I$ D$
Intra-chip switch (ICS)
I$ D$
32GB/sec, 1-cycle delay
ICS
© 2004 Mark D. Hill
I$ D$
I$ D$
I$ D$
I$ D$
CPU
CPU
CPU
CPU
20
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
CPU
CPU
L2$
I$ D$
CPU
CPU
L2$
I$ D$
1-issue, in-order,
500MHz
L2$
I$ D$
L1 caches:
L2$
I$ D$
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
ICS
I$ D$
L2$
L2$
CPU
© 2004 Mark D. Hill
I$ D$
I$ D$
L2$
CPU
I$ D$
L2$
CPU
21
CPU
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
L2$
I$ D$
CPU
CPU
L2$
I$ D$
L2$
I$ D$
1-issue, in-order,
500MHz
L1 caches:
L2$
I$ D$
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
ICS
Memory Controller (MC)
RDRAM, 12.8GB/sec
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
8 banks
@1.6GB/sec
© 2004 Mark D. Hill
22
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
HE
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle delay
L2 cache:
shared, 1MB, 8-way
ICS
I$ D$
RE L2$
Memory Controller (MC)
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE)
prog., 1K instr.,
even/odd interleaving
CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
© 2004 Mark D. Hill
23
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
4 Links
@ 8GB/s
HE
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle delay
Router
L2 cache:
shared, 1MB, 8-way
ICS
I$ D$
RE L2$
Memory Controller (MC)
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
© 2004 Mark D. Hill
24
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):
prog., 1K instr.,
even/odd interleaving
System Interconnect:
4-port Xbar router
topology independent
32GB/sec total bandwidth
Wisconsin Multifacet Project
Piranha Processing Node
Alpha core:
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
HE
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
L2$
I$ D$
1-issue, in-order,
500MHz
L1 caches:
I&D, 64KB, 2-way
Intra-chip switch (ICS)
32GB/sec, 1-cycle delay
Router
L2 cache:
shared, 1MB, 8-way
ICS
I$ D$
RE L2$
I$ D$
L2$
CPU
Memory Controller (MC)
I$ D$
L2$
CPU
I$ D$
L2$
CPU
System Interconnect:
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL
© 2004 Mark D. Hill
25
RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):
prog., 1K instr.,
even/odd interleaving
4-port Xbar router
topology independent
32GB/sec total bandwidth
Wisconsin Multifacet Project
Normalized Execution Time
Single-Chip Piranha Performance
350
350
L2Miss
L2Hit
CPU
300
250
233
191
200
145
150
100
100
100
44
34
50
0
P1
INO
OOO
P8
500 MHz 1GHz
1GHz 500MHz
1-issue 1-issue 4-issue 1-issue
OLTP
P1
INO
OOO
P8
500 MHz 1GHz
1GHz 500MHz
1-issue 1-issue 4-issue 1-issue
DSS
• Piranha’s performance margin 3x for OLTP and 2.2x for DSS
• Piranha has more outstanding misses  better utilizes memory system
© 2004 Mark D. Hill
26
Wisconsin Multifacet Project
Simultaneous Multithreading (SMT)
• Multiplex S logical processors on each processor
– Replicate registers, share caches, & manage other parts
– Implementation factors keep S small, e.g., 2-4
• Cost-effective gain if threads available
– E.g, S=2  1.4x performance
• Modest cost
– Limits waste if additional logical processor(s) not used
• Worthwhile CMP enhancement
© 2004 Mark D. Hill
27
Wisconsin Multifacet Project
Small CMP Systems
• Use One CMP (with C cores of S-way SMT)
– C=[2,16] & S=[2,4]  C*S = [4,64]
– Size of a small PC!
• Directly Connect CMP (C) to
Memory Controller (M) or DRAM
C
© 2004 Mark D. Hill
M
C
28
Wisconsin Multifacet Project
Medium CMP Systems
• Use 2-16 CMPs (with C cores of S-way SMT)
– Smaller: 2*4*4 = 32
– Larger: 16*16*4 = 1024
– In a single cabinet
• Connecting CMPs & Memory Controllers/DRAM & many issues
M
M
M
C
C
C
C
C
C
C
M
M
M
M
M
Dance Hall
Processor-Centric
© 2004 Mark D. Hill
C
29
Wisconsin Multifacet Project
Inflection Points
• Inflection point occurs when
– Smooth input change leads
– Disruptive output change
• Enough transistors for …
–
–
–
–
•
1970s simple microprocessor
1980s pipelined RISC
1990s speculative out-of-order
2000s …
CMP will be Server Inflection Point
– Expect >10x performance for less cost
– Implying, >>10x cost-performance
– Early CMPs like old SMPs but expect dramatic advances!
© 2004 Mark D. Hill
30
Wisconsin Multifacet Project
So What’s Wrong with CMP Picture?
• Chip Multiprocessors
– Allow profitable use of more transistors
– Support modest to vast multithreading
– Will be inflection point for commercial servers
• But
– Many workloads have single thread (available to run)
– Even if single thread solves a problem formerly done by
many people in parallel (e.g., clerks in payroll processing)
• Go to a Hard Place
– Make most workloads flourish with CMPs
© 2004 Mark D. Hill
31
Wisconsin Multifacet Project
Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
– Parallel from Fringe to Center
– For All of Computer Science!
© 2004 Mark D. Hill
32
Wisconsin Multifacet Project
Thread Parallelism from Fringe to Center
• History
– Automatic Computer (vs. Human)  Computer
– Digital Computer (vs. Analog)  Computer
• Must Change
–
–
–
–
–
Parallel Computer (vs. Sequential)  Computer
Parallel Algorithm (vs. Sequential)  Algorithm
Parallel Programming (vs. Sequential)  Programming
Parallel Library (vs. Sequential)  Library
Parallel X (vs. Sequential)  X
• Otherwise, repeated performance doublings unlikely
© 2004 Mark D. Hill
33
Wisconsin Multifacet Project
Computer Architects Can Contribute
• Chip Multiprocessor Design
– Transcend pre-CMP multiprocessor design
– Intra-CMP has lower latency & much higher bandwidth
• Hide Multithreading (Helper Threads)
• Assist Multithreading (Thread-Level Speculation)
• Ease Multithreaded Programming (Transactions)
• Provide a “Gentle Ramp to Parallelism” (Hennessy)
© 2004 Mark D. Hill
34
Wisconsin Multifacet Project
But All of Computer Science is Needed
• Hide Multithreading (Libraries & Compilers)
• Assist Multithreading (Development Environments)
• Ease Multithreaded Programming (Languages)
• Divide & Conquer Multithreaded Complexity
(Theory & Abstractions)
• Must Enable
– 99% of programmers think sequentially while
– 99% of instructions execute in parallel
• Enable a “Parallelism Superhighway”
© 2004 Mark D. Hill
35
Wisconsin Multifacet Project
Summary
• (Single-Threaded) Computing faces a Rock: Slow Memory
• Popular Moore’s Law (doubling performance) will end soon
• Chip Multiprocessing Can Help
– >>10x cost-performance for multithreaded workloads
– What about software with one apparent thread?
• Go to Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing
– Computer architects can help, but long run
– Requires moving multithreading from CS fringe to center
• Necessary For Restoring Popular Moore’s Law
© 2004 Mark D. Hill
36
Wisconsin Multifacet Project
Download