Benefits to Cadence of the MIT MAP

advertisement
VLSI Architecture
Past, Present, and Future
William J. Dally
Computer Systems Laboratory
Stanford University
March 23, 1999
3/23/99: 1
Past, Present, and Future
• The last 20 years has seen a 1000-fold
increase in grids per chip and a 20-fold
reduction in gate delay
• We expect this trend to continue for the next
20 years
• For the past 20 years, these devices have
been applied to implicit parallelism
• We will see a shift toward implicit parallelism
over the next 20 years
3/23/99: 2
Technology Evolution
10
10
3
10
2
10
2
1
wire pitch (um)
10
1
10
0
gate delay (ns)
gate length (um)
10
0
10
-2
-1
10
1960
3/23/99: 3
-1
1970
1980
1990
2000
2010
10
1960
1970
1980
1990
2000
2010
Technology Evolution (2)
Parameter
1999
2019
Units
Gate Length 5
0.2
0.008
m
Gate Delay
150
7.5
ps
Clock Cycle 200
2.5
0.08
ns
Gates/Clock 67
17
10
Wire Pitch
15
1
.07
m
Chip Edge
6
15
38
mm
Grids/Chip
1.6  105 2.3  108 3.0  1011
3/23/99: 4
1979
3000
Architecture Evolution
Year
Microprocessor High-end
Processor
1979
i8086
Cray 1
0.5 MIPS
70 MIPS
0.001 MFLOPS 250 MFLOPS
Compaq 21264
500 MIPS, 500 MFLOPS (x 4?)
1999
2019
3/23/99: 5
X
MP with 1000
10000 MIPS,
Xs
10000 MFLOPS
Performance
Incremental Returns
Quad-issue out of order
Dual-issue in order
Pipelined RISC
Processor Cost (Die Area)
3/23/99: 6
Peak Performance
Efficiency and Granularity
2P+M
2P+2M
P+M
System Cost (Die Area)
3/23/99: 7
VLSI in 1979
3/23/99: 8
VLSI Architecture in 1979
•
•
•
•
5m NMOS technology
6mm die size
100,000 grids per chip, 10,000 transistors
8086 microprocessor
– 0.5MIPS
3/23/99: 9
1979-1989: Attack of the Killer Micros
• 50% per year improvement in performance
• Transistors applied to implicit parallelism
– pipeline processor (10 CPI --> 1 CPI)
– shorten clock cycle
(67 gates/clock --> 30 gates/clock)
• in 1989 a 32-bit processor w/ floating point
and caches fits on one chip
– e.g., i860 40MIPS, 40MFLOPS
– 5,000,000 grids, 1M transistors (many memory)
3/23/99: 10
1989-1999: The Era of Diminishing Returns
• 50% per year increase in performance through 1996, but
– projects delayed, performance below expectations
– 50% increase in grids, 15% increase in frequency (72% total)
• Squeaking out the last implicit parallelism
– 2-way to 6-way issue, out-of-order issue, branch prediction
– 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock
• Convert data parallelism to ILP
• Examples
– Intel Pentium II (3-way o-o-o)
– Compaq 21264 (4-way o-o-o)
3/23/99: 11
1979-1999: Why Implicit Parallelism?
• Opportunity
– large gap between micros and fastest processors
• Compatibility
– software pool ready to run on implicitly parallel
machines
• Technology
– not available for fine-grain explicitly parallel
machines
3/23/99: 12
1999-2019: Explicit Parallelism Takes Over
• Opportunity
– no more processor gap
• Technology
– interconnection, interaction, and shared memory
technologies have been proven
3/23/99: 13
Technology for Fine-Grain Parallel Machines
• A collection of workstations does not make a
good parallel machine. (BLAGG)
–
–
–
–
Bandwidth - large fraction (0.1) of local memory BW
LAtency - small multiple (3) of local memory latency
Global mechanisms - sync, fetch-and-op
Granularity - of tasks (100 inst) and memory (8MB)
3/23/99: 14
Technology for Parallel Machines
Three Components
• Networks
– 2 clocks/hop latency
– 8GB/s global bandwidth
• Interaction mechanisms
– single-cycle communication and synchronization
• Software
3/23/99: 15
k-ary n-cubes
• Link bandwidth, B,
depends on radix, k, for
both wire- and pinlimited networks.
• Select radix to trade-off
diameter, D, against B.
70
60
Latency
50
40
30
4K Nodes
L = 256
Bs= 16K
20
T
10
0
0
2
4
6
8
10
Dimension
12
L
D
B
L nk
T

Ck 4
Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990
Delay of Express Channels
The Torus Routing Chip
• k-ary n-cube topology
– 2D Torus Network
– 8bit x 20MHz Channels
•
•
•
•
•
Hardware routing
Wormhole routing
Virtual channels
Fully Self-Timed Design
Internal Crossbar Architecture
Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986
The Reliable Router
• Fault-tolerant
– Adaptive routing (adaptation of
Duato’s algorithm)
– Link-level retry
– Unique token protocol
• 32bit x 200MHz channels
– Simultaneous bidirectional
signalling
– Low latency plesiochronous
synchronizers
• Optimisitic routing
Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994
Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995
Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993
Equalized 4Gb/s Signaling
3/23/99: 20
End-to-End Latency
• Software sees ~10s latency
with 500ns network
• Heavy compute load associated
with sending a message
Regs
– system call
– buffer allocation
– synchronization
Send
Tx Node
• Solution: treat the network like
memory, not like an I/O device
Net
Buffer
Dispatch
Rx Node
– hardware formatting,
addressing, and buffer
allocation
Network Summary
• We can build networks with 2-4 clocks/hop latency (12-24
clocks for a 512-node 3-cube)
– networks faster than main memory access of modern machines
– need end-to-end hardware support to see this, no ‘libraries’
• With high-speed signaling, bandwdith of 4GB/s or more
per channel (512GB/s bisection) is easy to achieve
– nearly flat memory bandwidth
• Topology is a matter of matching pin and bisection
constraints to the packaging technology
– its hard to beat a 3-D mesh or torus
• This gives us B and LA (of BLAGG)
3/23/99: 22
The Importance of Mechanisms
A
B
Serial Execution
3/23/99: 23
The Importance of Mechanisms
A
B
Serial Execution
OVH
COM
A
COM
OVH
Sync
B
Parallel Execution (High Ovherhead 0.5)
3/23/99: 24
The Importance of Mechanisms
A
B
Serial Execution
OVH
COM
A
COM
OVH
Sync
B
Parallel Execution (High Ovherhead 0.5)
A
B
Parallel Execution
(Low Ovherhead 0.062)
3/23/99: 25
Granularity and Cost Effectiveness
•
Parallel Computers Built for
– Capability - run problems that are too
big or take too long to solve any other
way
P
M
$
• absolute performance at any cost
– Capacity - get throughput on lots of
small problems
• performance/cost
•
A parallel computer built from
workstation size nodes will always have
lower perf/cost than a workstation
P $
P $
P $
P $
M
M
M
M
– sublinear speedup
– economies of scale
•
A parallel computer with less memory
per node can have better perf/cost than
a workstation
3/23/99: 26
MIT J-Machine (1991)
3/23/99: 27
Exploiting fine-grain threads
• Where will the parallelism come
from to keep all of these processors
busy?
– ILP - limited to about 5
– Outer-loop parallelism
• e.g., domain decomposition
• requires big problems to get lots of
parallelism
• Fine threads
– make communication and
synchronization very fast (1 cycle)
– break the problem into smaller
pieces
– more parallelism
3/23/99: 28
Mechanism and Granularity Summary
• Fast communication and synchronization mechanisms
enable fine-grain task decomposition
– simplifies programming
– exposes parallelism
– facilitates load balance
• Have demonstrated
– 1-cycle communication and synchronization locally
– 10-cycle communication, synchronization, and task dispatch
across a network
• Physically fine-grain machines have better
performance/cost than sequential machines
3/23/99: 29
A 2009 Multicomputer
Processor
Memory
8MB
System: 16 Chips
3/23/99: 30
Chip: 64 Tiles
Tile: P + 8MB
Challenges for the Explicitly Parallel Era
• Compatibility
• Managing locality
• Parallel software
3/23/99: 31
Compatibility
• Almost no fine-grain parallel software exists
• Writing parallel software is easy
– with good mechanisms
• Parallelizing sequential software is hard
– needs to be designed from the ground up
• An incremental migration path
– run sequential codes with acceptable performance
– parallelize selected applications for considerable
speedup
3/23/99: 32
Performance Depends on Locality
• Applications have data/timedependent graph structure
– Sparse-matrix solution
• non-zero and fill-in structure
– Logic simulation
• circuit topology and activity
– PIC codes
• structure changes as particles move
– ‘Sort-middle’ polygon rendering
• structure changes as viewpoint
moves
3/23/99: 33
Fine-Grain Data Migration
Drift and Diffusion
• Run-time relocation based on pointer use
– move data at both ends of pointer
– move control and data
• Each ‘relocation cycle’
– compute drift vector based on pointer use
– compute diffusion vector based on
density potential (Taylor)
– need to avoid oscillations
• Should data be replicated?
– not just update vs. invalidate
– need to duplicate computation to avoid
communication
3/23/99: 34
Migration and Locality
6
Distance (in tiles)
5
4
3
NoMigration
2
OneStep
Hierarchy
1
Mixed
0
1
3/23/99: 35
5
9
13
17
21
25
Migration Period
29
33
37
Parallel Software:
Focus on the Real Problems
• Almost all demanding problems have ample
parallelism
• Need to focus on fundamental problems
– extracting parallelism
– load balance
– locality
• load balance and locality can be covered by excess parallelism
• Avoid incidental issues
– aggregating tasks to avoid overhead
– manually managing data movement and replication
– oversynchronization
3/23/99: 36
Parallel Software:
Design Strategy
• A program must be designed for parallelism
from the ground up
– no bottlenecks in the data structures
• e.g., arrays instead of linked lists
• Data parallelism
– many for loops (over data,not time) can be forall
– break dependencies out of the loop
– synchronize on natural units (no barriers)
3/23/99: 37
Conclusion: We are on the threshold of the
explicitly parallel era
• As in 1979, we expect a 1000-fold increase in ‘grids’
per chip in the next 20 years
• Unlike 1979 these ‘grids’ are best applied to explicitly
parallel machines
– Diminishing returns from sequential processors (ILP) - no
alternative to explicit parallelism
– Enabling technologies have been proven
• interconnection networks, mechanisms, cache coherence
– Fine-grain machines are more efficient than sequential
machines
• Fine-grain machines will be constructed from multiprocessor/DRAM chips
• Incremental migration to parallel software
3/23/99: 38
Download