VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999 3/23/99: 1 Past, Present, and Future • The last 20 years has seen a 1000-fold increase in grids per chip and a 20-fold reduction in gate delay • We expect this trend to continue for the next 20 years • For the past 20 years, these devices have been applied to implicit parallelism • We will see a shift toward implicit parallelism over the next 20 years 3/23/99: 2 Technology Evolution 10 10 3 10 2 10 2 1 wire pitch (um) 10 1 10 0 gate delay (ns) gate length (um) 10 0 10 -2 -1 10 1960 3/23/99: 3 -1 1970 1980 1990 2000 2010 10 1960 1970 1980 1990 2000 2010 Technology Evolution (2) Parameter 1999 2019 Units Gate Length 5 0.2 0.008 m Gate Delay 150 7.5 ps Clock Cycle 200 2.5 0.08 ns Gates/Clock 67 17 10 Wire Pitch 15 1 .07 m Chip Edge 6 15 38 mm Grids/Chip 1.6 105 2.3 108 3.0 1011 3/23/99: 4 1979 3000 Architecture Evolution Year Microprocessor High-end Processor 1979 i8086 Cray 1 0.5 MIPS 70 MIPS 0.001 MFLOPS 250 MFLOPS Compaq 21264 500 MIPS, 500 MFLOPS (x 4?) 1999 2019 3/23/99: 5 X MP with 1000 10000 MIPS, Xs 10000 MFLOPS Performance Incremental Returns Quad-issue out of order Dual-issue in order Pipelined RISC Processor Cost (Die Area) 3/23/99: 6 Peak Performance Efficiency and Granularity 2P+M 2P+2M P+M System Cost (Die Area) 3/23/99: 7 VLSI in 1979 3/23/99: 8 VLSI Architecture in 1979 • • • • 5m NMOS technology 6mm die size 100,000 grids per chip, 10,000 transistors 8086 microprocessor – 0.5MIPS 3/23/99: 9 1979-1989: Attack of the Killer Micros • 50% per year improvement in performance • Transistors applied to implicit parallelism – pipeline processor (10 CPI --> 1 CPI) – shorten clock cycle (67 gates/clock --> 30 gates/clock) • in 1989 a 32-bit processor w/ floating point and caches fits on one chip – e.g., i860 40MIPS, 40MFLOPS – 5,000,000 grids, 1M transistors (many memory) 3/23/99: 10 1989-1999: The Era of Diminishing Returns • 50% per year increase in performance through 1996, but – projects delayed, performance below expectations – 50% increase in grids, 15% increase in frequency (72% total) • Squeaking out the last implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock • Convert data parallelism to ILP • Examples – Intel Pentium II (3-way o-o-o) – Compaq 21264 (4-way o-o-o) 3/23/99: 11 1979-1999: Why Implicit Parallelism? • Opportunity – large gap between micros and fastest processors • Compatibility – software pool ready to run on implicitly parallel machines • Technology – not available for fine-grain explicitly parallel machines 3/23/99: 12 1999-2019: Explicit Parallelism Takes Over • Opportunity – no more processor gap • Technology – interconnection, interaction, and shared memory technologies have been proven 3/23/99: 13 Technology for Fine-Grain Parallel Machines • A collection of workstations does not make a good parallel machine. (BLAGG) – – – – Bandwidth - large fraction (0.1) of local memory BW LAtency - small multiple (3) of local memory latency Global mechanisms - sync, fetch-and-op Granularity - of tasks (100 inst) and memory (8MB) 3/23/99: 14 Technology for Parallel Machines Three Components • Networks – 2 clocks/hop latency – 8GB/s global bandwidth • Interaction mechanisms – single-cycle communication and synchronization • Software 3/23/99: 15 k-ary n-cubes • Link bandwidth, B, depends on radix, k, for both wire- and pinlimited networks. • Select radix to trade-off diameter, D, against B. 70 60 Latency 50 40 30 4K Nodes L = 256 Bs= 16K 20 T 10 0 0 2 4 6 8 10 Dimension 12 L D B L nk T Ck 4 Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990 Delay of Express Channels The Torus Routing Chip • k-ary n-cube topology – 2D Torus Network – 8bit x 20MHz Channels • • • • • Hardware routing Wormhole routing Virtual channels Fully Self-Timed Design Internal Crossbar Architecture Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986 The Reliable Router • Fault-tolerant – Adaptive routing (adaptation of Duato’s algorithm) – Link-level retry – Unique token protocol • 32bit x 200MHz channels – Simultaneous bidirectional signalling – Low latency plesiochronous synchronizers • Optimisitic routing Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994 Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995 Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993 Equalized 4Gb/s Signaling 3/23/99: 20 End-to-End Latency • Software sees ~10s latency with 500ns network • Heavy compute load associated with sending a message Regs – system call – buffer allocation – synchronization Send Tx Node • Solution: treat the network like memory, not like an I/O device Net Buffer Dispatch Rx Node – hardware formatting, addressing, and buffer allocation Network Summary • We can build networks with 2-4 clocks/hop latency (12-24 clocks for a 512-node 3-cube) – networks faster than main memory access of modern machines – need end-to-end hardware support to see this, no ‘libraries’ • With high-speed signaling, bandwdith of 4GB/s or more per channel (512GB/s bisection) is easy to achieve – nearly flat memory bandwidth • Topology is a matter of matching pin and bisection constraints to the packaging technology – its hard to beat a 3-D mesh or torus • This gives us B and LA (of BLAGG) 3/23/99: 22 The Importance of Mechanisms A B Serial Execution 3/23/99: 23 The Importance of Mechanisms A B Serial Execution OVH COM A COM OVH Sync B Parallel Execution (High Ovherhead 0.5) 3/23/99: 24 The Importance of Mechanisms A B Serial Execution OVH COM A COM OVH Sync B Parallel Execution (High Ovherhead 0.5) A B Parallel Execution (Low Ovherhead 0.062) 3/23/99: 25 Granularity and Cost Effectiveness • Parallel Computers Built for – Capability - run problems that are too big or take too long to solve any other way P M $ • absolute performance at any cost – Capacity - get throughput on lots of small problems • performance/cost • A parallel computer built from workstation size nodes will always have lower perf/cost than a workstation P $ P $ P $ P $ M M M M – sublinear speedup – economies of scale • A parallel computer with less memory per node can have better perf/cost than a workstation 3/23/99: 26 MIT J-Machine (1991) 3/23/99: 27 Exploiting fine-grain threads • Where will the parallelism come from to keep all of these processors busy? – ILP - limited to about 5 – Outer-loop parallelism • e.g., domain decomposition • requires big problems to get lots of parallelism • Fine threads – make communication and synchronization very fast (1 cycle) – break the problem into smaller pieces – more parallelism 3/23/99: 28 Mechanism and Granularity Summary • Fast communication and synchronization mechanisms enable fine-grain task decomposition – simplifies programming – exposes parallelism – facilitates load balance • Have demonstrated – 1-cycle communication and synchronization locally – 10-cycle communication, synchronization, and task dispatch across a network • Physically fine-grain machines have better performance/cost than sequential machines 3/23/99: 29 A 2009 Multicomputer Processor Memory 8MB System: 16 Chips 3/23/99: 30 Chip: 64 Tiles Tile: P + 8MB Challenges for the Explicitly Parallel Era • Compatibility • Managing locality • Parallel software 3/23/99: 31 Compatibility • Almost no fine-grain parallel software exists • Writing parallel software is easy – with good mechanisms • Parallelizing sequential software is hard – needs to be designed from the ground up • An incremental migration path – run sequential codes with acceptable performance – parallelize selected applications for considerable speedup 3/23/99: 32 Performance Depends on Locality • Applications have data/timedependent graph structure – Sparse-matrix solution • non-zero and fill-in structure – Logic simulation • circuit topology and activity – PIC codes • structure changes as particles move – ‘Sort-middle’ polygon rendering • structure changes as viewpoint moves 3/23/99: 33 Fine-Grain Data Migration Drift and Diffusion • Run-time relocation based on pointer use – move data at both ends of pointer – move control and data • Each ‘relocation cycle’ – compute drift vector based on pointer use – compute diffusion vector based on density potential (Taylor) – need to avoid oscillations • Should data be replicated? – not just update vs. invalidate – need to duplicate computation to avoid communication 3/23/99: 34 Migration and Locality 6 Distance (in tiles) 5 4 3 NoMigration 2 OneStep Hierarchy 1 Mixed 0 1 3/23/99: 35 5 9 13 17 21 25 Migration Period 29 33 37 Parallel Software: Focus on the Real Problems • Almost all demanding problems have ample parallelism • Need to focus on fundamental problems – extracting parallelism – load balance – locality • load balance and locality can be covered by excess parallelism • Avoid incidental issues – aggregating tasks to avoid overhead – manually managing data movement and replication – oversynchronization 3/23/99: 36 Parallel Software: Design Strategy • A program must be designed for parallelism from the ground up – no bottlenecks in the data structures • e.g., arrays instead of linked lists • Data parallelism – many for loops (over data,not time) can be forall – break dependencies out of the loop – synchronize on natural units (no barriers) 3/23/99: 37 Conclusion: We are on the threshold of the explicitly parallel era • As in 1979, we expect a 1000-fold increase in ‘grids’ per chip in the next 20 years • Unlike 1979 these ‘grids’ are best applied to explicitly parallel machines – Diminishing returns from sequential processors (ILP) - no alternative to explicit parallelism – Enabling technologies have been proven • interconnection networks, mechanisms, cache coherence – Fine-grain machines are more efficient than sequential machines • Fine-grain machines will be constructed from multiprocessor/DRAM chips • Incremental migration to parallel software 3/23/99: 38