Benefits to Cadence of the MIT MAP

VLSI Architecture Past, Present, and Future William J. Dally Computer Systems Laboratory Stanford University March 23, 1999 3/23/99: 1 Past, Present, and Future • The last 20 years has seen a 1000-fold increase in grids per chip and a 20-fold reduction in gate delay • We expect this trend to continue for the next 20 years • For the past 20 years, these devices have been applied to implicit parallelism • We will see a shift toward implicit parallelism over the next 20 years 3/23/99: 2 Technology Evolution 10 10 3 10 2 10 2 1 wire pitch (um) 10 1 10 0 gate delay (ns) gate length (um) 10 0 10 -2 -1 10 1960 3/23/99: 3 -1 1970 1980 1990 2000 2010 10 1960 1970 1980 1990 2000 2010 Technology Evolution (2) Parameter 1999 2019 Units Gate Length 5 0.2 0.008 m Gate Delay 150 7.5 ps Clock Cycle 200 2.5 0.08 ns Gates/Clock 67 17 10 Wire Pitch 15 1 .07 m Chip Edge 6 15 38 mm Grids/Chip 1.6  105 2.3  108 3.0  1011 3/23/99: 4 1979 3000 Architecture Evolution Year Microprocessor High-end Processor 1979 i8086 Cray 1 0.5 MIPS 70 MIPS 0.001 MFLOPS 250 MFLOPS Compaq 21264 500 MIPS, 500 MFLOPS (x 4?) 1999 2019 3/23/99: 5 X MP with 1000 10000 MIPS, Xs 10000 MFLOPS Performance Incremental Returns Quad-issue out of order Dual-issue in order Pipelined RISC Processor Cost (Die Area) 3/23/99: 6 Peak Performance Efficiency and Granularity 2P+M 2P+2M P+M System Cost (Die Area) 3/23/99: 7 VLSI in 1979 3/23/99: 8 VLSI Architecture in 1979 • • • • 5m NMOS technology 6mm die size 100,000 grids per chip, 10,000 transistors 8086 microprocessor – 0.5MIPS 3/23/99: 9 1979-1989: Attack of the Killer Micros • 50% per year improvement in performance • Transistors applied to implicit parallelism – pipeline processor (10 CPI --> 1 CPI) – shorten clock cycle (67 gates/clock --> 30 gates/clock) • in 1989 a 32-bit processor w/ floating point and caches fits on one chip – e.g., i860 40MIPS, 40MFLOPS – 5,000,000 grids, 1M transistors (many memory) 3/23/99: 10 1989-1999: The Era of Diminishing Returns • 50% per year increase in performance through 1996, but – projects delayed, performance below expectations – 50% increase in grids, 15% increase in frequency (72% total) • Squeaking out the last implicit parallelism – 2-way to 6-way issue, out-of-order issue, branch prediction – 1 CPI --> 0.5 CPI, 30 gates/clock --> 20 gates/clock • Convert data parallelism to ILP • Examples – Intel Pentium II (3-way o-o-o) – Compaq 21264 (4-way o-o-o) 3/23/99: 11 1979-1999: Why Implicit Parallelism? • Opportunity – large gap between micros and fastest processors • Compatibility – software pool ready to run on implicitly parallel machines • Technology – not available for fine-grain explicitly parallel machines 3/23/99: 12 1999-2019: Explicit Parallelism Takes Over • Opportunity – no more processor gap • Technology – interconnection, interaction, and shared memory technologies have been proven 3/23/99: 13 Technology for Fine-Grain Parallel Machines • A collection of workstations does not make a good parallel machine. (BLAGG) – – – – Bandwidth - large fraction (0.1) of local memory BW LAtency - small multiple (3) of local memory latency Global mechanisms - sync, fetch-and-op Granularity - of tasks (100 inst) and memory (8MB) 3/23/99: 14 Technology for Parallel Machines Three Components • Networks – 2 clocks/hop latency – 8GB/s global bandwidth • Interaction mechanisms – single-cycle communication and synchronization • Software 3/23/99: 15 k-ary n-cubes • Link bandwidth, B, depends on radix, k, for both wire- and pinlimited networks. • Select radix to trade-off diameter, D, against B. 70 60 Latency 50 40 30 4K Nodes L = 256 Bs= 16K 20 T 10 0 0 2 4 6 8 10 Dimension 12 L D B L nk T  Ck 4 Dally, “Performance Analysis of k-ary n-cube Interconnection Networks”, IEEE TC, 1990 Delay of Express Channels The Torus Routing Chip • k-ary n-cube topology – 2D Torus Network – 8bit x 20MHz Channels • • • • • Hardware routing Wormhole routing Virtual channels Fully Self-Timed Design Internal Crossbar Architecture Dally and Seitz, “The Torus Routing Chip”, Distributed Computing, 1986 The Reliable Router • Fault-tolerant – Adaptive routing (adaptation of Duato’s algorithm) – Link-level retry – Unique token protocol • 32bit x 200MHz channels – Simultaneous bidirectional signalling – Low latency plesiochronous synchronizers • Optimisitic routing Dally, Dennison, Harris, Kan, and Xanthopoulos, “Architecture and Implementation of the Reliable Router”, Hot Interconnects II, 1994 Dally, Dennison, and Xanthopoulos, “Low-Latency Plesiochronous Data Retiming, “ ARVLSI 1995 Dennison, Lee, and Dally, “High Performance Bidirectional Signalling in VLSI Systems,” SIS 1993 Equalized 4Gb/s Signaling 3/23/99: 20 End-to-End Latency • Software sees ~10s latency with 500ns network • Heavy compute load associated with sending a message Regs – system call – buffer allocation – synchronization Send Tx Node • Solution: treat the network like memory, not like an I/O device Net Buffer Dispatch Rx Node – hardware formatting, addressing, and buffer allocation Network Summary • We can build networks with 2-4 clocks/hop latency (12-24 clocks for a 512-node 3-cube) – networks faster than main memory access of modern machines – need end-to-end hardware support to see this, no ‘libraries’ • With high-speed signaling, bandwdith of 4GB/s or more per channel (512GB/s bisection) is easy to achieve – nearly flat memory bandwidth • Topology is a matter of matching pin and bisection constraints to the packaging technology – its hard to beat a 3-D mesh or torus • This gives us B and LA (of BLAGG) 3/23/99: 22 The Importance of Mechanisms A B Serial Execution 3/23/99: 23 The Importance of Mechanisms A B Serial Execution OVH COM A COM OVH Sync B Parallel Execution (High Ovherhead 0.5) 3/23/99: 24 The Importance of Mechanisms A B Serial Execution OVH COM A COM OVH Sync B Parallel Execution (High Ovherhead 0.5) A B Parallel Execution (Low Ovherhead 0.062) 3/23/99: 25 Granularity and Cost Effectiveness • Parallel Computers Built for – Capability - run problems that are too big or take too long to solve any other way P M $ • absolute performance at any cost – Capacity - get throughput on lots of small problems • performance/cost • A parallel computer built from workstation size nodes will always have lower perf/cost than a workstation P $ P $ P $ P $ M M M M – sublinear speedup – economies of scale • A parallel computer with less memory per node can have better perf/cost than a workstation 3/23/99: 26 MIT J-Machine (1991) 3/23/99: 27 Exploiting fine-grain threads • Where will the parallelism come from to keep all of these processors busy? – ILP - limited to about 5 – Outer-loop parallelism • e.g., domain decomposition • requires big problems to get lots of parallelism • Fine threads – make communication and synchronization very fast (1 cycle) – break the problem into smaller pieces – more parallelism 3/23/99: 28 Mechanism and Granularity Summary • Fast communication and synchronization mechanisms enable fine-grain task decomposition – simplifies programming – exposes parallelism – facilitates load balance • Have demonstrated – 1-cycle communication and synchronization locally – 10-cycle communication, synchronization, and task dispatch across a network • Physically fine-grain machines have better performance/cost than sequential machines 3/23/99: 29 A 2009 Multicomputer Processor Memory 8MB System: 16 Chips 3/23/99: 30 Chip: 64 Tiles Tile: P + 8MB Challenges for the Explicitly Parallel Era • Compatibility • Managing locality • Parallel software 3/23/99: 31 Compatibility • Almost no fine-grain parallel software exists • Writing parallel software is easy – with good mechanisms • Parallelizing sequential software is hard – needs to be designed from the ground up • An incremental migration path – run sequential codes with acceptable performance – parallelize selected applications for considerable speedup 3/23/99: 32 Performance Depends on Locality • Applications have data/timedependent graph structure – Sparse-matrix solution • non-zero and fill-in structure – Logic simulation • circuit topology and activity – PIC codes • structure changes as particles move – ‘Sort-middle’ polygon rendering • structure changes as viewpoint moves 3/23/99: 33 Fine-Grain Data Migration Drift and Diffusion • Run-time relocation based on pointer use – move data at both ends of pointer – move control and data • Each ‘relocation cycle’ – compute drift vector based on pointer use – compute diffusion vector based on density potential (Taylor) – need to avoid oscillations • Should data be replicated? – not just update vs. invalidate – need to duplicate computation to avoid communication 3/23/99: 34 Migration and Locality 6 Distance (in tiles) 5 4 3 NoMigration 2 OneStep Hierarchy 1 Mixed 0 1 3/23/99: 35 5 9 13 17 21 25 Migration Period 29 33 37 Parallel Software: Focus on the Real Problems • Almost all demanding problems have ample parallelism • Need to focus on fundamental problems – extracting parallelism – load balance – locality • load balance and locality can be covered by excess parallelism • Avoid incidental issues – aggregating tasks to avoid overhead – manually managing data movement and replication – oversynchronization 3/23/99: 36 Parallel Software: Design Strategy • A program must be designed for parallelism from the ground up – no bottlenecks in the data structures • e.g., arrays instead of linked lists • Data parallelism – many for loops (over data,not time) can be forall – break dependencies out of the loop – synchronize on natural units (no barriers) 3/23/99: 37 Conclusion: We are on the threshold of the explicitly parallel era • As in 1979, we expect a 1000-fold increase in ‘grids’ per chip in the next 20 years • Unlike 1979 these ‘grids’ are best applied to explicitly parallel machines – Diminishing returns from sequential processors (ILP) - no alternative to explicit parallelism – Enabling technologies have been proven • interconnection networks, mechanisms, cache coherence – Fine-grain machines are more efficient than sequential machines • Fine-grain machines will be constructed from multiprocessor/DRAM chips • Incremental migration to parallel software 3/23/99: 38

Benefits to Cadence of the MIT MAP

Related documents

Products

Support

Benefits to Cadence of the MIT MAP

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib