Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project (www.cs.wisc.edu/multifacet) October 2004 Full Disclosure: Consult for Sun & US NSF © 2004 Mark D. Hill Wisconsin Multifacet Project Executive Summary: Problem • Expect computer performance doubling every 2 years • Derives from Technology & Architecture • Technology will advance for ten or more years talk • But Architecture faces a Rock: Slow Memory – a.k.a. Wall [Wulf & McKee 1995] • Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) © 2004 Mark D. Hill 2 Wisconsin Multifacet Project Executive Summary: Recommendation • Chip Multiprocessing (CMP) Can Help – Implement multiple processors per chip – >>10x cost-performance for multithreaded workloads – What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading – Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run – Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware) • Necessary For Restoring Popular Moore’s Law © 2004 Mark D. Hill 3 Wisconsin Multifacet Project Outline • Executive Summary • Background – – – – Moore’s Law Architecture Instruction Level Parallelism Caches • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading © 2004 Mark D. Hill 4 Wisconsin Multifacet Project Society Expects A Popular Moore’s Law Computing critical: commerce, education, engineering, entertainment, government, medicine, science, … – Servers (> PCs) – Clients (= PCs) – Embedded (< PCs) talk • Come to expect a misnamed “Moore’s Law” – Computer performance doubles every two years (same cost) – Progress in next two years = All past progress • Important Corollary – Computer cost halves every two years (same performance) – In ten years, same performance for 3% (sales tax – Jim Gray) • Derives from Technology & Architecture © 2004 Mark D. Hill 5 Wisconsin Multifacet Project (Technologist’s) Moore’s Law Provides Transistors Number of transistors per chip doubles every two years (18 months) Merely a “Law” of Business Psychology © 2004 Mark D. Hill 6 Wisconsin Multifacet Project Performance from Technology & Architecture Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers. © 2004 Mark D. Hill 7 Wisconsin Multifacet Project Architects Use Transistors To Compute Faster • Bit Level Parallelism (BLP) within Instructions Instrns Time • Instruction Level Parallelism (ILP) among Instructions Instrns Time • Scores of speculative instructions look sequential! © 2004 Mark D. Hill 8 Wisconsin Multifacet Project Architects Use Transistors Tolerate Slow Memory • Cache – Small, Fast Memory – Holds information (expected) to be used soon – Mostly Successful • Apply Recursively – Level-one cache(s) – Level-two cache • Most of microprocessor die area is cache! © 2004 Mark D. Hill 9 Wisconsin Multifacet Project Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock – Technology Continues – Slow Memory – Implications • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading © 2004 Mark D. Hill 10 Wisconsin Multifacet Project Future Technology Implications • For (at least) ten years, Moore’s Law continues – More repeated doublings of number of transistors per chip – Faster transistors • But hard for processor architects to use – More transistors due global wire delays – Faster transistors due too much dynamic power • Moreover, hitting a Rock: Slow Memory – Memory access = 100s floating-point multiplies! – a.k.a. Wall [Wulf & McKee 1995] © 2004 Mark D. Hill 11 Wisconsin Multifacet Project Rock: Memory Gets (Relatively) Slower Reprinted from Hennessy and Patterson,"Computer Architecture: A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers. © 2004 Mark D. Hill 12 Wisconsin Multifacet Project Impact of Slow Memory (Rock) • Off-Chip Misses are now hundreds of cycles Time Instrns Compute Phases Good Case! Memory Phases • More Realistic Case Time Instrns I1 I3 I4 © 2004 Mark D. Hill 13 I2 window = 4 (64) Wisconsin Multifacet Project Implications of Slow Memory (Rock) • Increasing Memory Latency hides Compute Phase • Near Term Implications – Reduce memory latency – Fewer memory accesses – More Memory Level Parallelism (MLP) • Longer Term Implications – What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000? – What can amazing speculative hardware do? © 2004 Mark D. Hill 14 Wisconsin Multifacet Project Assessment So Far • Appears – Popular Moore’s Law (doubling performance) will end soon, regardless of the real Moore’s Law (doubling transistors) • Processor performance hitting Rock (Slow Memory) • No known way to overcome this, unless • Redefine performance in Popular Moore’s Law – From Processor Performance – To Chip Performance © 2004 Mark D. Hill 15 Wisconsin Multifacet Project Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? – Small & Large CMPs – CMP Systems – CMP Workload • Go to the Hard Place of Mainstream Multithreading © 2004 Mark D. Hill 16 Wisconsin Multifacet Project Performance for Chip, not Processor or Thread • Chip Multiprocessing (CMP) • Replicate Processor • Private L1 Caches – Low latency – High bandwidth • Shared L2 Cache – Larger than if private © 2004 Mark D. Hill 17 Wisconsin Multifacet Project Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz CPU Next few slides from Luiz Barosso’s ISCA 2000 presentation of Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing © 2004 Mark D. Hill 18 Wisconsin Multifacet Project Piranha Processing Node Alpha core: 1-issue, in-order, 500MHz CPU L1 caches: I&D, 64KB, 2-way I$ D$ © 2004 Mark D. Hill 19 Wisconsin Multifacet Project Piranha Processing Node Alpha core: CPU CPU CPU 1-issue, in-order, 500MHz CPU L1 caches: I&D, 64KB, 2-way I$ D$ I$ D$ I$ D$ Intra-chip switch (ICS) I$ D$ 32GB/sec, 1-cycle delay ICS © 2004 Mark D. Hill I$ D$ I$ D$ I$ D$ I$ D$ CPU CPU CPU CPU 20 Wisconsin Multifacet Project Piranha Processing Node Alpha core: CPU CPU L2$ I$ D$ CPU CPU L2$ I$ D$ 1-issue, in-order, 500MHz L2$ I$ D$ L1 caches: L2$ I$ D$ I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way ICS I$ D$ L2$ L2$ CPU © 2004 Mark D. Hill I$ D$ I$ D$ L2$ CPU I$ D$ L2$ CPU 21 CPU Wisconsin Multifacet Project Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU CPU L2$ I$ D$ CPU CPU L2$ I$ D$ L2$ I$ D$ 1-issue, in-order, 500MHz L1 caches: L2$ I$ D$ I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way ICS Memory Controller (MC) RDRAM, 12.8GB/sec I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL 8 banks @1.6GB/sec © 2004 Mark D. Hill 22 Wisconsin Multifacet Project Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU CPU CPU CPU HE L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay L2 cache: shared, 1MB, 8-way ICS I$ D$ RE L2$ Memory Controller (MC) I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ RDRAM, 12.8GB/sec Protocol Engines (HE & RE) prog., 1K instr., even/odd interleaving CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL © 2004 Mark D. Hill 23 Wisconsin Multifacet Project Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU CPU CPU CPU 4 Links @ 8GB/s HE L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay Router L2 cache: shared, 1MB, 8-way ICS I$ D$ RE L2$ Memory Controller (MC) I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ CPU CPU CPU CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL © 2004 Mark D. Hill 24 RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving System Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth Wisconsin Multifacet Project Piranha Processing Node Alpha core: MEM-CTL MEM-CTL MEM-CTL MEM-CTL CPU CPU CPU CPU HE L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ L2$ I$ D$ 1-issue, in-order, 500MHz L1 caches: I&D, 64KB, 2-way Intra-chip switch (ICS) 32GB/sec, 1-cycle delay Router L2 cache: shared, 1MB, 8-way ICS I$ D$ RE L2$ I$ D$ L2$ CPU Memory Controller (MC) I$ D$ L2$ CPU I$ D$ L2$ CPU System Interconnect: CPU MEM-CTL MEM-CTL MEM-CTL MEM-CTL © 2004 Mark D. Hill 25 RDRAM, 12.8GB/sec Protocol Engines (HE & RE): prog., 1K instr., even/odd interleaving 4-port Xbar router topology independent 32GB/sec total bandwidth Wisconsin Multifacet Project Normalized Execution Time Single-Chip Piranha Performance 350 350 L2Miss L2Hit CPU 300 250 233 191 200 145 150 100 100 100 44 34 50 0 P1 INO OOO P8 500 MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue OLTP P1 INO OOO P8 500 MHz 1GHz 1GHz 500MHz 1-issue 1-issue 4-issue 1-issue DSS • Piranha’s performance margin 3x for OLTP and 2.2x for DSS • Piranha has more outstanding misses better utilizes memory system © 2004 Mark D. Hill 26 Wisconsin Multifacet Project Simultaneous Multithreading (SMT) • Multiplex S logical processors on each processor – Replicate registers, share caches, & manage other parts – Implementation factors keep S small, e.g., 2-4 • Cost-effective gain if threads available – E.g, S=2 1.4x performance • Modest cost – Limits waste if additional logical processor(s) not used • Worthwhile CMP enhancement © 2004 Mark D. Hill 27 Wisconsin Multifacet Project Small CMP Systems • Use One CMP (with C cores of S-way SMT) – C=[2,16] & S=[2,4] C*S = [4,64] – Size of a small PC! • Directly Connect CMP (C) to Memory Controller (M) or DRAM C © 2004 Mark D. Hill M C 28 Wisconsin Multifacet Project Medium CMP Systems • Use 2-16 CMPs (with C cores of S-way SMT) – Smaller: 2*4*4 = 32 – Larger: 16*16*4 = 1024 – In a single cabinet • Connecting CMPs & Memory Controllers/DRAM & many issues M M M C C C C C C C M M M M M Dance Hall Processor-Centric © 2004 Mark D. Hill C 29 Wisconsin Multifacet Project Inflection Points • Inflection point occurs when – Smooth input change leads – Disruptive output change • Enough transistors for … – – – – • 1970s simple microprocessor 1980s pipelined RISC 1990s speculative out-of-order 2000s … CMP will be Server Inflection Point – Expect >10x performance for less cost – Implying, >>10x cost-performance – Early CMPs like old SMPs but expect dramatic advances! © 2004 Mark D. Hill 30 Wisconsin Multifacet Project So What’s Wrong with CMP Picture? • Chip Multiprocessors – Allow profitable use of more transistors – Support modest to vast multithreading – Will be inflection point for commercial servers • But – Many workloads have single thread (available to run) – Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing) • Go to a Hard Place – Make most workloads flourish with CMPs © 2004 Mark D. Hill 31 Wisconsin Multifacet Project Outline • Executive Summary • Background • Going Forward Processor Architecture Hits Rock • Chip Multiprocessing to the Rescue? • Go to the Hard Place of Mainstream Multithreading – Parallel from Fringe to Center – For All of Computer Science! © 2004 Mark D. Hill 32 Wisconsin Multifacet Project Thread Parallelism from Fringe to Center • History – Automatic Computer (vs. Human) Computer – Digital Computer (vs. Analog) Computer • Must Change – – – – – Parallel Computer (vs. Sequential) Computer Parallel Algorithm (vs. Sequential) Algorithm Parallel Programming (vs. Sequential) Programming Parallel Library (vs. Sequential) Library Parallel X (vs. Sequential) X • Otherwise, repeated performance doublings unlikely © 2004 Mark D. Hill 33 Wisconsin Multifacet Project Computer Architects Can Contribute • Chip Multiprocessor Design – Transcend pre-CMP multiprocessor design – Intra-CMP has lower latency & much higher bandwidth • Hide Multithreading (Helper Threads) • Assist Multithreading (Thread-Level Speculation) • Ease Multithreaded Programming (Transactions) • Provide a “Gentle Ramp to Parallelism” (Hennessy) © 2004 Mark D. Hill 34 Wisconsin Multifacet Project But All of Computer Science is Needed • Hide Multithreading (Libraries & Compilers) • Assist Multithreading (Development Environments) • Ease Multithreaded Programming (Languages) • Divide & Conquer Multithreaded Complexity (Theory & Abstractions) • Must Enable – 99% of programmers think sequentially while – 99% of instructions execute in parallel • Enable a “Parallelism Superhighway” © 2004 Mark D. Hill 35 Wisconsin Multifacet Project Summary • (Single-Threaded) Computing faces a Rock: Slow Memory • Popular Moore’s Law (doubling performance) will end soon • Chip Multiprocessing Can Help – >>10x cost-performance for multithreaded workloads – What about software with one apparent thread? • Go to Hard Place: Mainstream Multithreading – Make most workloads flourish with chip multiprocessing – Computer architects can help, but long run – Requires moving multithreading from CS fringe to center • Necessary For Restoring Popular Moore’s Law © 2004 Mark D. Hill 36 Wisconsin Multifacet Project