Processor Design CS 3220 Fall 2014 Hadi Esmaeilzadeh hadi@cc.gatech.edu Georgia Institute of Technology A C T Alternative,Computing,Technologies Hadi Esmaeilzadeh From Khoy, Iran 2 PhD in CSE, University of Washington Doug Burger and Luis Ceze 2013 William Chan Memorial Dissertation Award MSc in CS, The University of Texas at Austin MSc and BSc in ECE, University of Tehran 3 Research: ACT Lab A Alternative Computing Technologies T C Alternative,Computing,Technologies General-purpose approximate computing Bridging neuromorphic and von Neumann models of computing Analog computing System design for online machine learning System design for perpetual devices 4 Agenda 1. Who is Hadi 2. Course organization 3. Why CS 3220 Processor Design 1. How we became and industry of new capabilities 2. Why we might become an industry of replacement 3. Specialization and FPGA Design 4. Pre-assessment Test 5 Objective Learn principles of processor design Learn hardware design and synthesis – Verilog Hardware Description Language (HDL) Learn how to benchmark and evaluate hardware – Hardware/software interface (Instruction Set Architecture) – Machine language and assembler Build and operate your own processor – Realize a piplined processor on a real Field-Programmable Gate Array (FPGA) 6 Format Project-based course that follows CS 2200 Lectures are the main source for exams and homework There is no perfect textbook for this course! – Recommended reading: First Edition of Digital Design and Computer Architecture by David Harris and Sarah Harris, 2007 Attendance is mandatory: three surprise quizzes 7 Prerequisites CS 2110 and its prerequisites – With minimum grade of C Basic digital design – Build an adder with NAND gates Basic processor design – Design single-cycle, multi-cycle, pipelined processors 8 Grading rubric Component Class Participation Midterm Exam Final Exam Project Assignments Fraction 15% 10% 20% 60% 9 Project Assignments Groups of two Each individual should be expert in all aspects of the work – Each individual submits a version of the project – Demos are also done individually Please DO NOT CHEAT! It is just not cool! – Follow the Georgia Tech Academic Honor Code – Ask me if you are not sure 10 Agenda 1. Who is Hadi 2. Course organization 3. Why CS 3220 Processor Design 1. How we became and industry of new capabilities 2. Why we might become an industry of replacement 3. Specialization and FPGA design 4. Pre-assessment Test 11 What has made computing pervasive? What is the backbone of the computing industry? 12 Programmability Networking 13 What makes computers programmable? 14 Von Neumann architecture General-purpose processors Components – Memory (RAM) – Central processing unit (CPU) • Control unit • Arithmetic logic unit (ALU) – Input/output system Memory stores program and data Program instructions execute sequentially – Program Counter PC 15 Programmability versus Efficiency Fetch Decode Reg Read Branch Predictor I Cache ITLB Execute Memory Write Back INT FU Decoder D Cache Register File FP FU Register File DTLB 16 Programmability versus Efficiency Programmability General-Purpose Processors SIMD Units GPUs FPGAs ASICs Efficiency 17 What is the difference between the computing industry and the tissue paper industry? 18 Industry of replacement 1971 2014 ? Industry of new capabilities 19 Can we continue being an industry of new capabilities? Personalized healthcare Virtual reality Real-time translators 20 Agenda 1. Who is Hadi 2. Course organization 3. Why CS 3220 Processor Design 1. How we became and industry of new capabilities 2. Why we might become an industry of replacement 3. Specialization and FPGA Design 4. Pre-assessment Test 21 Transistors/switches Building blocks of computing 22 Moore’s Law Or, how we became an industry of new possibilities Every 2 Years Double the number of transistors Build higher performance general-purpose processors – Make the transistors available to masses – Increase performance (1.8×↑) – Lower the cost of computing (1.8×↓) 23 What is the catch? Powering the transistors without melting the chip 10,000,000,000 2,200,000,000 Chip Transistor Count 1,000,000,000 100,000,000 Chip Power 10,000,000 Moore’s Law 1,000,000 100,000 10,000 2300 1,000 130 W 100 10 0.5 W 1 0 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 24 Dennard scaling: Doubling the transistors; scale their power down Transistor: 2D Voltage-Controlled Switch Dimensions Voltage ×0.7 Doping Concentrations Area 0.5×↓ Capacitance 0.7×↓ Frequency Power 1.4×↑ Power = Capacitance × Frequency × Voltage2 0.5×↓ 25 Dennard scaling broke: Double the transistors; still scale their power down Transistor: 2D Voltage-Controlled Switch Dimensions Voltage ×0.7 Doping Concentrations Area 0.5×↓ Capacitance 0.7×↓ Frequency Power 1.4×↑ Power = Capacitance × Frequency × Voltage2 0.5×↓ 26 Dark silicon If you cannot power them, why bother making them? Area Power 0.5×↓ 0.5×↓ Dark Silicon Fraction of transistors that need to be powered off at all times due to power constraints 27 Looking back Evolution of processors Dennard scaling broke Single-core Era Multicore Era 3.4 GHz 3.5 GHz 2003 2013 740 KHz 1971 2004 28 Are multicores a long-term solution or just a stopgap? 29 Agenda 1. Who Hadi is 2. Course organization 3. Why alternative computing technologies 1. How we became and industry of new possibilities 2. Why we might become an industry of replacement 4. Possible alternative computing technologies 5. Pre-assessment Test 30 Modeling future multicores Quantify the severity of the problem Predict the performance of best-case multicores – From 45 nm to 8 nm – Parallel benchmarks – Fixed power and area budget Transistor Scaling Model Single-Core Scaling Model Multicore Scaling Model Esmaeilzadeh, Belem, St. Amant, Sankaralingam, Burger, “Dark Silicon and the End of Multicore Scaling,” ISCA 2011 31 Transistor scaling model From 45 nm to 8 nm [Dennard, 1974] [ITRS, 2010] [VLSI-DAT, 2010] Historical Scaling Optimistic Scaling Model Conservative Scaling Model Area 32× ↓ 32× ↓ 32× ↓ Power 32× ↓ 8.3× ↓ 4.5× ↓ Speed 5.7× ↑ 3.9× ↑ 1.3× ↑ 32 Multicore scaling model From 45 nm to 8 nm Single Core Search Space (Scaled Area and Power Pareto Frontiers) Constraints Application Characteristics (Area and Power Budget) (% Parallel, % Memory Accesses) Multicore Organization: CPU-Like, GPU-Like (# of HW Threads, Cache Sizes) Multicore Topology Microarchitectural Features (Symmetric, Asymmetric, Dynamic, Composable) (Cache and Memory Latencies, CPI, Memory Bandwidth) Exhaustive search of multicore design space (Examine 800 design points for every technology node) 33 2014 Performance Improvement / 45 nm 20 18× Historical Trend 16 Optimistic Transistor Scaling (Projection) 12 Conservative Transistor Scaling (Projection) 8 7.9× 4 3.7× 0 45 nm 32 nm 22 nm 16 nm 11 nm 8 nm Dark Silicon 10 years 45 nm 32 nm 22 nm 16 nm 11 nm 8 nm 1% 17% 36% 40% 51% 34 Industry of replacement? Multicores are likely to be a stopgap – Not likely to continue the historical trends – Do not overcome the transistor scaling trends – The performance gap is significantly large Radical departures from conventional approaches are necessary – Extract more performance and efficiency from silicon while preserving programmability – Explore other models of computing 35 Agenda 1. Who is Hadi 2. Course organization 3. Why CS 3220 Processor Design 1. How we became and industry of new capabilities 2. Why we might become an industry of replacement 3. Specialization and FPGA Design 4. Pre-assessment Test 36 Possible paths forward My teaching! Do Nothing Specialization and Co-design Biological Computing Technology Breakthrough Quantum Computing Software Bloat Reduction Approximate Computing Easy for me! My research! Way long term! 37 Approximate computing Embracing error Relax the abstraction of “near-perfect” accuracy in general-purpose computing Allow errors to happen in the computation – Run faster – Run more efficiently 38 39 New landscape of computing Personalized and targeted computing 40 Classes of approximate applications Programs with analog inputs – Sensors, scene reconstruction Programs with analog outputs – Multimedia Programs with multiple possible answers – Web search, machine learning Convergent programs – Gradient descent, big data analytics 41 Adding a third dimension Embracing Error Energy Processor Pareto.Fron0er Performance 42 Adding the Dimension of Error Finding the Pareto surface Truffle [ASPLOS ‘12] Energy Processor Pareto.Fron0er R2, RFVP [ASPLOS ‘15] NPUs [MICRO ‘12] [ISCA ‘14] Performance (3.7×↑, 6.3×↓, 10%) 43 Parrot algorithmic transformation Learned Model Learned Model Core Accelerator 44 Neural networks for code approximation Powerful prediction tools Highly parallel Neural Processing Units Efficiently implementable with hardware – Both digital and analog Fault tolerant 45 NPU design alternatives CPU CPU GPU FPGA NPU Digital ASIC (Speed: 1.8×↑, (Speed: 2.3×↑, Energy: 1.7×↓, Energy: 3.0×↓, Quality: 10%↓) Quality: 10%↓) FPAA Analog ASIC (Speed: 3.7×↑, Energy: 6.3×↓, Quality: 10%↓) 46 Approximate computing versus conventional computing Possible paths forward My teaching! Do Nothing Specialization and Co-design Biological Computing Technology Breakthrough Quantum Computing Software Bloat Reduction Approximate Computing Easy for me! My research! Way long term! 48 Programmability versus Efficiency Programmability General-Purpose Processors SIMD Units GPUs FPGAs ASICs Efficiency 49 Large-Scale Reconfigurable Computing in a Microsoft Datacenter Microsoft Cloud Services Capabilities, Costs One Application’s Accelerator Xeon CPU NIC One Application’s Accelerator Xeon CPU Search Acc. (FPGA) Xeon CPU Search Acc. v2 (FPGA) Wasted Power, Holds back SW NIC Xeon CPU Math Accelerator Wasted Power, One more thing that can break NIC Search Acc. NIC (ASIC) Integrating FPGAs into the Datacenter Microsoft Open Compute Server Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs @ 2 TB, 2 SSDs @ 512 GB 10 Gb Ethernet No cable attachments to server 68 ⁰C Catapult FPGA Accelerator Card –Altera Stratix V GS D5 • 172k ALMs, 2,014 M20Ks, 1,590 DSPs –8GB DDR3-1333 –32 MB Configuration Flash Stratix V –PCIe Gen 3 x8 –8 lanes to Mini-SAS SFF-8088 connectors –Powered by PCIe slot Config Flash PCIe Gen3 x8 4x 20 Gbps Torus Network 8GB DDR3 Board Details 16 Layer, FR408 9.5cm x 8.8cm x 115.8 mil 35mm x 35mm FPGA 14.2mm high heatsink FPGA 1U Mezz Conn. Scalable Reconfigurable Fabric 1 FPGA board per Server 48 Servers per ½ Rack 6x8 Torus Network among FPGAs 20 Gb over SAS SFF-8088 cables Data Center Server (1U, ½ width) 60 Agenda 1. Who is Hadi 2. Course organization 3. Why CS 3220 Processor Design 1. How we became and industry of new capabilities 2. Why we might become an industry of replacement 3. Specialization and FPGA Design 4. Pre-assessment Test 61 62