Processor Design CS 3220 Hadi Esmaeilzadeh A

advertisement
Processor Design
CS 3220
Fall 2014
Hadi Esmaeilzadeh
hadi@cc.gatech.edu
Georgia Institute of Technology
A
C
T
Alternative,Computing,Technologies
Hadi Esmaeilzadeh
From Khoy, Iran
2
PhD in CSE, University of Washington
Doug Burger and Luis Ceze
2013 William Chan Memorial Dissertation Award
MSc in CS, The University of Texas at Austin
MSc and BSc in ECE, University of Tehran
3
Research: ACT Lab
A
Alternative Computing Technologies
T
C
Alternative,Computing,Technologies
 General-purpose approximate computing
 Bridging neuromorphic and von Neumann
models of computing
 Analog computing
 System design for online machine learning
 System design for perpetual devices
4
Agenda
1. Who is Hadi
2. Course organization
3. Why CS 3220 Processor Design
1. How we became and industry of new capabilities
2. Why we might become an industry of replacement
3. Specialization and FPGA Design
4. Pre-assessment Test
5
Objective
 Learn principles of processor design
 Learn hardware design and synthesis
– Verilog Hardware Description Language (HDL)
 Learn how to benchmark and evaluate hardware
– Hardware/software interface (Instruction Set Architecture)
– Machine language and assembler
 Build and operate your own processor
– Realize a piplined processor on a real Field-Programmable
Gate Array (FPGA)
6
Format
 Project-based course that follows CS 2200
 Lectures are the main source for exams and
homework
 There is no perfect textbook for this course!
– Recommended reading:
First Edition of Digital Design and Computer
Architecture by David Harris and Sarah Harris, 2007
 Attendance is mandatory: three surprise quizzes
7
Prerequisites
 CS 2110 and its prerequisites
– With minimum grade of C
 Basic digital design
– Build an adder with NAND gates
 Basic processor design
– Design single-cycle, multi-cycle, pipelined
processors
8
Grading rubric
Component
Class Participation
Midterm Exam
Final Exam
Project Assignments
Fraction
15%
10%
20%
60%
9
Project Assignments
 Groups of two
 Each individual should be expert in all aspects
of the work
– Each individual submits a version of the project
– Demos are also done individually
 Please DO NOT CHEAT! It is just not cool!
– Follow the Georgia Tech Academic Honor Code
– Ask me if you are not sure
10
Agenda
1. Who is Hadi
2. Course organization
3. Why CS 3220 Processor Design
1. How we became and industry of new capabilities
2. Why we might become an industry of replacement
3. Specialization and FPGA design
4. Pre-assessment Test
11
What has made computing
pervasive? What is the backbone
of the computing industry?
12
Programmability
Networking
13
What makes computers
programmable?
14
Von Neumann architecture
General-purpose processors
 Components
– Memory (RAM)
– Central processing unit (CPU)
• Control unit
• Arithmetic logic unit (ALU)
– Input/output system
 Memory stores program and data
 Program instructions execute sequentially
– Program Counter PC
15
Programmability versus Efficiency
Fetch
Decode
Reg Read
Branch
Predictor
I Cache
ITLB
Execute
Memory
Write
Back
INT FU
Decoder
D Cache
Register
File
FP FU
Register
File
DTLB
16
Programmability versus Efficiency
Programmability
General-Purpose Processors
SIMD Units
GPUs
FPGAs
ASICs
Efficiency
17
What is the difference between the
computing industry and the tissue
paper industry?
18
Industry of replacement
1971
2014
?
Industry of new capabilities
19
Can we continue being an
industry of new capabilities?
Personalized
healthcare
Virtual
reality
Real-time
translators
20
Agenda
1. Who is Hadi
2. Course organization
3. Why CS 3220 Processor Design
1. How we became and industry of
new capabilities
2. Why we might become an industry of replacement
3. Specialization and FPGA Design
4. Pre-assessment Test
21
Transistors/switches
Building blocks of computing
22
Moore’s Law
Or, how we became an industry of new possibilities
Every 2 Years
 Double the number of transistors
 Build higher performance
general-purpose processors
– Make the transistors available to masses
– Increase performance (1.8×↑)
– Lower the cost of computing (1.8×↓)
23
What is the catch?
Powering the transistors without melting the chip
10,000,000,000
2,200,000,000
Chip Transistor Count
1,000,000,000
100,000,000
Chip Power
10,000,000
Moore’s Law
1,000,000
100,000
10,000 2300
1,000
130 W
100
10
0.5 W
1
0
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
24
Dennard scaling:
Doubling the transistors; scale their power down
Transistor: 2D Voltage-Controlled Switch
Dimensions
Voltage
×0.7
Doping
Concentrations
Area
0.5×↓
Capacitance
0.7×↓
Frequency
Power
1.4×↑
Power = Capacitance × Frequency × Voltage2
0.5×↓
25
Dennard scaling broke:
Double the transistors; still scale their power down
Transistor: 2D Voltage-Controlled Switch
Dimensions
Voltage
×0.7
Doping
Concentrations
Area
0.5×↓
Capacitance
0.7×↓
Frequency
Power
1.4×↑
Power = Capacitance × Frequency × Voltage2
0.5×↓
26
Dark silicon
If you cannot power them, why bother making them?
Area
Power
0.5×↓
0.5×↓
Dark Silicon
Fraction of transistors that need to be
powered off at all times
due to power constraints
27
Looking back
Evolution of processors
Dennard scaling
broke
Single-core Era
Multicore Era
3.4 GHz
3.5 GHz
2003
2013
740 KHz
1971
2004
28
Are multicores a long-term
solution or just a stopgap?
29
Agenda
1. Who Hadi is
2. Course organization
3. Why alternative computing technologies
1. How we became and industry of new possibilities
2. Why we might become an
industry of replacement
4. Possible alternative computing technologies
5. Pre-assessment Test
30
Modeling future multicores
Quantify the severity of the problem
Predict the performance of best-case multicores
– From 45 nm to 8 nm
– Parallel benchmarks
– Fixed power and area budget
Transistor
Scaling Model
Single-Core
Scaling Model
Multicore
Scaling Model
Esmaeilzadeh, Belem, St. Amant, Sankaralingam, Burger, “Dark Silicon and the End of Multicore Scaling,” ISCA 2011
31
Transistor scaling model
From 45 nm to 8 nm
[Dennard, 1974]
[ITRS, 2010]
[VLSI-DAT, 2010]
Historical
Scaling
Optimistic
Scaling Model
Conservative
Scaling Model
Area
32× ↓
32× ↓
32× ↓
Power
32× ↓
8.3× ↓
4.5× ↓
Speed
5.7× ↑
3.9× ↑
1.3× ↑
32
Multicore scaling model
From 45 nm to 8 nm
Single Core Search Space
(Scaled Area and Power Pareto Frontiers)
Constraints
Application Characteristics
(Area and Power Budget)
(% Parallel, % Memory Accesses)
Multicore Organization:
CPU-Like, GPU-Like
(# of HW Threads, Cache Sizes)
Multicore Topology
Microarchitectural Features
(Symmetric, Asymmetric,
Dynamic, Composable)
(Cache and Memory Latencies, CPI,
Memory Bandwidth)
Exhaustive search of multicore design space
(Examine 800 design points for every technology node)
33
2014
Performance Improvement / 45 nm
20
18×
Historical Trend
16
Optimistic Transistor Scaling (Projection)
12
Conservative Transistor Scaling (Projection)
8
7.9×
4
3.7×
0
45 nm
32 nm
22 nm
16 nm
11 nm
8 nm
Dark Silicon
10 years
45 nm
32 nm
22 nm
16 nm
11 nm
8 nm
1%
17%
36%
40%
51%
34
Industry of replacement?
 Multicores are likely to be a stopgap
– Not likely to continue the historical trends
– Do not overcome the transistor scaling trends
– The performance gap is significantly large
 Radical departures from conventional approaches
are necessary
– Extract more performance and efficiency from silicon
while preserving programmability
– Explore other models of computing
35
Agenda
1. Who is Hadi
2. Course organization
3. Why CS 3220 Processor Design
1. How we became and industry of new capabilities
2. Why we might become an industry of replacement
3. Specialization and FPGA Design
4. Pre-assessment Test
36
Possible paths forward
My teaching!
Do Nothing
Specialization and
Co-design
Biological Computing
Technology Breakthrough
Quantum Computing
Software Bloat Reduction
Approximate Computing
Easy for me!
My research!
Way long term!
37
Approximate computing
Embracing error
 Relax the abstraction of “near-perfect”
accuracy in general-purpose computing
 Allow errors to happen in the computation
– Run faster
– Run more efficiently
38
39
New landscape of computing
Personalized and targeted computing
40
Classes of
approximate applications
 Programs with analog inputs
– Sensors, scene reconstruction
 Programs with analog outputs
– Multimedia
 Programs with multiple possible answers
– Web search, machine learning
 Convergent programs
– Gradient descent, big data analytics
41
Adding a third dimension
Embracing Error
Energy
Processor
Pareto.Fron0er
Performance
42
Adding the Dimension of Error
Finding the Pareto surface
Truffle
[ASPLOS ‘12]
Energy
Processor
Pareto.Fron0er
R2, RFVP
[ASPLOS ‘15]
NPUs
[MICRO ‘12]
[ISCA ‘14]
Performance
(3.7×↑, 6.3×↓, 10%)
43
Parrot algorithmic transformation
Learned
Model
Learned
Model
Core
Accelerator
44
Neural networks for
code approximation
 Powerful prediction tools
 Highly parallel
Neural
Processing
Units
 Efficiently implementable with hardware
– Both digital and analog
 Fault tolerant
45
NPU design alternatives
CPU
CPU
GPU
FPGA
NPU
Digital
ASIC
(Speed: 1.8×↑, (Speed: 2.3×↑,
Energy: 1.7×↓, Energy: 3.0×↓,
Quality: 10%↓) Quality: 10%↓)
FPAA
Analog
ASIC
(Speed: 3.7×↑,
Energy: 6.3×↓,
Quality: 10%↓)
46
Approximate computing versus
conventional computing
Possible paths forward
My teaching!
Do Nothing
Specialization and
Co-design
Biological Computing
Technology Breakthrough
Quantum Computing
Software Bloat Reduction
Approximate Computing
Easy for me!
My research!
Way long term!
48
Programmability versus Efficiency
Programmability
General-Purpose Processors
SIMD Units
GPUs
FPGAs
ASICs
Efficiency
49
Large-Scale Reconfigurable Computing in a Microsoft Datacenter
Microsoft Cloud Services
Capabilities, Costs
One Application’s Accelerator
Xeon CPU
NIC
One Application’s Accelerator
Xeon CPU
Search Acc.
(FPGA)
Xeon CPU
Search Acc. v2
(FPGA)
Wasted Power,
Holds back SW
NIC
Xeon CPU
Math
Accelerator
Wasted Power,
One more thing
that can break
NIC
Search Acc. NIC
(ASIC)
Integrating FPGAs into the Datacenter
Microsoft Open Compute Server
Two 8-core Xeon 2.1 GHz CPUs
64 GB DRAM
4 HDDs @ 2 TB, 2 SSDs @ 512 GB
10 Gb Ethernet
No cable attachments to server
68 ⁰C
Catapult FPGA Accelerator Card
–Altera Stratix V GS D5
• 172k ALMs, 2,014 M20Ks, 1,590
DSPs
–8GB DDR3-1333
–32 MB Configuration Flash
Stratix V
–PCIe Gen 3 x8
–8 lanes to Mini-SAS
SFF-8088 connectors
–Powered by PCIe slot
Config
Flash
PCIe Gen3 x8
4x 20 Gbps Torus
Network
8GB DDR3
Board Details
16 Layer, FR408
9.5cm x 8.8cm x 115.8 mil
35mm x 35mm FPGA
14.2mm high heatsink
FPGA
1U
Mezz Conn.
Scalable Reconfigurable Fabric
1 FPGA board per Server
48 Servers per ½ Rack
6x8 Torus Network among FPGAs
20 Gb over SAS SFF-8088 cables
Data Center Server (1U, ½ width)
60
Agenda
1. Who is Hadi
2. Course organization
3. Why CS 3220 Processor Design
1. How we became and industry of new capabilities
2. Why we might become an industry of replacement
3. Specialization and FPGA Design
4. Pre-assessment Test
61
62
Download