1

advertisement
1
Accelerating Deep Convolutional Neural
Networks Using Specialized Hardware in
the Datacenter
2
Top Row: Eric Peterson, Scott Hauck,
Aaron Smith, Jan Gray, Adrian M.
Caulfield, Phillip Yi Xiao, Michael
Haselman, Doug Burger
Bottom Row: Joo-Young Kim, Stephen
Heil, Derek Chiou, Sitaram Lanka,
Andrew Putnam, Eric S. Chung
Not Pictured: Kypros Constantinides,
John Demme, Hadi Esmaeilzadeh,
Jeremy Fowers, Gopi Prashanth Gopal,
Amir Hormati, James Larus, Simon
Pope, Jason Thong
Huge thanks to our partners at
3
4
5
6
7
FPGAs
ASICs
Source: Bob Broderson, Berkeley Wireless group
8
•
•
•
•
Software services change monthly
Machines last 3 years, purchased on a rolling basis
Machines repurposed ~½ way into lifecycle
Little/no HW maintenance, no accessibility
• Homogeneity is highly desirable
9
http://www.wired.com/2014/06/microsoft-fpga/
11
Deep
Neural Networks
FPGA
FPGA
FPGA
Web Search Pipeline
Physics
Engine
Comp.
Vision
Service
FPGA
12
•
•
•
•
•
•
Altera Stratix V D5
172,600 ALMs, 2,014 M20Ks, 1,590 DSPs
PCIe Gen 3 x8
8GB DDR3-1333
Powered by PCIe slot
Torus Network
Stratix V
8GB DDR3
PCIe Gen3 x8
13
•
•
•
•
•
Two 8-core Xeon 2.1 GHz CPUs
64 GB DRAM
4 HDDs @ 2 TB, 2 SSDs @ 512 GB
10 Gb Ethernet
No cable attachments to server
68 ⁰C
14
Data Center Server (1U, ½ width)
Document
8-Stage Pipeline
FE: Feature Extraction
FPGA 0
Route to
Head
FPGA 1
FPGA 2
FFE: Free-Form
Expressions
FPGA 3
FPGA 4
Score
Compute
Score
Ranking Servers
Server
Server
Document
Scoring
Request
Return
Score
Server
Server
Server
FPGA 5
Server
FPGA 6
Server
FPGA 7
Server
18
19
INPUT
OUTPUT
“Dog”
3-D Convolution and Max Pooling
Krizhevsky
Dense Layers
20
Input
Model
Weights
Output
21
Input
Model
Weights
Output
22
Input
Model
Weights
Output
23
Input
Model
Weights
Output
24
Input
Model
Weights
Output
25
Input
Model
Weights
Output
26
Input
Model
Weights
Output
27
Input
Model
Weights
Output
28
Input
Model
Weights
Output
29
Input
Model
Weights
Output
30
Input
Model
Weights
Output
31
Input
Model
Weights
Output
32
* N, k, H, and p may vary across layers
Convolution between k x k x D kernel
and region of Input Feature Map
N
Max value over p x p region
N k
k
p
p
H
D
N = input height and width
k = kernel height and width
D = input depth
Input Feature Map
H = # feature maps
S = kernel stride
Convolution Output
Max Pooled Output
(Optional)
33
Top-level Layer Controller (Software Configurable)
PE Array
PE Array
MultiInput
Banked
Buffer
Input
Buffer
Kernel
Weight
Buffer
Input
Layer
PE Array
PE Array
Weights
Output
Layer
Network-on-Chip
(Data re-distribution)
DRAM
Channels
Image Load & Writeback
34
CNN
Engines
FPGA
FPGA
Fully-Connected
Engines
FPGA
FPGA
35
CIFAR-10
ImageNet 1K
ImageNet 22K
FPGA or GPU
Power
Server +
Stratix V D5
2318
images/s
134 images/s
91 images/sec
25W
Server +
Arria 10 GX1150
-
~233 images/s
(projected)
~158 images/sec
(projected)
25W
Best prior CNN on
FPGA [FPGA’15]
-
46 images/s
-
18W
Caffe+cuDNN
on Tesla K20
-
376 images/s
-
225W
Caffe+cuDNN
on Tesla K40
-
824 images/s
-
225W
See whitepaper @ http://research.microsoft.com/apps/pubs/?id=240715
36
37
38
39
40
41
42
Download