1 Accelerating Deep Convolutional Neural Networks Using Specialized Hardware in the Datacenter 2 Top Row: Eric Peterson, Scott Hauck, Aaron Smith, Jan Gray, Adrian M. Caulfield, Phillip Yi Xiao, Michael Haselman, Doug Burger Bottom Row: Joo-Young Kim, Stephen Heil, Derek Chiou, Sitaram Lanka, Andrew Putnam, Eric S. Chung Not Pictured: Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Amir Hormati, James Larus, Simon Pope, Jason Thong Huge thanks to our partners at 3 4 5 6 7 FPGAs ASICs Source: Bob Broderson, Berkeley Wireless group 8 • • • • Software services change monthly Machines last 3 years, purchased on a rolling basis Machines repurposed ~½ way into lifecycle Little/no HW maintenance, no accessibility • Homogeneity is highly desirable 9 http://www.wired.com/2014/06/microsoft-fpga/ 11 Deep Neural Networks FPGA FPGA FPGA Web Search Pipeline Physics Engine Comp. Vision Service FPGA 12 • • • • • • Altera Stratix V D5 172,600 ALMs, 2,014 M20Ks, 1,590 DSPs PCIe Gen 3 x8 8GB DDR3-1333 Powered by PCIe slot Torus Network Stratix V 8GB DDR3 PCIe Gen3 x8 13 • • • • • Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 HDDs @ 2 TB, 2 SSDs @ 512 GB 10 Gb Ethernet No cable attachments to server 68 ⁰C 14 Data Center Server (1U, ½ width) Document 8-Stage Pipeline FE: Feature Extraction FPGA 0 Route to Head FPGA 1 FPGA 2 FFE: Free-Form Expressions FPGA 3 FPGA 4 Score Compute Score Ranking Servers Server Server Document Scoring Request Return Score Server Server Server FPGA 5 Server FPGA 6 Server FPGA 7 Server 18 19 INPUT OUTPUT “Dog” 3-D Convolution and Max Pooling Krizhevsky Dense Layers 20 Input Model Weights Output 21 Input Model Weights Output 22 Input Model Weights Output 23 Input Model Weights Output 24 Input Model Weights Output 25 Input Model Weights Output 26 Input Model Weights Output 27 Input Model Weights Output 28 Input Model Weights Output 29 Input Model Weights Output 30 Input Model Weights Output 31 Input Model Weights Output 32 * N, k, H, and p may vary across layers Convolution between k x k x D kernel and region of Input Feature Map N Max value over p x p region N k k p p H D N = input height and width k = kernel height and width D = input depth Input Feature Map H = # feature maps S = kernel stride Convolution Output Max Pooled Output (Optional) 33 Top-level Layer Controller (Software Configurable) PE Array PE Array MultiInput Banked Buffer Input Buffer Kernel Weight Buffer Input Layer PE Array PE Array Weights Output Layer Network-on-Chip (Data re-distribution) DRAM Channels Image Load & Writeback 34 CNN Engines FPGA FPGA Fully-Connected Engines FPGA FPGA 35 CIFAR-10 ImageNet 1K ImageNet 22K FPGA or GPU Power Server + Stratix V D5 2318 images/s 134 images/s 91 images/sec 25W Server + Arria 10 GX1150 - ~233 images/s (projected) ~158 images/sec (projected) 25W Best prior CNN on FPGA [FPGA’15] - 46 images/s - 18W Caffe+cuDNN on Tesla K20 - 376 images/s - 225W Caffe+cuDNN on Tesla K40 - 824 images/s - 225W See whitepaper @ http://research.microsoft.com/apps/pubs/?id=240715 36 37 38 39 40 41 42