2022 IEEE Hot Chips 34 Symposium (HCS) | 978-1-6654-6028-6/22/$31.00 ©2022 IEEE | DOI: 10.1109/HCS55958.2022.9895604 TM 壁仞 BR100 GPGPU: Accelerating Datacenter Scale AI Computing Mike Hong, Lingjie Xu & Team Hot Chips 34, Aug 2022 Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 1 Notice and Disclaimer • Copyright 2020-2022 Biren Technology. All rights reserved. • Confidentiality. This document contains confidential information of Biren Technology, which shall not be disclosed to any third party by any means unless explicitly permitted by Biren Technology. • Trademark. All trade names, trademarks, graphical marks and domain names in this document are the properties of Biren Technology. Without prior written consent, they may not be copied, reproduced, modified, published, uploaded, posted, transmitted, or distributed in any way. • Forward Looking Statements. Information in this document, other than statements or description of historical fact, may contain forward-looking statements. The forward-looking statements are made based on certain assumptions, projections and calculations made by us with regards to the industry and management's expertise. These forwardlooking statements are subject to significant risks and uncertainties and our actual results may differ materially. Your business decisions shall not be made solely based on the information. • Disclaimer. This material is provided "as is" without any express or implied warranty of any kind, including warranties of merchantability, title, non-infringement of intellectual property, or fitness for any particular purpose. Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 2 BIREN BR100 Area 1074mm2 @7nm Transistor Count 77Billion Host Interface PCIe Gen 5 x16 w/ CXL Peak Performance 2048 TOPS @ INT8 1024 TFLOPS @ BF16 512 TFLOPS @ TF32+ 256 TFLOPS @ FP32 Also supports FP16, INT32, INT16 and other formats Memory 64GB HBM2E Interconnections 8 BLinkTM 2.3TB/s external I/O bandwidth Form Factor OAM with 550W Max TDP Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 3 Accelerating Deep Learning Workloads Benchmark Performance Measured in Throughput NVIDIA A100 3.0 X BIREN BR100 Average speedup (~2.6 X) 2.8 X 2.5 X 2.7 X 2.6 X 2.5 X 2.4 X 2.0 X 1.5 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 1.0 X 0.5 X 0.0 X ResNet-50 Bert-Large Tacotron2 Yolo v5 Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. Transformer 4 Purpose-built for Datacenter-Scale Computing Compute density Connectivity Cost of ownership Co-design Compatibility with hardware and software datacenter infrastructure Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 5 One Tapeout, Multiple Products To break the reticle size limit and integrate more transistors on chip HBM2E HBM2E Compute tile Compute tile One tapeout to empower multiple SKUs Smaller die for better yield, hence lower cost 896GB/s high speed die-to-die interconnect HBM2E HBM2E 30% more performance, 20% better yield compared with a monolithic design Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 6 BR100 Product Line BR100 BR104 Hearten Server 壁砺TM 100 OAM 壁砺TM 104 PCIe Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 7 OAM Server Interconnect Topology NIC CPU NIC NIC NIC CPU NIC NIC PCIe Switch NIC NIC NIC Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. NIC 8 BR100 Product Line BR100 BR104 Hearten Server 壁砺TM 100 OAM 壁砺TM 104 PCIe Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 9 BR100 Architecture Diagram Video Encoders HBM2E HBM2E SPC SPC SPC SPC L2 Cache L2 Cache L2 Cache L2 Cache M HBM subsystem SPC (Streaming Processing Cluster) Video Decoders HBM subsystem M L2 Cache/Scratchpad, 8MB Blink M M L2 Cache L2 Cache L2 Cache L2 Cache SPC SPC SPC SPC EU EU EU EU EU EU EU EU EU EU L1 Cache (LSC), 64KB PCIe SPC SPC SPC SPC L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache SPC SPC SPC SPC SPC SPC SPC SPC L2 Cache L2 Cache L2 Cache L2 Cache M M M M Blink M D2D D2D SPC SPC SPC L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache SPC SPC SPC SPC M M M SPC D2D D2D M M HBM subsystem L2 Cache L2 Cache L2 Cache SPC SPC SPC SPC EU PCIe L1 Cache (LSC), 64KB EU EU HBM subsystem L1 Cache (LSC), 64KB M L2 Cache HBM2E Video Decoders M EU EU EU HBM2E Video Encoders Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. L1 Cache (LSC), 64KB 10 BR100 SPC Architecture Building blocks of a BR100 SPC: EU (Execution Unit) SPC L0 Instruction Cache L2 Cache/Scratchpad, 8MB Warp Scheduler/ Dispatch Unit (32 thread/clk) EU EU EU EU TLR (Register File, 40KB) L1 Cache (LSC), 64KB EU EU EU EU EU EU EU EU L1 Cache (LSC), 64KB EU 4 x 64KB L1 Cache/LSC (Load & Store Cache) T-core V-core EU L1 Cache (LSC), 64KB EU EU SFU L1 Cache (LSC), 64KB SFU SFU 16 x EU (execution unit), each EU has: • 16 x streaming processing core (V-core), 1 x tensor engine (T-core) • 40KB TLR (Thread Local Register) • 4 x SFU • TDA (Tensor Data Accelerator) SFU TDA (Tensor Data Accelerator for Load/Store) Up to 8MB Distributed L2 Cache • Holds shared data for all SPCs • Can be configured into scratchpad • Built-in reduction engine Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 11 Scalable CU Architecture SPC L2 Cache/Scratchpad, 8MB Multiple EUs form a CU (compute unit) EU EU EU EU Thread groups in a CU are synchronized Each CU can contain 4/8/16 EUs L1 Cache (LSC), 64KB EU EU EU EU EU EU L1 Cache (LSC), 64KB EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU EU L1 Cache (LSC), 64KB EU EU EU EU L1 Cache (LSC), 64KB Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 12 V-core: A General-Purpose SIMT Processor Full set ISA for general purpose computing EU SPC L0 Instruction Cache L2 Cache/Scratchpad, 8MB Warp Scheduler/ Dispatch Unit (32 thread/clk) EU EU EU EU TLR (Register File, 40KB) L1 Cache (LSC), 64KB EU EU EU EU EU EU V-core L1 Cache (LSC), 64KB EU EU 16x cores, supporting FP32, FP16, INT32, INT16 SFU Load/Store Data preprocessing Manages T-core with multiple sync channels Handles DL OPs like Batch Norm, ReLu, etc T-core Enhanced SIMT Model L1 Cache (LSC), 64KB EU EU EU EU SFU L1 Cache (LSC), 64KB SFU SFU SFU TDA (Tensor Data Accelerator for Load/Store) 128K threads run on 32 SPCs Cooperative Warps Super-scaler (static and dynamic) Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 13 Flexible V-Core Warp Control Standard Control Mode C-Warp/Kernel Coroutine Control Mode EU EU Register File Register File Register File Warps run the same code and partition the register files Warps run different codes, collaborate with each other (producerconsumer) and exchange data in register files Enhanced parallelism Highly efficient for OP fusion Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 14 T-core: High Level Overview To accelerate common operations in AI: EU SPC L0 Instruction Cache L2 Cache/Scratchpad, 8MB Warp Scheduler/ Dispatch Unit (32 thread/clk) EU EU EU EU TLR (Register File, 40KB) MMA (Matrix Multiplication Addition) Convolution … L1 Cache (LSC), 64KB EU EU EU EU Design Goals: L1 Cache (LSC), 64KB EU EU EU EU EU T-core V-core EU L1 Cache (LSC), 64KB EU EU SFU L1 Cache (LSC), 64KB SFU SFU SFU TDA (Tensor Data Accelerator for Load/Store) To speed up MMA & convolution which account for majority workload in deep learning Best reuse among different dimensions (batch, sample, channel…) Model level and data level parallelism Closely coupled with V-cores and L2 scratchpad to maximize utilization Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 15 SPC-Scale 2.5D GEMM Architecture SPC L2 Cache/Scratchpad, 8MB EU EU EU Consists of: EU L1 Cache (LSC), 64KB EU EU EU EU EU EU L1 Cache (LSC), 64KB EU EU L1 Cache (LSC), 64KB EU EU EU EU dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp dp 16 T-cores in a 2D systolic array 2 groups of 8 x 8 dot product (dp) operations per T-core (8 x 8 x dp8 3D MMA for BF16) Equivalent to 64 x 64 Matrix Multiplication • Supports FP32, TF32+, BF16, INT16, INT8, INT4 tensor formats Benefits: Higher data reuse rate with a big 2.5D GEMM Less memory bandwidth and cache occupation Better throughput and energy efficiency Lower latency v.s. big 2D GEMM More scalable and easier routing v.s. 3D GEMM L1 Cache (LSC), 64KB Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 16 TF32+ Tensor Data Type Why TF32+? FP32 s E8 M23 E8M15, with 24 bits in total 32x more precise compared to TF32 in AI training TF32+ s E8 TF32 E8 s M15 M10 To reuse BF16 multiplier (with 1+7 mantissa) and simplify Tcore design Automatically kicked in when using tensor acceleration libraries and declared as FP32 Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 17 Tensor Data Accelerator (TDA) V-core T-core Tensor (1D-5D) Load / Store Request TDA TDA Load / Store Request (using Tensor Descriptor) Handle address calculation and OOB (out-of-bound) Asynchronous Memory Copy Memory System TDAs in T-core and V-core are dedicated to accelerate address calculation and OOB using tensor descriptor TDA improves tensor data fetch efficiency by offloading addressing overhead and supporting different tensor layouts Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 18 Memory Scheme: NUMA and UMA SPC SPC SPC SPC Local Access HBM/ L2 NUMA Mem NUMA Mem NUMA Mem NUMA Mem SPC SPC SPC SPC Stores SPC’s private data, e.g. activation Local access with high bandwidth Data broadcasted to relevant SPC in model parallelism UMA memory scheme Stores shared data used by all SPCs, e.g. weights Accelerated by NoC multicasting NoC Multicast HBM/ L2 NUMA memory scheme UMA Mem M Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. M 19 Near Memory Computing Embedding Accelerator Near Memory Engine: L2 Reduction Reduction operations in L2 by SPC instructions Reduction by DMA (2) Embedding table processing accounts for major computation in recommendation systems (1) Reduces the embedding table in L2 before SPC To offload SPCs and accelerate computation SPC SPC (1) (2) SPC (1) (1) L2 (shared by all SPCs) SPC (1) … +× -÷ (2) Embedding Table Reduction in L2 (L2 offloading SPC with reduction engine) Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 20 BIRENSUPATM Software Platform Application Application Workflows AutoStream … Framework suCTR suInfer Library Programming Platform BIRENSUPA Programming Model Tools C++ Extension & Runtime API Compiler Driver/HAL KMD UMD Virtualization Firmware Hardware Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 21 Summary We introduced BR100, a GPGPU designed for accelerating datacenter-scale AI computing • • • • • • PFLOPS level compute density, high connection bandwidth (on-die/off-die) 7nm chiplet design with CoWoS packaging 550W OAM form factor with 8 cards all-to-all interconnection topology Over 300MB on-chip SRAM for data cache and reuse A general-purpose processor with 2.5D GEMM acceleration A new TF32+ tensor data type We optimized data streaming by introducing following features in BR100: • • • • Special C-Warp control mode TDA NUMA/UMA memory scheme with NoC multicast Near memory computing BIRENSUPA software platform allows developers to program on BR100 with ease Authorized licensed use limited to: Southeast University. Downloaded on January 22,2024 at 07:17:50 UTC from IEEE Xplore. Restrictions apply. 22