大计算时代的网络 宋庆春 NVIDIA 网络亚太区高级总监 | 2024 大计算时代的变革: 数据中心成为了计算机 网络定义数据中心 AI FACTORY NVLink + InfiniBand Cloud • Multi-tenant • Variety of small-scale workloads • Traditional ethernet network can suffice AI CLOUD InfiniBand Generative AI Cloud • Multi-tenant NVIDIA Spectrum-X AI Ethernet Fabric • Variety of workloads including larger scale Generative AI Traditional Ethernet • Traditional Ethernet network for North-South traffic • NVIDIA Spectrum-X Ethernet for AI fabric (East-West) AI Factories • Single or few users • Extremely large AI models • NVIDIA NVLink and InfiniBand gold standard for AI fabric CLOUD 10 100 1k 10k # of GPU in Cluster 100k 1M+ 新的标准衡量 Generative AI Training 性能 MLPerf Training – 应用和算力平台性能的融合 Stable Diffusion GPT-3 175B Text-to-Image Large Language Model DLRMv2 Recommendation BERT-Large NLP RetinaNet Object Detection, Lightweight Mask R-CNN Object Detection, Heavyweight 3D U-Net Biomedical Image Segmentation RNN-T Speech Recognition ResNet-50 v1.5 Image Classification NVIDIA 专注于提升计算性能 - (GPU + 网络端到端参考架构) 参考架构 、整体方案、持续优化 + 12,000 Hopper ConnectX-7 10,000 GPUs 8,000 + Ampere 6,000 ConnectX-6 + 4,000 Volta ConnectX-5 2,000 0 2019 2021 2023 1,536 Volta GPUs 100G InfiniBand 4,320 Ampere GPUs 200G InfiniBand 10,752 Hopper GPUs 400G InfiniBand 大计算时代的 6 项计算新纪录 没有最快、只有更快 GPT-3 175B (1B Tokens) Stable Diffusion 3.9 Minutes 2.5 Minutes 2.8X Faster New Workload DLRM-dcnv2 BERT-Large 1 Minute 7.2 Seconds 1.6X Faster 1.1X Faster RetinaNet 3D U-Net 55.2 Seconds 46 Seconds 1.8X Faster 1.07X Faster MLPerf™ Training v3.1. Results retrieved from www.mlperf.org on November 8, 2023. Format: Chip Count, MLPerf ID | GPT-3: 3584x 3.0-2003, 10752x 3.1-2007 | Stable Diffusion: 1024x 3.1-2050 | DLRMv2: 128x 3.0-2065, 128x 3.1-2051 | BERT-Large: 3072x 3.0-2001, 3472x 3.1-2053 | RetinaNet: 768x 3.0-2077, 2048x 3.1-2052 | 3D U-Net: 432x 3.0-2067, 768x 3.1-2064. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. 在万卡 NVIDIA GPU 上的线性 AI 大模型训练性能扩展 GPU 数量增加 3 倍,大模型训练效率提升 2.8 倍 Efficiency : 93.3% MLPerf Training GPT-3 175B Benchmark Time to Train, 1B Tokens 12 3,584 GPUs 10.9 Minutes to Train 10 New Software Increases GPU Performance, Enables Record Scale 8.6 8 6 6.0 10,752 GPUs 4.9 4 3.9 2 0 0 2,000 4,000 6,000 8,000 # of GPUs MLPerf v3.0 MLPerf v3.1 MLPerf™ Training v3.0 and v3.1. Results retrieved from www.mlperf.org on November 8, 2023, from entries 3.0-2003, 3.1-2005, 3.1-2007, 3.1-2008, 3.1-2009. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. 10,000 12,000 NVIDIA 网络平台 – InfiniBand 网络 + 以太网络 QUANTUM-2 INFINIBAND SWITCH CONNECTX-7 SMARTNIC SPECTRUM-4 ETHERNET SWITCH BLUEFIELD-3 SuperNIC MANAGEMENT 面向 AI Cloud 的网络 • North-South for user-to-cloud communications Control / User Access Network (North-South) Traditional Network AI Fabric (East-West) North – South • East-West for distributed and disaggregated processing XX XX XX XX XX XX XX XX Loosely Coupled Applications Distributed Tightly-Coupled Processing Data Center TCP (Low Bandwidth Flows and Utilization) RoCE (High Bandwidth Flows and Utilization) East – West High Jitter Tolerance Low Jitter Tolerance (Long Tail Kills Performance) Heterogeneous Traffic Average Multi-Pathing Bursty Network Capacity Predictable Performance 利用 NVIDIA Digital Twin 构建 AI 算力中心 网络为自己买单、性能至上、算力至上 赢了网络、赢了算力 Relative AI Fabric Performance (NCCL AllReduce) >2X 1.6X Traditional Ethernet NVIDIA Spectrum-X Ethernet NVIDIA InfiniBand NVIDIA 高效网络方案助力大模型应用 冯高锋 NVIDIA 网络技术市场高级总监 | 2024 A POD at Any Scale Growing with Scalable Units (SU) NVIDIA Eos and Microsoft Azure AI Supercomputers Record-setting performance with over 10,000 GPUs and NVIDIA Quantum-2 InfiniBand Networking NVIDIA EOS MLPerf Proven #1 AI Supercomputer in the World Microsoft Azure Eagle Largest Cloud Submission for MLPerf and Top500 InfiniBand Roadmap SDR - Single Data Rate DDR - Double Data Rate QDR - Quad Data Rate FDR - Fourteen Data Rate EDR - Enhanced Data Rate HDR - High Data Rate NDR - Next Data Rate https://www.infinibandta.org/infiniband-roadmap/ NVIDIA Quantum-2 InfiniBand Platform Unprecedented Performance, Scalability, and Security for Scientific Computing Most Advanced Networking End-to-End Bare-Metal Secured Multi-Tenant Infrastructure Performance Isolation with Congestion Control Advanced Adaptive Routing In-Network Computing 400Gb/s InfiniBand High Throughput Extremely Low Latency High Message Rate RDMA GPUDirect RDMA GPUDirect Storage Adaptive Routing Congestion Control Smart Topologies 1.2x Higher Application Performance with BlueField DPU and Quantum InfiniBand InNetwork Computing BlueField-3 DPU ConnectX-7 MPI Tag Matching Programmable Datapath Accelerator Data Processing Units (Arm Cores) Self Healing Network Data Security / Tenant Isolation Switch All-to-All Data Reductions (SHARP) End-to-End NVIDIA Quantum-2 Switch End-to-End Adapter/DPU In-Network Computing Physics/Chemistry Weather FFT NVIDIA GPUDirect RDMA 10X Higher Performance 2 Full Copy Operations 0 2 PCIe Transactions 1 GPU Utilization CPU Usage Latency No GPUDirect GPUDirect Network Handled by CPU and CPU-Memory Network Goes Directly to GPU Memory KERNEL USER InfiniBand Native Lossless Network for RDMA HARDWARE RDMA In-Network Computing Accelerated Supercomputing Software-Defined, Hardware-Accelerated, InfiniBand Network RDMA GPUDirect RDMA GPUDirect Storage Adaptive Routing Congestion Control Smart Topologies In-Network Computing Architected to Scale All-to-All MPI Tag Matching Programmable Datapath Accelerator Data processing units (Arm cores) Self-healing Network End-to-End High Message Rate Adapter/DPU Extremely Low Latency End-to-End End-to-End High Throughput Data Reductions (SHARP) Data security / tenant isolation Centralized Management Switch In-Network Computing Most Advanced Networking Standard Collective Operations SHARP Accelerates AI Performance The CPU in a parameter server becomes the bottleneck Performs the Gradient Averaging Replaces all physical parameter servers Accelerate AI Performance NVIDIA SHARP Scalable Hierarchical Aggregation and Reduction Protocol In-network Tree Based Aggregation Mechanism Multiple Simultaneous Outstanding Operations AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN Small Message and Large Message Reduction Barrier, Reduce, All-Reduce, Broadcast and More Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND Integer and Floating-Point, 16/32/64 bits AN SHARP Aggregation Node: Switch Resident Host: Data source and Destination SHARPv3 Multi-Tenant Network Reductions Cloud Native Supercomputing with Quantum-2 400Gb/s • SHARP v3 – Now multi-tenant, multiple high bandwidth reductions running in parallel over disjoint trees • Enhanced precision (19 bits) for lower precision operands • 32X more data reduction engines • Double the bandwidth to 400Gb/s NVIDIA Quantum 200Gb/s SHARPv2 15% Faster Deep Learning Recommendations 17% Faster Natural Language Processing 15% Faster Computational Fluid Dynamics Simulations NVIDIA Quantum-3 SHARPv3 Tenant A Tenant B Tenant C NCCL allreduce Performance with SHARP on AI Cluster NVIDIA Quantum InfiniBand Infrastructure In-Network Computing Accelerated Network for Supercomputing Skyway Gateway MetroX Long-haul ConnectX Adapter BlueField DPU Quantum Switch UFM Cyber-AI LinkX Interconnect Thank You NVIDIA Spectrum-X 构建以太网为底座拥有极致性能的 AI 集群 陈龙 NVIDIA 网络市场开发总监 | 2024 Two Types of AI Data Centers Networking for AI Data Centers AI Factories Single or few users | Extremely large AI models | NVLink and InfiniBand AI fabric AI Cloud Multi-tenant | Variety of workloads | Ethernet network The Data Center is The Computer The network defines the data center AI FACTORY • Cloud NVLink + InfiniBand AI CLOUD InfiniBand NVIDIA Spectrum-X AI Ethernet Fabric Traditional Ethernet 100 1k 10k # of GPU in Cluster 100k • Generative AI Cloud • Multi-tenant • Variety of workloads including larger scale Generative AI • Traditional Ethernet network for North-South traffic • NVIDIA Spectrum-X Ethernet for AI fabric (East-West) • AI Factories CLOUD 10 • Multi-tenant • Variety of small-scale workloads • Traditional Ethernet network can suffice 1M+ • Single or few users • Extremely large AI models • NVIDIA NVLink and InfiniBand gold standard for AI fabric AI Workloads Require an AI Fabric Full Stack Optimized Control / User Access Network (N-S) AI Fabric (E-W) Loosely-Coupled Applications, No Isolation Required Tightly-Coupled Processes, Tenant Isolation Required TCP (Low Bandwidth Flows and Utilization) High Jitter Tolerance Heterogeneous Traffic, Statistical Multi-Pathing RDMA (High Bandwidth Flows and Utilization) Low Jitter Tolerance Bursty Network Capacity, Predictive Performance Running AI Workloads on Traditional Ethernet Sub-Optimal for Addressing Needs of AI Clouds AI Workload Significant Congestion Increased Latency Bandwidth Unfairness Introducing Spectrum-X Networking Platform World’s First Purpose-Built Ethernet Fabric for AI AI Workload 95% Highest Effective Bandwidth 1.6X Increased AI Network Performance NVIDIA Spectrum-X: World’s First Ethernet Platform for AI Combining Specialized High-Performance Architecture with Standard Ethernet Connectivity NVIDIA Spectrum-X Networking Platform Nearly Perfect Effective Bandwidth at Scale NVIDIA RoCE Extensions for Scalable AI Communications Extremely Low Latency Deterministic Performance and Performance Isolation Full Stack and End-to-End Optimization Spectrum-4 Ethernet Switch Open NOSes: SONiC, Cumulus Standard Ethernet Connectivity BlueField-3 SuperNIC NVIDIA Spectrum-4 The First Ethernet Switch Family Purpose-Built for AI FAST EFFICIENT 4X bandwidth capacity increase 4X reduction in solution footprint SECURE GREEN In-flight and at-rest encryption 50% Reduction in solution power Spectrum-4 Switch ASIC SN5000 Ethernet Switches 51.2Tbps 400/800GbE Aggregate bandwidth 100G SerDes technology 100B 4N Transistors NVIDIA design process 400Gb/s 1:1 Network Bandwidth GPU/SuperNIC Ratio 16C <75W Programmable Compute Power Envelope NVIDIA BlueField-3 Network Accelerator for Powering Generative AI Clouds Peak AI Workload Efficiency Secure Cloud Multi-Tenancy Extensible Infrastructure NVIDIA B3140H SuperNIC Optimized for E-W in GPU-Accelerated Systems Power-Efficient, Low-Profile Design End-to-End Adaptive RDMA Routing With Lossless Network Increases effective data throughout by 1.6X SPECTRUM-4 ADAPTIVE ROUTING Memory A BLUEFIELD-3 OUT OF ORDER DATA DELIVERY Memory C A D C B A C D A B A B C D B BlueField-3 BlueField-3 C A D D B Memory Memory D C B A C D B A A A B • BlueField-3 sends data into the switch network C B D B BlueField-3 BlueField-3 D 1 • Increase from typical 60% to 95% effective bandwidth 2 3 4 Effective Network Bandwidth With and Without Adaptive Routing 1.6X HIGHER 400 Gb/s 0 Gb/s C D • Spectrum-4 spreads the data packets across all available routes • BlueField-3 ensures in-order data delivery C Switch Ports Switch Ports Traditional Ethernet Spectrum-X Platform 5 Congestion Occurring on Traditional Ethernet Results in Victim Flows Noise Isolation With Programable Congestion Control Memory A B C A A C D Workload A C Workload A D B A B C Memory D C D A • Diverse workloads can impact each other’s performance B A Workload B D C B Workload B Victim B Flow C D Memory • Spectrum-X detects congestion spots in real time • Programmable congestion control meters the data flow BlueField-3 Flow Metering A Spectrum-4 Telemetry Probes Congestion Detection Memory B C • Results in performance isolation across workloads D BlueField-3 Workload A D D C A A B B C Workload A BlueField-3 A B C D Memory BlueField-3 A B BlueField-3 A Workload B D B C D Memory C BlueField-3 Workload B Performance Comparison LLM NCCL AllReduce — Traditional Ethernet LLM NCCL AllReduce — Spectrum-X Ethernet 100% Percentage of Peak Bandwidth Percentage of Peak Bandwidth 100% 80% Jitter 60% 40% 20% 0% 80% 60% 40% 20% 0% Optimal Placement Average Placement Worse-Case Placement Optimal Placement Average Placement Worse-Case Placement • Spectrum-X performance is consistent; Traditional Ethernet shows run-to-run bandwidth variability • Results in 1.4X higher LLM performance (2K GPUs) Spectrum-X Rail Optimized Leaf and Spine Spectrum-4 + BlueField-3 GPU-to-GPU Fabric for 8K H100 GPU Cloud Spine: Spectrum SN5600 (400G) 128x downlinks 400G DAC End of Row: Spectrum SN5600 (400G) 64x uplinks 64x downlinks Rail-optimized GPU-to-GPU connectivity 400G LinkX optics … 1 SU: 32 Nodes (256 GPU) 8x 400G per Node 4 Leaf Switches 400G BlueField-3 DPU Rail-Optimized Scale Out Architecture Up to 512k H100 GPUs, Non-blocking Connectivity (64 Node SU) Leaf 1 … SuperSpine Plane 1 SuperSpine 1 SuperSpine 64 … Spine 1 Spine 64 … Leaf 8 Leaf 1 SuperSpine SuperSpine 1 SuperSpine 64 … Plane 64 Spine 1 … … Leaf 8 … Leaf 1 … … Leaf 8 … Spine 64 Leaf 1 … Leaf 8 … … … … … SU1 (512 GPU) SU8 (512 GPU) SU993 (512 GPU) SU1000 (512 GPU) Max of 5 ASIC hops between GPUs (11 required for modular chassis-based cluster) Israel-1 Spectrum-X Generative AI Cloud Most Powerful Supercomputer in Israel with Peak Performance of 8 Exaflops 256 x HGX Hopper Servers 2048 x Hopper GPUs 80 x Spectrum-4 SN5600 Switches 2560 x BlueField-3 SuperNICs Thank You U C L O U D . C N !"#$%&'()*+,-./ UCloud计算产品中心研发总监 王晓慧 目录 • !"#$%&'()*+,-. • /0%&'*123456 • 789:;<=> • ?@ABCD !"#$%&'()*+,-. NVIDIA BlueField • 2020年10月,NVIDIA推出了全新的NVIDIA® BlueField®2 系列DPU; • 2021年4月,NVIDIA发布了新一代数据处理器NVIDIA® BlueField®-3 DPU,为数据中心提供强大的软件定义网 络、存储和网络安全加速功能。 NVIDIA BlueField-2 DPU BlueField DPU硬件包括 • 一个强大的智能网卡,可支持高速以太网或InfiniBand 两种接口; • 一组ARM核(BF2:8个,BF3:16个)以及DRAM (BF2:DDR4, BF3:DDR5); • 一系列硬件加速器:主要面向安全、存储、网络、远程 管理。 NVIDIA BlueField-3 DPU NVIDIA DOCA DPU位于数据中心的服务器节点内,DOCA是用 于在BlueField DPU上开发应用程序的软件框架。 借助DOCA,可以在BlueField DPU从主机CPU卸 载并加速基础设施工作负载。 开发者基于DOCA开发运行在DPU上的各种服务, 让DPU成为与业务隔离的安全的服务域。 DOCA的核心组件 • 行业标准API:DPDK, SPDK, P4, Linux Netlink • 网络加速SDK,安全加速SDK,存储加速SDK, RDMA加速SDK,管理SDK,etc. • 用户空间接口、内核空间接口 DPU对裸金属的关键加速 裸金属的性能关注点:计算、存储、网络。 ARM核心 NVMe SNAP存储 ASAP2硬件卸载 智能卡,运行OS 和Service • NVMe Controller仿真 • NVMe Over Fabric硬件卸载 • 内核OVS卸载 • GRE隧道卸载 裸金属架构1.0 • 基于NVIDIA DPU ASAP2特性,将VPC网络卸载到DPU卡,包括GRE隧道卸载、UNet、 OpenVSwitch • 带宽由原有的10Gb提升至25Gb • 颠覆原有的VPC网关架构,节省了网关的设备成本,去除了网关瓶颈 裸金属架构2.0 • 基于NVIDIA DPU的NVMe SNAP特性 • 存储带宽提升至50Gb • 裸金属实例无本地盘,系统盘和数据盘均为基于RDMA网络的SSD云盘 裸金属架构3.0 • 具有裸金属近似的性能 • 增加轻薄Hypervisor • 资源静态切割,HPC场景友好 裸金属架构3.1 • UDisk搬至DPU,进一步释放Host CPU 裸金属各版本对比 功能 传统物理云 裸金属1.0 裸金属2.0 裸金属3.x 内网带宽 10G 25G 50G 50G 存储IOPS 16.7万 16.7万 48万 96万 VPC网关 有 无 无 无 内网直通快杰 N Y Y Y 免装机 N N Y Y 交付时间 30分钟 30分钟 5分钟 1-2分钟 磁盘介质 本地盘 本地盘 RSSD云盘 RSSD云盘 高效云盘 磁盘扩容 N N Y Y 故障冷迁移 N N Y Y 数据备份 N N Y Y 虚拟化层 N N N Y 热迁移 N N N Y /0%&'*123456 算力中心网络架构设计 400G IB-Spine01 IB-Spine09 ··· ··· IB-Spine08 IB-Spine16 存储网 Spine01 存储网 Spine02 管理网 Spine01 ··· 存储网 Spine03 存储网 Spine04 管理网 Spine04 400G汇聚组 200G IB-Leaf01 ··· IB-Leaf08 Leaf01 GPU01 GPU01 GPU01 GPU02 GPU02 GPU02 GPU03 Server01 ··· GPU03 Server02 ··· GPU08 GPU08 ··· 训练区 Group1 IB-Leaf01 Leaf01 GPU03 Server32 ··· GPU08 .... ··· IB-Leaf08 Leaf01 Leaf01 GPU01 GPU01 GPU01 GPU02 GPU02 GPU02 GPU03 Server01 ··· GPU03 Server02 ··· GPU08 GPU08 ··· 训练区 GroupN GPU03 Server32 ··· GPU08 Leaf01 Leaf01 UPFS 并行文件 US3 对象存储 存储区 ··· Leaf01 UDisk 块存储 大模型训练集群 A800 算力区 4*200G RoCE网络架构 大模型训练集群 H800 算力区 8*400G IB网络架构 789:;<=> 大模型训练中的痛点 多节点批调度 NCCL通信异常 XID错误 Loss异常 “孔明”智算平台提供的能力 “孔明”智算平台支持通用的pytorch任务调度, 并整合了大语言模型场景下的deepspeed、 Megatron框架,让用户能够轻松接入,无需关 注资源和调度的细节。结合平台内的故障发现、 断点续训、备机备份等机制,能够将训练过程 中异常引起的中断时间降至最低。 产品优势 1. 支持同构、异构卡的统一调度与管理; 2. 支持TCP/IP、IB、RoCE等网络方案; 3. 支持分布式训练及断点续训; 4. 支持自研高性能存储UPFS(该系统支持 GPU Direct Storage),显著提升了存储 系统的吞吐能力,Checkpoint速度相较传 统存储提升近10倍。 ?@ABCD AIGC应用 识问-企业内部知识问答助手 PICPIK-为专业用户设计而生 EFGHIJKLM