Uploaded by Wenchao Wang

大计算时代网络:NVIDIA网络解决方案

advertisement
大计算时代的网络
宋庆春 NVIDIA 网络亚太区高级总监 | 2024
大计算时代的变革:
数据中心成为了计算机
网络定义数据中心
AI FACTORY
NVLink + InfiniBand
Cloud
• Multi-tenant
• Variety of small-scale workloads
• Traditional ethernet network can suffice
AI CLOUD
InfiniBand
Generative AI Cloud
• Multi-tenant
NVIDIA Spectrum-X
AI Ethernet Fabric
• Variety of workloads including larger scale Generative AI
Traditional Ethernet
• Traditional Ethernet network for North-South traffic
• NVIDIA Spectrum-X Ethernet for AI fabric (East-West)
AI Factories
• Single or few users
• Extremely large AI models
• NVIDIA NVLink and InfiniBand gold standard for AI fabric
CLOUD
10
100
1k
10k
# of GPU in Cluster
100k
1M+
新的标准衡量 Generative AI Training 性能
MLPerf Training – 应用和算力平台性能的融合
Stable Diffusion
GPT-3 175B
Text-to-Image
Large Language Model
DLRMv2
Recommendation
BERT-Large
NLP
RetinaNet
Object Detection,
Lightweight
Mask R-CNN
Object Detection,
Heavyweight
3D U-Net
Biomedical Image
Segmentation
RNN-T
Speech Recognition
ResNet-50 v1.5
Image Classification
NVIDIA 专注于提升计算性能 - (GPU + 网络端到端参考架构)
参考架构 、整体方案、持续优化
+
12,000
Hopper
ConnectX-7
10,000
GPUs
8,000
+
Ampere
6,000
ConnectX-6
+
4,000
Volta
ConnectX-5
2,000
0
2019
2021
2023
1,536 Volta GPUs
100G InfiniBand
4,320 Ampere GPUs
200G InfiniBand
10,752 Hopper GPUs
400G InfiniBand
大计算时代的 6 项计算新纪录
没有最快、只有更快
GPT-3 175B (1B Tokens)
Stable Diffusion
3.9 Minutes
2.5 Minutes
2.8X Faster
New Workload
DLRM-dcnv2
BERT-Large
1 Minute
7.2 Seconds
1.6X Faster
1.1X Faster
RetinaNet
3D U-Net
55.2 Seconds
46 Seconds
1.8X Faster
1.07X Faster
MLPerf™ Training v3.1. Results retrieved from www.mlperf.org on November 8, 2023. Format: Chip Count, MLPerf ID | GPT-3: 3584x 3.0-2003,
10752x 3.1-2007 | Stable Diffusion: 1024x 3.1-2050 | DLRMv2: 128x 3.0-2065, 128x 3.1-2051 | BERT-Large: 3072x 3.0-2001, 3472x 3.1-2053 |
RetinaNet: 768x 3.0-2077, 2048x 3.1-2052 | 3D U-Net: 432x 3.0-2067, 768x 3.1-2064. The MLPerf™ name and logo are trademarks of
MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See
www.mlcommons.org for more information.
在万卡 NVIDIA GPU 上的线性 AI 大模型训练性能扩展
GPU 数量增加 3 倍,大模型训练效率提升 2.8 倍
Efficiency : 93.3%
MLPerf Training GPT-3 175B
Benchmark Time to Train, 1B Tokens
12
3,584 GPUs
10.9
Minutes to Train
10
New Software Increases GPU
Performance, Enables Record Scale
8.6
8
6
6.0
10,752 GPUs
4.9
4
3.9
2
0
0
2,000
4,000
6,000
8,000
# of GPUs
MLPerf v3.0
MLPerf v3.1
MLPerf™ Training v3.0 and v3.1. Results retrieved from www.mlperf.org on November 8, 2023, from entries 3.0-2003, 3.1-2005, 3.1-2007, 3.1-2008, 3.1-2009. The MLPerf™
name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See
www.mlcommons.org for more information.
10,000
12,000
NVIDIA 网络平台 – InfiniBand 网络 + 以太网络
QUANTUM-2
INFINIBAND SWITCH
CONNECTX-7
SMARTNIC
SPECTRUM-4
ETHERNET SWITCH
BLUEFIELD-3
SuperNIC
MANAGEMENT
面向 AI Cloud 的网络
• North-South for user-to-cloud communications
Control / User Access Network
(North-South)
Traditional Network
AI Fabric
(East-West)
North – South
• East-West for distributed and disaggregated processing
XX
XX
XX
XX
XX
XX
XX
XX
Loosely
Coupled Applications
Distributed
Tightly-Coupled Processing
Data Center
TCP (Low Bandwidth
Flows and Utilization)
RoCE (High Bandwidth
Flows and Utilization)
East – West
High Jitter Tolerance
Low Jitter Tolerance
(Long Tail Kills Performance)
Heterogeneous Traffic
Average Multi-Pathing
Bursty Network Capacity
Predictable Performance
利用 NVIDIA Digital Twin 构建 AI 算力中心
网络为自己买单、性能至上、算力至上
赢了网络、赢了算力
Relative AI Fabric Performance
(NCCL AllReduce)
>2X
1.6X
Traditional Ethernet
NVIDIA Spectrum-X Ethernet
NVIDIA InfiniBand
NVIDIA 高效网络方案助力大模型应用
冯高锋 NVIDIA 网络技术市场高级总监 | 2024
A POD at Any Scale
Growing with Scalable Units (SU)
NVIDIA Eos and Microsoft Azure AI Supercomputers
Record-setting performance with over 10,000 GPUs and NVIDIA Quantum-2 InfiniBand Networking
NVIDIA EOS
MLPerf Proven #1 AI Supercomputer in the World
Microsoft Azure Eagle
Largest Cloud Submission for MLPerf and Top500
InfiniBand Roadmap
SDR - Single Data Rate
DDR - Double Data Rate
QDR - Quad Data Rate
FDR - Fourteen Data Rate
EDR - Enhanced Data Rate
HDR - High Data Rate
NDR - Next Data Rate
https://www.infinibandta.org/infiniband-roadmap/
NVIDIA Quantum-2 InfiniBand Platform
Unprecedented Performance, Scalability, and Security for Scientific Computing
Most Advanced Networking
End-to-End
Bare-Metal Secured Multi-Tenant Infrastructure
Performance Isolation with Congestion Control
Advanced Adaptive Routing
In-Network Computing
400Gb/s InfiniBand
High
Throughput
Extremely
Low Latency
High
Message Rate
RDMA
GPUDirect
RDMA
GPUDirect
Storage
Adaptive
Routing
Congestion
Control
Smart
Topologies
1.2x Higher Application Performance with
BlueField DPU and Quantum InfiniBand InNetwork Computing
BlueField-3 DPU
ConnectX-7
MPI
Tag Matching
Programmable
Datapath
Accelerator
Data
Processing
Units
(Arm Cores)
Self Healing
Network
Data Security / Tenant Isolation
Switch
All-to-All
Data
Reductions
(SHARP)
End-to-End
NVIDIA Quantum-2
Switch
End-to-End
Adapter/DPU
In-Network Computing
Physics/Chemistry
Weather
FFT
NVIDIA GPUDirect RDMA
10X Higher Performance
2 Full Copy Operations 0
2 PCIe Transactions 1
GPU Utilization
CPU Usage
Latency
No GPUDirect
GPUDirect
Network Handled by CPU and CPU-Memory
Network Goes Directly to GPU Memory
KERNEL
USER
InfiniBand Native Lossless Network for RDMA
HARDWARE
RDMA
In-Network Computing Accelerated Supercomputing
Software-Defined, Hardware-Accelerated, InfiniBand Network
RDMA
GPUDirect
RDMA
GPUDirect
Storage
Adaptive
Routing
Congestion
Control
Smart
Topologies
In-Network Computing
Architected to Scale
All-to-All
MPI
Tag Matching
Programmable
Datapath
Accelerator
Data processing
units
(Arm cores)
Self-healing
Network
End-to-End
High
Message Rate
Adapter/DPU
Extremely
Low Latency
End-to-End
End-to-End
High
Throughput
Data
Reductions
(SHARP)
Data security / tenant isolation
Centralized Management
Switch
In-Network Computing
Most Advanced Networking
Standard
Collective Operations
SHARP Accelerates AI Performance
The CPU in a parameter server becomes the bottleneck
Performs the Gradient Averaging
Replaces all physical parameter servers
Accelerate AI Performance
NVIDIA SHARP
Scalable Hierarchical Aggregation and Reduction Protocol
In-network Tree Based Aggregation Mechanism
Multiple Simultaneous Outstanding Operations
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
Small Message and Large Message Reduction
Barrier, Reduce, All-Reduce, Broadcast and More
Sum, Min, Max, Min-loc, max-loc, OR, XOR, AND
Integer and Floating-Point, 16/32/64 bits
AN
SHARP Aggregation Node: Switch Resident
Host: Data source and Destination
SHARPv3 Multi-Tenant Network Reductions
Cloud Native Supercomputing with Quantum-2 400Gb/s
• SHARP v3 – Now multi-tenant, multiple high bandwidth
reductions running in parallel over disjoint trees
• Enhanced precision (19 bits) for lower precision operands
• 32X more data reduction engines
• Double the bandwidth to 400Gb/s
NVIDIA Quantum 200Gb/s SHARPv2
15%
Faster Deep Learning
Recommendations
17%
Faster Natural
Language Processing
15%
Faster Computational
Fluid Dynamics Simulations
NVIDIA Quantum-3 SHARPv3
Tenant A
Tenant B
Tenant C
NCCL allreduce Performance with SHARP on AI Cluster
NVIDIA Quantum InfiniBand Infrastructure
In-Network Computing Accelerated Network for Supercomputing
Skyway Gateway
MetroX Long-haul
ConnectX Adapter
BlueField DPU
Quantum Switch
UFM Cyber-AI
LinkX Interconnect
Thank You
NVIDIA Spectrum-X 构建以太网为底座拥有极致性能的 AI 集群
陈龙 NVIDIA 网络市场开发总监 | 2024
Two Types of AI Data Centers
Networking for AI Data Centers
AI Factories
Single or few users | Extremely large AI models | NVLink and InfiniBand AI fabric
AI Cloud
Multi-tenant | Variety of workloads | Ethernet network
The Data Center is The Computer
The network defines the data center
AI FACTORY
• Cloud
NVLink + InfiniBand
AI CLOUD
InfiniBand
NVIDIA Spectrum-X
AI Ethernet Fabric
Traditional Ethernet
100
1k
10k
# of GPU in Cluster
100k
• Generative AI Cloud
• Multi-tenant
• Variety of workloads including larger scale Generative AI
• Traditional Ethernet network for North-South traffic
• NVIDIA Spectrum-X Ethernet for AI fabric (East-West)
• AI Factories
CLOUD
10
• Multi-tenant
• Variety of small-scale workloads
• Traditional Ethernet network can suffice
1M+
• Single or few users
• Extremely large AI models
• NVIDIA NVLink and InfiniBand gold standard for AI fabric
AI Workloads Require an AI Fabric
Full Stack Optimized
Control / User Access Network (N-S)
AI Fabric (E-W)
Loosely-Coupled Applications, No Isolation Required
Tightly-Coupled Processes, Tenant Isolation Required
TCP (Low Bandwidth Flows and Utilization)
High Jitter Tolerance
Heterogeneous Traffic, Statistical Multi-Pathing
RDMA (High Bandwidth Flows and Utilization)
Low Jitter Tolerance
Bursty Network Capacity, Predictive Performance
Running AI Workloads on Traditional Ethernet
Sub-Optimal for Addressing Needs of AI Clouds
AI Workload
Significant
Congestion
Increased
Latency
Bandwidth
Unfairness
Introducing Spectrum-X Networking Platform
World’s First Purpose-Built Ethernet Fabric for AI
AI Workload
95%
Highest
Effective
Bandwidth
1.6X
Increased
AI Network
Performance
NVIDIA Spectrum-X: World’s First Ethernet Platform for AI
Combining Specialized High-Performance Architecture with Standard Ethernet Connectivity
NVIDIA Spectrum-X Networking Platform
Nearly
Perfect Effective
Bandwidth
at Scale
NVIDIA RoCE
Extensions for
Scalable AI
Communications
Extremely Low
Latency
Deterministic
Performance and
Performance
Isolation
Full Stack and
End-to-End
Optimization
Spectrum-4 Ethernet Switch
Open NOSes:
SONiC,
Cumulus
Standard
Ethernet
Connectivity
BlueField-3 SuperNIC
NVIDIA Spectrum-4
The First Ethernet Switch Family
Purpose-Built for AI
FAST
EFFICIENT
4X bandwidth capacity
increase
4X reduction in solution
footprint
SECURE
GREEN
In-flight and
at-rest encryption
50% Reduction in
solution power
Spectrum-4 Switch ASIC
SN5000 Ethernet Switches
51.2Tbps
400/800GbE
Aggregate bandwidth
100G SerDes technology
100B
4N
Transistors
NVIDIA design process
400Gb/s
1:1
Network Bandwidth
GPU/SuperNIC Ratio
16C
<75W
Programmable Compute
Power Envelope
NVIDIA BlueField-3
Network Accelerator for Powering
Generative AI Clouds
Peak AI Workload Efficiency
Secure Cloud Multi-Tenancy
Extensible Infrastructure
NVIDIA B3140H SuperNIC
Optimized for E-W in GPU-Accelerated Systems
Power-Efficient, Low-Profile Design
End-to-End Adaptive RDMA
Routing With Lossless Network
Increases effective data throughout by 1.6X
SPECTRUM-4
ADAPTIVE ROUTING
Memory
A
BLUEFIELD-3
OUT OF ORDER DATA DELIVERY
Memory
C A
D C B A
C D A B
A
B
C
D
B
BlueField-3
BlueField-3
C A
D
D B
Memory
Memory
D C B A
C D B A
A
A
B
• BlueField-3 sends data into the switch network
C
B
D B
BlueField-3
BlueField-3
D
1
• Increase from typical 60% to 95% effective bandwidth
2
3
4
Effective Network Bandwidth
With and Without Adaptive Routing
1.6X HIGHER
400 Gb/s
0 Gb/s
C
D
• Spectrum-4 spreads the data packets across all available routes
• BlueField-3 ensures in-order data delivery
C
Switch Ports
Switch Ports
Traditional Ethernet
Spectrum-X Platform
5
Congestion Occurring on Traditional Ethernet Results in Victim Flows
Noise Isolation With Programable
Congestion Control
Memory
A
B
C
A
A
C
D
Workload A
C
Workload A
D B
A
B
C
Memory
D C
D
A
• Diverse workloads can impact each other’s performance
B
A
Workload B
D
C
B
Workload B
Victim
B
Flow
C
D
Memory
• Spectrum-X detects congestion spots in real time
• Programmable congestion control meters the data flow
BlueField-3
Flow Metering
A
Spectrum-4
Telemetry Probes
Congestion Detection
Memory
B
C
• Results in performance isolation across workloads
D
BlueField-3
Workload A
D
D
C
A
A
B
B
C
Workload A
BlueField-3
A
B
C
D
Memory
BlueField-3
A
B
BlueField-3
A
Workload B
D
B
C
D
Memory
C
BlueField-3
Workload B
Performance Comparison
LLM NCCL AllReduce — Traditional Ethernet
LLM NCCL AllReduce — Spectrum-X Ethernet
100%
Percentage of Peak Bandwidth
Percentage of Peak Bandwidth
100%
80%
Jitter
60%
40%
20%
0%
80%
60%
40%
20%
0%
Optimal
Placement
Average
Placement
Worse-Case
Placement
Optimal
Placement
Average
Placement
Worse-Case
Placement
• Spectrum-X performance is consistent; Traditional Ethernet shows run-to-run bandwidth variability
• Results in 1.4X higher LLM performance (2K GPUs)
Spectrum-X Rail Optimized Leaf and Spine
Spectrum-4 + BlueField-3 GPU-to-GPU Fabric for 8K H100 GPU Cloud
Spine: Spectrum SN5600 (400G)
128x downlinks
400G DAC
End of Row: Spectrum SN5600 (400G)
64x uplinks
64x downlinks
Rail-optimized GPU-to-GPU
connectivity
400G LinkX optics
…
1 SU: 32 Nodes (256 GPU)
8x 400G per Node
4 Leaf Switches
400G BlueField-3 DPU
Rail-Optimized Scale Out Architecture
Up to 512k H100 GPUs, Non-blocking Connectivity (64 Node SU)
Leaf 1
…
SuperSpine
Plane 1
SuperSpine 1 SuperSpine 64
…
Spine 1
Spine 64
…
Leaf 8
Leaf 1
SuperSpine SuperSpine 1 SuperSpine 64
…
Plane 64
Spine 1
…
…
Leaf 8
…
Leaf 1
…
…
Leaf 8
…
Spine 64
Leaf 1
…
Leaf 8
…
…
…
…
…
SU1 (512 GPU)
SU8 (512 GPU)
SU993 (512 GPU)
SU1000 (512 GPU)
Max of 5 ASIC hops between GPUs
(11 required for modular chassis-based cluster)
Israel-1 Spectrum-X Generative AI Cloud
Most Powerful Supercomputer in Israel with Peak Performance of 8 Exaflops
256 x HGX Hopper Servers
2048 x Hopper GPUs
80 x Spectrum-4
SN5600 Switches
2560 x BlueField-3
SuperNICs
Thank You
U C L O U D . C N
!"#$%&'()*+,-./
UCloud计算产品中心研发总监
王晓慧
目录
• !"#$%&'()*+,-.
• /0%&'*123456
• 789:;<=>
• ?@ABCD
!"#$%&'()*+,-.
NVIDIA BlueField
• 2020年10月,NVIDIA推出了全新的NVIDIA® BlueField®2 系列DPU;
• 2021年4月,NVIDIA发布了新一代数据处理器NVIDIA®
BlueField®-3 DPU,为数据中心提供强大的软件定义网
络、存储和网络安全加速功能。
NVIDIA BlueField-2 DPU
BlueField DPU硬件包括
• 一个强大的智能网卡,可支持高速以太网或InfiniBand
两种接口;
• 一组ARM核(BF2:8个,BF3:16个)以及DRAM
(BF2:DDR4, BF3:DDR5);
• 一系列硬件加速器:主要面向安全、存储、网络、远程
管理。
NVIDIA BlueField-3 DPU
NVIDIA DOCA
DPU位于数据中心的服务器节点内,DOCA是用
于在BlueField DPU上开发应用程序的软件框架。
借助DOCA,可以在BlueField DPU从主机CPU卸
载并加速基础设施工作负载。
开发者基于DOCA开发运行在DPU上的各种服务,
让DPU成为与业务隔离的安全的服务域。
DOCA的核心组件
• 行业标准API:DPDK, SPDK, P4, Linux Netlink
• 网络加速SDK,安全加速SDK,存储加速SDK,
RDMA加速SDK,管理SDK,etc.
• 用户空间接口、内核空间接口
DPU对裸金属的关键加速
裸金属的性能关注点:计算、存储、网络。
ARM核心
NVMe SNAP存储
ASAP2硬件卸载
智能卡,运行OS
和Service
• NVMe Controller仿真
• NVMe Over Fabric硬件卸载
• 内核OVS卸载
• GRE隧道卸载
裸金属架构1.0
• 基于NVIDIA DPU ASAP2特性,将VPC网络卸载到DPU卡,包括GRE隧道卸载、UNet、
OpenVSwitch
• 带宽由原有的10Gb提升至25Gb
• 颠覆原有的VPC网关架构,节省了网关的设备成本,去除了网关瓶颈
裸金属架构2.0
• 基于NVIDIA DPU的NVMe SNAP特性
• 存储带宽提升至50Gb
• 裸金属实例无本地盘,系统盘和数据盘均为基于RDMA网络的SSD云盘
裸金属架构3.0
• 具有裸金属近似的性能
• 增加轻薄Hypervisor
• 资源静态切割,HPC场景友好
裸金属架构3.1
• UDisk搬至DPU,进一步释放Host CPU
裸金属各版本对比
功能
传统物理云
裸金属1.0
裸金属2.0
裸金属3.x
内网带宽
10G
25G
50G
50G
存储IOPS
16.7万
16.7万
48万
96万
VPC网关
有
无
无
无
内网直通快杰
N
Y
Y
Y
免装机
N
N
Y
Y
交付时间
30分钟
30分钟
5分钟
1-2分钟
磁盘介质
本地盘
本地盘
RSSD云盘
RSSD云盘 高效云盘
磁盘扩容
N
N
Y
Y
故障冷迁移
N
N
Y
Y
数据备份
N
N
Y
Y
虚拟化层
N
N
N
Y
热迁移
N
N
N
Y
/0%&'*123456
算力中心网络架构设计
400G
IB-Spine01
IB-Spine09
···
···
IB-Spine08
IB-Spine16
存储网 Spine01
存储网 Spine02
管理网 Spine01
···
存储网 Spine03
存储网 Spine04
管理网 Spine04
400G汇聚组
200G
IB-Leaf01
···
IB-Leaf08
Leaf01
GPU01
GPU01
GPU01
GPU02
GPU02
GPU02
GPU03
Server01
···
GPU03
Server02
···
GPU08
GPU08
···
训练区 Group1
IB-Leaf01
Leaf01
GPU03
Server32
···
GPU08
....
···
IB-Leaf08
Leaf01
Leaf01
GPU01
GPU01
GPU01
GPU02
GPU02
GPU02
GPU03
Server01
···
GPU03
Server02
···
GPU08
GPU08
···
训练区 GroupN
GPU03
Server32
···
GPU08
Leaf01
Leaf01
UPFS
并行文件
US3
对象存储
存储区
···
Leaf01
UDisk
块存储
大模型训练集群
A800 算力区 4*200G RoCE网络架构
大模型训练集群
H800 算力区 8*400G IB网络架构
789:;<=>
大模型训练中的痛点
多节点批调度
NCCL通信异常
XID错误
Loss异常
“孔明”智算平台提供的能力
“孔明”智算平台支持通用的pytorch任务调度,
并整合了大语言模型场景下的deepspeed、
Megatron框架,让用户能够轻松接入,无需关
注资源和调度的细节。结合平台内的故障发现、
断点续训、备机备份等机制,能够将训练过程
中异常引起的中断时间降至最低。
产品优势
1.
支持同构、异构卡的统一调度与管理;
2.
支持TCP/IP、IB、RoCE等网络方案;
3.
支持分布式训练及断点续训;
4.
支持自研高性能存储UPFS(该系统支持
GPU Direct Storage),显著提升了存储
系统的吞吐能力,Checkpoint速度相较传
统存储提升近10倍。
?@ABCD
AIGC应用
识问-企业内部知识问答助手
PICPIK-为专业用户设计而生
EFGHIJKLM
Download