Uploaded by 郝丰澧

CNS20439

advertisement
RDMA ACCELERATING
SPARK DATA PROCESSING
Dec, 2020
DATA SCIENCE IN PRACTICE
More than 70% of Compute is Spent on Data Processing
Inference
• More time is spent processing data
than any other stage
• Compute required is greater
• Benefit to developers is high
Training
• NVIDIA opportunity is sizable
Data
Processing
2
Accelerating the Machine Learning Data Pipeline
Ac
Ac
c
e
el
ed
t
ra
Ac
le
e
c
le
e
c
ed
t
ra
c
c
A
d
e
t
ra
Ac
le
e
c
er
l
e
d
e
at
ed
t
ra
3
RDMA – Remote DMA
InfiniBand and RoCE (RDMA over Converged Ethernet)
Messages send/receive, remote DMA and remote atomics
Hardware transport
Kernel and hypervisor bypass
Application
TCP
Application
RDMA
Kernel TCP/IP Stack
Advantages
Lower latency – 10us à 700ns
Higher message rate – 215M messages/s
Hypervisor
Lower CPU utilization – 0%
NIC
RDMA NIC
4
RDMA 写操作过程
HCA 消耗⼀个WQE,
读取buffer并发送到远端,产⽣发送完成信息
Host A 内存
当数据包到达HCA的时,它会检查地址
和内存Key,并直接写⼊内存
发送队列
发送队列
HCA
接收队列
完成队列
发送端分配⼀个发送buffer并在HCA注册,放置⼀
个包含远端虚拟地址的发送WQE并“按⼀下⻔铃”
Host B 内存
HCA
接收队列
完成队列
申请分配接收buffer并传送地址和Key到远端
5
Enabling Kernel Bypass
Data path
Completion, Request event
Kernel
Co
nt
ro
l
Send, Receive, RDMA
RDMA
Stack
Connection establishment
Control path
RDMA
Library
RDMA
Library
RDMA
Library
Device
Driver
Device
Driver
Device
Driver
rol
t
n
o
C
ol
Contr
Device
Driver
Resource setup
ol
ntr
Co
Memory management
Connection establishment
Verbs
Datapath
User
Application
/
Middleware
Verbs
Datapath
Separation of Control and Data paths
Application
/
Middleware
Verbs
Datapath
Application
/
Middleware
HW
NIC Hardware
6
RDMA over Converged Ethernet (RoCE)
RoCEv2 Packet
Eth L2
InfiniBand Packet
IP
UDP
BTH+
Payload
iCRC
FCS
LRH
GRH
BTH+
Payload
iCRC
vCRC
o InfiniBand is a centrally managed, lossless high performance
network architecture
o Provides specification for the RDMA transport
o RoCE adapts the efficient RDMA transport to run over Ethernet
networks
o Standard Ethernet management
RDMA
Transport
RDMA
Transport
IB Network
Layer
UDP
InfiniBand
Link layer
Ethernet
Link layer
InfiniBand
ROCEv2
InfiniBand
Management
Ethernet/IP
Management
IP
7
Verbs programming
1. Get the device list
2. Open the requested device
3. Query the device capabilities
4. Allocate a Protection Domain to contain
your resources
5. Register a memory region
6. Create a Completion Queue (CQ)
7. Create a Queue Pair (QP)
8. Bring up a QP
9. Post work requests and poll for completion
……
10.Cleanup
HCA
PD
CQ
QP
AV
MR
MW
8
RDMA & InfiniBand software Stack
Seamless application integration with Upper Layer Protocols (ULPs)
RDMA Transport Interface
RDMA
Apps
PD, MR
QP, CQ
SM
IB
Tools
SHARP
Spark
RDMA
Apps
MPI
SHMEM
Packet Interface
TensorFlow
NCCL
UCX
DPDK
Apps
Socket
Apps
Video
Apps
DPDK
PMD
VMA
Rivermax
RDMA Core Services library (verbs, CM)
RDMA Core Services (verbs, CM)
Device Driver
Data Path
kernel bypass
Control Path
Device driver library
User
Kernel
Hardware
Operating system
RDMA Infrastructure
Applications
9
UCX Library
Unified Communication X (http://openucx.org)
Mission:
Collaboration between industry, laboratories, and
academia to create production grade communication
frameworks and open standards for data centric and
high-performance applications
10
UCX LAYERS
Applications
HPC (MPI, SHMEM, …)
Storage, RPC, AI
Web 2.0 (Spark, Hadoop)
UCX-Py
UCP – High Level API (Protocols)
UCX
Transport selection, multi-rail, fragmentation
HPC API:
tag matching, active messages
I/O API:
Stream, RPC, remote memory access, atomics
Connection establishment:
client/server, external
UCT – Low Level API (Transports)
RDMA
RC
DCT
GPU / Accelerators
UD
OFA Verbs Driver
iWarp
CUDA
Cuda
AMD/ROCM
Others
Shared
memor
y
TCP
OmniPat
h
Cray
ROCM
Hardware
11
JUCX – JAVA BINDINGS FOR UCX
Transport abstraction - implemented on top of UCP layer
Can run over different types of transports (Shared memory, Infiniband/RoCE,
Cuda,…)
Ease of use API wrapper over high level UCP layer
Supported operations: non blocking send/recv/put/get
12
SPARK SHUFFLE BASICS
Reduce
Map
Input
Map output
Map
File
Map
File
Map
File
Map
File
Map
File
Reduce task
Fetch blocks
Reduce task
Fetch blocks
Reduce task
Fetch blocks
Reduce task
Fetch blocks
Reduce task
Fetch blocks
Publ
ish
Map
Stat
us
Spark
Master
13
THE COST OF SHUFFLING
§ Shuffling is very expensive in terms of CPU, RAM, disk and network Ios
§ Spark users try to avoid shuffles as much as they can
§ Speedy shuffles can relieve developers of such concerns, and simplify
applications
14
ShuffleManager Plugin
Spark allows for external implementations of ShuffleManagers to be plugged in
Configurable per-job using: “spark.shuffle.manager”
Interface allows proprietary implementations of Shuffle Writers and Readers, and essentially
defers the entire Shuffle process to the new component
SparkUCX utilizes this interface to introduce RDMA in the Shuffle process
SortShuffleManager
UcxShuffleManager
15
SparkUCX Performance
27%
30
%
60
%
16
SPARK 3.0 IS GPU ACCELERATED
Meeting the requirements of data processing today and tomorrow
Data to
Analyze
Spark 3.0 scale-out clusters are accelerated
using RAPIDS software and NVIDIA GPUs
Data Processing
Requirements
GPU year-over-year performance increases to
meet and exceed data processing
requirements
3.0
GPU
“These contributions lead to faster
data pipelines, model training and
scoring for more breakthroughs and
insights with Apache Spark 3.0 and
Databricks.”
— Matei Zaharia, original creator of Apache
Spark and chief technologist at Databricks
CPU
2000
Hadoop Era
2010
2020
Spark Era
2030
Spark + GPU Era
17
Bringing GPU Acceleration To Apache Spark
Seamless Integration with Spark 3.0
Features
•
Use existing (unmodified) customer code
•
Spark features that are not GPU enabled
run transparently on the CPU
Initial Release - GPU Acceleration of:
•
Spark Data Frames
•
Spark SQL
•
ML/DL training frameworks
18
10X Higher Performance with GPUDirect™ RDMA
§ Accelerates HPC and Deep Learning performance
§ Lowest communication latency for GPUs
GPUDirect™ RDMA
Courtesy of Dhabaleswar K. (DK) Panda
Ohio State University
19
GPUDirect Storage
20
Accelerated Shuffle Results
Inventory Pricing Query Speedups
Query Duration In Seconds
250
228
200
150
100
45
50
8.4
0
CPU
GPU
GPU+UCX
21
SPARK 3.0 SOLUTION ADVANTAGES
A100 GPU, Mellanox Networking, and Vertically-Integrated Solution
NVIDIA A100 GPUs
•
•
•
•
Fastest at processing Spark SQL, and DataFrame, ML/DL operations (Largest GPU memory, superior memory B/W, most operations/s)
TCO advantages; reduces cost per job while saving on data center footprint
Ability to partition/share GPUs to optimize for effective utilization
Paired with best-of-breed CPUs, enables converged data processing and ML/DL infrastructure
Mellanox ConnectX NICs and Switches
•
•
Best-of-breed performance (latency and bandwidth); minimal CPU overhead
Advanced storage capabilities includes NVMe over Fabric offloads
Software Stack
•
•
Zero code changes to run Spark workloads on A100 (Spark API calls that can run on GPUs are automatically intercepted and accelerated)
Can be delivered and configured out-of-the-box by HPE
Vertical Integration
•
Optimal overall performance and TCO – especially when compared to cloud-based solutions
22
Download