RDMA ACCELERATING SPARK DATA PROCESSING Dec, 2020 DATA SCIENCE IN PRACTICE More than 70% of Compute is Spent on Data Processing Inference • More time is spent processing data than any other stage • Compute required is greater • Benefit to developers is high Training • NVIDIA opportunity is sizable Data Processing 2 Accelerating the Machine Learning Data Pipeline Ac Ac c e el ed t ra Ac le e c le e c ed t ra c c A d e t ra Ac le e c er l e d e at ed t ra 3 RDMA – Remote DMA InfiniBand and RoCE (RDMA over Converged Ethernet) Messages send/receive, remote DMA and remote atomics Hardware transport Kernel and hypervisor bypass Application TCP Application RDMA Kernel TCP/IP Stack Advantages Lower latency – 10us à 700ns Higher message rate – 215M messages/s Hypervisor Lower CPU utilization – 0% NIC RDMA NIC 4 RDMA 写操作过程 HCA 消耗⼀个WQE, 读取buffer并发送到远端,产⽣发送完成信息 Host A 内存 当数据包到达HCA的时,它会检查地址 和内存Key,并直接写⼊内存 发送队列 发送队列 HCA 接收队列 完成队列 发送端分配⼀个发送buffer并在HCA注册,放置⼀ 个包含远端虚拟地址的发送WQE并“按⼀下⻔铃” Host B 内存 HCA 接收队列 完成队列 申请分配接收buffer并传送地址和Key到远端 5 Enabling Kernel Bypass Data path Completion, Request event Kernel Co nt ro l Send, Receive, RDMA RDMA Stack Connection establishment Control path RDMA Library RDMA Library RDMA Library Device Driver Device Driver Device Driver rol t n o C ol Contr Device Driver Resource setup ol ntr Co Memory management Connection establishment Verbs Datapath User Application / Middleware Verbs Datapath Separation of Control and Data paths Application / Middleware Verbs Datapath Application / Middleware HW NIC Hardware 6 RDMA over Converged Ethernet (RoCE) RoCEv2 Packet Eth L2 InfiniBand Packet IP UDP BTH+ Payload iCRC FCS LRH GRH BTH+ Payload iCRC vCRC o InfiniBand is a centrally managed, lossless high performance network architecture o Provides specification for the RDMA transport o RoCE adapts the efficient RDMA transport to run over Ethernet networks o Standard Ethernet management RDMA Transport RDMA Transport IB Network Layer UDP InfiniBand Link layer Ethernet Link layer InfiniBand ROCEv2 InfiniBand Management Ethernet/IP Management IP 7 Verbs programming 1. Get the device list 2. Open the requested device 3. Query the device capabilities 4. Allocate a Protection Domain to contain your resources 5. Register a memory region 6. Create a Completion Queue (CQ) 7. Create a Queue Pair (QP) 8. Bring up a QP 9. Post work requests and poll for completion …… 10.Cleanup HCA PD CQ QP AV MR MW 8 RDMA & InfiniBand software Stack Seamless application integration with Upper Layer Protocols (ULPs) RDMA Transport Interface RDMA Apps PD, MR QP, CQ SM IB Tools SHARP Spark RDMA Apps MPI SHMEM Packet Interface TensorFlow NCCL UCX DPDK Apps Socket Apps Video Apps DPDK PMD VMA Rivermax RDMA Core Services library (verbs, CM) RDMA Core Services (verbs, CM) Device Driver Data Path kernel bypass Control Path Device driver library User Kernel Hardware Operating system RDMA Infrastructure Applications 9 UCX Library Unified Communication X (http://openucx.org) Mission: Collaboration between industry, laboratories, and academia to create production grade communication frameworks and open standards for data centric and high-performance applications 10 UCX LAYERS Applications HPC (MPI, SHMEM, …) Storage, RPC, AI Web 2.0 (Spark, Hadoop) UCX-Py UCP – High Level API (Protocols) UCX Transport selection, multi-rail, fragmentation HPC API: tag matching, active messages I/O API: Stream, RPC, remote memory access, atomics Connection establishment: client/server, external UCT – Low Level API (Transports) RDMA RC DCT GPU / Accelerators UD OFA Verbs Driver iWarp CUDA Cuda AMD/ROCM Others Shared memor y TCP OmniPat h Cray ROCM Hardware 11 JUCX – JAVA BINDINGS FOR UCX Transport abstraction - implemented on top of UCP layer Can run over different types of transports (Shared memory, Infiniband/RoCE, Cuda,…) Ease of use API wrapper over high level UCP layer Supported operations: non blocking send/recv/put/get 12 SPARK SHUFFLE BASICS Reduce Map Input Map output Map File Map File Map File Map File Map File Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Reduce task Fetch blocks Publ ish Map Stat us Spark Master 13 THE COST OF SHUFFLING § Shuffling is very expensive in terms of CPU, RAM, disk and network Ios § Spark users try to avoid shuffles as much as they can § Speedy shuffles can relieve developers of such concerns, and simplify applications 14 ShuffleManager Plugin Spark allows for external implementations of ShuffleManagers to be plugged in Configurable per-job using: “spark.shuffle.manager” Interface allows proprietary implementations of Shuffle Writers and Readers, and essentially defers the entire Shuffle process to the new component SparkUCX utilizes this interface to introduce RDMA in the Shuffle process SortShuffleManager UcxShuffleManager 15 SparkUCX Performance 27% 30 % 60 % 16 SPARK 3.0 IS GPU ACCELERATED Meeting the requirements of data processing today and tomorrow Data to Analyze Spark 3.0 scale-out clusters are accelerated using RAPIDS software and NVIDIA GPUs Data Processing Requirements GPU year-over-year performance increases to meet and exceed data processing requirements 3.0 GPU “These contributions lead to faster data pipelines, model training and scoring for more breakthroughs and insights with Apache Spark 3.0 and Databricks.” — Matei Zaharia, original creator of Apache Spark and chief technologist at Databricks CPU 2000 Hadoop Era 2010 2020 Spark Era 2030 Spark + GPU Era 17 Bringing GPU Acceleration To Apache Spark Seamless Integration with Spark 3.0 Features • Use existing (unmodified) customer code • Spark features that are not GPU enabled run transparently on the CPU Initial Release - GPU Acceleration of: • Spark Data Frames • Spark SQL • ML/DL training frameworks 18 10X Higher Performance with GPUDirect™ RDMA § Accelerates HPC and Deep Learning performance § Lowest communication latency for GPUs GPUDirect™ RDMA Courtesy of Dhabaleswar K. (DK) Panda Ohio State University 19 GPUDirect Storage 20 Accelerated Shuffle Results Inventory Pricing Query Speedups Query Duration In Seconds 250 228 200 150 100 45 50 8.4 0 CPU GPU GPU+UCX 21 SPARK 3.0 SOLUTION ADVANTAGES A100 GPU, Mellanox Networking, and Vertically-Integrated Solution NVIDIA A100 GPUs • • • • Fastest at processing Spark SQL, and DataFrame, ML/DL operations (Largest GPU memory, superior memory B/W, most operations/s) TCO advantages; reduces cost per job while saving on data center footprint Ability to partition/share GPUs to optimize for effective utilization Paired with best-of-breed CPUs, enables converged data processing and ML/DL infrastructure Mellanox ConnectX NICs and Switches • • Best-of-breed performance (latency and bandwidth); minimal CPU overhead Advanced storage capabilities includes NVMe over Fabric offloads Software Stack • • Zero code changes to run Spark workloads on A100 (Spark API calls that can run on GPUs are automatically intercepted and accelerated) Can be delivered and configured out-of-the-box by HPE Vertical Integration • Optimal overall performance and TCO – especially when compared to cloud-based solutions 22