High-PerformanceGPUClustering:GPUDirectRDMAover 40GbEiWARP TomReu ConsultingApplicationsEngineer ChelsioCommunications tomreu@chelsio.com EfficientPerformance™ 1 ChelsioCorporateSnapshot LeaderinHighSpeedConvergedEthernetAdapters • Leading10/40GbEadaptersolutionproviderforserversandstoragesystems • • • • ~5M+IOPs MarketCoverage Mediastreaminghardware/software WANOptimization,Security,etc. Storage HPC CompanyFacts • • • 80MPPS 1.5μsec Featurerichsolution • • Finance OilandGas Highperformanceprotocolengine • • • Manufacturing ~800Kportsshipped Foundedin2000 150strongstaff Media Security Service/Cloud R&DOffices • • • USA–Sunnyvale India–Bangalore China-Shanghai EfficientPerformance™ 2 RDMAOverview ChelsioT5RNIC • Directmemory-to-memorytransfer • AllprotocolprocessinghandlingbytheNIC • Mustbeinhardware • ProtectionhandledbytheNIC • Userspaceaccessrequiresbothlocal andremoteenforcement ChelsioT5RNIC • Asynchronouscommunicationmodel • Reducedhostinvolvement • Performance • Latency-polling • Throughput • Efficiency • Zerocopy • Kernelbypass(userspaceI/O) • CPUbypass Performanceandefficiencyinreturn fornewcommunicationparadigm EfficientPerformance™ iWARP Whatisit? • • • • • ProvidestheabilitytodoRemoteDirectMemoryAccess overEthernetusingTCP/IP UsesWell-KnownIBVerbs InboxedinOFEDsince2008 RunsontopofTCP/IP • ChelsioimplementsiWARP/TCP/IPstackinsilicon • Cut-throughsend • Cut-throughreceive Benefits • Engineeredtouse“typical”Ethernet • NoneedfortechnologieslikeDCBorQCN • NativelyRoutable • Multi-pathsupportatLayer3(andLayer2) • ItrunsonTCP/IP • MatureandProven • GoeswhereTCP/IPgoes(everywhere) EfficientPerformance™ 4 iWARP • • • iWARPupdatesandenhancementsaredonebytheIETF STORM(StorageMaintenance)workinggroup RFCs • RFC5040ARemoteDirectMemoryAccessProtocol Specification • RFC5041DirectDataPlacementoverReliable Transports • RFC5044MarkerPDUAlignedFramingforTCP Specification • RFC6580IANARegistriesfortheRDDPProtocols • RFC6581EnhancedRDMAConnectionEstablishment • RFC7306RemoteDirectMemoryAccess(RDMA) ProtocolExtensions Supportfromseveralvendors,Chelsio,Intel,QLogic EfficientPerformance™ 5 iWARP IncreasingInterestiniWARPasoflate • SomeUseCases • HighPerformanceComputing • SMBDirect • GPUDirectRDMA • NFSoverRDMA • FreeBSDiWARP • HadoopRDMA • LustreRDMA • NVMeoverRDMAfabrics EfficientPerformance™ 6 iWARP AdvantagesoverOtherRDMATransports • • It’sEthernet • WellUnderstoodandAdministered • UsesTCP/IP • MatureandProven • Supportsrack,cluster,datacenter,LAN/MAN/WAN andwireless • CompatiblewithSSL/TLS • Donotneedtouseanybolt-ontechnologieslike • DCB • QCN Doesnotrequireatotallynewnetworkinfrastructure • ReducesTCOandOpEx EfficientPerformance™ 7 iWARPvsRoCE iWARP RoCE Native TCP/IP over Ethernet, no different from NFS Difficult to install and configure - “needs a team of or HTTP experts” - Plug-and-Debug Works with ANY Ethernet switches Requires DCB - expensive equipment upgrade Works with ALL Ethernet equipment Poor interoperability - may not work with switches from different vendors No need for special QoS or configuration - TRUE Plug-and-Play Fixed QoS configuration - DCB must be setup identically across all switches No need for special configuration, preserves network robustness Easy to break - switch configuration can cause performance collapse TCP/IP allows reach to Cloud scale Does not scale - requires PFC, limited to single subnet No distance limitations. Ideal for remote communication and HA Short distance - PFC range is limited to few hundred meters maximum WAN routable, uses any IP infrastructure RoCEv1 not routable. RoCE v2 requires lossless IP infrastructure and restricts router configuration Standard for whole stack has been stable for a decade ROCEv2 incompatible with v1. More fixes to missing reliability and scalability layers required and expected Incomplete specification and opaque process Transparent and open IETF standards process EfficientPerformance™ Chelsio’sT5 SingleASICdoesitall • • • HighPerformancePurposeBuiltProtocolProcessor Runsmultipleprotocols • TCPwithStatelessOffloadandFullOffload • UDPwithStatelessOffload • iWARP • FCoEwithOffload • iSCSIwithOffload AlloftheseprotocolsrunonT5withaSINGLEFIRMWARE IMAGE • Noneedtoreinitializethecardfordifferentuses • Futureproofe.g.supportforNVMfyetpreserves today’sinvestmentiniSCSI EfficientPerformance™ 9 T5ASICArchitecture DMAEngine PCI-e,X8,Gen3 HighPerformancePurposeBuiltProtocolProcessor ApplicationCoProcessorTX ApplicationCoProcessorRX 1G/10G/40GMAC Cut-ThroughTXMemory Traffic Manager Data-flow ProtocolEngine Embedded Layer2 Ethernet Switch 1G/10G/40GMAC 100M/1G/10GMAC 100M/1G/10GMAC GeneralPurpose Processor Cut-ThroughRXMemory MemoryController On-ChipDRAM Optionalexternal DDR3memory Lookup,filtering andFirewall ▪ Singleprocessordata-flow pipelinedarchitecture ▪ Upto1Mconnections ▪ ConcurrentMulti-Protocol Operation Singleconnectionat40Gb.LowLatency. EfficientPerformance™ 10 LeadingUnifiedWire™Architecture ConvergedNetworkArchitecturewithall-in-oneAdapterandSoftware Storage ▪NVMe/Fabrics ▪SMBDirect ▪iSCSIandFCoEwithT10-DIX ▪iSERandNFSoverRDMA ▪pNFS(NFS4.1)andLustre ▪NASOffload ▪Disklessboot ▪Replicationandfailover Virtualization&Cloud ▪Hypervisoroffload ▪SR-IOVwithembeddedVEB ▪VEPA,VN-TAGs ▪VXLAN/NVGRE ▪NFVandSDN ▪OpenStackstorage ▪HadoopRDMA HFT HPC ▪WireDirecttechnology ▪Ultralowlatency ▪Highestmessages/sec ▪Wirerateclassification ▪iWARPRDMAoverEthernet ▪GPUDirectRDMA ▪LustreRDMA ▪pNFS(NFS4.1) ▪OpenMPI ▪MVAPICH Networking ▪4x10GbE/2x40GbENIC ▪FullProtocolOffload ▪DataCenterBridging ▪Hardwarefirewall ▪WireAnalytics ▪DPDK/netmap MediaStreaming ▪TrafficManagement ▪VideosegmentationOffload ▪Largestreamcapacity SingleQualification–SingleSKU ConcurrentMulti-ProtocolOperation EfficientPerformance™ 11 GPUDirectRDMA • IntroducedbyNVIDIAwiththeKeplerClassGPUs.Available todayonTeslaandQuadroGPUsaswell. • EnablesMultipleGPUs,3rdpartynetworkadapters,SSDs andotherdevicestoreadandwriteCUDAhostanddevice memory • Avoidsunnecessarysystemmemorycopiesandassociated CPUoverheadbycopyingdatadirectlytoandfrompinned GPUmemory • Onehardwarelimitation • TheGPUandtheNetworkdeviceMUSTsharethesame upstreamPCIerootcomplex • AvailablewithInfiniband,RoCE,andnowiWARP EfficientPerformance™ GPUDirectRDMA T5iWARPRDMAoverEthernetcertifiedwithNVIDIAGPUDirect • Read/writeGPUmemory directlyfromnetwork adapter • • • • • • • Peer-to-peerPCIe communication BypasshostCPU Bypasshostmemory Zerocopy Ultralowlatency Veryhighperformance ScalableGPUpooling • Host MEMORY MEMORY CPU Payload GPU Host CPU Notifications Notifications RNIC RNIC Packets Payload GPU Packets LAN/Datacenter/WAN Network AnyEthernetnetworks EfficientPerformance™ 13 ModulesrequiredforGPUDirectRMDAwithiWARP • ChelsioModules • cxgb4-Chelsioadapterdriver • iw_cxgb4-ChelsioiWARPdriver • rdma_ucm-RDMAUserSpaceConnectionManager • NVIDIAModules • nvidia-NVIDIAdriver • nvidia_uvm-NVIDIAUnifiedMemory • nv_peer_mem-NVIDIAPeerMemory EfficientPerformance™ CaseStudies HOOMD-blue • GeneralPurposeParticlesimulationtoolkit • Standsfor:HighlyOptimizedObject-orientedMany-particle Dynamics-BlueEdition • RunningonGPUDirectRDMA-WITHNOCHANGESTOTHE CODE-ATALL! • MoreInfo:www.codeblue.umich.edu/hoomd-blue EfficientPerformance™ 16 HOOMD-blue TestConfiguration • • • • • • • • • • • • 4Nodes IntelE5-1660v2@3.7Ghz 64GBRAM ChelsioT580-CR40GbAdapter NVIDIATeslaK80(2GPUspercard) RHEL6.5 OpenMPI1.10.0 OFED3.18 CUDAToolkit6.5 HOOMD-bluev1.3.1-9 Chelsio-GDR-1.0.0.0 CommandLine: $MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr 1 -np X -hostfile /root/hosts -mca btl openib,sm,self -mca btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538 mca btl_openib_receive_queues P,131072,64 -x CUDA_VISIBILE_DEVICES=0,1 / root/hoomd-install/bin/hoomd ./bmark.py --mode=gpu|cpu EfficientPerformance™ 17 HOOMD-blue Lennard-JonesLiquid64KParticlesBenchmark • ClassicbenchmarkforgeneralpurposeMDsimulations. • RepresentativeoftheperformanceHOOMD-blue achievesforstraightpairpotentialsimulations EfficientPerformance™ 18 HOOMD-blue Lennard-JonesLiquid64KParticlesBenchmarkResults AverageTimestepsperSecond 26 Test1 2 CPU Cores 88 Test2 0 2 GPUs 1,230 2 GPUs 8 CPU Cores 503 214 Test3 488 4 GPUs 40 CPU Cores 450 1,403 4 GPUs 1,089 900 CPU GPUw/GPUDirectRDMA 8 GPUs 1350 1,771 8 GPUs 1800 GPUw/oGPUDirectRDMA LongerisBetter EfficientPerformance™ 19 HOOMD-blue Lennard-JonesLiquid64KParticlesBenchmarkResults Hourstocomplete10e6steps Test1 6 2.2 Test2 108 2 GPUs 2 GPUs 5.5 4 GPUs 1.7 4 GPUs 2.5 1.5 Test3 0 CPU 2 CPU Cores 32 8 CPU Cores 13 40 CPU Cores 8 GPUs 8 GPUs 30 60 GPUw/oGPUDirectRDMA 90 120 GPUw/GPUDirectRDMA ShorterisBetter EfficientPerformance™ 20 HOOMD-blue QuasicrystalBenchmark • runsasystemofparticleswithanoscillatorypair potentialthatformsaicosahedralquasicrystal • Thismodelisusedintheresearcharticle:EngelM,et. al.(2015)Computationalself-assemblyofaonecomponenticosahedralquasicrystal,Naturematerials 14(January),p.109-116. EfficientPerformance™ 21 HOOMD-blue Quasicrystalresults AverageTimestepsperSecond 11 Test1 Test2 Test3 0 2 CPU Cores 308 43 8 CPU Cores 31 40 CPU Cores 300 2 GPUs 407 2 GPUs 656 4 GPUs 728 4 GPUs 915 600 CPU GPUw/GPUDirectRDMA 900 8 GPUs 1,158 8 GPUs 1200 GPUw/oGPUDirectRDMA LongerisBetter EfficientPerformance™ 22 HOOMD-blue Quasicrystalresults Hourstocomplete10e6steps Test1 9 7 Test2 4 3.5 3 2.4 Test3 0 CPU 264 2 CPU Cores 2 GPUs 2 GPUs 63 8 CPU Cores 4 GPUs 4 GPUs 86 40 CPU Cores 8 GPUs 8 GPUs 75 150 GPUw/oGPUDirectRDMA 225 300 GPUw/GPUDirectRDMA ShorterisBetter EfficientPerformance™ 23 Caffe DeepLearningFramework • OpensourceDeepLearningsoftwarefromBerkeleyVisionand LearningCenter • UpdatedtoincludeCUDAsupporttoutilizeGPUs • StandardversiondoesNOTincludeMPIsupport • MPIimplementations • mpi-caffe • Usedtotrainalargenetworkacrossaclusterofmachines • model-paralleldistributedapproach. • caffe-parallel • Fasterframeworkfordeeplearning. • data-parallelviaMPI,splitsthetrainingdataacrossnodes EfficientPerformance™ Summary GPUDirectRDMAover40GbEiWARP • • • • • • iWARPprovidesRDMACapabilitiestoaEthernet network iWARPusestriedandtrueTCP/IPasitsunderlying transportmechanism UsingiWARPdoesnotrequireawholenewnetwork infrastructureandthemanagementrequirementsthat comealongwithit iWARPcanbeusedwithexistingsoftwarerunningon GPUDirectRDMAwhichNOCHANGESrequiredtothe code ApplicationsthatuseGPUDirectRDMAwillseehuge performanceimprovements Chelsioprovides10/40GbiWARPTODAYwith25/50/100 Gbonthehorizon EfficientPerformance™ 25 Moreinformation GPUDirectRDMAover40GbEiWARP • • • • • • Visitourwebsite,www.chelsio.com,formoreWhite Papers,Benchmarks,etc. GPUDirectRDMAWhitePaper:http://www.chelsio.com/ wp-content/uploads/resources/T5-40Gb-LinuxGPUDirect.pdf Webinar:https://www.brighttalk.com/webcast/ 13671/189427 BetacodeforGPUDirectRDMAisavailableTODAYfrom ourdownloadsiteatservice.chelsio.com Salesquestions-sales@chelsio.com Supportquestions-support@chelsio.com EfficientPerformance™ 26 Questions? 27 ThankYou