iWarp

advertisement
High-PerformanceGPUClustering:GPUDirectRDMAover
40GbEiWARP
TomReu
ConsultingApplicationsEngineer
ChelsioCommunications
tomreu@chelsio.com
EfficientPerformance™
1
ChelsioCorporateSnapshot
LeaderinHighSpeedConvergedEthernetAdapters
•
Leading10/40GbEadaptersolutionproviderforserversandstoragesystems
•
•
•
•
~5M+IOPs
MarketCoverage
Mediastreaminghardware/software
WANOptimization,Security,etc.
Storage
HPC
CompanyFacts
•
•
•
80MPPS
1.5μsec
Featurerichsolution
•
•
Finance
OilandGas
Highperformanceprotocolengine
•
•
•
Manufacturing
~800Kportsshipped
Foundedin2000
150strongstaff
Media
Security
Service/Cloud
R&DOffices
•
•
•
USA–Sunnyvale
India–Bangalore
China-Shanghai
EfficientPerformance™
2
RDMAOverview
ChelsioT5RNIC
• Directmemory-to-memorytransfer
• AllprotocolprocessinghandlingbytheNIC
• Mustbeinhardware
• ProtectionhandledbytheNIC
• Userspaceaccessrequiresbothlocal
andremoteenforcement
ChelsioT5RNIC
• Asynchronouscommunicationmodel
• Reducedhostinvolvement
• Performance
• Latency-polling
• Throughput
• Efficiency
• Zerocopy
• Kernelbypass(userspaceI/O)
• CPUbypass
Performanceandefficiencyinreturn
fornewcommunicationparadigm
EfficientPerformance™
iWARP
Whatisit?
•
•
•
•
•
ProvidestheabilitytodoRemoteDirectMemoryAccess
overEthernetusingTCP/IP
UsesWell-KnownIBVerbs
InboxedinOFEDsince2008
RunsontopofTCP/IP
• ChelsioimplementsiWARP/TCP/IPstackinsilicon
• Cut-throughsend
• Cut-throughreceive
Benefits
• Engineeredtouse“typical”Ethernet
• NoneedfortechnologieslikeDCBorQCN
• NativelyRoutable
• Multi-pathsupportatLayer3(andLayer2)
• ItrunsonTCP/IP
• MatureandProven
• GoeswhereTCP/IPgoes(everywhere)
EfficientPerformance™
4
iWARP
•
•
•
iWARPupdatesandenhancementsaredonebytheIETF
STORM(StorageMaintenance)workinggroup
RFCs
• RFC5040ARemoteDirectMemoryAccessProtocol
Specification
• RFC5041DirectDataPlacementoverReliable
Transports
• RFC5044MarkerPDUAlignedFramingforTCP
Specification
• RFC6580IANARegistriesfortheRDDPProtocols
• RFC6581EnhancedRDMAConnectionEstablishment
• RFC7306RemoteDirectMemoryAccess(RDMA)
ProtocolExtensions
Supportfromseveralvendors,Chelsio,Intel,QLogic
EfficientPerformance™
5
iWARP
IncreasingInterestiniWARPasoflate
•
SomeUseCases
• HighPerformanceComputing
• SMBDirect
• GPUDirectRDMA
• NFSoverRDMA
• FreeBSDiWARP
• HadoopRDMA
• LustreRDMA
• NVMeoverRDMAfabrics
EfficientPerformance™
6
iWARP
AdvantagesoverOtherRDMATransports
•
•
It’sEthernet
• WellUnderstoodandAdministered
• UsesTCP/IP
• MatureandProven
• Supportsrack,cluster,datacenter,LAN/MAN/WAN
andwireless
• CompatiblewithSSL/TLS
• Donotneedtouseanybolt-ontechnologieslike
• DCB
• QCN
Doesnotrequireatotallynewnetworkinfrastructure
• ReducesTCOandOpEx
EfficientPerformance™
7
iWARPvsRoCE
iWARP
RoCE
Native TCP/IP over Ethernet, no different from NFS Difficult to install and configure - “needs a team of
or HTTP
experts” - Plug-and-Debug
Works with ANY Ethernet switches
Requires DCB - expensive equipment upgrade
Works with ALL Ethernet equipment
Poor interoperability - may not work with switches
from different vendors
No need for special QoS or configuration - TRUE
Plug-and-Play
Fixed QoS configuration - DCB must be setup
identically across all switches
No need for special configuration, preserves
network robustness
Easy to break - switch configuration can cause
performance collapse
TCP/IP allows reach to Cloud scale
Does not scale - requires PFC, limited to single
subnet
No distance limitations. Ideal for remote
communication and HA
Short distance - PFC range is limited to few
hundred meters maximum
WAN routable, uses any IP infrastructure
RoCEv1 not routable. RoCE v2 requires lossless IP
infrastructure and restricts router configuration
Standard for whole stack has been stable for a
decade
ROCEv2 incompatible with v1. More fixes to
missing reliability and scalability layers required and
expected
Incomplete specification and opaque process
Transparent and open IETF standards process
EfficientPerformance™
Chelsio’sT5
SingleASICdoesitall
•
•
•
HighPerformancePurposeBuiltProtocolProcessor
Runsmultipleprotocols
• TCPwithStatelessOffloadandFullOffload
• UDPwithStatelessOffload
• iWARP
• FCoEwithOffload
• iSCSIwithOffload
AlloftheseprotocolsrunonT5withaSINGLEFIRMWARE
IMAGE
• Noneedtoreinitializethecardfordifferentuses
• Futureproofe.g.supportforNVMfyetpreserves
today’sinvestmentiniSCSI
EfficientPerformance™
9
T5ASICArchitecture
DMAEngine
PCI-e,X8,Gen3
HighPerformancePurposeBuiltProtocolProcessor
ApplicationCoProcessorTX
ApplicationCoProcessorRX
1G/10G/40GMAC
Cut-ThroughTXMemory
Traffic
Manager
Data-flow
ProtocolEngine
Embedded
Layer2
Ethernet
Switch
1G/10G/40GMAC
100M/1G/10GMAC
100M/1G/10GMAC
GeneralPurpose
Processor
Cut-ThroughRXMemory
MemoryController
On-ChipDRAM
Optionalexternal
DDR3memory
Lookup,filtering
andFirewall
▪ Singleprocessordata-flow
pipelinedarchitecture
▪ Upto1Mconnections
▪ ConcurrentMulti-Protocol
Operation
Singleconnectionat40Gb.LowLatency.
EfficientPerformance™
10
LeadingUnifiedWire™Architecture
ConvergedNetworkArchitecturewithall-in-oneAdapterandSoftware
Storage
▪NVMe/Fabrics
▪SMBDirect
▪iSCSIandFCoEwithT10-DIX
▪iSERandNFSoverRDMA
▪pNFS(NFS4.1)andLustre
▪NASOffload
▪Disklessboot
▪Replicationandfailover
Virtualization&Cloud
▪Hypervisoroffload
▪SR-IOVwithembeddedVEB
▪VEPA,VN-TAGs
▪VXLAN/NVGRE
▪NFVandSDN
▪OpenStackstorage
▪HadoopRDMA
HFT
HPC
▪WireDirecttechnology
▪Ultralowlatency
▪Highestmessages/sec
▪Wirerateclassification
▪iWARPRDMAoverEthernet
▪GPUDirectRDMA
▪LustreRDMA
▪pNFS(NFS4.1)
▪OpenMPI
▪MVAPICH
Networking
▪4x10GbE/2x40GbENIC
▪FullProtocolOffload
▪DataCenterBridging
▪Hardwarefirewall
▪WireAnalytics
▪DPDK/netmap
MediaStreaming
▪TrafficManagement
▪VideosegmentationOffload
▪Largestreamcapacity
SingleQualification–SingleSKU
ConcurrentMulti-ProtocolOperation
EfficientPerformance™
11
GPUDirectRDMA
• IntroducedbyNVIDIAwiththeKeplerClassGPUs.Available
todayonTeslaandQuadroGPUsaswell.
• EnablesMultipleGPUs,3rdpartynetworkadapters,SSDs
andotherdevicestoreadandwriteCUDAhostanddevice
memory
• Avoidsunnecessarysystemmemorycopiesandassociated
CPUoverheadbycopyingdatadirectlytoandfrompinned
GPUmemory
• Onehardwarelimitation
• TheGPUandtheNetworkdeviceMUSTsharethesame
upstreamPCIerootcomplex
• AvailablewithInfiniband,RoCE,andnowiWARP
EfficientPerformance™
GPUDirectRDMA
T5iWARPRDMAoverEthernetcertifiedwithNVIDIAGPUDirect
•
Read/writeGPUmemory
directlyfromnetwork
adapter
•
•
•
•
•
•
•
Peer-to-peerPCIe
communication
BypasshostCPU
Bypasshostmemory
Zerocopy
Ultralowlatency
Veryhighperformance
ScalableGPUpooling
•
Host
MEMORY
MEMORY
CPU
Payload
GPU
Host
CPU
Notifications
Notifications
RNIC
RNIC
Packets
Payload
GPU
Packets
LAN/Datacenter/WAN
Network
AnyEthernetnetworks
EfficientPerformance™
13
ModulesrequiredforGPUDirectRMDAwithiWARP
• ChelsioModules
• cxgb4-Chelsioadapterdriver
• iw_cxgb4-ChelsioiWARPdriver
• rdma_ucm-RDMAUserSpaceConnectionManager
• NVIDIAModules
• nvidia-NVIDIAdriver
• nvidia_uvm-NVIDIAUnifiedMemory
• nv_peer_mem-NVIDIAPeerMemory
EfficientPerformance™
CaseStudies
HOOMD-blue
•
GeneralPurposeParticlesimulationtoolkit
•
Standsfor:HighlyOptimizedObject-orientedMany-particle
Dynamics-BlueEdition
•
RunningonGPUDirectRDMA-WITHNOCHANGESTOTHE
CODE-ATALL!
•
MoreInfo:www.codeblue.umich.edu/hoomd-blue
EfficientPerformance™
16
HOOMD-blue
TestConfiguration
•
•
•
•
•
•
•
•
•
•
•
•
4Nodes
IntelE5-1660v2@3.7Ghz
64GBRAM
ChelsioT580-CR40GbAdapter
NVIDIATeslaK80(2GPUspercard)
RHEL6.5
OpenMPI1.10.0
OFED3.18
CUDAToolkit6.5
HOOMD-bluev1.3.1-9
Chelsio-GDR-1.0.0.0
CommandLine:
$MPI_HOME/bin/mpirun --allow-run-as-root -mca btl_openib_want_cuda_gdr 1
-np X -hostfile /root/hosts -mca btl openib,sm,self -mca
btl_openib_if_include cxgb4_0:1 --mca btl_openib_cuda_rdma_limit 65538 mca btl_openib_receive_queues P,131072,64 -x CUDA_VISIBILE_DEVICES=0,1 /
root/hoomd-install/bin/hoomd ./bmark.py --mode=gpu|cpu
EfficientPerformance™
17
HOOMD-blue
Lennard-JonesLiquid64KParticlesBenchmark
• ClassicbenchmarkforgeneralpurposeMDsimulations.
• RepresentativeoftheperformanceHOOMD-blue
achievesforstraightpairpotentialsimulations
EfficientPerformance™
18
HOOMD-blue
Lennard-JonesLiquid64KParticlesBenchmarkResults
AverageTimestepsperSecond
26
Test1
2 CPU Cores
88
Test2
0
2 GPUs
1,230
2 GPUs
8 CPU Cores
503
214
Test3
488
4 GPUs
40 CPU Cores
450
1,403 4 GPUs
1,089
900
CPU
GPUw/GPUDirectRDMA
8 GPUs
1350
1,771
8 GPUs
1800
GPUw/oGPUDirectRDMA
LongerisBetter
EfficientPerformance™
19
HOOMD-blue
Lennard-JonesLiquid64KParticlesBenchmarkResults
Hourstocomplete10e6steps
Test1
6
2.2
Test2
108
2 GPUs
2 GPUs
5.5 4 GPUs
1.7 4 GPUs
2.5
1.5
Test3
0
CPU
2 CPU Cores
32
8 CPU Cores
13
40 CPU Cores
8 GPUs
8 GPUs
30
60
GPUw/oGPUDirectRDMA
90
120
GPUw/GPUDirectRDMA
ShorterisBetter
EfficientPerformance™
20
HOOMD-blue
QuasicrystalBenchmark
• runsasystemofparticleswithanoscillatorypair
potentialthatformsaicosahedralquasicrystal
• Thismodelisusedintheresearcharticle:EngelM,et.
al.(2015)Computationalself-assemblyofaonecomponenticosahedralquasicrystal,Naturematerials
14(January),p.109-116.
EfficientPerformance™
21
HOOMD-blue
Quasicrystalresults
AverageTimestepsperSecond
11
Test1
Test2
Test3
0
2 CPU Cores
308
43
8 CPU Cores
31
40 CPU Cores
300
2 GPUs
407 2 GPUs
656 4 GPUs
728 4 GPUs
915
600
CPU
GPUw/GPUDirectRDMA
900
8 GPUs
1,158
8 GPUs
1200
GPUw/oGPUDirectRDMA
LongerisBetter
EfficientPerformance™
22
HOOMD-blue
Quasicrystalresults
Hourstocomplete10e6steps
Test1
9
7
Test2
4
3.5
3
2.4
Test3
0
CPU
264
2 CPU Cores
2 GPUs
2 GPUs
63
8 CPU Cores
4 GPUs
4 GPUs
86
40 CPU Cores
8 GPUs
8 GPUs
75
150
GPUw/oGPUDirectRDMA
225
300
GPUw/GPUDirectRDMA
ShorterisBetter
EfficientPerformance™
23
Caffe
DeepLearningFramework
• OpensourceDeepLearningsoftwarefromBerkeleyVisionand
LearningCenter
• UpdatedtoincludeCUDAsupporttoutilizeGPUs
• StandardversiondoesNOTincludeMPIsupport
• MPIimplementations
• mpi-caffe
• Usedtotrainalargenetworkacrossaclusterofmachines
• model-paralleldistributedapproach.
• caffe-parallel
• Fasterframeworkfordeeplearning.
• data-parallelviaMPI,splitsthetrainingdataacrossnodes
EfficientPerformance™
Summary
GPUDirectRDMAover40GbEiWARP
•
•
•
•
•
•
iWARPprovidesRDMACapabilitiestoaEthernet
network
iWARPusestriedandtrueTCP/IPasitsunderlying
transportmechanism
UsingiWARPdoesnotrequireawholenewnetwork
infrastructureandthemanagementrequirementsthat
comealongwithit
iWARPcanbeusedwithexistingsoftwarerunningon
GPUDirectRDMAwhichNOCHANGESrequiredtothe
code
ApplicationsthatuseGPUDirectRDMAwillseehuge
performanceimprovements
Chelsioprovides10/40GbiWARPTODAYwith25/50/100
Gbonthehorizon
EfficientPerformance™
25
Moreinformation
GPUDirectRDMAover40GbEiWARP
•
•
•
•
•
•
Visitourwebsite,www.chelsio.com,formoreWhite
Papers,Benchmarks,etc.
GPUDirectRDMAWhitePaper:http://www.chelsio.com/
wp-content/uploads/resources/T5-40Gb-LinuxGPUDirect.pdf
Webinar:https://www.brighttalk.com/webcast/
13671/189427
BetacodeforGPUDirectRDMAisavailableTODAYfrom
ourdownloadsiteatservice.chelsio.com
Salesquestions-sales@chelsio.com
Supportquestions-support@chelsio.com
EfficientPerformance™
26
Questions?
27
ThankYou
Download