Virtualizing Modern High-Speed Interconnection

advertisement
Virtualizing Modern High-Speed
Interconnection Networks with
Performance and Scalability
Bo Li, Zhigang Huo, Panyong Zhang, Dan Meng
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
Presenter: Xiang Zhang zhangxiang@ncic.ac.cn
Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China
Introduction
• Virtualization is now one of the enabling
technologies of Cloud Computing
• Many HPC providers now use their systems as
platforms for cloud/utility computing, these
HPC on Demand offerings include:
– Penguin's POD
– IBM's Computing On Demand service
– R Systems' dedicated hosting service
– Amazon’s EC2
Introduction:
Virtualizing HPC clouds?
• Pros:
–
–
–
–
good manageability
proactive fault tolerance
performance isolation
online system maintenance
• Cons:
– Performance gap
• Lack low latency interconnects, which is important to tightlycoupled MPI applications
• VMM-bypass has been proposed to relieve the
worry
Introduction:
VMM-bypass I/O Virtualization
• Xen split device driver model only used to setup necessary
user access points
• data communication in the critical path bypasses both the
guest OS and the VMM
VM
IDD
Application
Application
Guest Module
OS
OS
Backend Module
Privileged Module
OS-bypass I/O device
Privileged Access
VMM-bypass Access
VMM-Bypass I/O (courtesy [7])
Introduction:
InfiniBand Overview
• InfiniBand is a popular
high-speed interconnect
– OS-bypass/RDMA
– Latency: ~1us
– BW: 3300MB/s
• ~41.4% of Top500 now
uses InfiniBand as the
primary interconnect
Interconnect Family / Systems
June 2010
Source: http://www.top500.org
Introduction:
InfiniBand Scalability Problem
• Reliable Connection (RC)
– Queue Pair (QP), Each QP consists of SQ and RQ
– QPs require memory
• Shared Receive Queue (SRQ)
• eXtensible Reliable Connection (XRC)
N: node count
C: cores per node
– XRC domain & SRQ-based addressing
XRC domain
P3
P4
P1
P2
P3
P4
Conns/Process:
(N-1)×C
node1
P2
node1
P1
Conns/Process:
(N-1)
XRC domain
P7
RC in InfiniBand
P8
P5
P6
P7
P8
SRQ5
SRQ6
SRQ7
SRQ8
XRC in InfiniBand
node2
P6
SRQ
node2
P5
RQ
Problem Statement
• Does scalability gap exist between native and
virtualized environments?
– CV: cores per VM
XRC domain
XRCD
XRCD
XRC in VMs (Cv=1)
VM
XRC in VMs (Cv=2)
XRCD
QPs per Node
(N-1)×C2
(N-1)×C
(N-1)×C2
2/C )
(N-1)×(C
V
P7
P8
VM
XRC domain
VM
P8
VM
XRC domain
P7
VM 4
P6
VM 3
P5
XRCD
VM
VM
RC
XRC
RC
XRC
QPs per Process
(N-1)×C
(N-1)
(N-1)×C
(N-1)×(C/C
) P6
P5 V
XRCD
VM
Native
VM
Transport
P1
VM
VM 2
VM 1
P1
XRC domain
XRCD
XRCD
Scalability
gap
exists!
XRCD
Presentation Outline
•
•
•
•
•
Introduction
Problem Statement
Proposed Design
Evaluation
Conclusions and Future Work
Proposed Design:
VM-proof XRC design
• Design goal is to eliminate the scalability gap
– Conns/Process: (N-1)×(C/CV)  (N-1)
Shared XRC domain
P1
V
M
P5
P6 V
M
V
M
P7
P8 V
M
Shared XRC domain
Proposed Design:
Design Challenges
Guest Domain
IDD
• VM-proof sharing of XRC domain
MPI Application
– A single XRC domain must be shared among
different VMs within a physical node
Internal MPI Architecture
Device Mananger and Control
Software
Abstraction Device Interface (ADI)
• VM-proof connection management
VM-proof
CM
MPI
Library
Channel
Interface
– With a single XRC connection, P1 is able to
send data to all the processes in another
physical node (P5~P8), no matter which VMs
those processes reside in
Shared XRC domain
Userspace
Kernel
Back-end Driver
Core
InfiniBand
Modules
Core InfiniBand Modules
Resource
Management
VM-proof
XRCD
sharing
Communication
Device APIs
InfiniBand OS-bypass I/O
Front-end Driver
VM-proof
XRCD
sharing
Resource
Management
P1
V
M
P5
P6 V
M
V
M
Native HCA Driver
Device Channel
Event Channel
Xen Hypervisor
High-Speed Interconnection Network
P7
P8 V
M
Shared XRC domain
Proposed Design:
Implementation
• VM-proof sharing of XRCD
– XRCD is shared by opening the same XRCD file
– guest domains and IDD have dedicated, nonshared filesystem
– pseudo XRCD file and real XRCD file
• VM-proof CM
– Traditionally IP/hostname was used to identify
a node
– LID of the HCA is used instead
Proposed Design:
Discussions
• safe XRCD sharing
– unauthorized applications from other VMs may share
the XRCD
• the isolation of the sharing of XRCD could be guaranteed by
the IDD
– isolation between VMs running different MPI jobs
• By using different XRCD files, different jobs (or VMs) could
share different XRCDs and run without interfering with each
other
• XRC migration
– main challenge: XRC connection is a process-to-node
communication channel.
• Future work
Presentation Outline
•
•
•
•
•
Introduction
Problem Statement
Proposed Design
Evaluation
Conclusions and Future Work
Evaluation:
Platform
• Cluster Configuration:
– 128-core InfiniBand Cluster
– Quad Socket, Quad-Core Barcelona 1.9GHz
– Mellanox DDR ConnectX HCA, 24-port MT47396
Infiniscale-III switch
• Implementation
– Xen 3.4 with Linux 2.6.18.8
– OpenFabrics Enterprise Edition (OFED) 1.4.2
– MVAPICH-1.1.0
Evaluation:
Latency (us)
Microbenchmark
2
4
8
16
32
64
128
256
512
1024
• The bandwidth results are
nearly the same
• Virtualized IB performs ~0.1us
worse when using blueframe
mechanism.
between
the Size
guest
Message
(Bytes)
domain and the IDD.
VM
Message Size (Bytes)
IB verbs latency using blueframe
IB verbs latency using doorbell
6
Latency (us)
Native
2
4
8
16
32
64
128
256
512
1024
Latency (us)
– memory copy of the sending data
to the HCA's blueframe page
6
5
4
3
2
1
0
6
Native
VM
5
4
3 Explanation: Memory
2 copy operations under
1 virtualized case would
0 include interactions
Native
VM
4
2
0
0
2
8 32 128 512
Message Size (Bytes)
MPI latency using blueframe
Evaluation:
VM-proof XRC Evaluation
• Configurations
– Native-XRC: Native environment running XRCbased MVAPICH.
– VM-XRC (CV=n): VM-based environment running
unmodified XRC-based MVAPICH. The parameter
CV denotes the number of cores per VM.
– VM-proof XRC: VM-based environment running
MVAPICH with our VM-proof XRC design.
Evaluation:
Memory Usage
– 64K processes will
consume 13GB/node with
the VM-XRC (CV=1)
configuration
– The VM-proof XRC design
reduces the memory
usage to only
800MB/node
Memory Usage (GB/Node)
• 16x less memory usage
14
12
10
8
6
4
VM-XRC (Cv=1)
VM-XRC (Cv=2)
VM-XRC (Cv=4)
Better
• 16 cores/node cluster fully
connected
– The X-axis denotes the
process count
– ~12KB memory for each
QP
13GB
VM-XRC (Cv=8)
Native/VM-proof XRC
2
0
128 256 512 1K 2K 4K 8K 16K 32K 64K
Number of Processes
800MB
Evaluation:
MPI Alltoall Evaluation
VM-XRC (Cv=1)
Native-XRC
Latency (us)
VM-proof
XRC
1
2
Better
400
350
300
250
200
150
100
50
0
VM-proof XRC
4
8 16 32 64 128 256
Message Size (Bytes)
• a total of 32 processes
• 10%~25% improvement for messages < 256B
Evaluation:
• VM-proof XRC performs
nearly the same as NativeXRC
VM-proof
XRC
• Both are better than VM-XRC
Native-XRC
VM-XRC (Cv=1)
VM-proof XRC
Better
Normalized Time
1.2
1
0.8
0.6
0.4
0.2
0
BT CG EP FT IS LU MG SP
Data Set
VM-XRC (Cv=1)
VM-XRC (Cv=2)
VM-XRC (Cv=4)
VM-XRC (Cv=8)
2
1.5
1
Better
– Except BT and EP
Normalized Time
Application Benchmarks
0.5
0
BT
EP LU MG SP
Data set
• little variation for
different CV values
• Cv=8 is an exception
• Memory allocation not
NUMA-aware guaranteed
Evaluation:
FT
IS
Avg.
QPs/Node
Max
QPs/Process
Avg.
QPs/Process
Comm. Peers
Configuration
Benchmark
Application Benchmarks (Cont’d)
VM-XRC (Cv=1)
127
127
2032
VM-XRC (Cv=2)
63.4
65
VM-XRC (Cv=4)
VM-XRC (Cv=8)
VM-proof XRC
Native-XRC
VM-XRC (Cv=1)
127 31.1
15.1
8
7
127
1014
498
242
128
112
2032
VM-XRC (Cv=2)
63.7
VM-XRC (Cv=4)
VM-XRC (Cv=8)
VM-proof XRC
Native-XRC
127 31.7
15.8
8.6
7.6
32
16
8
7
127
65
33
18
12
11
1019
507
253
138
122
~15.9x
less conns
~14.7x
less conns
Conclusion and Future Work
• VM-proof XRC design converges two technologies
– VMM-bypass I/O virtualization
– eXtensible Reliable Connection in modern high speed interconnection
networks (InfiniBand)
• the same raw performance and scalability as in native nonvirtualized environment with our VM-proof XRC design
– ~16x scalability improvement is seen in 16-core/node clusters
• Future work
– evaluations on different platforms with increased scale
– add VM migration support to our VM-proof XRC design
– extend our work to the newly SRIOV-enabled ConnectX-2 HCAs
Questions?
{leo, zghuo, zhangpanyong, md}@ncic.ac.cn
Backup Slides
OS-bypass of InfiniBand
OpenIB Gen2 stack
Download