GridMPI Yutaka Ishikawa University of Tokyo and

advertisement
GridMPI
Yutaka Ishikawa
University of Tokyo
and
Grid Technology Research Center at AIST
2003/10/03
Yutaka Ishikawa @ University of Tokyo
1
Background
• SCore Cluster System Software
– Real World Computing Partnership (1992 – 2001)
• Funded by Ministry of Economy,Trade and Industry, METI.
2003/10/03
Applications
SCASH MPC++ MPICH-SCore PVM-SCore
PBS
SCore-D Global Operating System
User
Level
OMNI/SCASH
PM/Shmem
PM/Myrinet
PM/Ethernet
PM/UDP
PM/Shmem
driver
PM/Myrinet
driver
PM/Ethernet
driver
Socket
UDP/IP
Ethernet driver
Myrinet NIC
PM firmware
Yutaka Ishikawa @ University of Tokyo
Ethernet NIC
Linux
Kernel
Level
PMv2
NIC
Level
High Performance Communication Libraries
PMv2
11.0 usec Round Trip time
240 MB/s Bandwidth
MPICH-SCore MPI Library
24.4 usec Round Trip time
228 MB/s Bandwidth
PM/Ethernet Network Trunking
Utilizing more than one NIC
Global Operating System
SCore-D
Single/Multi User Environment
Gang scheduling
Checkpoint and restart
Parallel Programming Language
MPC++ Multi-Thread Template Library
Shared Memory Programming Support
Omni OpenMP on SCASH
2
RWC SCore III
• Host
– NEC Express Servers
• Dual Pentium III 933 MHz
• 512 Mbytes of Main Memory
• # of Hosts
– 512 Hosts (1,024 Processors)
• Networks
– Myrinet-2000 (2 Gbps + 2 Gbps)
– 2 Ethernet Links
• Linpack Result
– 618.3 Gflops
This is the world fastest PC cluster at August of 2001
2003/10/03
Yutaka Ishikawa @ University of Tokyo
3
TOP 500 as of December 2002
• HELICS, rank 64th (825.0 Gflops),
– 512 Athlons 1.4GHz, Myrinet-2000
– Heiderberg University – IWR,
http://helics.iwr.uni-heidelberg.de/
• Presto III, rank 68th (760.2 Gflops),
– 512 Athlons 1.6GHz, Myrinet-2000
– GSIC, Tokyo Institute of Technology,
http://www.gsic.titech.ac.jp/
• Magi Cluster, rank 86th (654.0 Gflops),
– 1040 Pentium III 933MHz, Myrinet-2000
– CBRC-TACC/AIST, http://www.cbrc.jp/magi/
• RWC SCore III, 90th (618.3 Gflops),
– 1024 Pentium III 933MHz, Myrinet-2000
– RWCP, http://www.rwcp.or.jp
2003/10/03
Yutaka Ishikawa @ University of Tokyo
4
SCore Users
• Japan
– Universities
• University of Tokyo, Tokyo institute of technologies,
Tsukuba university, …
– Industries
• Japanese Car Manufacturing Companies use the
production line
• UK
– Oxford University
Streamline Computing Ltd
– Warwick University
SCore integration business
• Germany
– University of Bonn
– University of Heidelberg
– University of Tuebingen
2003/10/03
Yutaka Ishikawa @ University of Tokyo
5
PC Cluster Consortium
http://www.pccluster.org
• Purpose
– Contribution to the PC cluster market, through the
development, maintenance, and promotion of cluster system
software based on the SCore cluster system software and the
Omni OpenMP compiler, developed at RWCP.
• Members
– Japanese companies
• NEC, Fujistu, Hitatchi, HP Japan, IBM Japan, Intel Japan, AMD Japan,
…
– Research Institutes
• Tokyo Institute of Technology GSIC, Riken
– Individuals
2003/10/03
Yutaka Ishikawa @ University of Tokyo
6
Lesson Learned
• New MPI implementation is needed
– It is tough to change/modify existing MPI implementations
– New MPI implementation
• Open implementation in addition to open source
• Customizable to implement a new protocol
• New transport implementation is needed
– The PM library is not on top of IP protocol
• Not acceptable in the Grid environment
– The current TCP/IP implementation (BSD and Linux) does not
perform well in a large latency environment
– Mismatch between socket API and MPI communication model
– TCP/IP protocol is not an issue, but its implementation is the
issue
2003/10/03
Yutaka Ishikawa @ University of Tokyo
7
Lesson Learned
• Mismatch between socket API
and MPI communication model
Data Area
Application
Copy
MPI Library
MPI_Irecv(buf, MPI_ANY_SOURCDE, MPI_ANY_TAG, …) Runtime
Message Queues
MPI_Irecv(buf, 1, 2, …)
MPI_Irecv(buf, 1, MPI_ANY_TAG, …)
Copy
Kernel
Socket Queues
Node 0
Node 1
2003/10/03
Yutaka Ishikawa @ University of Tokyo
Node 3
Node 2
8
GridMPI
• Latency-aware MPI implementation
•Development of applications in
a small cluster located at a lab.
•Production run in the Grid
environment
Application
Execution
Internet
Application
Execution
Application
Execution
Application
development
Data
Resource
2003/10/03
Yutaka Ishikawa @ University of Tokyo
9
Is It Feasible ?
• Is it feasible to run non-EP (Embarrassingly
Parallel) applications on Grid-connected clusters?
– NO for long-distance networks
– YES for metropolitan- or campus-area networks
• Example: Greater Tokyo area
– Diameter: 100km-300km (or 60miles-200miles)
– Latency: 1-2ms one-way
– Bandwidth: 1-10G bps, or more
2003/10/03
Yutaka Ishikawa @ University of Tokyo
10
Experimental Environment
Node
Cluster 1
Cluster 2
Node
Node
16 nodes
Node
Node
Delay = 0.5ms, 1.0ms,
1.5ms, 2.0ms, 10ms
1Gbps
Ethernet
Ethernet
Switch
Router PC
(NIST Net)
1Gbps
Ethernet
16 nodes
Nodes
Ethernet
Switch
Cluster Nodes
Router PC
Processor
Pentium III 933 MHz (dual)
Processor
Xeon 2.4 GHz (dual)
Memory
1 GByte
Memory
1 GByte
I/O Bus
66 MHz PCI-64
I/O Bus
66 MHz PCI-64
Network
3Com Gigabit Ethernet NIC
Network
Two 3Com Gigabit Ethernet NICs
OS
Linux 2.4.18
OS
Linux 2.4.20
2003/10/03
Yutaka Ishikawa @ University of Tokyo
11
NAS Parallel Benchmark Results
CG (Class B)
Scalability
1.5
1
0.5
0.5
1
2
3
4
20
1.5
MPICH-SCore
MPICH-G2/SCore
MPICH-P4
1
0.5
0
Latency (ms)
2
MPICH-G2/SCore
MPICH-P4
1.5
1
0
MPICH-SCore
2
MPICH-G2/SCore
MPICH-P4
LU (Class B)
Scalability
MPICH-SCore
2
Scalability
MG (Class B)
1
2
3
4
20
Latency (ms)
0
1
2
3
4
20
Latency (ms)
Scalability: Relative to 16 node MPICH-SCore with no delay case
• Speed up : 1.2 to twice
• Memory usage: twice
2003/10/03
Yutaka Ishikawa @ University of Tokyo
12
Approach
• Latency-aware Communication Facility
– New TCP/IP Implementation
– New socket API
• Additional feature for MPI
– New communication protocol in the MPI
implementation level
• Message routing
• Dynamic collective communication path
2003/10/03
Yutaka Ishikawa @ University of Tokyo
13
GridMPI Software Architecture
MPI Core
RPIM
Grid ADI
IMPI Latency-aware Communication Topology
Other
SSH RSH GRAM Vendor
Vendor Comm.
P-to-P Communication
MPI
Library
MPI
TCP/IP
PMv2 Others
•
•
•
•
•
MPI Core & Grid ADI
– Providing MPI features: Communicator, Group, and Topology
– Providing MPI communication facilities implemented using Grid ADI
RPIM (Remote Process Invocation Mechanism)
– Abstraction of remote process invocation mechanisms
IMPI
– Interoperable MPI specification
Grid ADI
– Abstraction of communication facilities
LACT (Latency-Aware Communication Topology)
– Transparency of latency and network topology
2003/10/03
Yutaka Ishikawa @ University of Tokyo
14
LACT (Latency-Aware Communication
Topology)
• Concerning network bandwidth
and latency
– Message routing using pointto-point communication
• Independent of IP routing
– Collecting data in collective
communication
• Communication pattern
for network topology
2003/10/03
Bandwidth: 10 Gbps
Latency:
1ms
Cluster B
Cluster A
Bandwidth: 1 Gbps
Latency:
0.5ms
Bandwidth: 1 Gbps
Latency:
0.5ms
Cluster C
Yutaka Ishikawa @ University of Tokyo
Cluster D
Bandwidth: 100 Mbps
Latency:
2ms
15
LACT (Latency-Aware Communication
Topology)
An Example:Reduction
• Concerning network bandwidth
and latency
– Message routing using pointto-point communication
– Independent of IP routing
– Collecting data in collective
communication
• Communication pattern
for network topology
2003/10/03
Bandwidth: 10 Gbps
Latency:
1ms
Cluster B
Reduction
Cluster A
Bandwidth: 1 Gbps
Latency:
0.5ms
Bandwidth: 1 Gbps
Latency:
0.5ms
Cluster C
Reduction
Yutaka Ishikawa @ University of Tokyo
Cluster D
Reduction
Bandwidth: 100 Mbps
Latency:
2ms
16
Schedule
• Current
– The first GridMPI
implementation
• A part of MPI-1 and IMPI
• NAS parallel benchmarks
run
• FY 2003
– GridMPI version 0.1
• MPI-1 and IMPI
• Prototype of new TCP/IP
implementation
• Prototype of a LACT
implementation
2003/10/03
• FY 2004
– GridMPI version 0.5
• MPI-2
• New TCP/IP
implementation
• LACT implementation
• OGSA interface
• Vendor MPI
• FY 2005
– GridMPI version 1.0
Yutaka Ishikawa @ University of Tokyo
17
Download