GridMPI Yutaka Ishikawa University of Tokyo and Grid Technology Research Center at AIST 2003/10/03 Yutaka Ishikawa @ University of Tokyo 1 Background • SCore Cluster System Software – Real World Computing Partnership (1992 – 2001) • Funded by Ministry of Economy,Trade and Industry, METI. 2003/10/03 Applications SCASH MPC++ MPICH-SCore PVM-SCore PBS SCore-D Global Operating System User Level OMNI/SCASH PM/Shmem PM/Myrinet PM/Ethernet PM/UDP PM/Shmem driver PM/Myrinet driver PM/Ethernet driver Socket UDP/IP Ethernet driver Myrinet NIC PM firmware Yutaka Ishikawa @ University of Tokyo Ethernet NIC Linux Kernel Level PMv2 NIC Level High Performance Communication Libraries PMv2 11.0 usec Round Trip time 240 MB/s Bandwidth MPICH-SCore MPI Library 24.4 usec Round Trip time 228 MB/s Bandwidth PM/Ethernet Network Trunking Utilizing more than one NIC Global Operating System SCore-D Single/Multi User Environment Gang scheduling Checkpoint and restart Parallel Programming Language MPC++ Multi-Thread Template Library Shared Memory Programming Support Omni OpenMP on SCASH 2 RWC SCore III • Host – NEC Express Servers • Dual Pentium III 933 MHz • 512 Mbytes of Main Memory • # of Hosts – 512 Hosts (1,024 Processors) • Networks – Myrinet-2000 (2 Gbps + 2 Gbps) – 2 Ethernet Links • Linpack Result – 618.3 Gflops This is the world fastest PC cluster at August of 2001 2003/10/03 Yutaka Ishikawa @ University of Tokyo 3 TOP 500 as of December 2002 • HELICS, rank 64th (825.0 Gflops), – 512 Athlons 1.4GHz, Myrinet-2000 – Heiderberg University – IWR, http://helics.iwr.uni-heidelberg.de/ • Presto III, rank 68th (760.2 Gflops), – 512 Athlons 1.6GHz, Myrinet-2000 – GSIC, Tokyo Institute of Technology, http://www.gsic.titech.ac.jp/ • Magi Cluster, rank 86th (654.0 Gflops), – 1040 Pentium III 933MHz, Myrinet-2000 – CBRC-TACC/AIST, http://www.cbrc.jp/magi/ • RWC SCore III, 90th (618.3 Gflops), – 1024 Pentium III 933MHz, Myrinet-2000 – RWCP, http://www.rwcp.or.jp 2003/10/03 Yutaka Ishikawa @ University of Tokyo 4 SCore Users • Japan – Universities • University of Tokyo, Tokyo institute of technologies, Tsukuba university, … – Industries • Japanese Car Manufacturing Companies use the production line • UK – Oxford University Streamline Computing Ltd – Warwick University SCore integration business • Germany – University of Bonn – University of Heidelberg – University of Tuebingen 2003/10/03 Yutaka Ishikawa @ University of Tokyo 5 PC Cluster Consortium http://www.pccluster.org • Purpose – Contribution to the PC cluster market, through the development, maintenance, and promotion of cluster system software based on the SCore cluster system software and the Omni OpenMP compiler, developed at RWCP. • Members – Japanese companies • NEC, Fujistu, Hitatchi, HP Japan, IBM Japan, Intel Japan, AMD Japan, … – Research Institutes • Tokyo Institute of Technology GSIC, Riken – Individuals 2003/10/03 Yutaka Ishikawa @ University of Tokyo 6 Lesson Learned • New MPI implementation is needed – It is tough to change/modify existing MPI implementations – New MPI implementation • Open implementation in addition to open source • Customizable to implement a new protocol • New transport implementation is needed – The PM library is not on top of IP protocol • Not acceptable in the Grid environment – The current TCP/IP implementation (BSD and Linux) does not perform well in a large latency environment – Mismatch between socket API and MPI communication model – TCP/IP protocol is not an issue, but its implementation is the issue 2003/10/03 Yutaka Ishikawa @ University of Tokyo 7 Lesson Learned • Mismatch between socket API and MPI communication model Data Area Application Copy MPI Library MPI_Irecv(buf, MPI_ANY_SOURCDE, MPI_ANY_TAG, …) Runtime Message Queues MPI_Irecv(buf, 1, 2, …) MPI_Irecv(buf, 1, MPI_ANY_TAG, …) Copy Kernel Socket Queues Node 0 Node 1 2003/10/03 Yutaka Ishikawa @ University of Tokyo Node 3 Node 2 8 GridMPI • Latency-aware MPI implementation •Development of applications in a small cluster located at a lab. •Production run in the Grid environment Application Execution Internet Application Execution Application Execution Application development Data Resource 2003/10/03 Yutaka Ishikawa @ University of Tokyo 9 Is It Feasible ? • Is it feasible to run non-EP (Embarrassingly Parallel) applications on Grid-connected clusters? – NO for long-distance networks – YES for metropolitan- or campus-area networks • Example: Greater Tokyo area – Diameter: 100km-300km (or 60miles-200miles) – Latency: 1-2ms one-way – Bandwidth: 1-10G bps, or more 2003/10/03 Yutaka Ishikawa @ University of Tokyo 10 Experimental Environment Node Cluster 1 Cluster 2 Node Node 16 nodes Node Node Delay = 0.5ms, 1.0ms, 1.5ms, 2.0ms, 10ms 1Gbps Ethernet Ethernet Switch Router PC (NIST Net) 1Gbps Ethernet 16 nodes Nodes Ethernet Switch Cluster Nodes Router PC Processor Pentium III 933 MHz (dual) Processor Xeon 2.4 GHz (dual) Memory 1 GByte Memory 1 GByte I/O Bus 66 MHz PCI-64 I/O Bus 66 MHz PCI-64 Network 3Com Gigabit Ethernet NIC Network Two 3Com Gigabit Ethernet NICs OS Linux 2.4.18 OS Linux 2.4.20 2003/10/03 Yutaka Ishikawa @ University of Tokyo 11 NAS Parallel Benchmark Results CG (Class B) Scalability 1.5 1 0.5 0.5 1 2 3 4 20 1.5 MPICH-SCore MPICH-G2/SCore MPICH-P4 1 0.5 0 Latency (ms) 2 MPICH-G2/SCore MPICH-P4 1.5 1 0 MPICH-SCore 2 MPICH-G2/SCore MPICH-P4 LU (Class B) Scalability MPICH-SCore 2 Scalability MG (Class B) 1 2 3 4 20 Latency (ms) 0 1 2 3 4 20 Latency (ms) Scalability: Relative to 16 node MPICH-SCore with no delay case • Speed up : 1.2 to twice • Memory usage: twice 2003/10/03 Yutaka Ishikawa @ University of Tokyo 12 Approach • Latency-aware Communication Facility – New TCP/IP Implementation – New socket API • Additional feature for MPI – New communication protocol in the MPI implementation level • Message routing • Dynamic collective communication path 2003/10/03 Yutaka Ishikawa @ University of Tokyo 13 GridMPI Software Architecture MPI Core RPIM Grid ADI IMPI Latency-aware Communication Topology Other SSH RSH GRAM Vendor Vendor Comm. P-to-P Communication MPI Library MPI TCP/IP PMv2 Others • • • • • MPI Core & Grid ADI – Providing MPI features: Communicator, Group, and Topology – Providing MPI communication facilities implemented using Grid ADI RPIM (Remote Process Invocation Mechanism) – Abstraction of remote process invocation mechanisms IMPI – Interoperable MPI specification Grid ADI – Abstraction of communication facilities LACT (Latency-Aware Communication Topology) – Transparency of latency and network topology 2003/10/03 Yutaka Ishikawa @ University of Tokyo 14 LACT (Latency-Aware Communication Topology) • Concerning network bandwidth and latency – Message routing using pointto-point communication • Independent of IP routing – Collecting data in collective communication • Communication pattern for network topology 2003/10/03 Bandwidth: 10 Gbps Latency: 1ms Cluster B Cluster A Bandwidth: 1 Gbps Latency: 0.5ms Bandwidth: 1 Gbps Latency: 0.5ms Cluster C Yutaka Ishikawa @ University of Tokyo Cluster D Bandwidth: 100 Mbps Latency: 2ms 15 LACT (Latency-Aware Communication Topology) An Example:Reduction • Concerning network bandwidth and latency – Message routing using pointto-point communication – Independent of IP routing – Collecting data in collective communication • Communication pattern for network topology 2003/10/03 Bandwidth: 10 Gbps Latency: 1ms Cluster B Reduction Cluster A Bandwidth: 1 Gbps Latency: 0.5ms Bandwidth: 1 Gbps Latency: 0.5ms Cluster C Reduction Yutaka Ishikawa @ University of Tokyo Cluster D Reduction Bandwidth: 100 Mbps Latency: 2ms 16 Schedule • Current – The first GridMPI implementation • A part of MPI-1 and IMPI • NAS parallel benchmarks run • FY 2003 – GridMPI version 0.1 • MPI-1 and IMPI • Prototype of new TCP/IP implementation • Prototype of a LACT implementation 2003/10/03 • FY 2004 – GridMPI version 0.5 • MPI-2 • New TCP/IP implementation • LACT implementation • OGSA interface • Vendor MPI • FY 2005 – GridMPI version 1.0 Yutaka Ishikawa @ University of Tokyo 17