Redbooks Paper Ruzhu Chen PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ Problem The benchmark performance on the POWER4+™ platforms p690+ and p655+ platforms was evaluated using the PMB-2.2.1 benchmark written by Pallas. Proposed solution Pallas has written a comprehensive set of MPI benchmarks, known as PMB, which has the following objectives: Providing a concise set of benchmarks for measuring the MPI functions of point-to-point message-passing, global data movement and computation routines, one-sided communications, file I/O Establishing precise benchmark procedures, including run rules, a set of required results, repetition factors and message lengths Avoiding interpretation of the measured results: execution time, throughput, global operations performance For a complete explanation and interpretation of PMB benchmark results, refer to the PMB-MPI1.pdf and to the PMB-MPI2.pdf. For detailed results and output logs, refer to the output and log files in the directory PMB2.2.1-mpi, available at: http://www.pallas.com/e/products/pmb/index.htm System configuration The PMB2.2.1 benchmark was tested on the IBM® POWER4+ platforms p690+ and p655+. © Copyright IBM Corp. 2004. All rights reserved. ibm.com/redbooks 1 Table 1 lists the details of the configurations of these platforms as used in this benchmark. Table 1 System and hardware configurations Configurations P690+ P655+ Processor 1.7 GHz Power4+ 1.5GHz POWER4+ Processors/node 32 Memory/node 128 GB (8-card) 16 GB (2-card) L1 64/32 KB (1-way/2-way) 64 / 32 KB (1-way/2-way) L2 1.5 MB/card (4-way) 1.5 MB/card (4-way) L3 128 MB 128 MB OS AIX® 5.1.0.0 AIX 5.1.0.0 AIX Kernel 64-bit 64-bit File system(s) Local or gpfs Local or gpfs FORTRAN compiler XLF 8.1 XLF 8.1 C/C++ compiler VAC 6.0 VAC 6.0 Mem(GB)/processor Caches Measurement and results Our testing gave the following results. Example 1 Compilation MPI_HOME MPI_INCLUDE LIBS CC CLINKER OPTFLAGS CPPFLAGS = = = = = = = /usr/lpp/ppe.poe/ $(MPI_HOME)/include -bmaxdata:0x70000000 -bmaxstack:0x10000000 -lm mpcc_r mpcc_r -DnoCHECK Example 2 Run script export MP_EUILIB=us export MP_EUIDEVICE=csss export MP_INFOLEVEL=0 export MP_SHARED_MEMORY=yes export MP_STDINMODE=none export MP_EAGER_LIMIT=65536 #(try this to see if performance can be ) export MP_BUFFER_MEM=67108864 #(set this when MP_EAGER_LIMIT is set) export MP_WAIT_MODE=poll #(need to set this when MP_EULIB=ip ) export MP_HOSTFILE=host.list export MP_PROCS=$1 PMB-MPI1 (or PMB-IO, PMB-EXT) 2 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ Point-to-point performance Point-to-point performance is measured between two processes within the same node (memory performance), or between two nodes (network performance). The performance is measured in MBytes/s per process (send+recv) in units of microseconds. The following series of graphs illustrate the performance of PingPong, Multi-PingPong, Multi-Sendrecv, Sendrecv, Multi-Exchange and Exchange on p690+ and on p655+. PingPong on p690+ Throughput Transfer Latency 2500 2500 2000 2000 1500 1500 1000 1000 500 500 0 Trans. Latency (microseconds) Throughput (MBytes/s) 3000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PingPong on p655+ Throughput Transfer Latency 2500 3000 2000 2500 2000 1500 1500 1000 1000 500 500 0 Trans. Latency (microseconds) Throughput (MBytes/s) 3500 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 3 Multi-PingPong on p690+ Throughput (MBytes/s) Transfer Latency 2500 2000 2000 1500 1500 1000 1000 500 500 0 Trans. Latency (microseconds) Throughput 2500 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Multi-PingPong on p655+ Throughput Transfer Latency 3000 2500 2000 2000 1500 1500 1000 1000 500 Trans. Latency (microseconds) Throughput (MBytes/s) 2500 500 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Multi-Sendrecv on p690+ (8 processors) Throughput (MBytes/s) Transfer Latency 4500 4000 3000 3500 3000 2500 2500 2000 2000 1500 1000 1500 1000 500 500 0 0 0 2 8 32 128 512 2k 8k Message Size (Bytes) 4 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 32k 128k 512k 2m Trans. Latency (microseconds) Throughput 3500 Multi-Sendrecv on p655+ (8 processors) Throughput (MBytes/s) Transfer Latency 7000 3000 6000 2500 5000 2000 4000 1500 3000 1000 2000 500 1000 0 Trans. Latency (microseconds) Throughput 3500 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Sendrecv on p690+ (8 processors) Throughput (MBytes/s) Transfer Latency 4500 4000 2500 3500 3000 2000 2500 2000 1500 1000 1500 1000 500 Trans. Latency (microseconds) Throughput 3000 500 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Sendrecv on p655+ (8 processors) Throughput Transfer Latency 7000 6000 2500 5000 2000 4000 1500 3000 1000 2000 500 Trans. Latency (microseconds) Throughput (MBytes/s) 3000 1000 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 5 Multi-Exchange on p690+ (8 processors) Throughput Transfer Latency 9000 8000 2500 7000 2000 6000 5000 1500 4000 1000 3000 Trans. Latency (microseconds) Throughput (MBytes/s) 3000 2000 500 1000 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Multi-Exchange on p655+ (8 processors) Throughput Transfer Latency 14000 12000 2500 10000 2000 8000 1500 6000 1000 4000 500 Trans. Latency (microseconds) Throughput (MBytes/s) 3000 2000 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Exchange on p690+ (8 processors) Throughput (MBytes/s) Transfer Latency 10000 2000 8000 1500 6000 1000 4000 500 2000 0 0 0 2 8 32 128 512 2k 8k Message Size (Bytes) 6 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 32k 128k 512k 2m Trans. Latency (microseconds) Throughput 2500 Exchange on p655+ (8 processors) Throughput Transfer Latency 14000 12000 2000 10000 1500 8000 1000 6000 4000 500 Trans. Latency (microseconds) Throughput (MBytes/s) 2500 2000 0 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Collective benchmarks Collective or system-wide interconnect performance is measured between all or a subset of the nodes in the system. All collective benchmarks are measured in Microseconds transfer latency. The following series of graphs illustrate the performance of Multi-All reduce, Allreduce, Multi-Reduce, Reduce, Multi-Reduce_scatter, Reduce_scatter, Multi-Allgather, Allgather, Multi-Allgatherv, Allgatherv, Multi-Alltoall, Alltoall, Multi-Bcast, and Bcast on p690+ and on p655+. Trans. Latency (microseconds) Multi-Allreduce on p690+ (8 processors) 16000 14000 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 7 Trans. Latency (microseconds) Multi-Allreduce on p655+ (8 processors) 16000 14000 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Allreduce on p690+ (8 processors) 40000 35000 30000 25000 20000 15000 10000 5000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Allreduce on p655+ (8 processors) 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0 8 32 128 512 2k 8k 32k Message Size (Bytes) 8 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 128k 512k 2m Trans. Latency (microseconds) Multi-Reduce on p690+ (8 processors) 14000 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Reduce on p655+ (8 processors) 14000 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Reduce on p690+ (8 processors) 30000 25000 20000 15000 10000 5000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 9 Trans. Latency (microseconds) Reduce on p655+ (8 processors) 35000 30000 25000 20000 15000 10000 5000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Reduce_scatter on p690+ (8 processors) 8000 7000 6000 5000 4000 3000 2000 1000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Reduce_scatter on p655+ (8 processors) 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k Message Size (Bytes) 10 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 128k 512k 2m Trans. Latency (microseconds) Reduce_scatter on p690+ (8 processors) 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Reduce_scatter on p655+ (8 processors) 16000 14000 12000 10000 8000 6000 4000 2000 0 0 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Allgather on p690+ (8 processors) 12000 10000 8000 6000 t 4000 2000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 11 Trans. Latency (microseconds) Multi-Allgather on p655+ (8 processors) 14000 12000 10000 8000 6000 4000 2000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Allgather on p690+ (8 processors) 70000 60000 50000 40000 30000 20000 10000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Allgather on p655+ (8 processors) 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 0 2 8 32 128 512 2k 8k Message Size (Bytes) 12 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 32k 128k 512k 2m Trans. Latency (microseconds) Multi-Allgatherv on p690+ (8 processors) 7000 6000 5000 4000 3000 2000 1000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Allgatherv on p655+ (8 processors) 14000 12000 10000 8000 6000 4000 2000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Allgatherv on p690+ (8 processors) 35000 30000 25000 20000 15000 10000 5000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 13 Trans. Latency (microseconds) Allgatherv on p655+ (8 processors) 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Alltoall on p690+ (8 processors) 7000 6000 5000 4000 3000 2000 1000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Alltoall on p655+ (8 processors) 16000 14000 12000 10000 8000 6000 4000 2000 0 0 2 8 32 128 512 2k 8k Message Size (Bytes) 14 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 32k 128k 512k 2m Trans. Latency (microseconds) Alltoall on p690+ (8 processors) 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Alltoall on p655+ (8 processors) 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Multi-Bcast on p690+ (8 processors) 2500 2000 1500 1000 500 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 15 Trans. Latency (microseconds) Multi-Bcast on p655+ (8 processors) 3500 3000 2500 2000 1500 1000 500 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Message Size (Bytes) Trans. Latency (microseconds) Bcast on p690+ (8 processors) 7000 6000 5000 4000 3000 2000 1000 0 0 2 8 32 128 512 2k 8k Message Size (Bytes) 16 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 32k 128k 512k 2m Bca st on p655+ (8 proces sors ) 8000 Trans. Latency (microseconds) 7000 6000 5000 4000 3000 2000 1000 0 0 2 8 32 128 512 2k 8k 32k 128k 512k 2m Mess age Size (By tes) MPI_Barrier Table 2 MPI_Barrier () function benchmark Barrier (microseconds) Test 16 32 P690+ 3.64 8.48 14.97 22.35 29.96 P655+ 3.27 6.23 10.69 78.57 125.43 Summary The PMB-2.2.1 benchmark was completed on POWER4+ platforms p690+ and p655+. The MPI-I output results, without modification, are shown in graphical format in this report. Author Ruzhu Chen (ruzhuchen@us.ibm.com) pSeries® & HPC Benchmark Center, IBM Poughkeepsie, NY Reference PMB2.2.1-mpi: http://www.pallas.com/e/products/pmb/index.htm PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ 17 18 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces. © Copyright IBM Corp. 2004. All rights reserved. 19 ® Send us your comments in one of the following ways: Use the online Contact us review redbook form found at: ibm.com/redbooks Send your comments in an Internet note to: redbook@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYJ Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 U.S.A. Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: AIX® IBM® POWER4+™ pSeries® Redbooks™ Redbooks (logo)™ Redbooks (logo) Other company, product, and service names may be trademarks or service marks of others. 20 PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+ ™