Red books PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+

advertisement
Redbooks Paper
Ruzhu Chen
PMB-2.2.1 Benchmarking on POWER4+
Platforms p655+ and p690+
Problem
The benchmark performance on the POWER4+™ platforms p690+ and p655+ platforms was
evaluated using the PMB-2.2.1 benchmark written by Pallas.
Proposed solution
Pallas has written a comprehensive set of MPI benchmarks, known as PMB, which has the
following objectives:
򐂰 Providing a concise set of benchmarks for measuring the MPI functions of point-to-point
message-passing, global data movement and computation routines, one-sided
communications, file I/O
򐂰 Establishing precise benchmark procedures, including run rules, a set of required results,
repetition factors and message lengths
򐂰 Avoiding interpretation of the measured results: execution time, throughput, global
operations performance
For a complete explanation and interpretation of PMB benchmark results, refer to the
PMB-MPI1.pdf and to the PMB-MPI2.pdf. For detailed results and output logs, refer to the
output and log files in the directory PMB2.2.1-mpi, available at:
http://www.pallas.com/e/products/pmb/index.htm
System configuration
The PMB2.2.1 benchmark was tested on the IBM® POWER4+ platforms p690+ and p655+.
© Copyright IBM Corp. 2004. All rights reserved.
ibm.com/redbooks
1
Table 1 lists the details of the configurations of these platforms as used in this benchmark.
Table 1 System and hardware configurations
Configurations
P690+
P655+
Processor
1.7 GHz Power4+
1.5GHz POWER4+
Processors/node
32
Memory/node
128 GB (8-card)
16 GB (2-card)
L1
64/32 KB (1-way/2-way)
64 / 32 KB (1-way/2-way)
L2
1.5 MB/card (4-way)
1.5 MB/card (4-way)
L3
128 MB
128 MB
OS
AIX® 5.1.0.0
AIX 5.1.0.0
AIX Kernel
64-bit
64-bit
File system(s)
Local or gpfs
Local or gpfs
FORTRAN compiler
XLF 8.1
XLF 8.1
C/C++ compiler
VAC 6.0
VAC 6.0
Mem(GB)/processor
Caches
Measurement and results
Our testing gave the following results.
Example 1 Compilation
MPI_HOME
MPI_INCLUDE
LIBS
CC
CLINKER
OPTFLAGS
CPPFLAGS
=
=
=
=
=
=
=
/usr/lpp/ppe.poe/
$(MPI_HOME)/include
-bmaxdata:0x70000000 -bmaxstack:0x10000000 -lm
mpcc_r
mpcc_r
-DnoCHECK
Example 2 Run script
export MP_EUILIB=us
export MP_EUIDEVICE=csss
export MP_INFOLEVEL=0
export MP_SHARED_MEMORY=yes
export MP_STDINMODE=none
export MP_EAGER_LIMIT=65536 #(try this to see if performance can be )
export MP_BUFFER_MEM=67108864 #(set this when MP_EAGER_LIMIT is set)
export MP_WAIT_MODE=poll #(need to set this when MP_EULIB=ip )
export MP_HOSTFILE=host.list
export MP_PROCS=$1
PMB-MPI1 (or PMB-IO, PMB-EXT)
2
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
Point-to-point performance
Point-to-point performance is measured between two processes within the same node
(memory performance), or between two nodes (network performance). The performance is
measured in MBytes/s per process (send+recv) in units of microseconds.
The following series of graphs illustrate the performance of PingPong, Multi-PingPong,
Multi-Sendrecv, Sendrecv, Multi-Exchange and Exchange on p690+ and on p655+.
PingPong on p690+
Throughput
Transfer Latency
2500
2500
2000
2000
1500
1500
1000
1000
500
500
0
Trans. Latency
(microseconds)
Throughput (MBytes/s)
3000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PingPong on p655+
Throughput
Transfer Latency
2500
3000
2000
2500
2000
1500
1500
1000
1000
500
500
0
Trans. Latency
(microseconds)
Throughput (MBytes/s)
3500
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
3
Multi-PingPong on p690+
Throughput (MBytes/s)
Transfer Latency
2500
2000
2000
1500
1500
1000
1000
500
500
0
Trans. Latency
(microseconds)
Throughput
2500
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Multi-PingPong on p655+
Throughput
Transfer Latency
3000
2500
2000
2000
1500
1500
1000
1000
500
Trans. Latency
(microseconds)
Throughput (MBytes/s)
2500
500
0
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Multi-Sendrecv on p690+ (8 processors)
Throughput (MBytes/s)
Transfer Latency
4500
4000
3000
3500
3000
2500
2500
2000
2000
1500
1000
1500
1000
500
500
0
0
0
2
8
32
128
512
2k
8k
Message Size (Bytes)
4
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
32k 128k 512k 2m
Trans. Latency
(microseconds)
Throughput
3500
Multi-Sendrecv on p655+ (8 processors)
Throughput (MBytes/s)
Transfer Latency
7000
3000
6000
2500
5000
2000
4000
1500
3000
1000
2000
500
1000
0
Trans. Latency
(microseconds)
Throughput
3500
0
0
2
8
32
128
512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Sendrecv on p690+ (8 processors)
Throughput (MBytes/s)
Transfer Latency
4500
4000
2500
3500
3000
2000
2500
2000
1500
1000
1500
1000
500
Trans. Latency
(microseconds)
Throughput
3000
500
0
0
0
2
8
32
128
512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Sendrecv on p655+ (8 processors)
Throughput
Transfer Latency
7000
6000
2500
5000
2000
4000
1500
3000
1000
2000
500
Trans. Latency
(microseconds)
Throughput (MBytes/s)
3000
1000
0
0
0
2
8
32
128
512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
5
Multi-Exchange on p690+ (8 processors)
Throughput
Transfer Latency
9000
8000
2500
7000
2000
6000
5000
1500
4000
1000
3000
Trans. Latency
(microseconds)
Throughput (MBytes/s)
3000
2000
500
1000
0
0
0
2
8
32
128
512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Multi-Exchange on p655+ (8 processors)
Throughput
Transfer Latency
14000
12000
2500
10000
2000
8000
1500
6000
1000
4000
500
Trans. Latency
(microseconds)
Throughput (MBytes/s)
3000
2000
0
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Exchange on p690+ (8 processors)
Throughput (MBytes/s)
Transfer Latency
10000
2000
8000
1500
6000
1000
4000
500
2000
0
0
0
2
8
32
128 512
2k
8k
Message Size (Bytes)
6
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
32k 128k 512k 2m
Trans. Latency
(microseconds)
Throughput
2500
Exchange on p655+ (8 processors)
Throughput
Transfer Latency
14000
12000
2000
10000
1500
8000
1000
6000
4000
500
Trans. Latency
(microseconds)
Throughput (MBytes/s)
2500
2000
0
0
0
2
8
32
128
512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Collective benchmarks
Collective or system-wide interconnect performance is measured between all or a subset of
the nodes in the system. All collective benchmarks are measured in Microseconds transfer
latency.
The following series of graphs illustrate the performance of Multi-All reduce, Allreduce,
Multi-Reduce, Reduce, Multi-Reduce_scatter, Reduce_scatter, Multi-Allgather, Allgather,
Multi-Allgatherv, Allgatherv, Multi-Alltoall, Alltoall, Multi-Bcast, and Bcast on p690+ and on
p655+.
Trans. Latency (microseconds)
Multi-Allreduce on p690+ (8 processors)
16000
14000
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
7
Trans. Latency (microseconds)
Multi-Allreduce on p655+ (8 processors)
16000
14000
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Allreduce on p690+ (8 processors)
40000
35000
30000
25000
20000
15000
10000
5000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Allreduce on p655+ (8 processors)
50000
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
0
8
32
128
512
2k
8k
32k
Message Size (Bytes)
8
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
128k 512k
2m
Trans. Latency (microseconds)
Multi-Reduce on p690+ (8 processors)
14000
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Reduce on p655+ (8 processors)
14000
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Reduce on p690+ (8 processors)
30000
25000
20000
15000
10000
5000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
9
Trans. Latency (microseconds)
Reduce on p655+ (8 processors)
35000
30000
25000
20000
15000
10000
5000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Reduce_scatter on p690+ (8 processors)
8000
7000
6000
5000
4000
3000
2000
1000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Reduce_scatter on p655+ (8 processors)
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
Message Size (Bytes)
10
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
128k 512k
2m
Trans. Latency (microseconds)
Reduce_scatter on p690+ (8 processors)
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Reduce_scatter on p655+ (8 processors)
16000
14000
12000
10000
8000
6000
4000
2000
0
0
8
32
128
512
2k
8k
32k
128k 512k
2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Allgather on p690+ (8 processors)
12000
10000
8000
6000
t
4000
2000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
11
Trans. Latency (microseconds)
Multi-Allgather on p655+ (8 processors)
14000
12000
10000
8000
6000
4000
2000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Allgather on p690+ (8 processors)
70000
60000
50000
40000
30000
20000
10000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Allgather on p655+ (8 processors)
200000
180000
160000
140000
120000
100000
80000
60000
40000
20000
0
0
2
8
32
128
512
2k
8k
Message Size (Bytes)
12
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
32k 128k 512k 2m
Trans. Latency (microseconds)
Multi-Allgatherv on p690+ (8 processors)
7000
6000
5000
4000
3000
2000
1000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Allgatherv on p655+ (8 processors)
14000
12000
10000
8000
6000
4000
2000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Allgatherv on p690+ (8 processors)
35000
30000
25000
20000
15000
10000
5000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
13
Trans. Latency (microseconds)
Allgatherv on p655+ (8 processors)
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Alltoall on p690+ (8 processors)
7000
6000
5000
4000
3000
2000
1000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Alltoall on p655+ (8 processors)
16000
14000
12000
10000
8000
6000
4000
2000
0
0
2
8
32
128 512
2k
8k
Message Size (Bytes)
14
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
32k 128k 512k 2m
Trans. Latency (microseconds)
Alltoall on p690+ (8 processors)
45000
40000
35000
30000
25000
20000
15000
10000
5000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Alltoall on p655+ (8 processors)
100000
90000
80000
70000
60000
50000
40000
30000
20000
10000
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Multi-Bcast on p690+ (8 processors)
2500
2000
1500
1000
500
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
15
Trans. Latency (microseconds)
Multi-Bcast on p655+ (8 processors)
3500
3000
2500
2000
1500
1000
500
0
0
2
8
32
128 512
2k
8k
32k 128k 512k 2m
Message Size (Bytes)
Trans. Latency (microseconds)
Bcast on p690+ (8 processors)
7000
6000
5000
4000
3000
2000
1000
0
0
2
8
32
128 512
2k
8k
Message Size (Bytes)
16
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
32k 128k 512k 2m
Bca st on p655+ (8 proces sors )
8000
Trans. Latency (microseconds)
7000
6000
5000
4000
3000
2000
1000
0
0
2
8
32
128
512
2k
8k
32k
128k
512k
2m
Mess age Size (By tes)
MPI_Barrier
Table 2 MPI_Barrier () function benchmark
Barrier (microseconds)
Test
16
32
P690+
3.64
8.48
14.97
22.35
29.96
P655+
3.27
6.23
10.69
78.57
125.43
Summary
The PMB-2.2.1 benchmark was completed on POWER4+ platforms p690+ and p655+. The
MPI-I output results, without modification, are shown in graphical format in this report.
Author
Ruzhu Chen (ruzhuchen@us.ibm.com)
pSeries® & HPC Benchmark Center, IBM Poughkeepsie, NY
Reference
򐂰 PMB2.2.1-mpi:
http://www.pallas.com/e/products/pmb/index.htm
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
17
18
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrates programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and
distribute these sample programs in any form without payment to IBM for the purposes of developing, using,
marketing, or distributing application programs conforming to IBM's application programming interfaces.
© Copyright IBM Corp. 2004. All rights reserved.
19
®
Send us your comments in one of the following ways:
򐂰 Use the online Contact us review redbook form found at:
ibm.com/redbooks
򐂰 Send your comments in an Internet note to:
redbook@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYJ Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400 U.S.A.
Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX®
IBM®
POWER4+™
pSeries®
Redbooks™
Redbooks (logo)™
Redbooks (logo)
Other company, product, and service names may be trademarks or service marks of others.
20
PMB-2.2.1 Benchmarking on POWER4+ Platforms p655+ and p690+
™
Download