Red paper IBM System Blue Gene Solution: Compute Node Linux

advertisement
Redpaper
Brant Knudson
Jeff Chauvin
Jeffrey Lien
Mark Megerian
Andrew Tauferner
IBM System Blue Gene Solution:
Compute Node Linux
Overview
This IBM® Redpaper publication describes the use of compute node Linux® on the IBM
System Blue Gene® Solution. Readers of this paper need to be familiar with general Blue
Gene/P™ system administration and application development concepts.
The normal boot process for a Blue Gene/P partition loads Linux on the I/O nodes and loads
the Blue Gene/P Compute Node Kernel (CNK) on the compute nodes. This standard
configuration provides the best performance and reliability for running applications on Blue
Gene/P. The lightweight CNK provides a subset of system calls and has tight control over the
threads and processes that run on the node. Thus, using the CNK provides very little
interference with the applications that run on the compute nodes.
Blue Gene/P release V1R3 provides compute node Linux, a new feature that allows users to
run Linux on the compute nodes. This feature provides a new means for research and
experimentation by allowing all of the compute nodes in a Blue Gene/P partition to operate
independently with a full Linux kernel. While this environment is not optimal for running
high-performance applications or running highly parallel applications that communicate using
the Message Passing Interface (MPI), you might have an environment where you want Blue
Gene/P to act like a large cluster of nodes that are each running Linux. The term
High-Throughput Computing (HTC) is used to describe applications that are running in this
environment. HTC was introduced as a software feature in V1R2 of Blue Gene/P. Using HTC
allows all nodes to act independently, each capable of running a different job.
The new compute node Linux feature provides benefits that can make it easier to develop or
port applications to Blue Gene/P. The biggest benefit that application developers might notice
when running Linux on the compute nodes is that there is no restriction on the number of
threads that are running on a node. There is also no restriction on the system calls, because
the compute nodes are running Linux rather than the CNK. This feature opens Blue Gene/P
to potential new applications that can take advantage of the full Linux kernel.
© Copyright IBM Corp. 2009. All rights reserved.
ibm.com/redbooks
1
Because this feature is enabled for research and is not considered a core feature of the Blue
Gene/P software stack, it has not been tested under an exhaustive set of circumstances. IBM
has performed extensive testing of the compute node Linux functionality that is within the core
Blue Gene/P software stack. However, because having a completely functional Linux kernel
on the compute nodes allows many new types of applications to run on the compute nodes,
we cannot claim to have tested every scenario. To be specific, IBM has not formally tested this
environment with external software such as IBM General Parallel File System (GPFS™), XL
compilers, Engineering Scientific Subroutine Library (ESSL), or the HPC Toolkit.
Because of the experimental nature of compute node Linux, it is disabled by default. You
cannot boot a partition to run Linux on the compute nodes until you receive an activation key
from IBM. To request an activation key, contact your assigned Blue Gene/P Technical
Advocate. The Technical Advocate can help determine whether you qualify for the key and,
upon qualification, can assist with the additional agreements that you need in place. This
function is not restricted, and there is no fee. So, each customer can individually contact IBM
for a key.
This paper discusses the following topics:
򐂰
򐂰
򐂰
򐂰
򐂰
򐂰
How compute node Linux works
System administration
Using compute node Linux
Application development
Job scheduler interfaces
Performance results
How compute node Linux works
The same Linux image that is used on the I/O nodes is loaded onto the compute nodes when
a block is booted in compute node Linux mode. This kernel is a 32-bit PPC SMP kernel
having a version of 2.6.16.46 (later releases of Blue Gene/P might have a newer version of
the kernel). Other than a runtime test that determines the node type in the network device
driver that uses the collective hardware, the kernel behaves identically on I/O nodes and
compute nodes. The same init scripts execute on both the compute nodes and I/O nodes
when the node is booted or shut down.
To indicate to the Midplane Management Control System (MMCS) that the same kernel is
running on the compute nodes as is running on the I/O nodes, the partition’s compute node
images must be set to the Linux images. The Linux boot images are located in the cns, linux,
and ramdisk files in the directory path /bgsys/drivers/ppcfloor/boot/.
When booting the partition, you also need to tell MMCS to handle the boot differently than it
does with CNK images. You can indicate to MMCS to use compute node Linux by setting the
partition’s options field to l (lower-case L). You can make both of these changes from either
the MMCS console or the job scheduler interface as we describe in “Job scheduler interfaces”
on page 14.
When you boot a partition to use compute node Linux, the compute nodes and I/O node in
each pset have an IP interface on the collective network through which they can
communicate. Because the compute nodes are each assigned a unique IP address, each can
be addressed individually by hosts outside of the Blue Gene/P system. This number of IP
addresses can be potentially a very large number of IP addresses, given that each rack of
Blue Gene/P hardware contains between 1032 and 1088 nodes. To accommodate a high
number of addresses, a private class A network is recommended, although a smaller network
2
IBM System Blue Gene Solution: Compute Node Linux
might suffice depending on the size of your Blue Gene/P system. When setting up the IP
addresses, the compute nodes and I/O nodes must be on the same subnet.
Unlike I/O nodes, the compute nodes have physical access only to the collective network. To
provide compute nodes access to the functional network, the I/O nodes use proxy ARP to
intercept IP packets that are destined for the compute nodes. Every I/O node acts as a proxy
for the compute nodes in its pset, replying to ARP requests that are destined to compute
nodes. This response establishes automatically a proper routing to the pset for every
compute node. With proxy ARP, both IP interfaces on the I/O nodes have the same IP
address. The I/O node establishes a route for every compute node in its pset at boot time.
Compute node Linux uses the Blue Gene/P HTC mode when jobs run on a partition. Using
HTC allows all nodes to act independently, where each node can potentially run a different
executable. HTC is often contrasted with the long-accepted term High Performance
Computing (HPC). HPC refers to booting partitions that run a single program on all of the
compute nodes, primarily using MPI to communicate to do work in parallel.
Because the compute nodes are accessible from the functional network after the block is
booted and because it is running the services that typically run on a Linux node, a user can
ssh to the compute node and run programs from the command prompt. Users can also run
jobs on the compute nodes using Blue Gene/P’s HTC infrastructure through the submit
command.
System administration
This section describes the installation and configuration tasks for compute node Linux that
system administrators must perform and the interfaces that are provided for these tasks.
Installation
To install compute node Linux, you (as the system administrator) must enter the activation key
into the Blue Gene/P database properties file. The instructions that explain how to enter the
key are provided with the activation key, which is available by contacting your Blue Gene
Technical Advocate). If the database properties file does not contain a valid activation key,
then blocks will fail to boot in Linux mode with the following error message:
boot_block:
linux on compute nodes not enabled
You need to set the IP addresses of the compute nodes using the Blue Gene/P database
populate script, commonly referred to as DB populate. If the IP addresses are not set, then
blocks will fail to boot in Linux mode.
When setting the compute node IP addresses, you need to tell DB populate the IP address at
which to start. You can calculate this value from the last IP address for the I/O node by adding
1 to both the second and fourth octets in the IP address (where the first octet is on the left). To
get the IP address for the last I/O node, run the following commands on the service node:
$ . ~bgpsysb/sqllib/db2profile
$ db2 connect to bgdb0 user bgpsysdb
(type in the password for the bgpsysdb user)
$ db2 "select ipaddress from bgpnode where location = (select max(location)
from bgpnode where isionode = 'T')"
IBM System Blue Gene Solution: Compute Node Linux
3
The output of the query looks similar to the following example:
IPADDRESS
-------------172.16.100.64
So, if you add 1 to the second and fourth octets of this IP address, you have an IP address for
DB populate of 172.17.100.65.
After you have the IP address, invoke the DB populate script using the following commands:
$ . ~bgpsysb/sqllib/db2profile
$ cd /bgsys/drivers/ppcfloor/schema
$ ./dbPopulate.pl --input=BGP_config.txt --size=<size> --dbproperties
/bgsys/local/etc/db.properties --cniponly --proceed --ipaddress <ipaddress>
In this command, replace <size> with the dimensions of your system in racks using <rows> x
<columns>, and replace <ipaddress> with the IP address that you just calculated. For
example,
$ ./dbPopulate.pl --input=BGP_config.txt --size=1x1 --dbproperties
/bgsys/local/etc/db.properties --cniponly --proceed --ipaddress 172.17.100.65
This command can run for several minutes on a large system. The output of DB populate
indicates whether the command succeeds or fails. If it fails, you need to correct the issue and
re-run the program. You can run DB populate later to change the IP addresses of the compute
nodes.
You also need to add a route to the compute node’s functional network to service nodes, front
end nodes, and any other system on the functional network, so that those systems can
communicate with the compute nodes. You need to edit the /etc/sysconfig/network/routes file
on each system and then restart the network service or run route. For example, using the
previous configuration where the service node is on the 172.16.0.0/16 network and the
compute nodes are on 172.17.0.0/16, you add the following line to the
/etc/sysconfig/network/routes file:
172.0.0.0
0.0.0.0
255.0.0.0
eth0
This change takes effect on each boot. To make the change take effect immediately, you can
run /etc/init.d/network restart or route add -net 172.0.0.0 netmask 255.0.0.0 eth0.
In addition to adding the route to the service node, you might need to update the /etc/exports
NFS configuration file on the service node to include the compute nodes. If you do not, the
compute nodes cannot mount the file system and will fail to boot.
Scaling considerations
Because of the large number of Linux instances that boot whenever a block is booted in
compute node Linux mode, network services such as file servers and time servers can
experience more requests than they can handle without some system configuration changes.
This section discusses some common techniques that are available to system administrators
that can alleviate scaling problems.
Limiting concurrent service startups on the compute nodes
A network service might not be able to keep up with every compute node that is attempting to
access it simultaneously when a partition boots in compute node Linux. For example, the
service node’s rpc.mountd service might crash when all the compute nodes attempt to mount
its NFS file system. Thus, you might need to introduce some limits to parallelization among
4
IBM System Blue Gene Solution: Compute Node Linux
the nodes to alleviate this class of problem. A new feature included with compute node Linux
support makes this configuration easy to do.
The administrator can indicate that only a certain number of the init scripts run in parallel
when the compute nodes are booting. Simply add a number at the end of the init script’s
name in /etc/init.d that indicates the number of parallel instances of the init script that can run
simultaneously when the block boots or is shut down.
For example, to limit the number of parallel mounts of the site file systems to 64, change the
name of /bgsys/iofs/etc/init.d/S10sitefs to S10sitefs.64. The number can be between 1 and
999. The service startup pacing is handled by the new /etc/init.d/atomic script when the block
is booted in compute node Linux mode. The parallelization is scoped to the block, so booting
several blocks at a time can overwhelm the service.
Increasing the number of threads for rpc.mountd
Administrators might also find it necessary to increase the number of threads that rpc.mountd
has available in order to prevent it from crashing. As of SLES 10 SP2, you can now pass
rpc.mountd the -t option to specify the number of rpc.mountd threads to spawn. To increase
performance and to prevent rpc.mountd from crashing due to load, we suggest updating to
the SP2 version of nfs-utils (nfs-utils-1.0.7-36.29 or later). To take advantage of this option on
boot, you need to edit /etc/init.d/nfsserver.
Example 1 contains an excerpt from the nfsserver init script that illustrates how to indicate
rpc.mountd should start 16 threads.
Example 1 The nfsserver init script
if [ -n “$MOUNTD_PORT” ] ; then
startproc /usr/sbin/rpc.mountd -t 16 -p $MOUNTD_PORT
else
startproc /usr/sbin/rpc.mountd -t 16
fi
Increasing the size of the ARP cache
The default ARP cache settings in SLES 10 SP1 or SP2 are insufficient to deal with a large
number of compute nodes that boot in Linux mode. Without any modifications, neighbor table
overflow errors occur, and messages that are related to the error are included in
/var/log/messages on any system on the functional network that has a large number of
compute nodes that are trying to access resources. This problem can affect the service node
or any other file servers that the compute nodes are trying to contact. To prevent neighbor
table overflows for 4096 compute nodes, use the ARP settings in /etc/sysctl.conf that are
shown in Example 2.
Example 2 Updated ARP settings
# No garbage collection will take place if the ARP cache has fewer than gc_thresh1
entries
net.ipv4.neigh.default.gc_thresh1 = 1024
# The kernel will start to try to remove ARP entries if the soft maximum of
entries to keep ( gc_thresh2 ) is exceeded
net.ipv4.neigh.default.gc_thresh2 = 4096
# Set the maximum ARP table size. The kernel will not allow more the gc_thresh3
entries
net.ipv4.neigh.default.gc_thresh3 = 16384
IBM System Blue Gene Solution: Compute Node Linux
5
After updating /etc/sysctl.conf, you can run sysctl -p /etc/sysctl.conf to cause the
changes to take effect immediately. To ensure that it is set on each boot, configure sysctl to
run on startup by running the chkconfig boot.sysctl 35 command. You need to make these
changes to all systems that have resources that the compute nodes access in large numbers,
including the service node, other file servers, and potentially front end nodes.
MMCS commands
You can allocate blocks for Linux use using the mmcs_db_console with the MMCS allocate
command as follows:
mmcs$ allocate <blockname> htc=linux
In the compute node Linux case, MMCS still uses the images that are associated with the
block definition to determine what to load onto each node. To designate the Linux images for
compute nodes, you can change the block’s boot information using the setblockinfo MMCS
command, as shown in Example 3.
Example 3 The setblockinfo command
mmcs$ setblockinfo <blockname> /bgsys/drivers/ppcfloor//boot/uloader
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers
/ppcfloor//boot/ramdisk
/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers
/ppcfloor//boot/ramdisk
In this command, note that the Linux images are provided twice to indicate that the same
images need to be used for both compute nodes and I/O nodes. Using setblockinfo along
with the allocate statement boots a block with Linux running on the compute nodes.
Because the setblockinfo change is persistent, you might want to have two block definitions
for the same hardware—one for normal boots with the CNK and another for using compute
node Linux.
If there is a preference to not make modifications to the block definition, the images can also
be overridden at boot time using console commands as shown in Example 4.
Example 4 Override boot images
mmcs$ allocate_block <blockname> htc=linux
mmcs$ boot_block update
cnload=/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/
drivers/ppcfloor//boot/ramdisk
When the boot completes, you can use list_blocks to see the block’s mode, as shown in
Example 5 for block R00-M0-N14.
Example 5 The list_blocks output
mmcs$ list_blocks
OK
R00-M0-N14
I bguser1(1)
connected
htc=linux
There are some differences in the behavior of a block that is booted with compute node Linux.
For example, when booting a block with compute node Linux, MMCS waits for the full Linux
startup to complete on all of the compute nodes before making the block available for use and
updating its state to initialized. For normal CNK blocks, this delay is only done for the I/O
6
IBM System Blue Gene Solution: Compute Node Linux
nodes, but for compute node Linux, we must wait for Linux startup on all compute nodes as
well as I/O nodes. Similarly, when freeing a block that was booted with compute node Linux,
MMCS must wait for the complete Linux shutdown process on all nodes. For these reasons,
you might see differences in the time that it takes to allocate and free blocks when using
compute node Linux.
You can also use redirect on the blocks in the same way that you do for blocks that are booted
with CNK. Because Linux is running on all nodes, you see much more Linux output when
allocating and freeing these blocks. Again, this increase in output is because Linux is running
on every node, not just the I/O nodes. MMCS allows the sysrq command to run against
compute nodes when the block is booted with compute node Linux. For blocks booted with
CNK, the sysrq command is allowed only on I/O nodes.
After a block is booted in compute node Linux mode, users can run jobs on the compute
nodes.
Blue Gene/P Navigator
Blue Gene/P Navigator provides a Web-based administrative console that you can use to
browse and monitor a Blue Gene/P system. There are several enhancements to the
Navigator to support compute node Linux.
The Navigator’s Block Summary page shows the HTC Status for a block as Linux if it is
booted in compute node Linux mode. If you select a block that is using the default Linux
images, the block details page shows the CN Load Image as Linux default. Normally, the CN
Load Image field does not display because the block definition uses the default images. If the
block is booted in Linux mode, then the options field also shows l (lowercase L).
The Midplane Activity page shows the Job Status as HTC - Linux if the block is booted in
Linux mode. This page also shows the number of Linux jobs submitted to the block on the
midplane through the submit command.
After the installation steps are complete, the compute nodes have IP addresses assigned to
them. You can use the Navigator’s Hardware Browser to browse the compute node IP
addresses and to search for compute nodes by IP address.
Figure 1 shows the Hardware Browser page on a system with IP addresses assigned to
compute nodes. To search for a compute node with a given IP address, enter the IP address
in the “Jump to” input field and click Go.
IBM System Blue Gene Solution: Compute Node Linux
7
Figure 1 Hardware browser in Blue Gene/P Navigator
Using compute node Linux
This section describes how to run jobs on a partition that is booted in compute node LInux
mode.
Submitting jobs
You must boot a partition in compute node Linux mode before you can submit Linux programs
to it. You can boot a partition in compute node Linux mode from the front end node using the
htcpartition command with the mode parameter set to linux_smp.
Note: Compute node images for the partition must be set to the Linux images before
calling htcpartition. The system administrator can set the images using the MMCS
console or through the job scheduler APIs.
You can also boot a partition in compute node Linux mode directly through the MMCS
console as described in “MMCS commands” on page 6. This method requires access to the
service node.
You can run programs on the compute nodes in a partition that is booted to use compute
node Linux using several methods. One method is to use the submit command with the mode
parameter set to linux_smp. When using linux_smp mode, you can submit only one program
to each location in the HTC pool. If no location is supplied, MMCS picks an unused location if
8
IBM System Blue Gene Solution: Compute Node Linux
one is available. To run multiple programs on a compute node using submit, you can submit a
shell script that runs the programs in the background.
Example 6 provides an example of using htcpartition and submit to boot a partition for
compute node Linux, run a couple of jobs, and then free the partition.
Example 6 Submitting a compute node Linux program using submit
$ htcpartition --boot --partition MY_CNL_PARTITION --mode linux_smp
$ submit -mode linux_smp -pool MY_CNL_PARTITION hello_world.cnl
hello world from pid 946
$ submit -mode linux_smp -pool MY_CNL_PARTITION hello_world.cnl
hello world from pid 947
$ htcpartition --free --partition MY_CNL_PARTITION
Another method to run a compute node Linux job is to ssh to the compute node. To use this
method, you need to know the IP address of a compute node in the booted partition.
Example 7 provides an example of logging in to a compute node Linux program using ssh and
then running a couple programs.
Example 7 Submitting a compute node Linux program using ssh
$ ssh 172.17.105.218
Password:
Last login: Fri Oct 31 10:25:42 2008 from 172.16.3.1
BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash)
Enter 'help' for a list of built-in commands.
/bgusr/usr1/bguser1 $ ./hello_world.cnl
hello world from pid 1033
/bgusr/usr1/bguser1 $ /bgsys/drivers/ppcfloor/gnu-linux/bin/python py_hello.py
Hello, world!
/bgusr/usr1/bguser1 $ exit
Connection to 172.17.105.218 closed.
Advantages to using the submit command includes that it starts faster than ssh and that the
job information will be available to the Blue Gene/P Navigator.
For more information about the htcpartition and submit commands, see IBM System Blue
Gene Solution: Blue Gene/P Application Development, SG24-7287.
IBM Scheduler for High-Throughput Computing on Blue Gene/P
The IBM Scheduler for HTC on Blue Gene/P is a simple scheduler that you can use to submit
a large number of jobs to an HTC partition. The 2.0 version of the HTC scheduler is enhanced
to support compute node Linux. It submits jobs to a partition and ensures that the maximum
number of jobs that are allowed are running while monitoring the job exit status to see if any
jobs fail and restarting jobs that fail.
You can download the IBM Scheduler for HTC on Blue Gene/P from the IBM alphaWorks®
Web site at:
http://www.alphaworks.ibm.com/tech/bgscheduler
IBM System Blue Gene Solution: Compute Node Linux
9
Example 8 contains output that shows the use of the HTC scheduler to run some jobs on a
compute node Linux partition. Note the use of the L in the pool size, indicating that the nodes
are using Linux.
Example 8 Submitting jobs to a compute node Linux partition using the HTC Scheduler
$ cat cmds.txt
/bgusr/bguser1/hello.cnl
/bgusr/bguser1/hello.cnl
/bgusr/bguser1/hello.cnl
$ /bgsys/opt/simple_sched/bin/run_simple_sched_jobs -pool-name MY_CNL_PARTITION
-pool-size 32L cmds.txt
Using temporary configuration file 'my_simple_sched.cfg.JVwW4K'.
2 -| 2 is COMPLETED exit status 0 [location='R00-M0-N14-J28-C00' jobId=2312609
partition='MY_CNL_PARTITION']
1 -| 1 is COMPLETED exit status 0 [location='R00-M0-N14-J25-C00' jobId=2312611
partition='MY_CNL_PARTITION']
3 -| 3 is COMPLETED exit status 0 [location='R00-M0-N14-J12-C00' jobId=2312610
partition='MY_CNL_PARTITION']
Submitted 3 jobs, 3 completed, 0 had non-zero exit status, 0 requests failed.
$ cat submit-1.out
hello world from pid 1015
LoadLeveler® can act as a meta-scheduler for the HTC scheduler in a compute node Linux
environment. LoadLeveler can create partitions dynamically and manage the booting and
freeing of those partitions. The HTC scheduler runs as a job under LoadLeveler in this setup
and submits the real HTC job workload. You can find further details about using the HTC
scheduler in a LoadLeveler installation in the HTC scheduler’s documentation.
Application development
Compute node Linux expands the types of programs that can run on the Blue Gene/P
compute nodes. As the name suggests, you can now compile normal Linux applications to
run on Blue Gene/P. This section contains information about writing and compiling a program
for compute node Linux, along with a sample program.
Writing programs
You can write programs to run on compute node Linux in the languages available for CNK,
such as C, C++, and Fortran. You can also write compute node Linux programs in other
languages, such as shell scripts, that are not available for CNK.
You compile the Linux applications written in C, C++, or Fortran for the compute nodes using
the same GNU toolchain that is used to compile CNK applications. The compilers are in
/bgsys/drivers/ppcfloor/gnu-linux/bin. The sample program shown in Example 10 includes a
Makefile that uses the C compiler.
Use of the Blue Gene/P System Programming Interfaces (SPIs) is not recommended. Most of
these SPIs rely on system calls or memory mappings that exist only when CNK is the
operating system. There are usually alternate techniques to accomplish the same tasks in
Linux. For example, the SPI Kernel_GetPersonality() is used to obtain the node personality
10
IBM System Blue Gene Solution: Compute Node Linux
when running on CNK. This SPI does not work on Linux, but the same action can be
accomplished by reading the /proc/personality file.
Sample program
Example 9 shows the source code for a simple program that creates threads that print out the
location of the compute node on which it is running. The resulting executable can run on a
partition booted in any of the following modes:
򐂰 Normal HPC partition
򐂰 HTC partition with CNK on compute nodes
򐂰 Compute node Linux
The program is submitted using mpirun for case 1 and submit for cases 2 and 3.
Example 9 Sample code: threadhello.c
#include
#include
#include
#include
#include
<stdio.h>
<sys/types.h>
<pthread.h>
<stdlib.h>
<sys/utsname.h>
#include <common/bgp_personality.h>
#include <common/bgp_personality_inlines.h>
#include <spi/kernel_interface.h>
#define MAX_THREAD 1000
typedef struct {
int id;
int sleepsec;
char locationStr[16];
} parm;
static void *hello(void *arg) {
parm *p=(parm *)arg;
printf("Hello from thread %d - %s \n", p->id, p->locationStr);
sleep(p->sleepsec);
return (NULL);
}
int
main(int argc, char* argv[]) {
int s=0;
pthread_t *threads;
parm *p;
struct utsname utsname;
_BGP_Personality_t pers;
char locationStr[15] = "";
int n,i;
int rc=0;
if (argc < 2)
{
printf("Usage: %s n [s] \n
where n is the number of threads and s is \
IBM System Blue Gene Solution: Compute Node Linux
11
the number of seconds for each thread to sleep\n",argv[0]);
return 0;
}
n=atoi(argv[1]);
if ((n < 1) || (n > MAX_THREAD))
{
printf("The no of thread should between 1 and %d.\n", MAX_THREAD);
return 0;
}
if (argc > 2)
s = atoi(argv[2]);
threads=(pthread_t *)malloc(n*sizeof(*threads));
p=(parm *)malloc(n*sizeof(*p));
uname( &utsname );
if (strcmp(utsname.sysname, "Linux") == 0)
{
FILE *stream=fopen("/proc/personality", "r");
if (stream == NULL)
{
printf("couldn't open /proc/personality \n");
return 0;
}
fread(&pers, sizeof(_BGP_Personality_t), 1, stream);
fclose(stream);
}
else
{
Kernel_GetPersonality(&pers, sizeof(_BGP_Personality_t));
}
BGP_Personality_getLocationString(&pers, locationStr);
rc = 0;
for (i=0; (i<n) && (rc == 0); i++)
{
p[i].id=i;
p[i].sleepsec=s;
strcpy(p[i].locationStr, locationStr);
rc = pthread_create(&threads[i], NULL, hello, (void *)(p+i));
}
n = i; /* n is now the number of threads created. */
/* Synchronize the completion of each thread. */
12
IBM System Blue Gene Solution: Compute Node Linux
for (i=0; i<n; i++)
{
pthread_join(threads[i],NULL);
}
free(threads);
free(p);
return 0;
}
Example 10 contains the Makefile used to build the code in Example 9.
Example 10 Sample program Makefile
.PHONY: clean
CC = /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc
CFLAGS += -isystem /bgsys/drivers/ppcfloor/arch/include
CFLAGS += -pthread
CFLAGS += -g
LDFLAGS += -L /bgsys/drivers/ppcfloor/runtime/SPI -lSPI.cna
LDFLAGS += -lrt
LDFLAGS += -pthread
ALL: threadhello
threadhello: threadhello.o
$(CC) -o $@ $^ $(LDFLAGS)
clean:
$(RM) threadhello threadhello.o
There are two key aspects of the sample program. First, this example shows how threads are
managed in the different modes. When running on a normal HPC block, you decide at job
submission time how to use the four cores on each Blue Gene/P node. The number of
additional threads that are allowed in each process is determined by the mode that you
choose (SMP, Dual, or Virtual Node).
For HTC, the mode is determined at the time that you boot the partition. For compute node
Linux, because Linux runs on every node and manages thread creation any number of
threads can be created. Using the sample code, you can run the program in many different
modes, and see that behavior is:
򐂰 Normal HPC partition
– SMP: 1 process per node, each with up to 3 additional threads
– Dual: 2 processes per node, each with up to 1 additional thread
– VNM: 4 processes per node, no additional threads
򐂰 HTC partition with CNK on compute nodes
– htc=smp: 1 job per node, each with up to 3 additional threads
– htc=dual: 2 jobs per node, each with up to 1 additional thread
– htc=vnm: 4 jobs per node, no additional threads
򐂰 Compute node Linux
– No limits to the number of threads, only limited by available memory on the node
IBM System Blue Gene Solution: Compute Node Linux
13
The other key aspect of the example is that it shows how to get the compute node’s
personality in either CNK or compute node Linux. It uses the compute node Linux method of
reading the personality by reading from /proc/personality into a _BGP_personality_t structure
when uname() indicates that it is running on Linux and not CNK. This same technique can be
used to read other values from the node’s personality.
Job scheduler interfaces
The Blue Gene/P job scheduler interfaces are used by job schedulers to interface with
MMCS. This section discusses how a scheduler can use the interfaces to boot blocks that can
then be used for compute node Linux. For more information about the job scheduler
interfaces, see IBM System Blue Gene Solution: Blue Gene/P Application Development,
SG24-7287.
To boot a partition that can be used for compute node Linux, the job scheduler must ensure
that the partition is configured to use Linux when the partition is booted. A partition is
configured to use Linux if its compute node load images are set to the Linux images and its
options field indicates to use Linux.
Setting the compute node load images and options field can be done using the bridge APIs.
The compute node images for Linux are:
򐂰 /bgsys/drivers/ppcfloor/boot/cns
򐂰 /bgsys/drivers/ppcfloor/boot/linux
򐂰 /bgsys/drivers/ppcfloor//boot/ramdisk
The partition’s options field must be set to l (lower-case L).
A partition’s options field is cleared by MMCS when the partition is freed. As such, if the
scheduler boots a partition for Linux use after freeing a partition that was just used for Linux, it
must set the options field again. After the partition’s compute node load images are set, they
will not be modified by MMCS.
If a scheduler uses the dynamic partition allocator APIs, it can call rm_allocate_htc_pool()
with the mode_extender parameter set to RM_LINUX to create partitions with the compute
node load images and the options field set to use Linux.
14
IBM System Blue Gene Solution: Compute Node Linux
Performance results
In our testing, we ran several preliminary performance tests on our compute node Linux
implementation. The following sections summarize the results.
CPU performance results
This section contains the results of performance tests that measure the CPU performance.
Stream
Stream is an industry standard benchmark that measures memory bandwidth. The Stream
suite consists of four benchmarks:
򐂰
򐂰
򐂰
򐂰
Copy
Add
Scale
Triad
All of these benchmarks measure the bandwidth to L1 cache, L3 cache, and main memory.
Figure 2 and Figure 3 summarize the results from the stream_omp code found at the following
Web site:
http://www.cs.virginia.edu/stream/FTP/Code/Versions/
The results show the difference in main memory bandwidth between the two modes. We mad
five different runs made, which included a non-threaded run along with threaded runs with 1,
2, 3, and 4 threads.
P-4
T-S
M
T-S
M
P-2
T-S
M
P-1
T-S
M
N1-S
M
P-3
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
P
Bandwidth (MB/sec)
STREAM benchmark on CNL BGP
DD2-850 Mhz, Main memory N=4000000
Mode (N=non-threaded, T=threaded)
Add:
Copy:
Scale:
Triad:
Figure 2 Stream benchmark on compute node Linux
IBM System Blue Gene Solution: Compute Node Linux
15
Figure 3 shows the results for the same tests running under CNK.
P-4
T-S
M
P-3
T-S
M
T-S
M
P-1
T-S
M
1-S
M
N-
P-2
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
P
Bandwidth (MB/sec)
STREAM benchmark on CNK BGP
DD2-850 Mhz, Main memory N=4000000
Mode (N=non-threaded, T=threaded)
Add:
Copy:
Scale:
Triad:
Figure 3 Stream benchmark on CNK
Dgemm
The Dgemm benchmark is part of the HPC Challenge Benchmark. It is used to measure the
floating-point execution rate of double precision real matrix-matrix multiply. The results of this
test, summarized in Table 1, show the difference in floating point performance between the
two modes. We used Dgemm version 1.0.0 from the University of Victoria to obtain the
results. It is available for download at:
http://www.westgrid.ca/npf/benchmarks/uvic
Table 1 Dgemm performance results
Mode
SingleDGEMM_Gflops N=3744
(CNK Max)
SingleDGEMM_Gflops N=5100
(compute node Linux Max)
compute node Linux
2.48025 GFLOPS
2.5618 GFLOPS
CNK
2.77518 GFLOPS
2.7642 GFLOPS
Psnap
Psnap is a system benchmark used to quantify operating system interference or noise. It
returns the percentage slowdown due to OS noise. The results of this test show the difference
in operating system noise between the CNK and compute node Linux modes. We used
Psnap version 1.0 to obtain these results:
򐂰 For compute node Linux, the average slowdown is 0.0739%, the minimum slowdown is
0.0611%, and the max slowdown is 0.0987%.
򐂰 For CNK, the average slowdown is 0.0617%, the minimum slowdown is 0.0616%, and the
max slowdown is 0.0618%.
You can download Psnap at:
http://www.ccs3.lanl.gov/pal/software/psnap/
16
IBM System Blue Gene Solution: Compute Node Linux
I/O performance results
This section contains the results of performance tests that measure the I/O performance.
IOR
IOR measures the I/O system bandwidth from the compute node to the disks in the file
system. It is an end-to-end I/O performance benchmark. The results, summarized in Figure 4,
compare the IOR performance for compute node Linux versus CNK with 1, 8, 16, and 32
compute nodes. We used IOR version 2.10.1 to obtain these results. You can download it at:
http://sourceforge.net/projects/ior-sio/
CNL vs CNK IOR Perform ance
900.00
800.00
700.00
MB/s
600.00
CNL
500.00
CNK
400.00
300.00
200.00
100.00
0.00
Write 1
Node
Write 8 Write 16 Write 32 Read 1 Read 8
Nodes Nodes Nodes Node Nodes
Read
16
Nodes
Read
32
Nodes
Figure 4 IOR performance results
IPERF
IPERF is used to measure I/O performance between two nodes. In this case, I/O performance
was measured between two compute nodes, between a compute node and I/O node, and
between a compute node and front end node. These measurements cannot be run in a CNK
environment, so there is no way to compare them with CNK. They do, however, show the I/O
bandwidth between compute nodes and between a compute node and I/O node. The results
are summarized in Table 2. We used IPERF version 2.0.3 to obtain these results. You can
download it from:
http://sourceforge.net/projects/iperf/
Table 2 IPERF performance results
Description
Compute node to
compute node
-P 2 -w 500k -l 4m -t 60
238 MBps
-P 4 -w 500k -l 4m -t 60
339 MBps
-P 8 -w 500k -l 4m -t 60
339 MBps
Compute node to
I/O node
Compute node to
front end node
346 MBps
236 MBps
IBM System Blue Gene Solution: Compute Node Linux
17
Description
Compute node to
compute node
Compute node to
I/O node
Compute node to
front end node
-P 12 -w 500k -l 4m -t 60
348 MBps
346 MBps
236 MBps
-P 14 -w 500k -l 4m -t 60
348 MBps
-P 16 -w 500k -l 4m -t 60
347 MBps
346 MBps
235 MBps
-P 16 -w 200k -l 4m -t 60
347 MBps
347 MBps
232 MBps
-P 16 -w 800k -l 4m -t 60
347 MBps
-P 16 -w 1m -l 4m -t 60
348 MBps
346 MBps
229 MBps
The team that wrote this paper
This paper was produced by a team of specialists working at the International Technical
Support Organization (ITSO), Rochester Center.
Brant Knudson is a Staff Software Engineer in the Advanced Systems SW Development
group of IBM in Rochester, MInnesota, where he has been a programmer on the Blue Gene
Control System team since 2003. Prior to working on Blue Gene, he worked on the IBM LDAP
directory server.
Jeff Chauvin is a Senior IT Specialist with the IBM Server System’s Operations group in
Rochester, Minnesota. He has worked on Blue Gene since 2004, helping with all IT related
aspects of the project.
Jeffrey Lien is an Advisory Software Engineer for IBM Rochester and is a member of the
Blue Gene performance team. He specializes in measuring and tuning I/O performance for
Blue Gene customers as well as running I/O benchmarks needed for customer bids. Prior to
joining the Blue Gene performance team, Jeff developed and tested I/O device drivers used
on Ethernet and modem adapter cards. Jeff has been with IBM since 1990.
Mark Megerian is a Senior Software Engineer in the IBM Advanced Systems Software
Development group in Rochester, Minnesota. He has worked on Blue Gene since 2002 in the
area of control system development.
Andrew Tauferner is an Advisory Software Engineer in the IBM Advanced Systems Software
Development group of IBM in Rochester, MInnesota, where he has been a programmer on
the I/O node kernel since 2005. Prior to working on Blue Gene, he worked on the IBM
integrated xSeries® server products, developing Windows® and Linux I/O infrastructure.
18
IBM System Blue Gene Solution: Compute Node Linux
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not give you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM Web sites are provided for convenience only and do not in any
manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the
materials for this IBM product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring
any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs.
© Copyright International Business Machines Corporation 2009. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by
GSA ADP Schedule Contract with IBM Corp.
19
This document REDP-4453-00 was created or updated on January 19, 2009.
®
Send us your comments in one of the following ways:
򐂰 Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
򐂰 Send your comments in an e-mail to:
redbooks@us.ibm.com
򐂰 Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400 U.S.A.
Redpaper ™
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation in the United States, other countries, or both. These and other IBM trademarked terms are
marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US
registered or common law trademarks owned by IBM at the time this information was published. Such
trademarks may also be registered or common law trademarks in other countries. A current list of IBM
trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
alphaWorks®
Blue Gene/P™
Blue Gene®
GPFS™
IBM®
LoadLeveler®
Redbooks (logo)
xSeries®
®
The following terms are trademarks of other companies:
Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other
countries, or both.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
20
IBM System Blue Gene Solution: Compute Node Linux
Download