Redpaper Brant Knudson Jeff Chauvin Jeffrey Lien Mark Megerian Andrew Tauferner IBM System Blue Gene Solution: Compute Node Linux Overview This IBM® Redpaper publication describes the use of compute node Linux® on the IBM System Blue Gene® Solution. Readers of this paper need to be familiar with general Blue Gene/P™ system administration and application development concepts. The normal boot process for a Blue Gene/P partition loads Linux on the I/O nodes and loads the Blue Gene/P Compute Node Kernel (CNK) on the compute nodes. This standard configuration provides the best performance and reliability for running applications on Blue Gene/P. The lightweight CNK provides a subset of system calls and has tight control over the threads and processes that run on the node. Thus, using the CNK provides very little interference with the applications that run on the compute nodes. Blue Gene/P release V1R3 provides compute node Linux, a new feature that allows users to run Linux on the compute nodes. This feature provides a new means for research and experimentation by allowing all of the compute nodes in a Blue Gene/P partition to operate independently with a full Linux kernel. While this environment is not optimal for running high-performance applications or running highly parallel applications that communicate using the Message Passing Interface (MPI), you might have an environment where you want Blue Gene/P to act like a large cluster of nodes that are each running Linux. The term High-Throughput Computing (HTC) is used to describe applications that are running in this environment. HTC was introduced as a software feature in V1R2 of Blue Gene/P. Using HTC allows all nodes to act independently, each capable of running a different job. The new compute node Linux feature provides benefits that can make it easier to develop or port applications to Blue Gene/P. The biggest benefit that application developers might notice when running Linux on the compute nodes is that there is no restriction on the number of threads that are running on a node. There is also no restriction on the system calls, because the compute nodes are running Linux rather than the CNK. This feature opens Blue Gene/P to potential new applications that can take advantage of the full Linux kernel. © Copyright IBM Corp. 2009. All rights reserved. ibm.com/redbooks 1 Because this feature is enabled for research and is not considered a core feature of the Blue Gene/P software stack, it has not been tested under an exhaustive set of circumstances. IBM has performed extensive testing of the compute node Linux functionality that is within the core Blue Gene/P software stack. However, because having a completely functional Linux kernel on the compute nodes allows many new types of applications to run on the compute nodes, we cannot claim to have tested every scenario. To be specific, IBM has not formally tested this environment with external software such as IBM General Parallel File System (GPFS™), XL compilers, Engineering Scientific Subroutine Library (ESSL), or the HPC Toolkit. Because of the experimental nature of compute node Linux, it is disabled by default. You cannot boot a partition to run Linux on the compute nodes until you receive an activation key from IBM. To request an activation key, contact your assigned Blue Gene/P Technical Advocate. The Technical Advocate can help determine whether you qualify for the key and, upon qualification, can assist with the additional agreements that you need in place. This function is not restricted, and there is no fee. So, each customer can individually contact IBM for a key. This paper discusses the following topics: How compute node Linux works System administration Using compute node Linux Application development Job scheduler interfaces Performance results How compute node Linux works The same Linux image that is used on the I/O nodes is loaded onto the compute nodes when a block is booted in compute node Linux mode. This kernel is a 32-bit PPC SMP kernel having a version of 2.6.16.46 (later releases of Blue Gene/P might have a newer version of the kernel). Other than a runtime test that determines the node type in the network device driver that uses the collective hardware, the kernel behaves identically on I/O nodes and compute nodes. The same init scripts execute on both the compute nodes and I/O nodes when the node is booted or shut down. To indicate to the Midplane Management Control System (MMCS) that the same kernel is running on the compute nodes as is running on the I/O nodes, the partition’s compute node images must be set to the Linux images. The Linux boot images are located in the cns, linux, and ramdisk files in the directory path /bgsys/drivers/ppcfloor/boot/. When booting the partition, you also need to tell MMCS to handle the boot differently than it does with CNK images. You can indicate to MMCS to use compute node Linux by setting the partition’s options field to l (lower-case L). You can make both of these changes from either the MMCS console or the job scheduler interface as we describe in “Job scheduler interfaces” on page 14. When you boot a partition to use compute node Linux, the compute nodes and I/O node in each pset have an IP interface on the collective network through which they can communicate. Because the compute nodes are each assigned a unique IP address, each can be addressed individually by hosts outside of the Blue Gene/P system. This number of IP addresses can be potentially a very large number of IP addresses, given that each rack of Blue Gene/P hardware contains between 1032 and 1088 nodes. To accommodate a high number of addresses, a private class A network is recommended, although a smaller network 2 IBM System Blue Gene Solution: Compute Node Linux might suffice depending on the size of your Blue Gene/P system. When setting up the IP addresses, the compute nodes and I/O nodes must be on the same subnet. Unlike I/O nodes, the compute nodes have physical access only to the collective network. To provide compute nodes access to the functional network, the I/O nodes use proxy ARP to intercept IP packets that are destined for the compute nodes. Every I/O node acts as a proxy for the compute nodes in its pset, replying to ARP requests that are destined to compute nodes. This response establishes automatically a proper routing to the pset for every compute node. With proxy ARP, both IP interfaces on the I/O nodes have the same IP address. The I/O node establishes a route for every compute node in its pset at boot time. Compute node Linux uses the Blue Gene/P HTC mode when jobs run on a partition. Using HTC allows all nodes to act independently, where each node can potentially run a different executable. HTC is often contrasted with the long-accepted term High Performance Computing (HPC). HPC refers to booting partitions that run a single program on all of the compute nodes, primarily using MPI to communicate to do work in parallel. Because the compute nodes are accessible from the functional network after the block is booted and because it is running the services that typically run on a Linux node, a user can ssh to the compute node and run programs from the command prompt. Users can also run jobs on the compute nodes using Blue Gene/P’s HTC infrastructure through the submit command. System administration This section describes the installation and configuration tasks for compute node Linux that system administrators must perform and the interfaces that are provided for these tasks. Installation To install compute node Linux, you (as the system administrator) must enter the activation key into the Blue Gene/P database properties file. The instructions that explain how to enter the key are provided with the activation key, which is available by contacting your Blue Gene Technical Advocate). If the database properties file does not contain a valid activation key, then blocks will fail to boot in Linux mode with the following error message: boot_block: linux on compute nodes not enabled You need to set the IP addresses of the compute nodes using the Blue Gene/P database populate script, commonly referred to as DB populate. If the IP addresses are not set, then blocks will fail to boot in Linux mode. When setting the compute node IP addresses, you need to tell DB populate the IP address at which to start. You can calculate this value from the last IP address for the I/O node by adding 1 to both the second and fourth octets in the IP address (where the first octet is on the left). To get the IP address for the last I/O node, run the following commands on the service node: $ . ~bgpsysb/sqllib/db2profile $ db2 connect to bgdb0 user bgpsysdb (type in the password for the bgpsysdb user) $ db2 "select ipaddress from bgpnode where location = (select max(location) from bgpnode where isionode = 'T')" IBM System Blue Gene Solution: Compute Node Linux 3 The output of the query looks similar to the following example: IPADDRESS -------------172.16.100.64 So, if you add 1 to the second and fourth octets of this IP address, you have an IP address for DB populate of 172.17.100.65. After you have the IP address, invoke the DB populate script using the following commands: $ . ~bgpsysb/sqllib/db2profile $ cd /bgsys/drivers/ppcfloor/schema $ ./dbPopulate.pl --input=BGP_config.txt --size=<size> --dbproperties /bgsys/local/etc/db.properties --cniponly --proceed --ipaddress <ipaddress> In this command, replace <size> with the dimensions of your system in racks using <rows> x <columns>, and replace <ipaddress> with the IP address that you just calculated. For example, $ ./dbPopulate.pl --input=BGP_config.txt --size=1x1 --dbproperties /bgsys/local/etc/db.properties --cniponly --proceed --ipaddress 172.17.100.65 This command can run for several minutes on a large system. The output of DB populate indicates whether the command succeeds or fails. If it fails, you need to correct the issue and re-run the program. You can run DB populate later to change the IP addresses of the compute nodes. You also need to add a route to the compute node’s functional network to service nodes, front end nodes, and any other system on the functional network, so that those systems can communicate with the compute nodes. You need to edit the /etc/sysconfig/network/routes file on each system and then restart the network service or run route. For example, using the previous configuration where the service node is on the 172.16.0.0/16 network and the compute nodes are on 172.17.0.0/16, you add the following line to the /etc/sysconfig/network/routes file: 172.0.0.0 0.0.0.0 255.0.0.0 eth0 This change takes effect on each boot. To make the change take effect immediately, you can run /etc/init.d/network restart or route add -net 172.0.0.0 netmask 255.0.0.0 eth0. In addition to adding the route to the service node, you might need to update the /etc/exports NFS configuration file on the service node to include the compute nodes. If you do not, the compute nodes cannot mount the file system and will fail to boot. Scaling considerations Because of the large number of Linux instances that boot whenever a block is booted in compute node Linux mode, network services such as file servers and time servers can experience more requests than they can handle without some system configuration changes. This section discusses some common techniques that are available to system administrators that can alleviate scaling problems. Limiting concurrent service startups on the compute nodes A network service might not be able to keep up with every compute node that is attempting to access it simultaneously when a partition boots in compute node Linux. For example, the service node’s rpc.mountd service might crash when all the compute nodes attempt to mount its NFS file system. Thus, you might need to introduce some limits to parallelization among 4 IBM System Blue Gene Solution: Compute Node Linux the nodes to alleviate this class of problem. A new feature included with compute node Linux support makes this configuration easy to do. The administrator can indicate that only a certain number of the init scripts run in parallel when the compute nodes are booting. Simply add a number at the end of the init script’s name in /etc/init.d that indicates the number of parallel instances of the init script that can run simultaneously when the block boots or is shut down. For example, to limit the number of parallel mounts of the site file systems to 64, change the name of /bgsys/iofs/etc/init.d/S10sitefs to S10sitefs.64. The number can be between 1 and 999. The service startup pacing is handled by the new /etc/init.d/atomic script when the block is booted in compute node Linux mode. The parallelization is scoped to the block, so booting several blocks at a time can overwhelm the service. Increasing the number of threads for rpc.mountd Administrators might also find it necessary to increase the number of threads that rpc.mountd has available in order to prevent it from crashing. As of SLES 10 SP2, you can now pass rpc.mountd the -t option to specify the number of rpc.mountd threads to spawn. To increase performance and to prevent rpc.mountd from crashing due to load, we suggest updating to the SP2 version of nfs-utils (nfs-utils-1.0.7-36.29 or later). To take advantage of this option on boot, you need to edit /etc/init.d/nfsserver. Example 1 contains an excerpt from the nfsserver init script that illustrates how to indicate rpc.mountd should start 16 threads. Example 1 The nfsserver init script if [ -n “$MOUNTD_PORT” ] ; then startproc /usr/sbin/rpc.mountd -t 16 -p $MOUNTD_PORT else startproc /usr/sbin/rpc.mountd -t 16 fi Increasing the size of the ARP cache The default ARP cache settings in SLES 10 SP1 or SP2 are insufficient to deal with a large number of compute nodes that boot in Linux mode. Without any modifications, neighbor table overflow errors occur, and messages that are related to the error are included in /var/log/messages on any system on the functional network that has a large number of compute nodes that are trying to access resources. This problem can affect the service node or any other file servers that the compute nodes are trying to contact. To prevent neighbor table overflows for 4096 compute nodes, use the ARP settings in /etc/sysctl.conf that are shown in Example 2. Example 2 Updated ARP settings # No garbage collection will take place if the ARP cache has fewer than gc_thresh1 entries net.ipv4.neigh.default.gc_thresh1 = 1024 # The kernel will start to try to remove ARP entries if the soft maximum of entries to keep ( gc_thresh2 ) is exceeded net.ipv4.neigh.default.gc_thresh2 = 4096 # Set the maximum ARP table size. The kernel will not allow more the gc_thresh3 entries net.ipv4.neigh.default.gc_thresh3 = 16384 IBM System Blue Gene Solution: Compute Node Linux 5 After updating /etc/sysctl.conf, you can run sysctl -p /etc/sysctl.conf to cause the changes to take effect immediately. To ensure that it is set on each boot, configure sysctl to run on startup by running the chkconfig boot.sysctl 35 command. You need to make these changes to all systems that have resources that the compute nodes access in large numbers, including the service node, other file servers, and potentially front end nodes. MMCS commands You can allocate blocks for Linux use using the mmcs_db_console with the MMCS allocate command as follows: mmcs$ allocate <blockname> htc=linux In the compute node Linux case, MMCS still uses the images that are associated with the block definition to determine what to load onto each node. To designate the Linux images for compute nodes, you can change the block’s boot information using the setblockinfo MMCS command, as shown in Example 3. Example 3 The setblockinfo command mmcs$ setblockinfo <blockname> /bgsys/drivers/ppcfloor//boot/uloader /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers /ppcfloor//boot/ramdisk /bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/drivers /ppcfloor//boot/ramdisk In this command, note that the Linux images are provided twice to indicate that the same images need to be used for both compute nodes and I/O nodes. Using setblockinfo along with the allocate statement boots a block with Linux running on the compute nodes. Because the setblockinfo change is persistent, you might want to have two block definitions for the same hardware—one for normal boots with the CNK and another for using compute node Linux. If there is a preference to not make modifications to the block definition, the images can also be overridden at boot time using console commands as shown in Example 4. Example 4 Override boot images mmcs$ allocate_block <blockname> htc=linux mmcs$ boot_block update cnload=/bgsys/drivers/ppcfloor/boot/cns,/bgsys/drivers/ppcfloor/boot/linux,/bgsys/ drivers/ppcfloor//boot/ramdisk When the boot completes, you can use list_blocks to see the block’s mode, as shown in Example 5 for block R00-M0-N14. Example 5 The list_blocks output mmcs$ list_blocks OK R00-M0-N14 I bguser1(1) connected htc=linux There are some differences in the behavior of a block that is booted with compute node Linux. For example, when booting a block with compute node Linux, MMCS waits for the full Linux startup to complete on all of the compute nodes before making the block available for use and updating its state to initialized. For normal CNK blocks, this delay is only done for the I/O 6 IBM System Blue Gene Solution: Compute Node Linux nodes, but for compute node Linux, we must wait for Linux startup on all compute nodes as well as I/O nodes. Similarly, when freeing a block that was booted with compute node Linux, MMCS must wait for the complete Linux shutdown process on all nodes. For these reasons, you might see differences in the time that it takes to allocate and free blocks when using compute node Linux. You can also use redirect on the blocks in the same way that you do for blocks that are booted with CNK. Because Linux is running on all nodes, you see much more Linux output when allocating and freeing these blocks. Again, this increase in output is because Linux is running on every node, not just the I/O nodes. MMCS allows the sysrq command to run against compute nodes when the block is booted with compute node Linux. For blocks booted with CNK, the sysrq command is allowed only on I/O nodes. After a block is booted in compute node Linux mode, users can run jobs on the compute nodes. Blue Gene/P Navigator Blue Gene/P Navigator provides a Web-based administrative console that you can use to browse and monitor a Blue Gene/P system. There are several enhancements to the Navigator to support compute node Linux. The Navigator’s Block Summary page shows the HTC Status for a block as Linux if it is booted in compute node Linux mode. If you select a block that is using the default Linux images, the block details page shows the CN Load Image as Linux default. Normally, the CN Load Image field does not display because the block definition uses the default images. If the block is booted in Linux mode, then the options field also shows l (lowercase L). The Midplane Activity page shows the Job Status as HTC - Linux if the block is booted in Linux mode. This page also shows the number of Linux jobs submitted to the block on the midplane through the submit command. After the installation steps are complete, the compute nodes have IP addresses assigned to them. You can use the Navigator’s Hardware Browser to browse the compute node IP addresses and to search for compute nodes by IP address. Figure 1 shows the Hardware Browser page on a system with IP addresses assigned to compute nodes. To search for a compute node with a given IP address, enter the IP address in the “Jump to” input field and click Go. IBM System Blue Gene Solution: Compute Node Linux 7 Figure 1 Hardware browser in Blue Gene/P Navigator Using compute node Linux This section describes how to run jobs on a partition that is booted in compute node LInux mode. Submitting jobs You must boot a partition in compute node Linux mode before you can submit Linux programs to it. You can boot a partition in compute node Linux mode from the front end node using the htcpartition command with the mode parameter set to linux_smp. Note: Compute node images for the partition must be set to the Linux images before calling htcpartition. The system administrator can set the images using the MMCS console or through the job scheduler APIs. You can also boot a partition in compute node Linux mode directly through the MMCS console as described in “MMCS commands” on page 6. This method requires access to the service node. You can run programs on the compute nodes in a partition that is booted to use compute node Linux using several methods. One method is to use the submit command with the mode parameter set to linux_smp. When using linux_smp mode, you can submit only one program to each location in the HTC pool. If no location is supplied, MMCS picks an unused location if 8 IBM System Blue Gene Solution: Compute Node Linux one is available. To run multiple programs on a compute node using submit, you can submit a shell script that runs the programs in the background. Example 6 provides an example of using htcpartition and submit to boot a partition for compute node Linux, run a couple of jobs, and then free the partition. Example 6 Submitting a compute node Linux program using submit $ htcpartition --boot --partition MY_CNL_PARTITION --mode linux_smp $ submit -mode linux_smp -pool MY_CNL_PARTITION hello_world.cnl hello world from pid 946 $ submit -mode linux_smp -pool MY_CNL_PARTITION hello_world.cnl hello world from pid 947 $ htcpartition --free --partition MY_CNL_PARTITION Another method to run a compute node Linux job is to ssh to the compute node. To use this method, you need to know the IP address of a compute node in the booted partition. Example 7 provides an example of logging in to a compute node Linux program using ssh and then running a couple programs. Example 7 Submitting a compute node Linux program using ssh $ ssh 172.17.105.218 Password: Last login: Fri Oct 31 10:25:42 2008 from 172.16.3.1 BusyBox v1.4.2 (2008-10-04 00:02:35 UTC) Built-in shell (ash) Enter 'help' for a list of built-in commands. /bgusr/usr1/bguser1 $ ./hello_world.cnl hello world from pid 1033 /bgusr/usr1/bguser1 $ /bgsys/drivers/ppcfloor/gnu-linux/bin/python py_hello.py Hello, world! /bgusr/usr1/bguser1 $ exit Connection to 172.17.105.218 closed. Advantages to using the submit command includes that it starts faster than ssh and that the job information will be available to the Blue Gene/P Navigator. For more information about the htcpartition and submit commands, see IBM System Blue Gene Solution: Blue Gene/P Application Development, SG24-7287. IBM Scheduler for High-Throughput Computing on Blue Gene/P The IBM Scheduler for HTC on Blue Gene/P is a simple scheduler that you can use to submit a large number of jobs to an HTC partition. The 2.0 version of the HTC scheduler is enhanced to support compute node Linux. It submits jobs to a partition and ensures that the maximum number of jobs that are allowed are running while monitoring the job exit status to see if any jobs fail and restarting jobs that fail. You can download the IBM Scheduler for HTC on Blue Gene/P from the IBM alphaWorks® Web site at: http://www.alphaworks.ibm.com/tech/bgscheduler IBM System Blue Gene Solution: Compute Node Linux 9 Example 8 contains output that shows the use of the HTC scheduler to run some jobs on a compute node Linux partition. Note the use of the L in the pool size, indicating that the nodes are using Linux. Example 8 Submitting jobs to a compute node Linux partition using the HTC Scheduler $ cat cmds.txt /bgusr/bguser1/hello.cnl /bgusr/bguser1/hello.cnl /bgusr/bguser1/hello.cnl $ /bgsys/opt/simple_sched/bin/run_simple_sched_jobs -pool-name MY_CNL_PARTITION -pool-size 32L cmds.txt Using temporary configuration file 'my_simple_sched.cfg.JVwW4K'. 2 -| 2 is COMPLETED exit status 0 [location='R00-M0-N14-J28-C00' jobId=2312609 partition='MY_CNL_PARTITION'] 1 -| 1 is COMPLETED exit status 0 [location='R00-M0-N14-J25-C00' jobId=2312611 partition='MY_CNL_PARTITION'] 3 -| 3 is COMPLETED exit status 0 [location='R00-M0-N14-J12-C00' jobId=2312610 partition='MY_CNL_PARTITION'] Submitted 3 jobs, 3 completed, 0 had non-zero exit status, 0 requests failed. $ cat submit-1.out hello world from pid 1015 LoadLeveler® can act as a meta-scheduler for the HTC scheduler in a compute node Linux environment. LoadLeveler can create partitions dynamically and manage the booting and freeing of those partitions. The HTC scheduler runs as a job under LoadLeveler in this setup and submits the real HTC job workload. You can find further details about using the HTC scheduler in a LoadLeveler installation in the HTC scheduler’s documentation. Application development Compute node Linux expands the types of programs that can run on the Blue Gene/P compute nodes. As the name suggests, you can now compile normal Linux applications to run on Blue Gene/P. This section contains information about writing and compiling a program for compute node Linux, along with a sample program. Writing programs You can write programs to run on compute node Linux in the languages available for CNK, such as C, C++, and Fortran. You can also write compute node Linux programs in other languages, such as shell scripts, that are not available for CNK. You compile the Linux applications written in C, C++, or Fortran for the compute nodes using the same GNU toolchain that is used to compile CNK applications. The compilers are in /bgsys/drivers/ppcfloor/gnu-linux/bin. The sample program shown in Example 10 includes a Makefile that uses the C compiler. Use of the Blue Gene/P System Programming Interfaces (SPIs) is not recommended. Most of these SPIs rely on system calls or memory mappings that exist only when CNK is the operating system. There are usually alternate techniques to accomplish the same tasks in Linux. For example, the SPI Kernel_GetPersonality() is used to obtain the node personality 10 IBM System Blue Gene Solution: Compute Node Linux when running on CNK. This SPI does not work on Linux, but the same action can be accomplished by reading the /proc/personality file. Sample program Example 9 shows the source code for a simple program that creates threads that print out the location of the compute node on which it is running. The resulting executable can run on a partition booted in any of the following modes: Normal HPC partition HTC partition with CNK on compute nodes Compute node Linux The program is submitted using mpirun for case 1 and submit for cases 2 and 3. Example 9 Sample code: threadhello.c #include #include #include #include #include <stdio.h> <sys/types.h> <pthread.h> <stdlib.h> <sys/utsname.h> #include <common/bgp_personality.h> #include <common/bgp_personality_inlines.h> #include <spi/kernel_interface.h> #define MAX_THREAD 1000 typedef struct { int id; int sleepsec; char locationStr[16]; } parm; static void *hello(void *arg) { parm *p=(parm *)arg; printf("Hello from thread %d - %s \n", p->id, p->locationStr); sleep(p->sleepsec); return (NULL); } int main(int argc, char* argv[]) { int s=0; pthread_t *threads; parm *p; struct utsname utsname; _BGP_Personality_t pers; char locationStr[15] = ""; int n,i; int rc=0; if (argc < 2) { printf("Usage: %s n [s] \n where n is the number of threads and s is \ IBM System Blue Gene Solution: Compute Node Linux 11 the number of seconds for each thread to sleep\n",argv[0]); return 0; } n=atoi(argv[1]); if ((n < 1) || (n > MAX_THREAD)) { printf("The no of thread should between 1 and %d.\n", MAX_THREAD); return 0; } if (argc > 2) s = atoi(argv[2]); threads=(pthread_t *)malloc(n*sizeof(*threads)); p=(parm *)malloc(n*sizeof(*p)); uname( &utsname ); if (strcmp(utsname.sysname, "Linux") == 0) { FILE *stream=fopen("/proc/personality", "r"); if (stream == NULL) { printf("couldn't open /proc/personality \n"); return 0; } fread(&pers, sizeof(_BGP_Personality_t), 1, stream); fclose(stream); } else { Kernel_GetPersonality(&pers, sizeof(_BGP_Personality_t)); } BGP_Personality_getLocationString(&pers, locationStr); rc = 0; for (i=0; (i<n) && (rc == 0); i++) { p[i].id=i; p[i].sleepsec=s; strcpy(p[i].locationStr, locationStr); rc = pthread_create(&threads[i], NULL, hello, (void *)(p+i)); } n = i; /* n is now the number of threads created. */ /* Synchronize the completion of each thread. */ 12 IBM System Blue Gene Solution: Compute Node Linux for (i=0; i<n; i++) { pthread_join(threads[i],NULL); } free(threads); free(p); return 0; } Example 10 contains the Makefile used to build the code in Example 9. Example 10 Sample program Makefile .PHONY: clean CC = /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc-bgp-linux-gcc CFLAGS += -isystem /bgsys/drivers/ppcfloor/arch/include CFLAGS += -pthread CFLAGS += -g LDFLAGS += -L /bgsys/drivers/ppcfloor/runtime/SPI -lSPI.cna LDFLAGS += -lrt LDFLAGS += -pthread ALL: threadhello threadhello: threadhello.o $(CC) -o $@ $^ $(LDFLAGS) clean: $(RM) threadhello threadhello.o There are two key aspects of the sample program. First, this example shows how threads are managed in the different modes. When running on a normal HPC block, you decide at job submission time how to use the four cores on each Blue Gene/P node. The number of additional threads that are allowed in each process is determined by the mode that you choose (SMP, Dual, or Virtual Node). For HTC, the mode is determined at the time that you boot the partition. For compute node Linux, because Linux runs on every node and manages thread creation any number of threads can be created. Using the sample code, you can run the program in many different modes, and see that behavior is: Normal HPC partition – SMP: 1 process per node, each with up to 3 additional threads – Dual: 2 processes per node, each with up to 1 additional thread – VNM: 4 processes per node, no additional threads HTC partition with CNK on compute nodes – htc=smp: 1 job per node, each with up to 3 additional threads – htc=dual: 2 jobs per node, each with up to 1 additional thread – htc=vnm: 4 jobs per node, no additional threads Compute node Linux – No limits to the number of threads, only limited by available memory on the node IBM System Blue Gene Solution: Compute Node Linux 13 The other key aspect of the example is that it shows how to get the compute node’s personality in either CNK or compute node Linux. It uses the compute node Linux method of reading the personality by reading from /proc/personality into a _BGP_personality_t structure when uname() indicates that it is running on Linux and not CNK. This same technique can be used to read other values from the node’s personality. Job scheduler interfaces The Blue Gene/P job scheduler interfaces are used by job schedulers to interface with MMCS. This section discusses how a scheduler can use the interfaces to boot blocks that can then be used for compute node Linux. For more information about the job scheduler interfaces, see IBM System Blue Gene Solution: Blue Gene/P Application Development, SG24-7287. To boot a partition that can be used for compute node Linux, the job scheduler must ensure that the partition is configured to use Linux when the partition is booted. A partition is configured to use Linux if its compute node load images are set to the Linux images and its options field indicates to use Linux. Setting the compute node load images and options field can be done using the bridge APIs. The compute node images for Linux are: /bgsys/drivers/ppcfloor/boot/cns /bgsys/drivers/ppcfloor/boot/linux /bgsys/drivers/ppcfloor//boot/ramdisk The partition’s options field must be set to l (lower-case L). A partition’s options field is cleared by MMCS when the partition is freed. As such, if the scheduler boots a partition for Linux use after freeing a partition that was just used for Linux, it must set the options field again. After the partition’s compute node load images are set, they will not be modified by MMCS. If a scheduler uses the dynamic partition allocator APIs, it can call rm_allocate_htc_pool() with the mode_extender parameter set to RM_LINUX to create partitions with the compute node load images and the options field set to use Linux. 14 IBM System Blue Gene Solution: Compute Node Linux Performance results In our testing, we ran several preliminary performance tests on our compute node Linux implementation. The following sections summarize the results. CPU performance results This section contains the results of performance tests that measure the CPU performance. Stream Stream is an industry standard benchmark that measures memory bandwidth. The Stream suite consists of four benchmarks: Copy Add Scale Triad All of these benchmarks measure the bandwidth to L1 cache, L3 cache, and main memory. Figure 2 and Figure 3 summarize the results from the stream_omp code found at the following Web site: http://www.cs.virginia.edu/stream/FTP/Code/Versions/ The results show the difference in main memory bandwidth between the two modes. We mad five different runs made, which included a non-threaded run along with threaded runs with 1, 2, 3, and 4 threads. P-4 T-S M T-S M P-2 T-S M P-1 T-S M N1-S M P-3 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 P Bandwidth (MB/sec) STREAM benchmark on CNL BGP DD2-850 Mhz, Main memory N=4000000 Mode (N=non-threaded, T=threaded) Add: Copy: Scale: Triad: Figure 2 Stream benchmark on compute node Linux IBM System Blue Gene Solution: Compute Node Linux 15 Figure 3 shows the results for the same tests running under CNK. P-4 T-S M P-3 T-S M T-S M P-1 T-S M 1-S M N- P-2 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 P Bandwidth (MB/sec) STREAM benchmark on CNK BGP DD2-850 Mhz, Main memory N=4000000 Mode (N=non-threaded, T=threaded) Add: Copy: Scale: Triad: Figure 3 Stream benchmark on CNK Dgemm The Dgemm benchmark is part of the HPC Challenge Benchmark. It is used to measure the floating-point execution rate of double precision real matrix-matrix multiply. The results of this test, summarized in Table 1, show the difference in floating point performance between the two modes. We used Dgemm version 1.0.0 from the University of Victoria to obtain the results. It is available for download at: http://www.westgrid.ca/npf/benchmarks/uvic Table 1 Dgemm performance results Mode SingleDGEMM_Gflops N=3744 (CNK Max) SingleDGEMM_Gflops N=5100 (compute node Linux Max) compute node Linux 2.48025 GFLOPS 2.5618 GFLOPS CNK 2.77518 GFLOPS 2.7642 GFLOPS Psnap Psnap is a system benchmark used to quantify operating system interference or noise. It returns the percentage slowdown due to OS noise. The results of this test show the difference in operating system noise between the CNK and compute node Linux modes. We used Psnap version 1.0 to obtain these results: For compute node Linux, the average slowdown is 0.0739%, the minimum slowdown is 0.0611%, and the max slowdown is 0.0987%. For CNK, the average slowdown is 0.0617%, the minimum slowdown is 0.0616%, and the max slowdown is 0.0618%. You can download Psnap at: http://www.ccs3.lanl.gov/pal/software/psnap/ 16 IBM System Blue Gene Solution: Compute Node Linux I/O performance results This section contains the results of performance tests that measure the I/O performance. IOR IOR measures the I/O system bandwidth from the compute node to the disks in the file system. It is an end-to-end I/O performance benchmark. The results, summarized in Figure 4, compare the IOR performance for compute node Linux versus CNK with 1, 8, 16, and 32 compute nodes. We used IOR version 2.10.1 to obtain these results. You can download it at: http://sourceforge.net/projects/ior-sio/ CNL vs CNK IOR Perform ance 900.00 800.00 700.00 MB/s 600.00 CNL 500.00 CNK 400.00 300.00 200.00 100.00 0.00 Write 1 Node Write 8 Write 16 Write 32 Read 1 Read 8 Nodes Nodes Nodes Node Nodes Read 16 Nodes Read 32 Nodes Figure 4 IOR performance results IPERF IPERF is used to measure I/O performance between two nodes. In this case, I/O performance was measured between two compute nodes, between a compute node and I/O node, and between a compute node and front end node. These measurements cannot be run in a CNK environment, so there is no way to compare them with CNK. They do, however, show the I/O bandwidth between compute nodes and between a compute node and I/O node. The results are summarized in Table 2. We used IPERF version 2.0.3 to obtain these results. You can download it from: http://sourceforge.net/projects/iperf/ Table 2 IPERF performance results Description Compute node to compute node -P 2 -w 500k -l 4m -t 60 238 MBps -P 4 -w 500k -l 4m -t 60 339 MBps -P 8 -w 500k -l 4m -t 60 339 MBps Compute node to I/O node Compute node to front end node 346 MBps 236 MBps IBM System Blue Gene Solution: Compute Node Linux 17 Description Compute node to compute node Compute node to I/O node Compute node to front end node -P 12 -w 500k -l 4m -t 60 348 MBps 346 MBps 236 MBps -P 14 -w 500k -l 4m -t 60 348 MBps -P 16 -w 500k -l 4m -t 60 347 MBps 346 MBps 235 MBps -P 16 -w 200k -l 4m -t 60 347 MBps 347 MBps 232 MBps -P 16 -w 800k -l 4m -t 60 347 MBps -P 16 -w 1m -l 4m -t 60 348 MBps 346 MBps 229 MBps The team that wrote this paper This paper was produced by a team of specialists working at the International Technical Support Organization (ITSO), Rochester Center. Brant Knudson is a Staff Software Engineer in the Advanced Systems SW Development group of IBM in Rochester, MInnesota, where he has been a programmer on the Blue Gene Control System team since 2003. Prior to working on Blue Gene, he worked on the IBM LDAP directory server. Jeff Chauvin is a Senior IT Specialist with the IBM Server System’s Operations group in Rochester, Minnesota. He has worked on Blue Gene since 2004, helping with all IT related aspects of the project. Jeffrey Lien is an Advisory Software Engineer for IBM Rochester and is a member of the Blue Gene performance team. He specializes in measuring and tuning I/O performance for Blue Gene customers as well as running I/O benchmarks needed for customer bids. Prior to joining the Blue Gene performance team, Jeff developed and tested I/O device drivers used on Ethernet and modem adapter cards. Jeff has been with IBM since 1990. Mark Megerian is a Senior Software Engineer in the IBM Advanced Systems Software Development group in Rochester, Minnesota. He has worked on Blue Gene since 2002 in the area of control system development. Andrew Tauferner is an Advisory Software Engineer in the IBM Advanced Systems Software Development group of IBM in Rochester, MInnesota, where he has been a programmer on the I/O node kernel since 2005. Prior to working on Blue Gene, he worked on the IBM integrated xSeries® server products, developing Windows® and Linux I/O infrastructure. 18 IBM System Blue Gene Solution: Compute Node Linux Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. © Copyright International Business Machines Corporation 2009. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 19 This document REDP-4453-00 was created or updated on January 19, 2009. ® Send us your comments in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks Send your comments in an e-mail to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 U.S.A. Redpaper ™ Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: alphaWorks® Blue Gene/P™ Blue Gene® GPFS™ IBM® LoadLeveler® Redbooks (logo) xSeries® ® The following terms are trademarks of other companies: Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. 20 IBM System Blue Gene Solution: Compute Node Linux