Scyld ClusterWare System Administration Confidential – Internal Use Only 1 Orientation Agenda – Part 1 Scyld ClusterWare foundations » Booting process • Startup scripts • File systems • Name services » Cluster Configuration Cluster Components » Networking infrastructure » NFS File servers » IPMI Configuration Break 2 Confidential – Internal Use Only Orientation Agenda – Part 2 Parallel jobs » MPI configuration » Infiniband interconnect Queuing » Initial setup » Tuning » Policy case studies Other software and tools Troubleshooting Questions and Answers 3 Confidential – Internal Use Only Orientation Agenda – Part 1 Scyld ClusterWare foundations » Booting process • Startup scripts • File systems • Name services » Cluster Configuration Cluster Components » Networking infrastructure » NFS File servers Break 4 Confidential – Internal Use Only Cluster Virtualization Architecture Realized Optional Disks Minimal in-memory OS with single daemon rapidly deployed in seconds no disk required » Less than 20 seconds Virtual, unified process space enables intuitive single sign-on, job submission » Effortless job migration to nodes Interconnection Network Monitor & manage efficiently from the Master » Single System Install » Single Process Space » Shared cache of the cluster state Master Node » Single point of provisioning » Better performance due to lightweight nodes Internet or Internal Network » No version skew is inherently more reliable Manage & use a cluster like a single SMP machine 5 Confidential – Internal Use Only Elements of Cluster Systems Optional Disks Some important elements of a cluster system » Booting and Provisioning » Process creation, monitoring and control Interconnection Network » Update and consistency model » Name services » File Systems Master Node » Physical management » Workload virtualization Internet or Internal Network 6 Confidential – Internal Use Only Booting and Provisioning Integrated, automatic network boot Basic hardware reporting and diagnostics in the PreOS stage Only CPU, memory and NIC needed Kernel and minimal environment from master Just enough to say “what do I do now?” Remaining configuration driven by master Logs are stored in: » /var/log/messages » /var/log/beowulf/node.* 7 Confidential – Internal Use Only DHCP and TFTP services Started from /etc/rc.d/init.d/beowulf » Locate vmlinuz in /boot » Configure syslog and other parameters on the head node » Loads kernel modules » Setup libraries » Creates ramdisk image for compute nodes » Starts DHCP/TFTP server (beoserv) » Configures NAT for ipforwarding if needed » Starts kickback name service daemon (4.2.0+) » Tune network stack 8 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 9 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 10 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Subnet configuration Default used to be class C Network » netmask 255.255.255.0 » Limited to 155 compute nodes ( 100 + $NODE < 255 ) » Last octect denotes special devices • x.x.x.10 switches • x.x.x.30 storage » Infiniband is a separate network • x.x.1.$(( 100 + $NODE )) » Needed eth0:1 to reach IPMI network • x.x.2.$(( 100 + $NODE )) • /etc/sysconfig/network-scripts/ifcfg-eth0:1 • ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0 11 Confidential – Internal Use Only Subnet configuration New standard is class B Network » netmask 255.255.0.0 » Limited to 100 * 256 compute nodes • 10.54.50.x – 10.54.149.x » Third octect denotes special devices • x.x.10.x switches • x.x.30.x storage » Infiniband is a separate network • x.$(( x+1)).x.x » IPMI is on the same network (eth0:1 not needed) • x.x.150.$NODE 12 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 13 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Setup_fs Script is in /usr/lib/beoboot/bin/setup_fs Configuration file: /etc/beowulf/fstab » # Select which FSTAB to use. if [ -r /etc/beowulf/fstab.$NODE ] ; then FSTAB=/etc/beowulf/fstab.$NODE else FSTAB=/etc/beowulf/fstab fi echo "setup_fs: Configuring node filesystems using $FSTAB...“ $MASTER is determined and populated “nonfatal” option allows compute nodes to finish boot process and log errors in /var/log/beowulf/node.* NFS mounts of external servers needs to be done via IP address because name services have not been configured yet 14 Confidential – Internal Use Only beofdisk Beofdisk configures partition tables on compute nodes » To configure first drive: •bpsh 0 fdisk /dev/sda – Typical interactive usage » Query partition table: •beofdisk -q --node 0 » Write partition tables to other nodes: •for i in $(seq 1 10); do beofdisk -w --node $i ; done » Create devices initially • Use head nodes /dev/sd* as reference: – [root@scyld brw-rw---brw-rw---brw-rw---brw-rw---[root@scyld 15 beowulf]# ls -l /dev/sda* 1 root disk 8, 0 May 20 08:18 /dev/sda 1 root disk 8, 1 May 20 08:18 /dev/sda1 1 root disk 8, 2 May 20 08:18 /dev/sda2 1 root disk 8, 3 May 20 08:18 /dev/sda3 beowulf]# bpsh 0 mknod /dev/sda1 b 8 1 Confidential – Internal Use Only Create local filesystems After partitions have been created, mkfs » bpsh –an mkswap /dev/sda1 » bpsh –an mkfs.ext2 /dev/sda2 • ext2 is a non-journaled filesystem, faster than ext3 for scratch file system • If corruption occurs, simply mkfs again Copy int18 bootblock if needed: » bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda /etc/beowulf/config options for file system creation » # The compute node file system creation and consistency checking policies. fsck full mkfs never 16 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 17 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Name services /usr/lib/beoboot/bin/node_up populates /etc/hosts and /etc/nsswitch.conf on compute nodes beo name service determines values from /etc/beowulf/config file bproc name service determines values from current environment ‘getent’ can be used to query entries » getent netgroup cluster » getent hosts 10.54.0.1 » getent hosts n3 If system-config-authentication is run, ensure that proper entries still exist in /etc/nsswitch.conf (head node) 18 Confidential – Internal Use Only BeoNSS Hostnames n0 n1 n2 n3 n4 n5 Optional Disks Opportunity: We control IP address assignment » Assign node IP addresses in node order » Changes name lookup to addition Interconnection Network » Master: 10.54.0.1 GigE Switch: 10.54.10.0 IB Switch: 10.54.11.0 NFS/Storage: 10.54.30.0 Nodes: 10.54.50.$node Name format .-1 » Cluster hostnames have the base form n<N> master » Options for admin-defined names and networks Master Node Special names for "self" and "master" » Current machine is ".-2" or "self". Internet or Internal Network 19 » Master is known as ".-1", “master”, “master0” Confidential – Internal Use Only Changes Prior to 4.2.0 » Hostnames default to .<NODE> form » /etc/hosts had to be populated with alternative names and IP addresses » May break @cluster netgroup and hence NFS exports » /etc/passwd and /etc/group needed on compute nodes for Torque 4.2.0+ » Hostnames default to n<NODE> form » Configuration is driven by /etc/beowulf/config and beoNSS » Username and groups can be provided by kickback daemon for Torque 20 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 21 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only ClusterWare Filecache functionality Provided by filecache kernel module Configured by /etc/beowulf/config libraries directives Dynamically controlled by ‘bplib’ Capabilities exist in all ClusterWare 4 versions » 4.2.0 add prestage keyword in /etc/beowulf/config » Prior versions needed additional scripts in /etc/beowulf/init.d For libraries listed in /etc/beowulf/config, files can be prestaged by ‘md5sum’ the file » # Prestage selected libraries. The keyword is generic, but the current # implementation only knows how to "prestage" a file that is open'able on # the compute node: through the libcache, across NFS, or already exists # locally (which isn't really a "prestaging", since it's already there). prestage_libs=`beoconfig prestage` for libname in $prestage_libs ; do # failure isn't always fatal, so don't use run_cmd echo "node_up: Prestage file:" $libname bpsh $NODE md5sum $libname > /dev/null done 22 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 23 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Compute nodes init.d scripts Located in /etc/beowulf/init.d Scripts start on the head node and need explicit bpsh and beomodprobe to operate on compute nodes $NODE has been prepopulated by /usr/lib/beoboot/bin/node_up Order is based on file name » Numbered files can be used to control order beochkconfig is used to set +x bit on files 24 Confidential – Internal Use Only Cluster Configuration /etc/beowulf/config is the central location for cluster configuration Features are documented in ‘man beowulf-config’ Compute node order is determined by ‘node’ parameters Changes can be activated by doing a ‘service beowulf reload’ 25 Confidential – Internal Use Only Orientation Agenda – Part 1 Scyld ClusterWare foundations » Booting process • Startup scripts • File systems • Name services » Cluster Configuration Cluster Components » Networking infrastructure » NFS File servers » IPMI configuration Break 26 Confidential – Internal Use Only Elements of Cluster Systems Optional Disks Some important elements of a cluster system » Booting and Provisioning » Process creation, monitoring and control Interconnection Network » Update and consistency model » Name services » File Systems Master Node » Physical management » Workload virtualization Internet or Internal Network 27 Confidential – Internal Use Only Compute Node Boot Process Starts with /etc/beowulf/node_up » Calls /usr/lib/beoboot/bin/node_up • Usage: node_up <nodenumber> • Sets up: – – – – – – – – – 28 System date Basic network configuration Kernel modules (device drivers) Network routing setup_fs Name services chroot Prestages files (4.2.0+) Other init scripts in /etc/beowulf/init.d Confidential – Internal Use Only Remote Filesystems Optional Disks Remote - Share a single disk among all nodes » Every node sees same filesystem » Synchronization mechanisms manage changes Interconnection Network » Locking has either high overhead or causes serial blocking » "Traditional" UNIX approach » Relatively low performance Master Node Internet or Internal Network 29 » Doesn't scale well; server becomes bottleneck in large systems » Simplest solution for small clusters, reading/writing small files Confidential – Internal Use Only NFS Server Configuration Head node NFS services » Configuration in /etc/exports » Provides system files (/bin, /usr/bin) » Increase number of NFS daemons • echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart Dedicated NFS server » SLES10 was recommended; RHEL5 now includes some xfs support • xfs has better performance • OS has better IO performance than RHEL4 » Network trunking can be used to increase bandwidth (with caveats) » Hardware RAID • Adaptec RAID card – CTRL-A at boot – arcconf utility from http://www.adaptec.com/en-US/support/raid/ » External storage (Xyratex or nStor) • SAS-attached • Fibre channel attached 30 Confidential – Internal Use Only Network trunking Use multiple physical links as a single pipe for data » Configuration must be done on host and switch SLES 10 configuration » Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the bond0 interface » BOOTPROTO=static DEVICE=bond0 IPADDR=10.54.30.0 NETMASK=255.255.0.0 STARTMODE=onboot MTU='‘ BONDING_MASTER=yes BONDING_SLAVE_0=eth0 BONDING_SLAVE_1=eth1 BONDING_MODULE_OPTS='mode=0 miimon=500' 31 Confidential – Internal Use Only Network trunking HP switch configuration » Create trunk group via serial or telnet interface Netgear (admin:password) » Create trunk group via http interface Cisco » Create etherchannel configuration 32 Confidential – Internal Use Only External Storage Xyratex arrays have a configuration interface » Text based via serial port » Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404) have embedded StorView • http://storage0:9292 – admin:password » RAID arrays, logical drives are configured and monitored • LUNs are numbered and presented on each port. Highest LUN is the controller itself • Multipath or failover needs to be configured 33 Confidential – Internal Use Only Need for QLogic Failover Collapse LUN presentation in OS to a single instance per LUN Minimize potential for user error which maintaining failover and static loadbalancing 34 Confidential – Internal Use Only Physical Management ipmitool » Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC) » Serial-over-LAN (SOL) can be implemented » Allows access to hardware such as sensor data or power states » E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off} bpctl » Controls the operational state and ownership of compute nodes » Examples might be to reboot or power off a node • Reboot: bpctl –S all -R • Power off: bpctl –S all –P » Limit user and group access to run on a particular node or set of nodes 35 Confidential – Internal Use Only IPMI Configuration Full spec is available here: » http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0 _E3_markup.pdf Penguin Specific configuration » Recent products all have IPMI implementations. Some are inband (share physical media with eth0), some are out-of-band (separate port and cable from eth0) • Altus 1300, 600, 650 – In-band, lan channel 6 • Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 – Out-of-band, lan channel 2 • Relion 1670 – In-band, lan channel 1 • Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1 Some ipmitool versions have a bug and need to following command to commit a write » bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0 36 Confidential – Internal Use Only Orientation Agenda – Part 2 Parallel jobs » MPI configuration » Infiniband interconnect Queueing » Initial setup » Tuning » Policy case studies Other software and tools Questions and Answers 37 Confidential – Internal Use Only Explicitly Parallel Programs Different paradigms exist for parallelizing programs » Shared memory » OpenMP » Sockets » PVM » Linda » MPI Most distributed parallel programs are now written using MPI » Different options for MPI stacks: MPICH, OpenMPI, HP, and Intel » ClusterWare comes integrated with customized versions of MPICH and OpenMPI 38 Confidential – Internal Use Only Compiling MPICH programs mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH » GNU, PGI, and Intel compilers are supported Effectively set libraries and includes for compile and linking » prefix="/usr“ part1="-I${prefix}/include“ part2="“ part3="-lmpi -lbproc“ … part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“ … $cc $part1 $part2 $part3 39 Confidential – Internal Use Only Running MPICH programs mpirun is used to launch MPICH programs If Infiniband is installed, the interconnect fabric can be chosen using the machine flag: » -machine p4 » -machine vapi » Done by changing LD_LIBRARY_PATH at runtime • export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH} » Hooks for using mpiexec for Queue system • elif [ -n "${PBS_JOBID}" ]; then for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP do unset $var done for hostname in `cat $PBS_NODEFILE` do NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'` BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“ done # Clean a leading : from the map export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'` # The -n 1 argument is important here exec mpiexec -n 1 ${progname} "$@" 40 Confidential – Internal Use Only Environment Variable Options Additional environment variable control: » NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes. » ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs. » ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU. » ALL_LOCAL — Run every process on the master node; used for debugging purposes. » NO_LOCAL — Don’t run any processes on the master node. » EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment. » BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on. 41 Confidential – Internal Use Only Running MPICH programs Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of the queue system » BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE • number_of_nodes=`cat $PBS_NODEFILE | wc -l` hostlist=`cat $PBS_NODEFILE | head -n 1 ` for i in $(seq 2 $number_of_nodes ) ; do hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1` done BEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'` export BEOWULF_JOB Starting with ClusterWare 4.1.4, mpiexec was included with the distribution. mpiexec is an alternative spawning mechanism that starts processes as part of the queue system Other MPI implementations have alternatives. HP-MPI and Intel MPI use rsh and run outside of the queue system. OpenMPI uses libtm to properly start processes 42 Confidential – Internal Use Only MPI Primer Only a brief introduction is provided here for MPI. Many other in-depth tutorials are available on the web and in published sources. » http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html » http://www.llnl.gov/computing/tutorials/mpi/ Paradigms for writing parallel programs depend upon the application » SIMD (single-instruction multiple-data) » MIMD (multiple-instruction multiple-data) » MISD (multiple-instruction single-data) SIMD will be presented here as it is a commonly used template » A single application source is compiled to perform operations on different sets of data » The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface) • Contrast this with shared memory or OpenMP where data is locally via memory • Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct MPI specification has many functions; however most MPI programs can be written with only a small subset 43 Confidential – Internal Use Only Infiniband Primer Infiniband provides a lowlatency, high-bandwidth interconnect for message to minimize IO for tightly coupled parallel applications Infiniband requires hardware, kernel drivers, O/S support, user land drivers, and application support Prior to 4.2.0, software stack was provided by SilverStorm Starting with 4.2.0, ClusterWare migrated to using the OpenFabrics (ofed, openIB) stack 44 Confidential – Internal Use Only Infiniband Subnet Manager Every Infiniband network requires a Subnet Manager to discover and manage the topology » Our clusters typically ship with a Managed QLogic Infiniband switch with an embedded subnet manager (10.54.0.20; admin:adminpass) » Subnet Manager is configured to start at switch boot » Alternatively, a software Subnet Manager (e.g. openSM) can be run on a host connected to the Infiniband fabric. » Typically the embedded subnet manager is more robust and provides a better experience 45 Confidential – Internal Use Only Communication Layers Verbs API (VAPI) provides a hardware specific interface to the transport media » Any program compiled with VAPI can only run on the same hardware profile and drivers » Makes portability difficult Direct Access Programming Language (DAPL) provides a more consistent interface » DAPL layers can communicate with IB, Myrinet, and 10GigE hardware » Better portability for MPI libraries TCP/IP interface » Another upper layer protocol provides IP-over-IB (IPoIB) where the IB interface is assigned an IP address and most standard TCP/IP applications work 46 Confidential – Internal Use Only MPI Implementation Comparison MPICH is provided by Argonne National Labs » Runs only over Ethernet Ohio State University has ported MPICH to use the Verbs API => MVAPICH » Similar to MPICH but uses Infiniband LAM-MPI was another implementation which provided a more modular format OpenMPI is the successor to LAM-MPI and has many options » Can use different physical interfaces and spawning mechanisms » http://www.openmpi.org HP-MPI, Intel-MPI » Licensed MPICH2 code and added functionality » Can use a variety of physical interconnects 47 Confidential – Internal Use Only OpenMPI Configuration ./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr --with-openib=/usr --without-bproc --without-lsf_bproc --withoutgrid --without-slurm --without-gridengine --without-portals -without-gm --without-loadleveler --without-xgrid --without-mx -enable-mpirun-prefix-by-default --enable-static make all make install Create scripts in /etc/profile.d to set default environment variables for all users mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self machinefile machinefile ./IMB-MPI1 48 Confidential – Internal Use Only Queuing How are resources allocated among multiple users and/or groups? » Statically by using bpctl user and group permissions » ClusterWare supports a variety of queuing packages • TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare) • Torque • SGE 49 Confidential – Internal Use Only Interacting with TaskMaster Because TaskMaster uses the MOAB scheduler with Torque pbs_server and pbs_mom components, all of the Torque commands are still valid » qsub will submit a job to Torque, MOAB then polls pbs_server to detect new jobs » msub will submit a job to Moab which then pushes the job to pbs_server Other TaskMaster commands » qstat -> showq » qdel, qhold, qrls -> mjobctl » pbsnodes -> showstate » qmgr -> mschedctl, mdiag » Configuration in /opt/moab/moab.cfg 50 Confidential – Internal Use Only Torque Initial Setup ‘/usr/bin/torque.setup root’ can be used to start with a clean slate » This will delete any current configuration that you have » qmgr qmgr qmgr qmgr –c –c –c –c ‘set ‘set ‘set ‘set server server server server keep_completed=300’ query_other_jobs=true’ operators += root@localhost.localdomain’ managers += root@localhost.localdomain’ /var/spool/torque/server_priv/nodes stores node information » n0 np=8 prop1 prop2 » qterm –t quick edit /var/spool/torque/server_priv/nodes service pbs_server start /var/spool/torque/sched_priv/sched_config configures default FIFO scheduler /var/spool/torque/mom_priv/config configure pbs_mom’s » Copied out during /etc/beowulf/init.d/torque 51 Confidential – Internal Use Only TaskMaster Initial Setup Edit configuration in /opt/moab/moab.cfg » SCHEDCFG[Scyld] MODE=NORMAL SERVER=scyld.localdomain:42559 • Ensure hostname is consistent with ‘hostname’ » ADMINCFG[1] USERS=root • Add additional users who can be queue managers » RMCFG[base] TYPE=PBS • TYPE=PBS integrates with a traditional Torque configuration 52 Confidential – Internal Use Only Tuning Default walltime can be set in Torque using: » qmgr -c ‘set queue batch resources_default.walltime=16:00:00’ If many small jobs need to be submitted, uncomment the following in /opt/moab/moab.cfg » JOBAGGREGATIONTIME 10 To exactly match node and processor requests, add the following to /opt/moab/moab.cfg » JOBNODEMATCHPOLICY EXACTNODE Changes in /opt/moab/moab.cfg can be activated by doing a ‘service moab restart’ 53 Confidential – Internal Use Only Case Studies Case Study #1 » Multiple queues for interactive, high priority, and standard jobs Case Study #2 » Different types of hardware configuration » Setup with FairShare • http://www.clusterresources.com/products/mwm/moabdocs/5.1.1p riorityoverview.shtml • http://www.clusterresources.com/products/mwm/moabdocs/5.1.2p riorityfactors.shtml 54 Confidential – Internal Use Only Troubleshooting Log files » /var/log/messages » /var/log/beowulf/node* » /var/spool/torque/server_logs » /var/spool/torque/mom_logs » qstat –f » tracejob » /opt/moab/log » mdiag » strace –p » gdb 55 Confidential – Internal Use Only Hardware Maintenance pbsnodes –o n0: mark node offline and allow jobs to drain bpctl –S 0 –s unavailable: prevent user interactive commands from running on node Wait until node is idle bpctl –S 0 –P: power off node Perform maintenance Power on node pbsnodes –c n0 56 Confidential – Internal Use Only Questions?? Confidential – Internal Use Only 57