Name services

advertisement
Scyld ClusterWare System
Administration
Confidential – Internal Use Only
1
Orientation Agenda – Part 1
 Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
 Cluster Components
» Networking infrastructure
» NFS File servers
» IPMI Configuration
 Break
2
Confidential – Internal Use Only
Orientation Agenda – Part 2
 Parallel jobs
» MPI configuration
» Infiniband interconnect
 Queuing
» Initial setup
» Tuning
» Policy case studies
 Other software and tools
 Troubleshooting
 Questions and Answers
3
Confidential – Internal Use Only
Orientation Agenda – Part 1
 Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
 Cluster Components
» Networking infrastructure
» NFS File servers
 Break
4
Confidential – Internal Use Only
Cluster Virtualization Architecture Realized
Optional Disks
 Minimal in-memory OS with single
daemon rapidly deployed in seconds no disk required
» Less than 20 seconds
 Virtual, unified process space enables
intuitive single sign-on, job
submission
» Effortless job migration to nodes
Interconnection Network
 Monitor & manage efficiently from the
Master
» Single System Install
» Single Process Space
» Shared cache of the cluster state
Master Node
» Single point of provisioning
» Better performance due to lightweight
nodes
Internet or Internal Network
» No version skew is inherently more
reliable
Manage & use a cluster like a single SMP machine
5
Confidential – Internal Use Only
Elements of Cluster Systems
Optional Disks
 Some important elements of
a cluster system
» Booting and Provisioning
» Process creation, monitoring
and control
Interconnection Network
» Update and consistency model
» Name services
» File Systems
Master Node
» Physical management
» Workload virtualization
Internet or Internal Network
6
Confidential – Internal Use Only
Booting and Provisioning
 Integrated, automatic network boot
 Basic hardware reporting and diagnostics in the PreOS stage
 Only CPU, memory and NIC needed
 Kernel and minimal environment from master
 Just enough to say “what do I do now?”
 Remaining configuration driven by master
 Logs are stored in:
» /var/log/messages
» /var/log/beowulf/node.*
7
Confidential – Internal Use Only
DHCP and TFTP services
 Started from /etc/rc.d/init.d/beowulf
» Locate vmlinuz in /boot
» Configure syslog and other parameters on the head node
» Loads kernel modules
» Setup libraries
» Creates ramdisk image for compute nodes
» Starts DHCP/TFTP server (beoserv)
» Configures NAT for ipforwarding if needed
» Starts kickback name service daemon (4.2.0+)
» Tune network stack
8
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
9
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
10
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Subnet configuration
 Default used to be class C Network
» netmask 255.255.255.0
» Limited to 155 compute nodes ( 100 + $NODE < 255 )
» Last octect denotes special devices
• x.x.x.10 switches
• x.x.x.30 storage
» Infiniband is a separate network
• x.x.1.$(( 100 + $NODE ))
» Needed eth0:1 to reach IPMI network
• x.x.2.$(( 100 + $NODE ))
• /etc/sysconfig/network-scripts/ifcfg-eth0:1
• ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0
11
Confidential – Internal Use Only
Subnet configuration
 New standard is class B Network
» netmask 255.255.0.0
» Limited to 100 * 256 compute nodes
• 10.54.50.x – 10.54.149.x
» Third octect denotes special devices
• x.x.10.x switches
• x.x.30.x storage
» Infiniband is a separate network
• x.$(( x+1)).x.x
» IPMI is on the same network (eth0:1 not needed)
• x.x.150.$NODE
12
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
13
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Setup_fs
 Script is in /usr/lib/beoboot/bin/setup_fs
 Configuration file: /etc/beowulf/fstab
» # Select which FSTAB to use.
if [ -r /etc/beowulf/fstab.$NODE ] ; then
FSTAB=/etc/beowulf/fstab.$NODE
else
FSTAB=/etc/beowulf/fstab
fi
echo "setup_fs: Configuring node filesystems using $FSTAB...“
 $MASTER is determined and populated
 “nonfatal” option allows compute nodes to finish boot
process and log errors in /var/log/beowulf/node.*
 NFS mounts of external servers needs to be done via IP
address because name services have not been
configured yet
14
Confidential – Internal Use Only
beofdisk
 Beofdisk configures partition tables on compute nodes
» To configure first drive:
•bpsh 0 fdisk /dev/sda
– Typical interactive usage
» Query partition table:
•beofdisk -q --node 0
» Write partition tables to other nodes:
•for i in $(seq 1 10); do beofdisk -w --node $i ; done
» Create devices initially
• Use head nodes /dev/sd* as reference:
– [root@scyld
brw-rw---brw-rw---brw-rw---brw-rw---[root@scyld
15
beowulf]# ls -l /dev/sda*
1 root disk 8, 0 May 20 08:18 /dev/sda
1 root disk 8, 1 May 20 08:18 /dev/sda1
1 root disk 8, 2 May 20 08:18 /dev/sda2
1 root disk 8, 3 May 20 08:18 /dev/sda3
beowulf]# bpsh 0 mknod /dev/sda1 b 8 1
Confidential – Internal Use Only
Create local filesystems
 After partitions have been created, mkfs
» bpsh –an mkswap /dev/sda1
» bpsh –an mkfs.ext2 /dev/sda2
• ext2 is a non-journaled filesystem, faster than ext3 for scratch
file system
• If corruption occurs, simply mkfs again
 Copy int18 bootblock if needed:
» bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda
 /etc/beowulf/config options for file system creation
» # The compute node file system creation and consistency checking policies.
fsck full
mkfs never
16
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
17
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Name services
 /usr/lib/beoboot/bin/node_up populates /etc/hosts and
/etc/nsswitch.conf on compute nodes
 beo name service determines values from
/etc/beowulf/config file
 bproc name service determines values from current
environment
 ‘getent’ can be used to query entries
» getent netgroup cluster
» getent hosts 10.54.0.1
» getent hosts n3
 If system-config-authentication is run, ensure
that proper entries still exist in /etc/nsswitch.conf (head
node)
18
Confidential – Internal Use Only
BeoNSS Hostnames
n0
n1
n2
n3
n4
n5
Optional Disks
 Opportunity: We control IP address
assignment
» Assign node IP addresses in node order
» Changes name lookup to addition
Interconnection Network
» Master: 10.54.0.1
GigE Switch: 10.54.10.0
IB Switch: 10.54.11.0
NFS/Storage: 10.54.30.0
Nodes: 10.54.50.$node
 Name format
.-1
» Cluster hostnames have the base form
n<N>
master
» Options for admin-defined names and
networks
Master Node
 Special names for "self" and "master"
» Current machine is ".-2" or "self".
Internet or Internal Network
19
» Master is known as ".-1", “master”,
“master0”
Confidential – Internal Use Only
Changes
 Prior to 4.2.0
» Hostnames default to .<NODE> form
» /etc/hosts had to be populated with alternative names and IP
addresses
» May break @cluster netgroup and hence NFS exports
» /etc/passwd and /etc/group needed on compute nodes for
Torque
 4.2.0+
» Hostnames default to n<NODE> form
» Configuration is driven by /etc/beowulf/config and beoNSS
» Username and groups can be provided by kickback daemon
for Torque
20
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
21
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
ClusterWare Filecache functionality
 Provided by filecache kernel module
 Configured by /etc/beowulf/config libraries directives
 Dynamically controlled by ‘bplib’
 Capabilities exist in all ClusterWare 4 versions
» 4.2.0 add prestage keyword in /etc/beowulf/config
» Prior versions needed additional scripts in /etc/beowulf/init.d
 For libraries listed in /etc/beowulf/config, files can be prestaged by
‘md5sum’ the file
» # Prestage selected libraries. The keyword is generic, but the current
# implementation only knows how to "prestage" a file that is open'able on
# the compute node: through the libcache, across NFS, or already exists
# locally (which isn't really a "prestaging", since it's already there).
prestage_libs=`beoconfig prestage`
for libname in $prestage_libs ; do
# failure isn't always fatal, so don't use run_cmd
echo "node_up: Prestage file:" $libname
bpsh $NODE md5sum $libname > /dev/null
done
22
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
23
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Compute nodes init.d scripts
 Located in /etc/beowulf/init.d
 Scripts start on the head node and need explicit bpsh
and beomodprobe to operate on compute nodes
 $NODE has been prepopulated by
/usr/lib/beoboot/bin/node_up
 Order is based on file name
» Numbered files can be used to control order
 beochkconfig is used to set +x bit on files
24
Confidential – Internal Use Only
Cluster Configuration
 /etc/beowulf/config is the central location for cluster
configuration
 Features are documented in ‘man beowulf-config’
 Compute node order is determined by ‘node’
parameters
 Changes can be activated by doing a ‘service beowulf
reload’
25
Confidential – Internal Use Only
Orientation Agenda – Part 1
 Scyld ClusterWare foundations
» Booting process
• Startup scripts
• File systems
• Name services
» Cluster Configuration
 Cluster Components
» Networking infrastructure
» NFS File servers
» IPMI configuration
 Break
26
Confidential – Internal Use Only
Elements of Cluster Systems
Optional Disks
 Some important elements of
a cluster system
» Booting and Provisioning
» Process creation, monitoring
and control
Interconnection Network
» Update and consistency model
» Name services
» File Systems
Master Node
» Physical management
» Workload virtualization
Internet or Internal Network
27
Confidential – Internal Use Only
Compute Node Boot Process
 Starts with /etc/beowulf/node_up
» Calls /usr/lib/beoboot/bin/node_up
• Usage: node_up <nodenumber>
• Sets up:
–
–
–
–
–
–
–
–
–
28
System date
Basic network configuration
Kernel modules (device drivers)
Network routing
setup_fs
Name services
chroot
Prestages files (4.2.0+)
Other init scripts in /etc/beowulf/init.d
Confidential – Internal Use Only
Remote Filesystems
Optional Disks
 Remote - Share a single disk among
all nodes
» Every node sees same filesystem
» Synchronization mechanisms
manage changes
Interconnection Network
» Locking has either high overhead or
causes serial blocking
» "Traditional" UNIX approach
» Relatively low performance
Master Node
Internet or Internal Network
29
» Doesn't scale well; server becomes
bottleneck in large systems
» Simplest solution for small clusters,
reading/writing small files
Confidential – Internal Use Only
NFS Server Configuration
 Head node NFS services
» Configuration in /etc/exports
» Provides system files (/bin, /usr/bin)
» Increase number of NFS daemons
• echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart
 Dedicated NFS server
» SLES10 was recommended; RHEL5 now includes some xfs support
• xfs has better performance
• OS has better IO performance than RHEL4
» Network trunking can be used to increase bandwidth (with caveats)
» Hardware RAID
• Adaptec RAID card
– CTRL-A at boot
– arcconf utility from http://www.adaptec.com/en-US/support/raid/
» External storage (Xyratex or nStor)
• SAS-attached
• Fibre channel attached
30
Confidential – Internal Use Only
Network trunking
 Use multiple physical links as a single pipe for data
» Configuration must be done on host and switch
 SLES 10 configuration
» Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the
bond0 interface
» BOOTPROTO=static
DEVICE=bond0
IPADDR=10.54.30.0
NETMASK=255.255.0.0
STARTMODE=onboot
MTU='‘
BONDING_MASTER=yes
BONDING_SLAVE_0=eth0
BONDING_SLAVE_1=eth1
BONDING_MODULE_OPTS='mode=0 miimon=500'
31
Confidential – Internal Use Only
Network trunking
 HP switch configuration
» Create trunk group via serial or telnet interface
 Netgear (admin:password)
» Create trunk group via http interface
 Cisco
» Create etherchannel configuration
32
Confidential – Internal Use Only
External Storage
 Xyratex arrays have a configuration interface
» Text based via serial port
» Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404)
have embedded StorView
• http://storage0:9292
– admin:password
» RAID arrays, logical drives are configured and monitored
• LUNs are numbered and presented on each port. Highest LUN is
the controller itself
• Multipath or failover needs to be configured
33
Confidential – Internal Use Only
Need for QLogic Failover
 Collapse LUN presentation in OS to a single instance
per LUN
 Minimize potential for user error which maintaining
failover and static loadbalancing
34
Confidential – Internal Use Only
Physical Management
 ipmitool
» Intelligent Platform Management Interface (IPMI) is
integrated into the base management console (BMC)
» Serial-over-LAN (SOL) can be implemented
» Allows access to hardware such as sensor data or
power states
» E.g. ipmitool
–H n$NODE-ipmi –U admin –P admin power
{status,on,off}
 bpctl
» Controls the operational state and ownership of
compute nodes
» Examples might be to reboot or power off a node
• Reboot: bpctl –S all -R
• Power off: bpctl –S all –P
» Limit user and group access to run on a particular node
or set of nodes
35
Confidential – Internal Use Only
IPMI Configuration
 Full spec is available here:
» http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0
_E3_markup.pdf
 Penguin Specific configuration
» Recent products all have IPMI implementations. Some are inband (share physical media with eth0), some are out-of-band
(separate port and cable from eth0)
• Altus 1300, 600, 650 – In-band, lan channel 6
• Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 –
Out-of-band, lan channel 2
• Relion 1670 – In-band, lan channel 1
• Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1
 Some ipmitool versions have a bug and need to following
command to commit a write
» bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0
36
Confidential – Internal Use Only
Orientation Agenda – Part 2
 Parallel jobs
» MPI configuration
» Infiniband interconnect
 Queueing
» Initial setup
» Tuning
» Policy case studies
 Other software and tools
 Questions and Answers
37
Confidential – Internal Use Only
Explicitly Parallel Programs
 Different paradigms exist for parallelizing programs
» Shared memory
» OpenMP
» Sockets
» PVM
» Linda
» MPI
 Most distributed parallel programs are now written
using MPI
» Different options for MPI stacks: MPICH, OpenMPI, HP, and
Intel
» ClusterWare comes integrated with customized versions of
MPICH and OpenMPI
38
Confidential – Internal Use Only
Compiling MPICH programs
 mpicc, mpiCC, mpif77, mpif90 are used to
automatically compile code and link in the correct MPI
libraries from /usr/lib64/MPICH
» GNU, PGI, and Intel compilers are supported
 Effectively set libraries and includes for compile and
linking
» prefix="/usr“
part1="-I${prefix}/include“
part2="“
part3="-lmpi -lbproc“
…
part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“
…
$cc $part1 $part2 $part3
39
Confidential – Internal Use Only
Running MPICH programs
 mpirun is used to launch MPICH programs
 If Infiniband is installed, the interconnect fabric can be chosen using the
machine flag:
» -machine p4
» -machine vapi
» Done by changing LD_LIBRARY_PATH at runtime
• export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH}
» Hooks for using mpiexec for Queue system
• elif [ -n "${PBS_JOBID}" ]; then
for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP
do
unset $var
done
for hostname in `cat $PBS_NODEFILE`
do
NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'`
BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“
done
# Clean a leading : from the map
export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'`
# The -n 1 argument is important here
exec mpiexec -n 1 ${progname} "$@"
40
Confidential – Internal Use Only
Environment Variable Options
 Additional environment variable control:
» NP — The number of processes requested, but not the number of
processors. As in the example earlier in this section, NP=4 ./a.out will run
the MPI program a.out with 4 processes.
» ALL_CPUS — Set the number of processes to the number of CPUs
available to the current user. Similar to the example above, --all-cpus=1
./a.out would run the MPI program a.out on all available CPUs.
» ALL_NODES—Set the number of processes to the number of nodes
available to the current user. Similar to the ALL_CPUS variable, but you get
a maximum of one CPU per node. This is useful for running a job per node
instead of per CPU.
» ALL_LOCAL — Run every process on the master node; used for debugging
purposes.
» NO_LOCAL — Don’t run any processes on the master node.
» EXCLUDE — A colon-delimited list of nodes to be avoided during node
assignment.
» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node
listed will be the first process (MPI Rank 0) and so on.
41
Confidential – Internal Use Only
Running MPICH programs
 Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of
the queue system
» BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE
• number_of_nodes=`cat $PBS_NODEFILE | wc -l`
hostlist=`cat $PBS_NODEFILE | head -n 1 `
for i in $(seq 2 $number_of_nodes ) ; do
hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1`
done BEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'`
export BEOWULF_JOB
 Starting with ClusterWare 4.1.4, mpiexec was included with the
distribution. mpiexec is an alternative spawning mechanism that
starts processes as part of the queue system
 Other MPI implementations have alternatives. HP-MPI and Intel
MPI use rsh and run outside of the queue system. OpenMPI uses
libtm to properly start processes
42
Confidential – Internal Use Only
MPI Primer
 Only a brief introduction is provided here for MPI. Many other in-depth tutorials
are available on the web and in published sources.
» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html
» http://www.llnl.gov/computing/tutorials/mpi/
 Paradigms for writing parallel programs depend upon the application
» SIMD (single-instruction multiple-data)
» MIMD (multiple-instruction multiple-data)
» MISD (multiple-instruction single-data)
 SIMD will be presented here as it is a commonly used template
» A single application source is compiled to perform operations on different sets of data
» The data is read by the different threads or passed between threads via messages (hence
MPI = message passing interface)
• Contrast this with shared memory or OpenMP where data is locally via memory
• Optimizations in the MPI implementation can perform localhost optimization; however, the
program is still written using a message passing construct
 MPI specification has many functions; however most MPI programs can be written
with only a small subset
43
Confidential – Internal Use Only
Infiniband Primer
 Infiniband provides a lowlatency, high-bandwidth
interconnect for message to
minimize IO for tightly coupled
parallel applications
 Infiniband requires hardware,
kernel drivers, O/S support,
user land drivers, and
application support
 Prior to 4.2.0, software stack
was provided by SilverStorm
 Starting with 4.2.0, ClusterWare
migrated to using the
OpenFabrics (ofed, openIB)
stack
44
Confidential – Internal Use Only
Infiniband Subnet Manager
 Every Infiniband network requires a Subnet Manager to
discover and manage the topology
» Our clusters typically ship with a Managed QLogic Infiniband
switch with an embedded subnet manager (10.54.0.20;
admin:adminpass)
» Subnet Manager is configured to start at switch boot
» Alternatively, a software Subnet Manager (e.g. openSM) can
be run on a host connected to the Infiniband fabric.
» Typically the embedded subnet manager is more robust and
provides a better experience
45
Confidential – Internal Use Only
Communication Layers
 Verbs API (VAPI) provides a hardware specific interface
to the transport media
» Any program compiled with VAPI can only run on the same
hardware profile and drivers
» Makes portability difficult
 Direct Access Programming Language (DAPL) provides
a more consistent interface
» DAPL layers can communicate with IB, Myrinet, and 10GigE
hardware
» Better portability for MPI libraries
 TCP/IP interface
» Another upper layer protocol provides IP-over-IB (IPoIB)
where the IB interface is assigned an IP address and most
standard TCP/IP applications work
46
Confidential – Internal Use Only
MPI Implementation Comparison
 MPICH is provided by Argonne National Labs
» Runs only over Ethernet
 Ohio State University has ported MPICH to use the Verbs API =>
MVAPICH
» Similar to MPICH but uses Infiniband
 LAM-MPI was another implementation which provided a more
modular format
 OpenMPI is the successor to LAM-MPI and has many options
» Can use different physical interfaces and spawning mechanisms
» http://www.openmpi.org
 HP-MPI, Intel-MPI
» Licensed MPICH2 code and added functionality
» Can use a variety of physical interconnects
47
Confidential – Internal Use Only
OpenMPI Configuration
 ./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr
--with-openib=/usr --without-bproc --without-lsf_bproc --withoutgrid --without-slurm --without-gridengine --without-portals -without-gm --without-loadleveler --without-xgrid --without-mx -enable-mpirun-prefix-by-default --enable-static
 make all
 make install
 Create scripts in /etc/profile.d to set default environment variables for all users
 mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self machinefile machinefile ./IMB-MPI1
48
Confidential – Internal Use Only
Queuing
 How are resources allocated among multiple users
and/or groups?
» Statically by using bpctl user and group permissions
» ClusterWare supports a variety of queuing packages
• TaskMaster (advanced MOAB policy based scheduler integrated
ClusterWare)
• Torque
• SGE
49
Confidential – Internal Use Only
Interacting with TaskMaster
 Because TaskMaster uses the MOAB scheduler with Torque
pbs_server and pbs_mom components, all of the Torque
commands are still valid
» qsub will submit a job to Torque, MOAB then polls pbs_server to
detect new jobs
» msub will submit a job to Moab which then pushes the job to
pbs_server
 Other TaskMaster commands
» qstat -> showq
» qdel, qhold, qrls -> mjobctl
» pbsnodes -> showstate
» qmgr -> mschedctl, mdiag
» Configuration in /opt/moab/moab.cfg
50
Confidential – Internal Use Only
Torque Initial Setup
 ‘/usr/bin/torque.setup root’ can be used to start with a clean slate
» This will delete any current configuration that you have
» qmgr
qmgr
qmgr
qmgr
–c
–c
–c
–c
‘set
‘set
‘set
‘set
server
server
server
server
keep_completed=300’
query_other_jobs=true’
operators += root@localhost.localdomain’
managers += root@localhost.localdomain’
 /var/spool/torque/server_priv/nodes stores node information
» n0 np=8 prop1 prop2
» qterm –t quick
edit /var/spool/torque/server_priv/nodes
service pbs_server start
 /var/spool/torque/sched_priv/sched_config configures default FIFO
scheduler
 /var/spool/torque/mom_priv/config configure pbs_mom’s
» Copied out during /etc/beowulf/init.d/torque
51
Confidential – Internal Use Only
TaskMaster Initial Setup
 Edit configuration in /opt/moab/moab.cfg
» SCHEDCFG[Scyld]
MODE=NORMAL SERVER=scyld.localdomain:42559
• Ensure hostname is consistent with ‘hostname’
» ADMINCFG[1]
USERS=root
• Add additional users who can be queue managers
» RMCFG[base]
TYPE=PBS
• TYPE=PBS integrates with a traditional Torque configuration
52
Confidential – Internal Use Only
Tuning
 Default walltime can be set in Torque using:
» qmgr
-c ‘set queue batch resources_default.walltime=16:00:00’
 If many small jobs need to be submitted, uncomment
the following in /opt/moab/moab.cfg
» JOBAGGREGATIONTIME 10
 To exactly match node and processor requests, add the
following to /opt/moab/moab.cfg
» JOBNODEMATCHPOLICY
EXACTNODE
 Changes in /opt/moab/moab.cfg can be activated by
doing a ‘service moab restart’
53
Confidential – Internal Use Only
Case Studies
 Case Study #1
» Multiple queues for interactive, high priority, and standard jobs
 Case Study #2
» Different types of hardware configuration
» Setup with FairShare
• http://www.clusterresources.com/products/mwm/moabdocs/5.1.1p
riorityoverview.shtml
• http://www.clusterresources.com/products/mwm/moabdocs/5.1.2p
riorityfactors.shtml
54
Confidential – Internal Use Only
Troubleshooting
 Log files
» /var/log/messages
» /var/log/beowulf/node*
» /var/spool/torque/server_logs
» /var/spool/torque/mom_logs
» qstat –f
» tracejob
» /opt/moab/log
» mdiag
» strace –p
» gdb
55
Confidential – Internal Use Only
Hardware Maintenance
 pbsnodes –o n0: mark node offline and allow jobs to
drain
 bpctl –S 0 –s unavailable: prevent user interactive
commands from running on node
 Wait until node is idle
 bpctl –S 0 –P: power off node
 Perform maintenance
 Power on node
 pbsnodes –c n0
56
Confidential – Internal Use Only
Questions??
Confidential – Internal Use Only
57
Download