Introduction to Linux Clusters

Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004 Cluster Components Hardware  Nodes  Disk array  Networking gear  Backup device  Admin front end  UPS  Rack units Software  Operating system  MPI  Compilers  Scheduler Cluster Components Hardware  Nodes         Compute nodes Admin node I/O node Login node Disk array Networking gear Backup device Admin front end Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Disk array  RAID5  SCSI 320  10k+ RPM, TB+ capacity  NFS-mounted from I/O node  Networking gear  Backup device  Admin front end Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Networking gear         Myrinet, gigE, 10/100 Switches Cables Networking cards Backup device Admin front end UPS Rack units Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Backup device  AIT3, DLT, LTO  N-slot cartridge drive  SAN  Admin front end  UPS  Rack units Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Admin front end  Console (keyboard, monitor, mouse)  KVM switches  KVM cables  UPS  Rack units Cluster Components Hardware  UPS  APC SmartUPS 3000  3 per 42U rack  Rack units Software  Operating system  Compilers  Scheduler  MPI Cluster Components Hardware  Rack units  42U, standard or deep Software  Operating system  Compilers  Scheduler  MPI Cluster Components Software  Operating system      Red Hat 9+ Linux Debian Linux SUSE Linux Mandrake Linux FreeBSD and others  MPI  Compilers  Scheduler Cluster Components Software  MPI     MPICH LAM/MPI MPI-GM MPI Pro  Compilers  Scheduler Cluster Components Software  Compilers  gnu  Portland Group  Intel  Scheduler Cluster Components Software  Scheduler  OpenPBS  PBS Pro  Maui Filesystem Requirements  Journalled filesystem  Reboots happen more quickly after a crash  Slight performance hit for this feature  ext3 is a popular choice (old ext2 was not journalled) Space and Power Requirements  Space  Standard 42U rack is about 24”W x 80”H x 40”D  Blade units give you more than 1 node per 1U space in a deeper rack  Cable management inside the rack  Consider overhead or raised floor cabling for the external cables  Power  67 node Xeon cluster consumes 19,872W = 5.65 tons of A/C to keep it cool  Ideally, each UPS plug should connect to its own circuit  Clusters (especially blades) run real hot; make sure there is adequate A/C and ventilation Network Requirements  External Network  One 10mbps network line is adequate (all computation and message passing is within the cluster)  Internal Network  gigE  Myrinet  Some combo  Base your net gear selection on whether most of your jobs are CPUbound or I/O bound Network Choices Compared  Fast Ethernet (100BT)  0.1 Gb/s (or 100 Mb/s) bandwidth  Essentially free  gigE  0.4 Gb/s to 0.64 Gb/s bandwidth  ~$400 per node  Myrinet     1.2 Gb/s to 2.0 Gb/s bandwidth ~$1000 per node Scales to thousands of nodes Buy fiber instead of copper cables Networking Gear Speeds 2500 2000 1500 1000 500 0 Fast Ethernet gigE Myrinet I/O Node  Globally accessible filesystem (RAID5 disk array)  Backup device I/O Node  Globally accessible filesystem (RAID5 disk array)  NFS share it  Put user home directories, apps, and scratch space directories on it so all compute nodes can access them  Enforce quotas on home directories  Backup device I/O Node  Globally accessible filesystem (RAID5 disk array)  Backup device  Make sure your device and software is compatible with your operating system  Plan a good backup strategy  Test the ETA of bringing back a single file or a filesystem from backups Admin Node  Only sysadmins log into this node  Runs cluster management software Admin Node  Only sysadmins log into this node  Accessible only from within the cluster  Runs cluster management software Admin Node  Only admins log into this node  Runs cluster management software  User and quota management  Node management  Rebuild dead nodes  Monitor CPU utilization and network traffic Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Don’t forget that some software companies license their software per node, so factor in software costs  Stick with a proven technology over future promise  Memory size of each node depends on the application mix. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  2 GB + for for large calculations  < 2 GB for financial databases  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node. Compute Nodes  Buy the fastest CPUs and bus speed you can afford.  Memory size of each node depends on the application mix.  Lots of hard disk space is not so much a priority since the nodes will primarily use shared space on the I/O node.  Disks are cheap nowadays... 40GB EIDE is standard per node Compute Nodes  Choose a CPU architecture you’re comfortable with  Intel: P4, Xeon, Itanium  AMD: Opteron, Athlon  Other: G4/G5  Consider that some algorithms require 2n nodes  32-bit Linux is free or close-to-free, 64-bit Red Hat Linux costs $1600 per node Login Node  Users login here  Only way to get into the cluster  Compile code  Job control Login Node  Users login here  ssh or ssh -X  Cluster designers recommend 1 login node per 64 compute nodes  Update /etc/profile.d so all users get the same environment when they log in  Only way to get into the cluster  Compile code  Job control Login Node  Users login here  Only way to get into the cluster  Static IP address (vs. DHCP addresses on all other cluster nodes)  Turn on built-in firewall software  Compile code  Job control Login Node  Users login here  Only way to get into the cluster  Compile code  Licenses should be purchased for this node only  Don’t pay for more than you need  2 licenses might be sufficient for code compilation for a department  Job control Login Node  Users login here  Only way to get into the cluster  Compile code  Job control (using a scheduler)  Choice of queues to access subset of resources  Submit, delete, terminate jobs  Check on job status Spare Nodes  Offline nodes that are put into service when an existing node dies  Use for spare parts  Use for testing environment Cluster Install Software  Designed to make cluster installation easier (“cluster in a box” concept)  Decreases ETA of the install process using automated steps  Decreases chance of user error  Choices:  OSCAR  Felix  IBM XCAT  IBM CSM Cluster Management Software  Run parallel commands via GUI  Or write Perl scripts for command-line control  Install new nodes, rebuild corrupted nodes  Check on status of hardware (nodes, network connections)  Ganglia  xpbsmon  Myrinet tests (gm_board_info) Cluster Management Software  xpbsmon shows jobs running that were submitted via the scheduler Cluster Consistency  Rsync or rdist /etc/password, shadow, gshadow, and group files from login node to compute nodes  Also consider (auto or manually) rsync’ing /etc/profile.d files, pbs config files, /etc/fstab, etc. Local and Remote Management  Local management  GUI desktop from console monitor  KVM switches to access each node  Remote management  Console switch  ssh in and see what’s on the console monitor screen from your remote desktop  Web-based tools  Ganglia ganglia.sourceforge.net  Netsaint www.netsaint.org  Big Brother www.bb4.com Ganglia  Tool for monitoring clusters of up to 2000 nodes  Used on over 500 clusters worldwide  For multiple OS’s and CPU architectures # ssh -X coffee.chem.upenn.edu # ssh coffeeadmin # mozilla & Open http://coffeeadmin/ganglia Periodically auto-refreshes web page Ganglia Ganglia Ganglia Scheduling Software (PBS)  Set up queues for different groups of users based on resource needs (i.e. not everyone needs Myrinet; some users only need 1 node)  The world does not end if one node goes down; the scheduler will run the job on another node  Make sure pbs_server and pbs_sched is running on login node  Make sure pbs_mom is running on all compute nodes, but not on login, admin, or I/O nodes Scheduling Software  OpenPBS  PBS Pro  Others Scheduling Software  OpenPBS  Limit users by number of jobs  Good support via messageboards  *** FREE ***  PBS Pro  Others Scheduling Software  OpenPBS  PBS Pro  The “pro” version of OpenPBS  Limit by nodes, not just jobs per user  Must pay for support ($25 per CPU, or $3200 for a 128 CPU cluster)  Others Scheduling Software  OpenPBS  PBS Pro  Others  Load Share Facility  Codeine  Maui MPI Software  MPICH (Argonne National Labs)  LAM/MPI (OSC/Univ. of Notre Dame)  MPI-GM (Myricom)  MPI Pro (MSTi Software)  Programmed by one of the original developers of MPICH  Claims to be 20% faster than MPICH  Costs $1200 plus support per year Compilers and Libraries  Compilers  gcc/g77  Portland Group  Intel www.gnu.org/software www.pgroup.com www.developer.intel.com  Libraries       BLAS ATLAS - portable BLAS www.math-atlas.sourceforge.net LAPACK SCALAPACK - MPI-based LAPACK FFTW - Fast Fourier Transform www.fftw.org many, many more Cluster Security  Securing/patching your Linux cluster is much like securing/patching your Linux desktop  Keep an eye out for the latest patches  Install a patch only if necessary and do it on a test machine first  Make sure there’s a way to back out of a patch before installing it Cluster Security  Get rid of unneeded software  Limit who installs and what gets installed  Close unused ports and services  Limit login service to ssh between login node and outside world  Use ssh to tunnel X connections safely  Limit access using hosts.allow/deny  Use scp and sftp for secure file transfer Cluster Security  Carefully configure NFS  Upgrade to the latest, safest Samba version, if used  Disable Apache if not needed  Turn on built-in Linux firewall software Troubleshooting  Make sure the core cluster services are running  Scheduler, MPI, NFS, cluster managers  Make sure software licenses are up-todate  Scan logs for break-in attempts  Keep a written journal of all patches installs and upgrades Troubleshooting  Sometimes a reboot will fix the problem  If you reboot the login node where the scheduler is running, be sure the scheduler is started after the reboot  Any jobs in the queues will be flushed  Hard-rebooting hardware, such as tape drives, usually fixes the problem Troubleshooting  Reboot order: I/O node, login node, admin node, compute nodes (i.e. master nodes first, then slave nodes)  Rebuilding a node takes 30 minutes with the cluster manager; reconfiguring it may take an hour more Vendor Choices         Dell IBM Western Scientific Aspen Systems Racksaver eRacks Penguin Computing Many, many others  Go with a proven vendor  Get every vendor to spec out the same hardware and software before you compare prices  Compare service agreements  How fast can they deliver a working cluster? Buying Commercial Software  Is it worth the money?  Is it proven software?  Are all the bells and whistles really necessary?  Paid software does not necessarily have the best support Cluster Tips  Keep all sysadmin scripts in an easily accessible place  /4sysadmin  /usr/local/4sysadmin Cluster Tips  Force everyone to use the scheduler to run their jobs (even uniprocessor jobs)  Police it  Don’t let users get away with things  Wrapping some applications into a scheduler script can be tricky Cluster Upgrades  Nodes become obsolete in 2 to 3 years  Upgrade banks of nodes at a time  If upgrading to a new CPU, check for compatibility problems and new A/C requirements  Upgrading memory and disk space is easy but tedious Cluster Upgrades  Upgrading the OS can be a major task  Even installing patches can be a major task Common Sense Cluster Administration  Plan a little before you do anything  Keep a journal of everything you do  Create procedures that are easy to follow in times of stress  Document everything! Common Sense Cluster Administration  Test software before announcing it  Educate and “radiate” your cluster knowledge to your support team coffee.chem  6 P.I.’s in Chemistry funded it  Located in FBA121 next to A/C3  69 dual-CPU node cluster       64 compute nodes 1 login node 1 admin node 1 I/O node 1 backup node 1 firewall node coffee.chem  Myrinet on 32 compute nodes, gigE on other 32  2 TB RAID5 array (1.7 TB formatted)  12-slot, 4.8 TB capacity LTO tape drive  2U fold-out console with LCD monitor, keyboard, trackpad coffee.chem  5 KVM daisy chained switches  9 APC 3000 UPS units each connected to their own circuit  3 42U racks coffee.chem  Red Hat 9  Felix cluster install and management      software PBS Pro MPICH, LAM/MPI, MPICH-GM gnu and Portland Group compilers BLAS, SCALAPACK, ATLAS libraries Gaussian98 (Gaussian03 + Linda soon) coffee.chem  /data on I/O node (coffeecompute00) holds common apps and user home directories  Admin node (coffeeadmin) runs Felix cluster manager  Compute nodes (coffeecompute01..64)  Every node in the cluster can access /data via NFS coffee.chem  Can ssh into compute nodes, admin, and I/O node only via login node  Backup node (javabean) temporarily has our backup device attached (we use tar right now) Logging Into coffee.chem  Everyone in this room will have user accounts on coffee.chem and home directories in /data/staff  Our existence on the system is for Chemistry’s benefit  Support scripts are found in /4sysadmin  If a reboot is necessary, make sure that PBS is started (/etc/init.d/pbs start) Compiling and Running Code  pgCC -Mmpi -o test hello.cpp  mpirun -np 8 test Compiling Code  pgCC -Mmpi -o test hello.cpp  MPICH includes mpicc and mpif77 to compile and link MPI programs  Scripts that pass the MPI library arguments to cc and f77 Running Code  mpirun -np XXX -machinefile YYY -nolocal test  -np = number of processors  -machinefile = filename with list of processors you want to run job on  -nolocal = don’t run the job locally Submitting a Job  3 queues to choose from  Coffeeq  general purpose queue  12 hours max run time  16 processors max  Espressoq  Higher priority than coffeeq  3 weeks max run time  Some may still use piq, but this will go away soon Submitting a Job  Prepare a scheduler script           #!/bin/tcsh #PBS -l arch=linux {define architecture} #PBS -l cput=1:00:00 {define CPU time needed} #PBS -l mem=400mb {define memory space needed} #PBS -l nodes=64:ppn=1 {define number of nodes needed} #PBS -m e {mail me the results} #PBS -c c {minimal checkpointing} #PBS -k oe {keep the output and errors} #PBS -q coffeeq {run the job on coffeeq} mpirun -np 8 -machinefile machines_gige_32.LINUX /data/staff/din/newhello  qsub the scheduler script More PBS Commands  Check on the status of all submitted jobs     with: qstat Submit a job with: qsub Delete a job with: qdel Terminate the execution of a job with: qterm See all your available compute node resources with: pbsnodes -a Node Terms  Login node = Service node = Head node = the node users log into  Master scheduler node = node where scheduler runs, usually login node  Admin node = the node the sysadmin logs into to gain access to cluster       management apps Compute node = one or more nodes that perform pieces of a larger computation Storage node = the node that has the RAID array or SAN attached to it Backup node = the node that has the backup solution attached to it I/O node = can combine features of storage and backup nodes Visualization node = the node that contains a graphics card and graphics console; multiple visualization nodes can be combined in a matrix to form a video wall Spare node = nodes that are not in service, but can be rebuilt to take the place of a compute node or, in some cases, an admin or login node References  Bookman, Charles. Linux Clustering: Building and Maintaining      Linux Clusters. New Riders, Indianapolis, Indiana, 2003. Howse, Martin. "Dropping the Bomb: AMD Opteron" in Linux User & Developer, Issue 33. pp 33-36. Robertson, Alan. "Highly-Affordable High Availability" in Linux Magazine, November 2003. pp 16-21. The Seventh LCI Workshop Systems Track Notes. Linux Clusters Institute, March 24-28, 2003. Sterling, Thomas et al. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters. The MIT Press, Cambridge, Massachusetts, 1999. Vrenios, Alex. Linux Cluster Architecture. Sams Publishing, Indianapolis, Indiana, 2002. coffee.chem Contact List  Dell hardware problems 800-234-1490  Myrinet problems help@myri.com  “Very limited” software support dellsup@mpi-softtech.com  PGI Compiler issues help@pgi.com Introduction to Linux Clusters Clare Din SAS Computing University of Pennsylvania March 15, 2004

Introduction to Linux Clusters

Related documents

Products

Support

Introduction to Linux Clusters

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib