Clusters in Molecular Sciences Applications Serguei Patchkovskii@#, Rochus Schmid@, Tom Ziegler@, Siu Pang Chan#, Andrew McCormack#, Roger Rousseau#, Ian Skanes# @Department #Theory of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 Canada and Computation Group, SIMS, NRC, 100 Sussex Dr., Ottawa, Ontario, K1A 0R6 “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 1 Overview • Beowulf-style clusters entered mainstream • Are clusters a lasting, efficient investment? • Odysseus: an internal cluster at the SIMS theory group • Clusters in molecular science applications: software availability and performance • Three war stories, and a cautionary message • Summary and conclusions “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 2 Shared, Academic Clusters in Canada Location CPUs URL of other info Carleton U. 8xPII-400 www.scs.carleton.ca/~gis/ UBC 256xPIII-1000 www.gdcfd.ubc.ca/Monster U of Calgary 179xAlpha www.maci-cluster.ucalgary.ca U of Western Ontario 144xAlpha GreatWhite.sharcnet.ca U of Western Ontario 48xAlpha DeepPurple.sharcnet.ca McMaster U 106xAlpha Idra.physics.mcmaster.ca U of Guelph 120xAlpha Hammerhead.uoguelph.ca U of Wundsor 8xAlpha Winfrid Laurier U 8xAlpha “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 3 Canadian top-500 facilities Cluster “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 4 Internal, “workhorse” clusters Location CPUs URL or other U of Alberta 98xPIII-450 www.phys.ualberta.ca/THOR U of Calgary 94x21164-500 www.cobalt.chem.ucalgary.ca U of Calgary 120xPIII-1000 www.ucalgary.ca/~tieleman/elk.html U of Calgary 32xPIII Memorial U 32xPII-300 weland.esd.mun.ca MDS Proteomics 400xPIII-1000 www.mdsproteomics.com ICPET, NRC 80xPIII-800 DRAO, NRC 16xPII-450 SIMS, NRC 32xPIII-933 Samuel Lunenfeld Research Institute 224xPIII-450 Sherbrooke U 64xPII-400 U of Saskatchewan 12xAthlon-800 Sasquatch.usask.ca Simon Frazer U 16xPIII-500 www.sfu.ca/acs/cluster/ U of Victoria 39xPIII-450 Pingu.phys.uvic.ca/muse/ (?) McMaster U 32xPIII-700 www.cim.mcgill.ca/~cvr/beowulf/ CERCA, Montreal 16xAthlon-1200 www.cerca.umontreal.ca/~fourmano/ U of Western Ontario various www.baldric.uwo.ca Bioinfo.mshri.on.ca/yac/ “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 5 Clusters are everywhere Lemma 1: A computationally-intensive research group in Canada can be in one of the three states: a) It owns a cluster, or b) It builds a cluster, or c) It plans building a cluster RSN Clusters became a mainstream research tool – useful, but not automatically worthy of a separate mention “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 6 Cobalt: Hardware World 100BaseTx Node 1 (half-duplex) Switch 93x100BaseTx Computers on benches all linked together Node 93 2x100BaseTx 128Mb memory 18Gbytes RAID-1 (4 spindles) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 7 Cobalt: Nodes and Network Digital/Compaq Personal Workstation 500au. CPU Cache Peak flops SpecInt 95 SpecFP 95 Alpha 21164A, 500 MHz 96Kb on-chip (L1 and L2) 109 Flop/second 15.7 (estimate) 19.5 (estimate) 4 x 3COM SuperStack II 3300 Peak aggregate b/w Peak internode b/w (TCP) NFS read/write Round-trip (TCP) Round-trip (UDP) 500.0 MB/s 11.2 MB/s 3.4/4.1 MB/s 360 μs 354 μs “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 8 Cobalt: Software OS, communications, and cluster management: Base OS: Tru64, using DMS, NIS, and NFS Compilers: Digital/Compaq C, C++, Fortran Communications: PVM, MPICH Batch queuing: DQS Application software: ADF: Amsterdam Density Functional (PVM) PAW: Projector-Augmented Wave (MPI) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 9 Cobalt: Return on the Investment Investment: Dollars Total cost … including: Initial purchase 390,800 346,000 Total publications … including: Organometallics 15,800 24,000 J. Am. Chem. Soc. J. Phys. Chem. J. Chem. Phys. Operating (’98-’01) power (6¢/kWh) admin (20% PDF) spare parts Payback: Research Articles 5,000 Inorg. Chem. 92 21 12 11 10 6 ROI: 1 publication / $4,250 “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 10 Odysseus: Low-tech solution for high-tech problems 1 “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 11 Odysseus: Low-tech solution for high-tech problems 2 Nodes (16+1) ABIT VP6 motherboard 2xPIII-933, 133MHz FSB 4x256Mbytes RAM 3COM 3C905C 36Gb 7200rpm IDE … plus, on the front end: Intel PRO/1000 Adaptec AHA-2940UW 60Gb 7200rpm IDE “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 12 Odysseus: Low-tech solution for high-tech problems 3 Network: SCI + 100Mbit Dolphin D339 (2D SCI) H ring V ring HP Procurve 2524 + 1Gig “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 13 Odysseus: Low-tech solution for high-tech problems 4 Backup unit: VXAtape (www.ecrix.com) 35Gbytes/cartridge (physical) TreeFrog autoloader (www.spectralogic.com) 16 cartridge capacity UPS Unit: Powerware 5119 2880VA “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 14 Odysseus: Low-tech solution for high-tech problems 5 Four little wheels Odysseus at a glance Processors: 32 (+2) Memory: 16Gbytes Disk: 636Gbytes Peak flops: 29.9GFlops/sec “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 15 Odysseus: cost overview Expense dollars Nodes 40,640 SCI network (cards & cables) 26,771 Backup unit (tape+robot) 5,860 Spare parts in stock 5,024 Ethernet (switch, cables, and head node link) 4,190 Compiler (PGI) 3,780 UPS 2,265 Backup tapes (16+1) 1,911 Total: 90,441 “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 16 Clusters in molecular science – software availability • • • • • Gaussian Turbomole GAMESS NWChem GROMOS • • • • • • • ADF PAW CPMD AMBER VASP PWSCF ABINIT “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 17 Software: ADF ADF – Amsterdam Density Functional (www.scm.com) Speedup Example: Cr(N)Porph Number of Cobalt nodes Full geometry optimization 38 atoms 580 basis functions C4v symmetry 45Mbytes of memory Serial time: 683 minutes “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 18 Software: PAW PAW – “Projector-Augmented Wave” Speedup (www.pt.tu-clausthal.de/~ptpb/PAW/pawmain.html) Cobalt Nodes Example: SN2 reaction CH3I + [Rh(CO)2I2]11Å unit cell Serial time per step: 83 seconds Memory: 231Mbytes “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 19 Software: CPMD CPMD – Car-Parinello Molecular Dynamic (www.mpi-stuttgart.mpg.de/parinello/) Example: H in Si64 odysseus 65 atoms, periodic 40Ryd cut-off Geometry opt (2 steps) + free MD (70 steps) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 20 Software: AMBER AMBER – “Assisted Model Building with Energy Refinement” (www.amber.ucsf.edu/amber/) Time (hour) Example: 22-residue polypeptide+4K+ +2500 H2O 1ns MD Ncpu “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 21 Software: VASP VASP – Vienna Ab-initio Simulation Package (cms.mpi.univie.ac.at/vasp/) odysseus Example: Li198 1000GPa 300 eV cutoff 9 K-points 10 WF optimization steps + stress tensor “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 22 Software: PWSCF PWSCF and PHONON – Plane wave pseudopotential codes, optimized for phonon spectra calculations (www.pwscf.org/) odysseus Example: MgB2 solid Geometry opt. 40 Ryd cut-off 60 K-points “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 23 Software: ABINIT ABINIT (www.mapr.ucl.ac.be/ABINIT/) Example: SiO2 (stishovite) 70Ryd cut-off 6 K-points 12 SCF iterations “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 24 War Story #1 Odysseus hardware maintenance log, Oct 19, 2001: Overnight, node 6 had a kernel OOPS … it responds to network pings and keyboard, but no new processes can be started … Reason: Heat sink on CPU#1 became loose, resulting in overheating under heavy load. Resolution: Reinstall the heat sink Detected by: Elevated temperature readings for the CPU#1 (lm_sensors) Downtime: 20 minutes (the affected node) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 25 War Story #2 Odysseus hardware maintenance log, Nov 12, 2001: A large, 16-CPU VASP job fails with “LAPACK: Routine ZPOTRF failed”, or random total energy Reason: DIMM in bank #0 on node 17 developed a singlebit failure at the address 0xfd9f0c Resolution: Replace memory module in bank #0 Detected by: Rerunning failing job with different sets of nodes, followed by the memory diagnostic on the affected node (memtest32) Downtime: 1 day (the whole cluster) + 2 days (the affected node) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 26 War Story #3 Odysseus hardware maintenance log, Dec 10, 2001: Apparently random application failures are observed Reason: Multiple single-bit memory failures, on the nodes (bank #): 6 (#2), 7 (#2,#3), 8 (#0), 10 (#0), 11 (#0) Resolution: Replace memory modules Detected by: Cluster-wide memory diagnostic (memtest32) Downtime: 3 days (the whole cluster) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 27 Cautionary Note • Using inexpensive, consumer-grade hardware potentially exposes you to low-quality components • Never use components which have no built-in hardware monitoring and error detection capability • Always configure your clusters to report corrected errors and out-of-range hardware sensors readings. • Act on the early warnings • Otherwise, you run a risk of producing garbage science, and never knowing it “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 28 Hardware Monitoring with Linux Category Parameter Package Motherboard Temperature; Power supply voltage; Fan status lm_sensors# Hard drives Corrected error counts; Impending failure indicators ide-smart$ S.M.A.R.T. Suite% Memory Corrected error counts ecc.o^ Network Hardware-dependent # http://www2.lm-sensors.nu/~lm78/ $ http://www.linux-ide.org/smart.html % http://csl.cse.ucsc.edu/smart.shtml ^ http://www.anime.net/~goemon/linux-ecc/ (2.2 kernels only) “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 29 Summary and Conclusions • Clusters are no longer a techno-geek’s toy, and will remain the primary workhorse of many research groups, at least for a while • Clusters give an impressive return on the investment, and may remain useful longer than expected • Many (most?) useful research codes in molecular sciences are readily available on clusters • Configuring and operating PC clusters can be tricky. Consider a reputable system integrator with Beowulf hardware and software experience “Clusters in Molecular Sciences Applications”, 2nd Annual iHPC Cluster Workshop, Ottawa Jan 11, 2002. p. 30