Chemical Supercomputing on the Cheap: 94GFlops computer system at cdn$3680/gigaflop S. Patchkovskii, R. Schmid, and T. Ziegler Department of Chemistry, University of Calgary, 2500 University Dr. NW, Calgary, Alberta, T2N 1N4 Canada 1 Chemical Supercomputing on the Cheap, CSC’82, 1999 Introduction Accurate quantum-chemical modeling of systems of chemical interest is extremely computationally intensive and requires substantial amounts of memory and secondary storage. This has traditionally consigned first-principles calculations of chemical properties to large (and expensive) vector and parallel computers, thus placing them out of reach of most practical chemists. With the ever-increasing computational power of low-end workstations and commodity PCs, it is now possible to perform useful quantum-chemical calculations on inexpensive off-the-shelf hardware. The widely available and robust local area (LAN) network technologies, such as switched 100Mbit/second Ethernet, may be used to combine multiple workstations into a larger parallel system, providing supercomputer level of performance at the favorable price/performance ratio. In this poster, we describe COBALT cluster (Computers On Benches All Linked Together) - a chemically oriented supercomputer built in our research group at the University of Calgary. 2 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt hardware: Nodes A node of the Cobalt cluster is a Compaq/Digital Personal Workstation model 500au. Each workstation is configured with: CPU Cache Memory Disk CD-ROM Graphics Network Peak Flops SpecInt 95 SpecFP 95 Average price Alpha 21164A, 500MHz 96Kb on-chip (L1 and L2) 64Mb to 512Mb 4Gb, 7200RPM UltraSCSI 24x none 100Mbit/sec Ethernet, full duplex 10^9 Flop/second 15.7(*) 19.5(*) cdn$3,468 (purchased between March 1998 and February 1999) For a comparison, a top of the line 550MHz Intel Xeon workstation with 512Kb of L2 cache achieves 24.4 SpecInt 95 and 17.1 SpecFP 95 and costs about cdn$4400 from Dell (May 1999). (*) SpecInt and SpecFP values estimated from published results for a 500au system with 2Mb L3 cache. 3 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt hardware: Network Bandwidth Peak bisection Peak aggregate Bisection (TCP) NFS read NFS write Local read Local write Cobalt nodes are communicate through a dedicated 96-port fullduplex 100BaseTx Ethernet Switch, constructed from 4 24-port 3COM SuperStack II 3300 switches linked by a matrix module. Latency Round-trip (TCP) Round-trip (UDP) Total cost (Mbytes/second) 25.00 125.00 11.24 3.39 4.14 10.11 5.58 (microseconds) 360 354 cdn$13,500 Latency and bandwidth measured with Larry McVoy’s Lmbench using otherwise idle nodes. 4 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt hardware: The Cluster World 100BaseTx Node 1 (half-duplex) RAID, assembly and miscellaneous costs: cdn$6,500 Switch 93x100BaseTx Node 93 2x100BaseTx 128Mb memory 18Gbytes RAID-1 (4 spindles) 5 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: system software Category Software Source Operating system Compaq/Digital Tru64 v.4.0D; Remote boot setup Bundled with hardware Single system image (SSI) support NIS/YP and NFS; All scratch space and user files can be accessed from any node using cluster-wide names Bundled with hardware Compilers C, C++, Fortran (-77, -90, and -95) Batch queuing system Parallel programming interfaces Campus Software Licence Grant (CSLG) DQS 3.2.3; Supports process quotas, Freely available resource-based scheduling and parallel jobs Parallel virtual machine (PVM); Freely available Message passing interface specification (MPI) - MPICH 6 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: Single system image Single system image (SSI), or ability of a group of computers to present the illusion of a large single computer system, is considered the definitive characteristic of clusters. In order to have a usability advantage over a pile of individual computers, a cluster must provide its users with the SSI covering most of the users’ problem areas. Cobalt nodes present the illusion of a single computer in several important aspects, namely: Global user information All login names and passwords are provided by the Cobalt master through the Network Information System (NIS), and identical on all Cobalt nodes. All nodes are identically configured, and run the same release of the Node transparency operating system. All software packages are available on all nodes, so that users can perform any task on any physical node. Global name All user data and scratch areas are physically distributed between the nodes for improved performance and reliability, but are accessible on any space node using identical (global) names. Translation between the logical global and physical names is handled by the automounter service. Load balancing The queuing system automatically assigns user jobs to the least loaded cluster nodes (subject to resource requirements), providing optimal performance and resource utilization. 7 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: application software Package Source Parallel Amsterdam Density Functional Co-developed at the UofC and Vrije (ADF) University (Netherlands) Yes, PVM or MPI Projector-Augmented Plane Wave (PAW) first-principles molecular dynamics Co-developed at the UofC and IBM research laboratories in Zürich Yes, MPI Gaussian-94 and Gaussian-98 Under site licence at the Chemistry department, no additional cost No* Visualization (Xmol, Rasmol, and Viewkel) Freely available on the Internet N/A Mopac 6 and 7 Freely available on the Internet No (*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt 8 Chemical Supercomputing on the Cheap, CSC’82, 1999 Cobalt: total cost Cluster nodes, 94xDPW 500au Network interconnect Assembly and miscellaneous System software Application software Tips to the system administrator Total: Per 1Gflops of peak performance: $ $ $ $ $ $ $ $ 325,992.00 13,500.00 6,500.00 8.00 346,000.00 3,680.85 The complete per-node construction price, including all hardware and software, is thus substantially lower than the retail price of a comparably equipped PC (*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt 9 Chemical Supercomputing on the Cheap, CSC’82, 1999 Running ADF in parallel on Cobalt ADF has been parallelized at the Vrije University in Amsterdam, and can utilize either MPI or PVM message passing libraries. Parallelization has been performed only for the computationally intensive parts of the program (numerical integration and density fitting). All relatively inexpensive parts of the calculations are repeated on all participating nodes, greatly reducing the amount of data which have to be communical over the network. In a typical ADF run, the nodes have to synchronize only once per SCF cycle or a gradient calculation. Node 1 Node 2 Communications Time Time Communications Node 3 Node 1 Node 2 Node 3 ADF Classical parallel application 10 Chemical Supercomputing on the Cheap, CSC’82, 1999 We illustrate the parallel performance of ADF for full geometry optimization of nitridoporphyrinatochromium(V), a medium-sized molecule with 38 atoms shown on the left. This calculation used polarized triple- basis set on all atoms, resulting in 580 basis functions. The molecule was constrained to C4v symmetry. Speedup For this system, a serial calculation takes 683 minutes on a single Cobalt node (using 45Mb of memory and about 100Mbytes of the disk space). For the parallel runs, the execution time can be approximated by the Amdahl’s law: T(n) T serial Tparallel Toverhead /n Number of nodes where Tserial is the inherently serial part of the calculation (21 minute), Tparallel is the parallel part of the calculation (662 minutes), and Toverhead is the parallel overhead (103 minutes). 11 Chemical Supercomputing on the Cheap, CSC’82, 1999 Running PAW in parallel on Cobalt As a parallel application, PAW is the exact opposite of ADF. Computationally, it is dominated by fast Fourier transforms (FFTs), which place a heavy demand on both the inter-node bandwidth and round-trip latency. When running on n nodes, parallel FFT algorithm used in PAW needs the exchange all Fourier coefficients on each node (which can easily require several hundred megabytes of storage) n times during each molecular dynamics (MD) step (see below), resulting in a heavy communications traffic. Node 1 Node 2 Node 3 In a typical parallel PAW run on Cobalt, the fullduplex 100Mbit/second communication links between the nodes and the central switch continuously run at over 20% utilization (or more than 2.5Mbytes/second) in each direction. In a sense, Cobalt nodes and communication network are perfectly matched together for PAW runs: having faster CPUs would have made the communication network choke on the data, while a slower communication network would have been unable to keep CPUs busy. 12 Chemical Supercomputing on the Cheap, CSC’82, 1999 To illustrate the performance of parallel PAW on Cobalt, consider an SN2 substitution reaction between CH3I and [Rh(CO)2I2]-. This mediumsize simulation was performed in an 11Å periodic cell. In a serial run, a single time step requires about 83 seconds; a complete simulation consists of several thousands steps. Fitting of the measured execution times using different node counts to the Amdahl law gives (all times in seconds): Speedup Tserial Tparallel T o v e rh e a d = = = 7.9 7 4. 9 4.0 Unlike the ADF case, there the inherently serial part constitutes less than 3% percent of the total work, PAW spends almost 10% of the total time in the parallel section. As a consequence, PAW cannot efficiently utilize more than four Cobalt nodes for this simulation. Nodes 13 Chemical Supercomputing on the Cheap, CSC’82, 1999 Molecular dynamics calculations in PAW are frequently limited by the amount of memory required to perform the calculation rather than by the simulation time. In the parallel mode, PAW can significantly reduce its per-node memory requirements by distributing both the real-space and Fourier-space grids between the nodes. Since the size of the grids grows with the unit cell size R as O(R3), they dominate PAW memory requirements for all but smallest systems. In the CH I and [Rh(CO) I ]- system, memory requirements in the serial mode are 3 2 2 relatively modest at 231 megabytes. In the parallel regime, per-node memory requirements are given by: M(n) M private Moverhead Mdistributed /n Per-node memory usage where Mprivate is the amount of memory holding data private to a given node (7Mb), Mdistributed is the amount of memory shared between the nodes (224Mb), and Moverhead is the parallel overhead (9Mb). Running this job on six nodes thus reduces the per-node memory requirements to just 53Mbytes. Nodes Parallel PAW was used to run jobs requiring almost 3Gbytes of memory on Cobalt - even though no Cobalt node has more that 512Mb of memory installed in it. 14 Chemical Supercomputing on the Cheap, CSC’82, 1999 Summary We described construction of the Cobalt cluster - a uniquely powerful and inexpensive dedicated computational chemistry resource. With per-node construction cost typical of high-end PCs, Cobalt provides super-computer level of performance on several quantumchemical applications. Multiple nodes can be utilized in parallel, resulting in increased throughput and reduced wall-clock execution time. Tens of nodes can be utilized efficiently for a single large DFT calculation using ADF. For further information on Cobalt hardware and software, visit the Cobalt home page at http://www.cobalt.chem.ucalgary.ca Credits Financial support for the construction of the Cobalt cluster was provided by: • Canada Foundation for Innovation (CFI) • Alberta Intellectual Infrastructure Partnership program (AIIP) • Department of Chemistry of the University of Calgary • Scientific Chemistry Simulations Inc., Netherlands • Mitsui Chemicals • Nova Chemicals 15 Chemical Supercomputing on the Cheap, CSC’82, 1999 References and further reading SPEC SpecFp95 and SpecInt95 benchmark results are available on the web site of the Standard Performance Evaluation Corp. (SPEC) at http://www.specbench.org Dell Prices and system specifications of Dell workstation were taken from the Dell Canada web site at http://www.dell.ca 3COM Technical specifications of the 3COM fast Ethernet switches are available on the 3COM web site at http://www.3com.com Larry McVoy’s Lmbench microbenchmark suite was downloaded from the Lmbench Bitmover web site at http://www.bitmover.com/lmbench/ Clusters Greg Pfister’s In Search of Clusters, 2nd edition, published by Prentice Hall in 1998 is the definitive guide to clusters ADF Additional information on the Amsterdam density functional code is available on the web site of Scientific Computing and Modeling at http://www.scm.com PAW Additional information on PAW first-principles MD code is available on the Cobalt web site at http://www.cobalt.chem.ucalgary.ca/paw/ Gaussian See the Gaussian Inc. web site at http://www.gaussian.com/ 16 Chemical Supercomputing on the Cheap, CSC’82, 1999