Microsoft Power Point Presentation - Cobalt

advertisement
Chemical Supercomputing on
the Cheap:
94GFlops computer system at
cdn$3680/gigaflop
S. Patchkovskii, R. Schmid, and T. Ziegler
Department of Chemistry, University of Calgary,
2500 University Dr. NW, Calgary, Alberta,
T2N 1N4 Canada
1
Chemical Supercomputing on the Cheap, CSC’82, 1999
Introduction
Accurate quantum-chemical modeling of systems of chemical interest is extremely
computationally intensive and requires substantial amounts of memory and secondary
storage. This has traditionally consigned first-principles calculations of chemical
properties to large (and expensive) vector and parallel computers, thus placing them out
of reach of most practical chemists.
With the ever-increasing computational power of low-end workstations and commodity
PCs, it is now possible to perform useful quantum-chemical calculations on inexpensive
off-the-shelf hardware. The widely available and robust local area (LAN) network
technologies, such as switched 100Mbit/second Ethernet, may be used to combine
multiple workstations into a larger parallel system, providing supercomputer level of
performance at the favorable price/performance ratio.
In this poster, we describe COBALT cluster (Computers On Benches All Linked
Together) - a chemically oriented supercomputer built in our research group at the
University of Calgary.
2
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt hardware: Nodes
A node of the Cobalt cluster is a Compaq/Digital
Personal Workstation model 500au. Each
workstation is configured with:
CPU
Cache
Memory
Disk
CD-ROM
Graphics
Network
Peak Flops
SpecInt 95
SpecFP 95
Average price
Alpha 21164A, 500MHz
96Kb on-chip (L1 and L2)
64Mb to 512Mb
4Gb, 7200RPM UltraSCSI
24x
none
100Mbit/sec Ethernet, full duplex
10^9 Flop/second
15.7(*)
19.5(*)
cdn$3,468 (purchased between
March 1998 and February 1999)
For a comparison, a top of the line 550MHz Intel Xeon workstation with 512Kb of L2
cache achieves 24.4 SpecInt 95 and 17.1 SpecFP 95 and costs about cdn$4400 from Dell
(May 1999).
(*) SpecInt and SpecFP values estimated from published results for a 500au system with 2Mb L3 cache.
3
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt hardware: Network
Bandwidth
Peak bisection
Peak aggregate
Bisection (TCP)
NFS read
NFS write
Local read
Local write
Cobalt nodes are communicate
through a dedicated 96-port fullduplex 100BaseTx Ethernet
Switch, constructed from 4 24-port
3COM SuperStack II 3300
switches linked by a matrix
module.
Latency
Round-trip (TCP)
Round-trip (UDP)
Total cost
(Mbytes/second)
25.00
125.00
11.24
3.39
4.14
10.11
5.58
(microseconds)
360
354
cdn$13,500
Latency and bandwidth measured with Larry McVoy’s Lmbench using otherwise idle nodes.
4
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt hardware: The Cluster
World
100BaseTx
Node 1
(half-duplex)
RAID, assembly and
miscellaneous costs: cdn$6,500
Switch
93x100BaseTx
Node 93
2x100BaseTx
128Mb memory
18Gbytes RAID-1 (4 spindles)
5
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt: system software
Category
Software
Source
Operating system Compaq/Digital Tru64 v.4.0D;
Remote boot setup
Bundled with hardware
Single system
image (SSI)
support
NIS/YP and NFS; All scratch space
and user files can be accessed from
any node using cluster-wide names
Bundled with hardware
Compilers
C, C++, Fortran (-77, -90, and -95)
Batch queuing
system
Parallel
programming
interfaces
Campus Software Licence
Grant (CSLG)
DQS 3.2.3; Supports process quotas, Freely available
resource-based scheduling and
parallel jobs
Parallel virtual machine (PVM);
Freely available
Message passing interface
specification (MPI) - MPICH
6
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt: Single system image
Single system image (SSI), or ability of a group of computers to present the illusion of a
large single computer system, is considered the definitive characteristic of clusters. In
order to have a usability advantage over a pile of individual computers, a cluster must
provide its users with the SSI covering most of the users’ problem areas. Cobalt nodes
present the illusion of a single computer in several important aspects, namely:
Global user
information
All login names and passwords are provided by the Cobalt master through
the Network Information System (NIS), and identical on all Cobalt nodes.
All nodes are identically configured, and run the same release of the
Node
transparency operating system. All software packages are available on all nodes, so that
users can perform any task on any physical node.
Global name All user data and scratch areas are physically distributed between the
nodes for improved performance and reliability, but are accessible on any
space
node using identical (global) names. Translation between the logical
global and physical names is handled by the automounter service.
Load
balancing
The queuing system automatically assigns user jobs to the least loaded
cluster nodes (subject to resource requirements), providing optimal
performance and resource utilization.
7
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt: application software
Package
Source
Parallel
Amsterdam Density Functional Co-developed at the UofC and Vrije
(ADF)
University (Netherlands)
Yes, PVM
or MPI
Projector-Augmented Plane
Wave (PAW) first-principles
molecular dynamics
Co-developed at the UofC and IBM
research laboratories in Zürich
Yes, MPI
Gaussian-94 and
Gaussian-98
Under site licence at the Chemistry
department, no additional cost
No*
Visualization (Xmol, Rasmol,
and Viewkel)
Freely available on the Internet
N/A
Mopac 6 and 7
Freely available on the Internet
No
(*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt
8
Chemical Supercomputing on the Cheap, CSC’82, 1999
Cobalt: total cost
Cluster nodes, 94xDPW 500au
Network interconnect
Assembly and miscellaneous
System software
Application software
Tips to the system administrator
Total:
Per 1Gflops of peak performance:
$
$
$
$
$
$
$
$
325,992.00
13,500.00
6,500.00
8.00
346,000.00
3,680.85
The complete per-node construction price, including all hardware and
software, is thus substantially lower than the retail price of a comparably
equipped PC
(*) Gaussian supports cluster environments with Network Linda - an extra-cost package is not available on Cobalt
9
Chemical Supercomputing on the Cheap, CSC’82, 1999
Running ADF in parallel on Cobalt
ADF has been parallelized at the Vrije University in Amsterdam, and can utilize either
MPI or PVM message passing libraries. Parallelization has been performed only for the
computationally intensive parts of the program (numerical integration and density
fitting). All relatively inexpensive parts of the calculations are repeated on all
participating nodes, greatly reducing the amount of data which have to be communical
over the network. In a typical ADF run, the nodes have to synchronize only once per
SCF cycle or a gradient calculation.
Node
1
Node
2
Communications
Time
Time
Communications
Node
3
Node
1
Node
2
Node
3
ADF
Classical parallel application
10
Chemical Supercomputing on the Cheap, CSC’82, 1999
We illustrate the parallel performance of
ADF for full geometry optimization of
nitridoporphyrinatochromium(V), a
medium-sized molecule with 38 atoms
shown on the left. This calculation used
polarized triple- basis set on all atoms,
resulting in 580 basis functions. The
molecule was constrained to C4v symmetry.
Speedup
For this system, a serial calculation takes
683 minutes on a single Cobalt node (using
45Mb of memory and about 100Mbytes of
the disk space). For the parallel runs, the
execution time can be approximated by the
Amdahl’s law:
T(n)  T serial  Tparallel  Toverhead /n
Number of nodes
where Tserial is the inherently serial part of
the calculation (21 minute), Tparallel is the
parallel part of the calculation (662 minutes),
and Toverhead is the parallel overhead (103
minutes).
11
Chemical Supercomputing on the Cheap, CSC’82, 1999
Running PAW in parallel on Cobalt
As a parallel application, PAW is the exact opposite of ADF. Computationally, it is
dominated by fast Fourier transforms (FFTs), which place a heavy demand on both the
inter-node bandwidth and round-trip latency. When running on n nodes, parallel FFT
algorithm used in PAW needs the exchange all Fourier coefficients on each node (which
can easily require several hundred megabytes of storage) n times during each molecular
dynamics (MD) step (see below), resulting in a heavy communications traffic.
Node 1
Node 2
Node 3
In a typical parallel PAW run on Cobalt, the fullduplex 100Mbit/second communication links between
the nodes and the central switch continuously run at
over 20% utilization (or more than 2.5Mbytes/second)
in each direction. In a sense, Cobalt nodes and
communication network are perfectly matched
together for PAW runs: having faster CPUs would
have made the communication network choke on the
data, while a slower communication network would
have been unable to keep CPUs busy.
12
Chemical Supercomputing on the Cheap, CSC’82, 1999
To illustrate the performance of parallel PAW on
Cobalt, consider an SN2 substitution reaction
between CH3I and [Rh(CO)2I2]-. This mediumsize simulation was performed in an 11Å
periodic cell. In a serial run, a single time step
requires about 83 seconds; a complete
simulation consists of several thousands steps.
Fitting of the measured execution times using
different node counts to the Amdahl law gives
(all times in seconds):
Speedup
Tserial
Tparallel
T o v e rh e a d
=
=
=
7.9
7 4. 9
4.0
Unlike the ADF case, there the inherently serial
part constitutes less than 3% percent of the total
work, PAW spends almost 10% of the total time
in the parallel section. As a consequence, PAW
cannot efficiently utilize more than four Cobalt
nodes for this simulation.
Nodes
13
Chemical Supercomputing on the Cheap, CSC’82, 1999
Molecular dynamics calculations in PAW are frequently limited by the amount of
memory required to perform the calculation rather than by the simulation time. In the
parallel mode, PAW can significantly reduce its per-node memory requirements by
distributing both the real-space and Fourier-space grids between the nodes. Since the
size of the grids grows with the unit cell size R as O(R3), they dominate PAW memory
requirements for all but smallest systems.
In the CH I and [Rh(CO) I ]- system, memory requirements in the serial mode are
3
2 2
relatively modest at 231 megabytes. In the parallel regime, per-node memory
requirements are given by:
M(n)  M private  Moverhead  Mdistributed /n
Per-node memory usage
where Mprivate is the amount of memory
holding data private to a given node (7Mb),
Mdistributed is the amount of memory shared
between the nodes (224Mb), and Moverhead is
the parallel overhead (9Mb). Running this
job on six nodes thus reduces the per-node
memory requirements to just 53Mbytes.
Nodes
Parallel PAW was used to run jobs requiring
almost 3Gbytes of memory on Cobalt - even
though no Cobalt node has more that 512Mb
of memory installed in it.
14
Chemical Supercomputing on the Cheap, CSC’82, 1999
Summary
We described construction of the Cobalt cluster - a uniquely powerful and inexpensive
dedicated computational chemistry resource. With per-node construction cost typical of
high-end PCs, Cobalt provides super-computer level of performance on several quantumchemical applications. Multiple nodes can be utilized in parallel, resulting in increased
throughput and reduced wall-clock execution time. Tens of nodes can be utilized
efficiently for a single large DFT calculation using ADF.
For further information on Cobalt hardware and software, visit the Cobalt home page at
http://www.cobalt.chem.ucalgary.ca
Credits
Financial support for the construction of the Cobalt cluster was
provided by:
• Canada Foundation for Innovation (CFI)
• Alberta Intellectual Infrastructure Partnership program (AIIP)
• Department of Chemistry of the University of Calgary
• Scientific Chemistry Simulations Inc., Netherlands
• Mitsui Chemicals
• Nova Chemicals
15
Chemical Supercomputing on the Cheap, CSC’82, 1999
References and further reading
SPEC
SpecFp95 and SpecInt95 benchmark results are available on the web site of the
Standard Performance Evaluation Corp. (SPEC) at http://www.specbench.org
Dell
Prices and system specifications of Dell workstation were taken from the Dell
Canada web site at http://www.dell.ca
3COM
Technical specifications of the 3COM fast Ethernet switches are available on
the 3COM web site at http://www.3com.com
Larry McVoy’s Lmbench microbenchmark suite was downloaded from the
Lmbench Bitmover web site at http://www.bitmover.com/lmbench/
Clusters
Greg Pfister’s In Search of Clusters, 2nd edition, published by Prentice Hall in
1998 is the definitive guide to clusters
ADF
Additional information on the Amsterdam density functional code is available
on the web site of Scientific Computing and Modeling at http://www.scm.com
PAW
Additional information on PAW first-principles MD code is available on the
Cobalt web site at http://www.cobalt.chem.ucalgary.ca/paw/
Gaussian
See the Gaussian Inc. web site at http://www.gaussian.com/
16
Chemical Supercomputing on the Cheap, CSC’82, 1999
Download