Parallel Computing in Chemistry Brian W. Hopkins Mississippi Center for Supercomputing Research 4 September 2008 What We’re Doing Here • Define some common terms of parallel and high-performance computing. • Discuss HPC concepts and how the relevance to chemistry thereof. • Discuss the particular needs of various computational chemistry applications and methodologies. • Briefly introduce the systems and applications in use MCSR. Why We’re Doing It • Increasing your research throughput – More Data – Less Time – More Papers – More Grants – &c. • Stop me if there are questions! Some Common Terms • Processors: x86, ia64, em64t, &c. • Architectures: distributed vs. shared memory. • Parallelism: message-passing vs. shared-memory parallelism. Processor Types: Basics • Integer bit length: the number of individual binary “bits” used to store an integer in memory. – 32-bit systems: 232 bits define an integer • Signed: -2,147,483,648 to +2,147,483,647 • Unsigned: 0 to +4,294,967,295 – 64-bit systems: 264 bits define an integer • Signed: −9,223,372,036,854,775,808 to +9,223,372,036,854,775,807 • Unsigned: 0 to +18,446,744,073,709,551,615 • Because integers are used for all kinds of stuff, integer bit length is an important constraint on what a processor can do. Processor Types: In Use Today • x86: Pentium, &c. – The most common microprocessor family in history. – Technically includes 16- and 64-bit processors, but “x86” is most commonly used to describe 32-bit systems. • em64t: Athlon, Opteron, Xeon, Core 2 – A 64-bit extension to the x86 family of processors. – Sometimes referred to as x86-64 or amd64. • ia64: Itanium, Itanium2 – A different, non-x86 compatible 64-bit architecture from Intel. • mips: R12000, R14000 – 64-bit RISC-type processors used in SGI supercomputers SUPER-computing • Modern supercomputers typically consist of a very large number of off-the-shelf microprocessors hooked together in one or more ways. – clusters – symmetric multiprocessors • All commonly used processor types are in use in current “super”-computers. Supercomputers: Clusters • The easiest and cheapest way to build a supercomputer is to hook many individual computers together with network cables. • These supercomputers tend to have “distributed memory”, meaning each processor principally works with a memory bank located in in the case with it. Clusters: The Interconnect • With a cluster-type supercomputer, the performance of the network connecting the nodes is critically important. • Network fabrics can be optimized for either latency or bandwidth, but not both. • Due to the special needs of high-performance clusters, a number of special networking technologies have been developed for these systems – hardware and software together – Myrinet, Infiniband, &c. Supercomputers: SMP Systems • Alternatively, it is possible to custom-build a computer case and circuitry to hold a large number of processors. • These systems are distinguished by having “shared” memory, meaning that all or many of the processors can access a huge shared pool of memory. Supercomputers: Hybrid Systems • Many modern supercomputers consist of a high-speed cluster of SMP machines. • The systems contain large numbers of distributedmemory nodes, each of which has a set of processors using shared memory. Parallel Applications • Each type of supercomputer architecture presents its own unique programming challenges. • Cluster systems require parallel programs to call special message-passing routines to communicate information between nodes. • Symmetric multiprocessing systems require special care to prevent each spawned thread from altering or deleting data that is needed by the other threads. • Hybrid systems typically require both of these. Parallel Programming for Clusters • The nodes in a cluster talk to each other via a network connection. • Because the network connection is a bottleneck, it is important to minimize the internode communication required by the program. • Most commonly, programs use a messagepassing library for communication between nodes. Message Passing • Cluster computing is extremely popular. • To streamline programming for these architectures, various libraries of standard message passing functions have been developed. – MPI – TCGMSG – Linda, &c. • These vary in portability and flexibility. The Ubiquitous MPI • The most popular message passing library is the message passing interface (MPI). – AMBER • Most companies marketing supercomputers also market their own versions of MPI specially tuned to maximize performance of the systems they sell. – SGI’s MPT – IBM MPI – HP-MPI • In principle, MPI functions are standardized, so any program built to work with Intel MPI should also work with HP-MPI. In practice…not so much. • In addition, an open-source, portable MPI is available through the MPICH project. TCGMSG • The TCGMSG library is a stripped-down message passing library designed specifically for the needs of computational chemistry applications. – NWChem – Molpro • TCGMSG is generally more efficient than MPI for the specific internode communication operations most common in comp-chem programs • However, from a programmer’s point of view TCGMSG is less flexible and capable than MPI. • There is some chatter about TCGMSG falling into disuse and becoming a legacy library. A Word On Linda et. al. • Some software vendors want extra money to use their software on a cluster rather than a single machine. • The surest way to get that money is to build your very own MP library, build the parallel version of your code to work only with it, and then sell the MP library for extra $$. • Hence, Linda and similar MP interfaces. • The nature of these libraries is such that you’re unlikely to need to code with them, and can expect system administrators to build and link them as needed. Programming for SMP Systems • There are different approaches to programming for these machines – Multithreaded programs – Message passing • The most common approach to SMP programming is the OpenMP Interface – Watch out! “MP” here stands for “multiprocessing,” not “message passing.” Programming for Hybrid Systems • Hybrid systems have characteristics of both clusters and SMP systems. • The most efficient programs for hybrid systems have hybrid design. – multithreaded or OpenMP parallelism within a node – message-passing parallelism between nodes • Because hybrid systems are rapidly increasing in popularity, true hybrid programming remains very rare. • Consequently it’s common for coders working on hybrid systems to use a pure message-passing approach. Parallel Efficiency • • Whenever you use multiple procs instead of one, you incur two problems: – Parts of a calculation cannot be parallelized. – The processors must perform extra communication tasks. As a result, most calculations will require more resources to do in parallel than in serial. Practical HPC, or, HPC and You • Computational Chemistry applications fall into four broad categories: – – – – Quantum Chemistry Molecular Simulation Data Analysis Visualization • Each of these categories presents its own challenges. Quantum Chemistry • Lots of Q-Chem at UM and around the state – UM: Tschumper, Davis, Doerksen, others – MSU: Saebo, Gwaltney – JSU: Leszczynski and many others. • Quantum programs are the biggest consumer of resources at MCSR by far: – Redwood: 99% (98 of 99 jobs) – Mimosa: 100% (86 of 86 jobs) – Sweetgum: 100% (24 of 24 jobs) Features of QC Programs • Very memory intensive – Limits of 32-bit systems will show – Limits on total memory in a box will also show • Very cycle intensive – Clock speed of procs is very important – Fastest procs at MCSR are in mimosa; use them! • Ideally not very disk intensive – Watch out! If there’s a memory shortage, QC programs will start building read/write files. – I/O will absolutely kill you. – To the extent that I/O cannot be avoided, do NOT do it over a network. Project-Level Parallelism • Quantum projects typically many dozens to thousands of individual calculations. • Typically, the most efficient way to finish 100 jobs on 100 available processors is to run 100 1-proc jobs simultaneously. • Total wall time needed increases with every increase in individual job parallelism: – 100x1 < 50x2 < 25x4 < 10x10 < 5x20 < 1x100 When to Parallelize a Job • Some jobs simply will not run in serial in any useful amount of time. • These jobs should be parallelized as little as possible. • In addition, jobs that require extensive parallelization should be run using special highperformance QC programs Implications for Gaussian 03 • Do not allow G03 to choose its own calculation algorithms. – MP2(fulldirect), SCF(direct) • Jobs that will run on mimosa ought to. • Jobs that will run in serial ought to. • Use local disk for unavoidable I/O. – copy files back en masse after a job. • For truly intensive jobs, consider moving to another program suite. – We’re here to help! Molecular Simulation • Simulation is not as common at MCSR as quantum chemistry. • Still, some simulation packages are available – – – – AMBER NWChem CPMD CP2K • And some scientists here do use them – MedChem, Wadkins, &c. Features of MD Programs • At their core, MD programs perform a single, simple calculation millions of times. • As a result, MD programs consume a LOT of clock cycles. • These programs must constantly write large molecular configurations to output, and so are often I/O restricted. • Memory load from ordinary MD programs is trivial. • Communication load and be quite high: long-range ES. • Almost all MD programs are MPI-enabled. Parallelizing Simulations • MD simulations generally parallelize much more readily than QC jobs. • Also, individual MD sims tend to take much longer than individual QC jobs: 1 sim vs. 100s of EPs. • As a result, simulations are almost always run in parallel. Notes on Parallel Simulations • Scale better with increasing size • Highly program- and methods-dependent • Because jobs are long, pre-production profiling is essential. • Help is available! Data Analysis • Computational chemists typically produce our own data analysis programs. • These can be constructed to run in parallel using MPI, OpenMP, or both. • For you data analysis needs, compilers and libraries are available on all of our systems. • MCR Consultants will work with you to help build and debug parallel analysis codes. Visualization • We have some molecular visualization software available and will help you with it. • However, one of the best programs is free and will usually run better locally. – VMD MCSR Systems: A Snapshot • • • • Redwood – SGI Altix SMP system – Fast, 64-bit IA64 processors – Very busy = long queue times Mimosa – Linux Cluster w/ distributed memory – Very fast 32-bit x86 processors – Less busy = shorter queue times Sweetgum – SGI Origin SMP system – Slow 64-bit MIPS processors – Mostly unused = no queue time at all Sequoia – SGI Altix XE Hybrid cluster – Very fast, multicore x86-64 procesors – Not open for production yet, but soon Seminars to Come • Chemistry in Parallel Computing • Building Computational Chemistry Applications • Using Gaussian 03 • Using AMBER • Using NWChem • Whatever you need! Packages available at MCSR • Gaussian (our most popular) • NWChem (flexible; better for parallel jobs) • MOLPRO (high-tech code for high-accuracy ES calculations) • MPQC (very scalable code with limited methodologies • AMBER (popular MD code) • CPMD, CP2K (ab initio MD codes for specialists) • Anything you need (within budget)