Simulating Life at the Atomic Scale James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Beckman Institute University of Illinois at Urbana-Champaign Theoretical and Computational Biophysics Group NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD: Scalable Molecular Dynamics 2002 Gordon Bell Award ATP synthase PSC Lemieux Blue Waters Target Application Illinois Petascale Computing Facility 37,000 Users, 1700 Citations Computational Biophysics Summer School GPU Acceleration NVIDIA Tesla NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ NCSA Lincoln Beckman Institute, UIUC Computational Microscopy Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Molecular Mechanics Force Field NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Classical Molecular Dynamics Energy function: used to determine the force on each atom: Newton’s equation represents a set of N second order differential equations which are solved numerically via the Verlet integrator at discrete time steps to determine the trajectory of each atom. Small terms added to control temperature and pressure. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Biomolecular Time Scales Motion Bond stretching Time Scale (sec) 10-14 to 10-13 Elastic vibrations 10-12 to 10-11 Rotations of surface sidechains 10-11 to 10-10 Hinge bending 10-11 to 10-7 Max Timestep: 1 fs Rotation of buried side 10-4 to 1 sec chains Allosteric transistions 10-5 to 1 sec Local denaturations 10-5 to 10 sec NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Sizes of Simulations Over Time BPTI 3K atoms Estrogen Receptor 36K atoms (1996) NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ ATP Synthase 327K atoms (2001) Beckman Institute, UIUC Our Solution: Parallel Computing HP 735 cluster 14 processors (1994) SGI Origin 2000 128 processors (1997) PSC Lemieux AlphaServer SC 3000 processors (2002) NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD Parallel Scaling Snapshot !"" +,-+!./ 012.3 4' +,-+!.5 678.9 8:8;= ApoA1: 92K atoms > 4?@./ 012.3 4' > 4?@.5 678.9 8:8;< > 4?@.5 678.9 8:8;= !" STMV: 1M atoms )#*&$ !&)$' $!(# '"(& #"'$ !"#' %!# #%& ! !#$ ns/day +,-+!.5 678.9 8:8;< number of cores NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Parallel Programming Lab University of Illinois at Urbana-Champaign Siebel Center for Computer Science NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Develop abstractions in context of full-scale applications Protein Folding NAMD: Molecular Dynamics Quantum Chemistry (QM/MM) Computational Cosmology STM virus simulation Parallel Objects, Adaptive Runtime System Libraries and Tools Rocket Simulation Dendritic Growth Crack Propagation Space-time meshes The enabling CS technology of parallel objects and intelligent runtime systems has led to several collaborative applications in CSE NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC TCBG Experimental Collaborations • Nearly every collaboration relies on NAMD. • High-end simulations push scaling efforts. – Try to anticipate needs: Million-atom virus just worked. • Innovative simulations generate feature requests: What is science goal? Existing features usable? Find a scalable method. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Make it general purpose. Beckman Institute, UIUC Adaptability Through Scripting • • Tcl customizations are portable Top-level protocols: – Minimize, heat, equilibrate – Simulated annealing – Replica exchange (two modifications) • Long-range forces on selected atoms – Torques and other steering forces – Adaptive bias free energy perturbation – Coupling to external coarse-grain model • Special boundary forces – Applies potentially to every atom – Several design iterations for efficiency – Shrinking phantom pore for DNA NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD: Practical Supercomputing • 37,000 users can’t all be computer experts. – 8800 have downloaded more than one version. – 1700 citations of NAMD reference papers. • One program for all platforms. – – – – • Desktops and laptops – setup and testing Linux clusters – affordable local workhorses Supercomputers – free allocations on TeraGrid Blue Waters – sustained petaflop/s performance User knowledge is preserved. – No change in input or output files. – Run any simulation on any number of cores. • Available free of charge to all. Phillips et al., J. Comp. Chem. 26:1781-1802, 2005. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Our Goal: Practical Acceleration • Broadly applicable to scientific computing – Programmable by domain scientists – Scalable from small to large machines • Broadly available to researchers – Price driven by commodity market – Low burden on system administration • Sustainable performance advantage – Performance driven by Moore’s law – Stable market and supply chain NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Acceleration Options for NAMD • Outlook in 2005-2006: – FPGA reconfigurable computing (with NCSA) • Difficult to program, slow floating point, expensive – Cell processor (NCSA hardware) • Relatively easy to program, expensive – ClearSpeed (direct contact with company) • Limited memory and memory bandwidth, expensive – MDGRAPE • Inflexible and expensive – Graphics processor (GPU) • Program must be expressed as graphics operations NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC GPU vs CPU: Raw Performance – Calculation: 450 GFLOPS vs 32 GFLOPS – Memory Bandwidth: 80 GB/s vs 8.4 GB/s G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX GFLOPS G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC CUDA: Practical Performance November 2006: NVIDIA announces CUDA for G80 GPU. • CUDA makes GPU acceleration usable: – – – – – Developed and supported by NVIDIA. No masquerading as graphics rendering. New shared memory and synchronization. No OpenGL or display device hassles. Multiple processes per card (or vice versa). Fun to program (and drive) • TCBG and collaborators make it useful: – Experience from VMD development – David Kirk (Chief Scientist, NVIDIA) – Wen-mei Hwu (ECE Professor, UIUC) Stone et al., J. Comp. Chem. 28:2618-2640, 2007. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Typical CPU Architecture L2 Cache L1 I L3 Cache L1 D Dispatch/Retire FPU FPU ALU Memory Controller NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Minimize the Processor No large caches or multiple execution units L1 I L1 D Dispatch/Retire FPU Do integer arithmetic on FPU Memory Controller NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Maximize Floating Point 8 FP pipelines per SIMD unit L1 I L1 D Shared data cache Dispatch/Retire Single instruction stream FPU FPU FPU FPU One thread per FPU allows branches and gather/scatter. FPU FPU FPU FPU Memory Controller NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Add More Threads Pipeline 4 threads per FPU to hide 4-cycle instruction latency. All 32 threads in a “warp” execute the same instruction. FPU FPU FPU FPU FPU FPU FPU FPU Divergent branches allowed through predication. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Add Even More Threads Multiple warps in a “block” hide main memory latency and can synchronize to share data. FPU FPU FPU FPU FPU FPU FPU FPU NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Add More Threads Again Multiple blocks on a single multiprocessor hide both memory and synchronization latency. FPU FPU FPU FPU FPU FPU FPU FPU All blocks execute a “kernel” function independently without synchronization or memory coherency. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Add Cores to Suit Customer Kernel is invoked on a “grid” of uniform blocks. Blocks are dynamically assigned to available multiprocessors and run to completion. Synchronization occurs when all blocks complete. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Support Fine-Grained Parallelism • Threads are cheap but desperately needed. – – – – – How many can you give? 512 threads will keep all 128 FPUs busy. 1024 threads will hide some memory latency. 12,288 threads can run simultaneously. Up to 2×1012 threads per kernel invocation. NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC VMD – “Visual Molecular Dynamics” • • • Visualization and analysis of molecular dynamics simulations, sequence data, volumetric data, quantum chemistry simulations, particle systems, … User extensible with scripting and plugins http://www.ks.uiuc.edu/Research/vmd/ NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Molecular Modeling: Ion Placement • Model structures are initially constructed in vacuum • Solvent (water) and ions are added as necessary to reproduce the required biological conditions • Computational requirements scale with the size of the simulated structure NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Electrostatic Potential Maps • Electrostatic potentials evaluated on 3-D lattice: • Applications include: – Ion placement for structure building – Time-averaged potentials for simulation – Visualization and analysis Isoleucine tRNA synthetase NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Direct Summation Algorithm • Each lattice point accumulates electrostatic potential contribution from all atoms: potential[j] += atom[i].charge / rij Lattice point j being evaluated rij: distance from lattice[j] to atom[i] atom[i] NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Ion Placement via Direct Sum • 110 CPU-hours on Altix • 1.35 hours on GPU • 27 minutes on three GPUs Satellite Tobacco Mosaic Virus (STMV) Ion Placement NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC CUDA Acceleration in VMD Electrostatic field calculation, ion placement 20x to 44x faster Molecular orbital calculation and display 100x to 120x faster Imaging of gas migration pathways in proteins with implicit ligand sampling 20x to 30x faster NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD Lincoln Cluster Performance (8 Intel cores and 2 NVIDIA Telsa GPUs per node) STMV (1M atoms) s/step ~2.8 2 GPUs = 24 cores 4 GPUs 8 GPUs 16 GPUs CPU cores NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD Petascale Preparations NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Blue Waters Architecture IBM Power 7 Peak Perf ~10 PF Sustained ~1 PF 300,000+ cores 1.2+ PB Memory 18+ PB Disc 8 cores/chip 4 chips/MCM 8 MCMs/Drawer 4 Drawers/SuperNode 1024 cores/SuperNode Linux OS NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Challenges and Opportunities Support systems >= 100 Million atoms Performance requirements for 100 Million atom Scale to over 300,000 cores Power 7 Hardware − − PPC architecture Wide node at least 32 cores with 128 HT threads Blue Waters Torrent interconnect Doing research under NDA NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Planned Petascale Simulations NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Thanks to NIH, NSF, DOE, and 15 years of NAMD and Charm++ developers and users. James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC