Solving the “last mile of computing problem” – developing portals to enable simulation-based science and engineering Tom Furlani, PhD Center for Computational Research University at Buffalo, SUNY The Role of High Performance Computation in Economic Development Rensselaer Polytechnic Institute October 22 - 24, 2008 Outline How Did Computation Become so Important Bringing HPC to the Researcher’s Desktop Portals Grid Computing Example Portals Research Center for Computational Research • Overview Understanding Protein Chemistry • Photoactive Yellow Protein Toward Petascale level calculations How did computation become critical? Revolution in Computing Storage Networking/Communication 1TB - $120. 1980’s 1940’s Today Computing Revolution 1890-1945 Mechanical, relay 7 year doubling 1945-1985 Tube, transistor 2.3 year doubling 1985-2005 Microprocessor 1 – 1.5 year doubling Exponentials Transistor density • 2X in ~18 months (Moore’s Law) Graphics: 100X in 3 years WAN bandwidth: 64X in 2 years Storage: 7X in 2 years Microprocessor Revolution How long would 1 hr calc24 today take on a PC from 1984? Years! Slide courtesy – Dan Reed, RENCI The Storage Revolution Megabyte 5 MB: complete works of Shakespeare Terabyte: 1,000,000 MB – ~$120 today The text in 1 million books Entire U.S. Library of Congress is 10TB of text 50,000 trees made into paper and printed Large Hadron Collider Experiment– 15 TB/day Petabyte: 1000 terabytes 20 million four-drawer filing cabinets full of text The Data Tsunami - Many sources Agricultural, Medical, Environmental, Engineering, Financial Why so much data? More sensors – higher resolution Faster/cheaper storage capability Faster processors – generate more data! The challenge: extracting insight! Without being overwhelmed Advanced Networking Eisenhower Interstate System National Lambda Rail Network Networks are the 21st century interstate highway system expertise and information - the real product Removes the barriers of time and space Enabling SBES for Non-Experts Bringing HPC to the desktop Analogous to impact of Windows vs DOS for PC’s • Brought computing/internet to the home Many users need periodic, but infrequent access Experiment driven Ease of use is key Shouldn’t need to know about OS, compilers, queuing system, etc GUI Interface, Web-based, Access anywhere How do we get there? Focus on development of portals, custom software and tools, data models, GUI’s, etc. Provide training on the use of these tools Ex: nanoHUB – one stop resource for nanotechnology “Old School” Computing VPN software Secure Shell software Unix commands Secure file transfer Use VPN to access network Secure login to front-end machine Create subdirectory Upload input data file Monitor job Input File Set path and variables PBS commands Submit job to queue Identify keywords for model Edit input file Add keywords to Input file Edit file Create PBS script file Application command line Set number of processors Set run time and queue PBS format and syntax Portal Driven Computing Secure login to web portal Upload input data file Select model and run job Monitor job Input File Open Browser Select Model View Output in Browser Monitor Jobs View Output What is an Application Portal? No consistent definition Web-based On-line simulation from you browser Simulation typically doesn’t run on your PC Doesn’t have to be grid enabled WebMO Computational Chemistry Portal nanoHUB Web-based resource for research, education and collaboration in nanotechnology Includes application portals (tools) Portal Basics Remote Access to simulations and compute power ccr.buffalo.edu Application Server Internet V Authentication Export Display Remote Desktop Run Simulation Application Portals Benefits Scientists able to focus on research rather than details of computing environment Underlying infrastructure complexities are hidden Transparently integrate compute and data resources Moving application to a web-based interface provides ubiquitous access Single sign-on – Don’t have to maintain accounts on many machines Challenges Requires close collaboration between domain experts and developers Developers must be aware of and hide underlying complexity Must be easy to use (web-based, GUI) Must provide full application functionality Grid Enabling Applications Why Needed Scientists require an ever growing amount of compute and storage resources Experiments may have requirements beyond the capabilities of a single data center Datasets are growing at a tremendous rate Grid Computing Provides infrastructure for data and job management Handles authentication of users across administrative and political domains Provides monitoring of resources and user jobs Allows researchers to harness the power of multiple datacenters for large experiments Provide reusable interface to commonly used functions: Job status, job submission, file management Example Portals WebMO – Computational Chemistry REDfly – Bioinformatics iNquiry: Common web interface to many command-line tools GenePattern: Scientific workflow and genomic analysis tools CCR Computational Chemistry Portal Based on WebMO: www.webmo.net CCR portal: webmo.ccr.buffalo.edu Extensive QC Support Gaussian, GAMESS, NWChem, Q-Chem, Mopac, Molpro, Tinker Interfaces with batch queues on U2 and several faculty clusters CCR iNquiry Bioinformatics Portal, Glimmer page Computational Chemistry Portal Browser based login Menu driven Computational Chemistry Portal Choose level of theory Computational Chemistry Portal View output Computational Chemistry Portal ……including vibrational modes Database/Portal Development REDfly (Regulatory Element Database for Fly) Database of transcriptional regulatory elements Aggregates data from multiple offline & online sources Over 2100 entries Most comprehensive resource of curated animal regulatory elements Fully searchable, includes DNA sequence, gene expression data, link-outs to other databases Extensive collaboration with other online data sources using web services CCR Bioinformatics Portal Based on iNquiry: www.bioteam.net Web portal: inquiry.ccr.buffalo.edu Extensive Application Support Includes popular opensource bioinformatics packages EMBOSS, *PHYLIP, HMMer, BLAST, MPI-BLAST, NCBI Toolkit, Glimmer, Wise2,*ClustalW, *BLAT, *FASTA Extensible for customized application interfaces Uses U2 Compute Cluster as Computational Engine TITAN - Modeling Geohazards Modeling of Volcanic Flows, Mud flows (flash flooding), and Avalanches Benefits for Developers Developers – too much time supporting user installations Support single web-based portal CCR supports back-end infrastructure Frees developers to focus on improving the models, science Integrate information from several sources Simulation results Remote sensing GIS data Web enable for remote access Metrics on Demand Portal UBMoD: Web-based Interface for On-demand Metrics CPU cycles delivered, Storage, Queue Statistics, etc Role based interface (User, Faculty, Staff, Admin) Available in open source : Center for Computational Research Under NYS Center for Excellence in Bioinformatics & Life Sciences Moved to New Buffalo Life Sciences Complex Building Leading Academic Supercomputing Site Mission: “Enabling community” and facilitating research within the University Enable Research by Providing high-end computing and visualization resources, software engineering, scientific computing/modeling, bioinformatics/computational biology, scientific and urban visualization, advanced computing systems Industrial Outreach/Technology Transfer to WNY Education, Outreach and Training in WNY 2007 Highlights Computational Cycles Delivered in 2007: 224 different users submitted jobs (88 research groups) 354,447 jobs run (almost 1000 per day) 700,000 CPU days delivered 200 new user accounts created CIT/CCR Collaboration to Improve Research Computing Condor deployment Portal/Tool Development Make machines easier to use • WebMO (Chemistry) • iNquiry (Bioinformatics) • UBMoD (Metrics on Demand) Accountability On-line real-time metrics UB 2020 Campus Master Planning 3D models of all 3 campuses NYSGrid CCR Research & Projects Groundwater Flow Modeling Turbulence and Combustion Modeling Molecular Structure Determination Protein Folding Prediction Data Mining – Digital Gov, Library Grid Computing Computational Chemistry Biomedical Engineering Bioinformatics Urban Simulation and Visualization Accident Reconstruction Risk Mitigation (GIS) Medical Imaging High School Workshops Cluster Computing Data Fusion Photoactive Yellow Protein Simple prototype of Rhodpsin family of proteins Chromophore is located completely inside the protein pocket Protein environment causes absorption shift from 2.70 eV (gas phase) to 2.78 eV (protein) yielding the yellow color Chromophore Spectra Measured Experimental spectra of the protein active site in vacuum, in a protein and in water solution Provides insight into environmental effects on electronic spectra, large shift of absorption maximum Can gauge accuracy of theory Modeling the System Combined Quantum Mechanical / Molecular Mechanical Method System is divided into a QM part and a MM part QM used in to model “important” part of system; MM used to model remainder The QM part includes the active site of the protein The MM part includes the rest of the protein, as well as surrounding water molecules QM QM versus MM based Methods QM Calculations Advantages: Very accurate, based on first principles (ab initio, DFT - there are not empirical parameters involved), can treat bond breaking and formation Disadvantages: Time consuming, limited to small molecular systems (~100 atoms) MM Calculations Advantages: Very fast, capable to calculate entire proteins or solutions (~100,000 atoms) Disadvantages: Less accurate, based on empirical parameters, not capable to calculate chemical reactions (electrons are not involved) QM/MM Why use the QM/MM Method? Improved accuracy (QM) and faster (MM) Model active site of proteins Drug-receptor binding Electrostatic effects Steric effects Interpretation of experimental data Vibrational spectra Electronic spectra Mechanism of enzymatic activity Reaction profiles Thermal motion effects on reactivity Modeling Protein Dynamics Goal: Understand how protein thermal dynamics effects function Protein dynamics time 1. 2. 3. 4. Run MM based Molecular Dynamics simulation From MD simulation, randomly select protein conformations (snapshots) Run QM/MM simulation for each snapshot Generate results based on averages taken from snapshots Getting Results Faster Carry out QM/MM calcs simultaneously for many snapshots (protein conformations) QM/MM Calc for Each Snapshot After MD, protein snapshots are randomly selected (1000) Full geometry optimization of the ligand inside the fixed protein matrix (Q-Chem) QM: DFT/B3LYP/6-31+G* (ligand) MM: AMBER (protein + water) Electronic excitations (Q-Chem): QM: TDDFT/B3LYP/aug-cc-pVTZ (ligand) MM: AMBER (protein + water) • 4500 water molecules CPU Demand - Current Calculation MD Simulation 1600 CPU hours Select 1000 Snapshots Each Snapshot (54 CPU Hours) Combined QM/MM Geometry Optimization • 24 CPU hours (3 hours on 8 processors) Electronic Excitation Calc • 30 CPU Hours Total for all 1000 snapshots + MD Simulation 55,600 CPU Hours (2300 CPU Days) Results Electronic excitations of the chromophore Electronic Excitation Calculated Gas-Phase (eV) 3.07 Protein (eV) 3.31(0.06) D=0.24 Solution (eV) 3.52(0.04) D=0.45 Experiment 2.70 2.78 D=0.08 3.10 D=0.40 ( ) - standard deviation D - change relative to the gas phase Toward Petascale Level Calc More accurate MD simulation Larger water sphere (50 A radius) • ~12,000 water molecules 500 hours on 32 processors - 16,000 CPU hours More accurate QM/MM simulations Larger basis set 350 hours on 16 processors - 5600 CPU hours Better statistics 100,000 MD snapshots (560,000,000 CPU hours) 2 MD simulations - 1,120,000,000 CPU hours! Power of Parallel Processing Assume a modest 4X increase in processor performance/computational efficiency over the next few years Reduce requirement to about 10,000,000 CPU days Translates to 100 CPU days on 100,000 cores Combined QM/MM simulations of this scale possible on petascale level hardware Acknowledgements Portal Development Steve Gallo, Dr. Matt Jones, Jon Bednasz, Rob Leach Combined QM/MM Calculations Dr. Marek Friendorf Funding NIH