Biomedical Computing Requirements for HPCS Kay Howell, Federation of American Scientists khowell@fas.org Gerry Higgins, SimQuest, LLC higgins@simquest.com Federation of American Scientists Biomedical Computing Requirements for HPCS Examine broad range of application areas Identify key applications driving computing demand Identify hardware/software challenges for important classes of applications Highlight HPCS areas critical to advances in biomedical computing Identify technology gaps common to biomedical, national security, and other nationally important applications Demonstrate market potential of HPCS in biomedical computing Federation of American Scientists Requirements Analysis System architecture requirements, including: processors, memory, interconnects, system software, and programming environments Bandwidth requirements System robustness Application development and maintenance System management, operation, and maintenance Federation of American Scientists Biomedical Computing Requirements Genome Bioinformatics Protein Biochemistry (Proteomics) Description Market Projections Genomics, DNA sequencing, microarray technologies and bioinformatics Moderate, but growing Chemoinformatics – Drug Discovery Includes protein structure and function Small, but growing very rapidly Moderate, growing moderately Computational Biology Molecular Modeling Tissue Engineering (MD, QM, MC, MM) /Organ Modeling / Systems Biology Small, but growing rapidly Small, but growing rapidly Very large, growing moderately People to Adam Arkin Jack Dixon Rick Blevins Wah Chiu interview George Church Andrea Sinz Donna Huryn Rick Lathrop Shankar Subramaniam Barry Stoddard People to Gerard Bouffard, NISC (NIH) Parag Chitnis, NSF Dan Zaharevitz, NCI Bret Peterson, NCRR Terry Yoo, NLM Interview – Stephen Altschul, NCBI (NIH) Yawen Bai, NIH Peter Steinbach, CMM, NIH Sri Kumar, DARP Carol Lucas, NSF Richard Swaja, NBIB Donna Hillmann, NSF Ruth Prachter , AF Larry Clarke, NCI Structural Bioinformatics federal Francis Collins (NHGRI) Nigel Page, DoD Andrew McCulloch Diagnostic Imaging and Image-Guided Interventions Brian Athey Klaus Schulten Companies to Incyte Geneva Bioinformatics ArQule talk to Celera Myriad Proteomics Viaken Oxford GlycoSciences Albany Molecular Research Trega HPC vendors Chris Johnson Physiome Sciences Entelos SimQuest Bill Lorensen Ron Kikinis Michael Vannier GE Medical Systems Medtronic BrainLab Federation of American Scientists Focus Areas Resources for managing, analyzing, interpreting data Extending the time scale & complexity of simulations Combined classical/quantum chemical simulations Simulations of large systems Protein structure prediction Diagnostic imaging and image-guided interventions Federation of American Scientists Work Plan Survey existing information and materials Interview researchers, sponsors and industrial representatives Produce preliminary report summarizing findings and distribute for review and comment Deliver initial reqmts one year after project award Update the report one year later Federation of American Scientists Biomedical Computing What we’d like to be able to do… Static Dynamic Functional Mouse/Human Genome Correlation Individual Pharmacogenomic analysis using Gene Expression Arrays Multi-modal Radiology Image Fusion Millisecond Structural Biology enabled by Synchrotron Xray Sources and 900 Mhz NMR Physiologically competent Digital Human Simulations your additions to the list… Federation of American Scientists Challenges in Biomedical Computing Non-linear - current models are simplified linear approximations System Complexity - need to span multiple scales of biological organization Time Scales Exponential increases in data Federation of American Scientists Biopolymers Cells Atoms Size Scale Organisms Biomedical Computing Problems 10 10 10 Finite element models Organ function Electrostatic continuum models Evolutionary Processes Cell signalling 0 10 0 10 DNA replication 6 Enzyme Mechanisms 3 10 Ab initio Quantum Chemistry Protein Folding 6 Empirical force field Molecular Dynamics First Principles Molecular Dynamics 3 0 10 ORNL Ecosystems and Epidemiology 6 3 10 10 Discrete Automata models 0 10 10 Complexity and Timescale 3 10 10 6 -15 10 -12 10 -9 Homologybased Protein modeling 10 -6 10 -3 10 0 Timescale (seconds) 10 3 10 6 10 9 Geologic & Evolutionary Timescales Federation of American Scientists Biomedical Computing Requirements for HPCS Application Areas Federation of American Scientists Biological Research Requiring ultra-HPC Resources Structure of proteosome, ribozyme, ribosome, ATPases, Virus, membrane protein complexes Whole genome comparison Combined quantum/classical simulations Protein folding/threading Microsecond time-scale simulations Self-organization and self-assembly Protein-protein and protein-DNA recognition and assembly Your additions…. Federation of American Scientists Sequencing and Analysis Key Attributes: Integer intensive Significant research into new kinds of statistical models: hybrids of HMMs and neural nets, dynamic Bayesian nets, factorial HMMs, Boltzmann trees Clusters typically used Large scale database infrastructure common Cluster can be dedicated to single task/local data control Cycle requirements can be substantial because of data Systems often in excess of 1Tflop (range 1-5) Federation of American Scientists Protein Structure Prediction Summary of Computational Characteristics Pipeline processing (network of interrelated tasks) Clustering: Computationally intensive Algorithms easier to implement using shared memory parallelism due to tight coupling, fine grained, non-uniform work load Generation of sequence fragments: ANN algorithm may be ideal for this and for clustering purposes Fragment library written to a database Compute intensive algorithms are clustering (ANN) and optimization (GA) Optimization easier to implement using loosely coupled distributed compute cluster Federation of American Scientists Protein Structure Predication Wish List Hardware/software to map the processing pipeline efficiently Tools to schedule such a pipeline, checkpoint Well balanced hardware pipeline from archival storage to the compute elements without bottlenecks Easily programmable FPGA coprocessor boards to handle integer and other DSP branch of the pipeline Hardware and software that can handle truly asynchronous computing as it is the key to scalability (overlapped computation, communication and I/O) Efficient ANN and GA libraries similar to LAPACK Efficient skeleton/template codes for common computation/communication/IO (OO jargon patterns) across all platforms Standardized Framework, libraries, database providing the computational characteristics of the underlying hardware/software environment Source: G. Chukkapalli, UCSD Federation of American Scientists Protein Structure Prediction Future requirements Combine knowledge based prediction with ab initio methods to improve the prediction accuracy Execute the whole pipeline on demand in an automated fashion Generate predicted structures for whole genomes Protein design: inverse problem All these are prohibitively expensive at present Federation of American Scientists Molecular Level Modeling Biochemical analysis Protein binding /drug target evaluation Dynamics of molecules Very large systems with physics Federation of American Scientists Computational Biology HPC Challenges Activity Current Limit Problem Size Complexity Memory Ab inito study of enzyme catalysis 60 heavy atoms 250 heavy atoms O(n^3) O(n) X-ray refinement of large assemblies 25,000 atoms 125,000 atoms O(n^2) O(n^2) Large scale protein motion, membrane transport 200 residues 1000 residues O(nlogn) O(n) Flexible docking of chemical databases 3000 compounds 1,000,000 compounds O(n) O(n) 150 sequences 200 sequences O(n^3) O(nxm) O(nlogn) O(n) Phylogenetic mapping RNA 3-D conformations 10,000 bases 1,000,000 bases 100 residues 1000’s residues Federation of American Scientists Source: S. Burke, NIH Biological Computing Assessment BioCatalysis Enzymes -Array Multiple Alignment Whole Genome Mn-Salen ras 8000 genes Phylogenetics Analysis (QM) (QM/MM) (Clustering) (Pattern Matching) (Sequence Comparison) (Assume 10^5 seconds to finish computation) (float) (float) (integer) (integer) (integer) Computational requirements (Ops/sec) 1X1012 10X1012 200X109 100X109 100X1012 Memory access patterns Random Partitioned Random Sequential Random Sequential or Random I/O performance Moderate Bandwidth Communication NA Memory Bandwidth Memory Bandwidth Compiler speed Optimization Optimization Optimization Optimization Critical for FPGA O/S speed and stability MTBF Processor Scale Processor Scale Processor Scale Support for new architectures Platform porting strategies/experiences Runs on Many CPU Platforms Most Scalar and Many Parallel Runs on Many CPU Platforms Runs on Many CPU Platforms All CPUs (ev7 opt) FPGA Performance across multiple architectures Scalar & Parallel Spatial Decomp. Scalar & Parallel Scalar Scalar, Parallel, Vector, FPGA Parallel Code size (Lines) Key algorithms and improvements 300,000 400,000 3,000 5,000 2,000 Direct, Parallel, Vector? Linear Scaling, Parallel Parallel, MHz Needs Parallelization FPGA Source: S. Burke, NIH Data Management Data management issues will be critically important - Growth rate of biological data is estimated to be doubling every 6 months - GenBank grew from 680,338 base pairs in 1982 to 22 billion base pairs in 2002 (compared to 13.5 base pairs as of August 2001 - Rate of data acquistion 100X higher than originally anticipated due to improved sequencing technology and methods Redundancies and database asynchrony is increasing - data-base-to-database comparisons are required for analysis and validation To look at long-range patterns of expression synthetic regions on the order of 10’s of megabases become reasonable lengths for consideration What other data issues should be highlighted? Federation of American Scientists Data Management Issues New Types of Data Support to extend existing RDBMS: Sequences and Strings Trees and Clusters Networks and Pathways Deep Images 3D Models and Shapes Molecules and Coordinate Structures Hierarchical Models and Systems Descriptions Time Series and Sets Probabilities and Confidence Factors Visualizations Source: Davidson, Bristol-Myers Squibb Pharm. Res. Institute Federation of American Scientists Systems Biology – Modeling the Cellular System Combine cell signaling, gene regulatory and metabolic networks to simulate cell behavior Hybrid information & physics based model Integrating Computational/Experimental Data at all levels Modeling of network connectivity (sets of reactions: proteins, small molecules, stochastic, MD) Difficult to handle computationally importance of spatial location within the cell instability associated with reactions between small numbers of molecular species combinatorial explosion of large numbers of different species >Petaflop problem Federation of American Scientists Systems Biology Need to simulate gene expression, metabolism and signal transduction for a single and multiple cells Algorithms need to be designed precisely for biological research - parameter optimizer needs to find as many local minima, including global minima, as possible because there are multiple possible solutions of which only one is actually used Must be able to simulate both high concentration of proteins that can be described by differential equations and low concentration of proteins that need to be handled by stochastic process simulation Stochastic methods are being used (STOCHSIM and Gillespie algorithm) individual molecules represented rather than concentrations of molecular species; Monte Carlo methods are used to predict interactions rate equations are replaced by individual reaction probabilities Federation of American Scientists Digital Imaging Used for monitoring of disease progression, diagnosis, preoperative planning and intraoperative guidance and monitoring Algorithms are computationally demanding Key issues are segmentation and registration Signal processing techniques are used to enhance features and generate the desired segmentation Results of the segmentation are aligned to other data acquisitions and to the actual patient during procedures Results of the segmentation are visualized using different rendering methods Federation of American Scientists Digital Imaging Idea Feature Enhancement Modulate selected characteristics Method Parallelization Applications Spatial and frequency domain filtering: convolutions SMP and MPI style for Fourier transforms [Frigo,1997] and convolutions Noise reduction [Gerig, 1992], removal of partial volume artefacts [Westin,1997] Each voxel treated separately [Friedman, 1975]. SMP for core, MPI Classification in different areas of the body [Kikinis 1992], [Huppi 1998], [Warfield, 1995, Warfield, 1996] Classification k-NN, Parzen window Nonparametric supervised statistical classification [Duda, 1973], Classify an unknown [Cover,1967], voxel based on prototypes [Cover,1968],[Clarke,1993], [Warfield, 1996], [Friedman, 1975] EM Increase robustness of statistical approach through adaptive behaviour Classification step as in k-NN, Iterates between statistical classification intensity correction and intensity prediction/correction [Wells, 1986]: convolutions [Wells, 1996] SMP, NUMA Classification primarily of brain MRI [Morocz, 1995], [Kikinis, 1997], [Iosifescu, 1997] Linear Registration Intra-subject Use inherent contrast similarity to align image Inter-subject Measure mismatch of alignment of two subjects Multiresolution alignment using XOR by counting the number of function [Warfield, 1998] voxel labels that don't match. Nonlinear Registration Requires entropy and joint entropy computation [Wells, 1996a] Joint histogram computation, parallelized by computing the histogram of data chunks. Joint entropy calculated by a loop over the histogram SMP, MPI Registration of slices for multichannel analysis [Huppi 1998, Nakajima, 1997] First data is resampled then misalignment used to calculate registration. MPI Initial alignment for template driven segmentation [Warfield, 1996] Low pass filter, upsampling, Use rubbersheet transform Multiresolution approach with fast local downsampling, arithmetic to align two data sets from similarity measurement, and a simplified operations, solve systems of different subjects. regularization model equations. SMP Template driven segmentation [Warfield, 1996] Visualization Pipeline of marching cubes Generate highly optimized [Lorensen,1987], triangle reduction triangle surface models [Schroeder,1992], and triangle Generation smoothing [Taubin,1995] Surface Model Volume Rendering Direct visualization of Shear warp algorithm [Ylä-Jääski, volume data without prior 1997], [Lacroute, 1994], processing [Saiviroonporn,1998]) Distributed computation of triangle models for each structure of a data set (up to 300). LSF Visualization for surgical applications and for presentation purposes [Ozlen,1998], [Chabrerie,1998],[Chabrerie,1998a], [Kikininis,1996] Render subvolumes separately. SMP MPI Visualize data before segmentation, interactive editing Source: R. Kikinis, Brigham and Women's Hospital and Harvard Medical School Federation of American Scientists Biomedical Application and Kernels Kernels BioCatalysis Quantum and MM Application Source Today Ab Initio Quantum Chemistry GAMESS DoD HPCMP TI-03 TeraOp/s sustained Quantum Chemistry GAUSSIAN www.gaussian.com/ TeraOp/s sustained Quantum Mechanics Macromolecular Dynamics NWChem CHARM PNNL http://yuri.harvard.edu/ TeraOp/s sustained 10 TeraOp/s sustained Energy Minimization MonteCarlo Simulation Molecular Mechanical Field Force AMBER http://www.amber.ucsf.edu/ 10 TeraOp/s sustained m-Array 8000 Genes Clustering CLUSTALW 200 GigaOps/s sustained Multiple Alignment Phylogenetics Pattern Matching NONMEM Pattern Matching PHYLIP Pattern Matching FASTme Sequence Comparison Needleman-Wunsch http://bimas.dcrt.nih.gov/sw.html http://www.globomaxservice.com/ products/nonmem.html http://evolution.genetics.washington.edu/ phylip.html http://www.ncbi.nlm.nih.gov/ CBBresearch/Desper/FastME.html http://www.med.nyu.edu/ rcr/rcr/course/sim-sw.html Sequence Comparison FASTA http://www.ebi.ac.uk/fasta33/ 100 TeraOps/s sustained Sequence Comparison HMMR http://hmmer.wustl.edu/ 100 TeraOps/s sustained Sequence Comparison GENSCAN http://genes.mit.edu/GENSCANinfo.html 100 TeraOps/s sustained Whole Genome Analysis Systems Biology Digital Imaging Functional Genomics http://genomics.lbl.gov/~aparkin/ Group/Codebase.html Biological Pathway Analysis Complex Systems Simulation and Analysis http://ecell.sourceforge.net/ Partial Differential Equation Solver Ordinary Differential Equation Solver http://www.nrcam.uchc.edu/ Marching Cubes Paper & Pencil for Kernels Triangle Reduction Paper & Pencil for Kernels Triangle Smoothing Noise Reduction Paper & Pencil for Kernels Paper & Pencil for Kernels Artifact Removal Paper & Pencil for Kernels 100 GigaOps/s sustained 100 GigaOps/s sustained 100 GigaOps/s sustained 100 TeraOps/s sustained Federation of American Scientists