Kay Howell

advertisement
Biomedical
Computing
Requirements
for HPCS
Kay Howell, Federation of American Scientists
khowell@fas.org
Gerry Higgins, SimQuest, LLC
higgins@simquest.com
Federation of American Scientists
Biomedical Computing
Requirements for HPCS
 Examine broad range of application areas
 Identify key applications driving computing demand
 Identify hardware/software challenges for important classes
of applications
 Highlight HPCS areas critical to advances in biomedical
computing
 Identify technology gaps common to biomedical, national
security, and other nationally important applications
 Demonstrate market potential of HPCS in biomedical
computing
Federation of American Scientists
Requirements Analysis
 System architecture requirements, including:
processors, memory, interconnects, system software, and
programming environments
 Bandwidth requirements
 System robustness
 Application development and maintenance
 System management, operation, and maintenance
Federation of American Scientists
Biomedical Computing Requirements
Genome Bioinformatics
Protein Biochemistry
(Proteomics)
Description
Market
Projections
Genomics, DNA sequencing,
microarray technologies and
bioinformatics
Moderate, but growing
Chemoinformatics –
Drug Discovery
Includes protein
structure and function
Small, but growing
very rapidly
Moderate, growing
moderately
Computational Biology
Molecular Modeling
Tissue Engineering
(MD, QM, MC, MM)
/Organ Modeling /
Systems Biology
Small, but growing
rapidly
Small, but growing
rapidly
Very large, growing
moderately
People to
Adam Arkin
Jack Dixon
Rick Blevins
Wah Chiu
interview
George Church
Andrea Sinz
Donna Huryn
Rick Lathrop
Shankar Subramaniam
Barry Stoddard
People to
Gerard Bouffard, NISC (NIH)
Parag Chitnis, NSF
Dan Zaharevitz, NCI
Bret Peterson, NCRR
Terry Yoo, NLM
Interview –
Stephen Altschul, NCBI (NIH)
Yawen Bai, NIH
Peter Steinbach, CMM,
NIH
Sri Kumar, DARP
Carol Lucas, NSF
Richard Swaja,
NBIB
Donna Hillmann, NSF
Ruth Prachter , AF
Larry Clarke, NCI
Structural
Bioinformatics
federal
Francis Collins (NHGRI)
Nigel Page, DoD
Andrew McCulloch
Diagnostic Imaging
and Image-Guided
Interventions
Brian Athey
Klaus Schulten
Companies to
Incyte
Geneva Bioinformatics
ArQule
talk to
Celera
Myriad Proteomics
Viaken
Oxford GlycoSciences
Albany Molecular
Research
Trega
HPC vendors
Chris Johnson
Physiome Sciences
Entelos
SimQuest
Bill Lorensen
Ron Kikinis
Michael Vannier
GE Medical
Systems
Medtronic
BrainLab
Federation of American Scientists
Focus Areas
 Resources for managing, analyzing, interpreting data
 Extending the time scale & complexity of simulations
 Combined classical/quantum chemical simulations
 Simulations of large systems
 Protein structure prediction
 Diagnostic imaging and image-guided interventions
Federation of American Scientists
Work Plan
 Survey existing information and materials
 Interview researchers, sponsors and industrial representatives
 Produce preliminary report summarizing findings and distribute for
review and comment
 Deliver initial reqmts one year after project award
 Update the report one year later
Federation of American Scientists
Biomedical Computing
What we’d like to be able to do…
Static
Dynamic
Functional

Mouse/Human Genome Correlation

Individual Pharmacogenomic analysis using Gene
Expression Arrays

Multi-modal Radiology Image Fusion

Millisecond Structural Biology enabled by Synchrotron Xray Sources and 900 Mhz NMR

Physiologically competent Digital Human Simulations
your additions to the list…
Federation of American Scientists
Challenges in
Biomedical Computing
 Non-linear - current models are simplified linear
approximations
 System Complexity - need to span multiple scales of
biological organization
 Time Scales
 Exponential increases in data
Federation of American Scientists
Biopolymers Cells
Atoms
Size Scale
Organisms
Biomedical
Computing Problems
10
10
10
Finite element
models
Organ function
Electrostatic
continuum models
Evolutionary
Processes
Cell signalling
0
10
0
10
DNA
replication
6
Enzyme
Mechanisms
3
10
Ab initio
Quantum Chemistry
Protein
Folding
6
Empirical force field
Molecular Dynamics
First Principles
Molecular Dynamics
3
0
10
ORNL
Ecosystems
and
Epidemiology
6
3
10
10
Discrete Automata
models
0
10
10
Complexity and Timescale
3
10
10
6
-15
10
-12
10
-9
Homologybased
Protein
modeling
10
-6
10
-3
10
0
Timescale (seconds)
10
3
10
6
10
9 Geologic &
Evolutionary
Timescales
Federation of American Scientists
Biomedical Computing
Requirements for HPCS
Application Areas
Federation of American Scientists
Biological Research Requiring
ultra-HPC Resources
 Structure of proteosome, ribozyme, ribosome, ATPases,
Virus, membrane protein complexes
 Whole genome comparison
 Combined quantum/classical simulations
 Protein folding/threading
 Microsecond time-scale simulations
 Self-organization and self-assembly
 Protein-protein and protein-DNA recognition and assembly
Your additions….
Federation of American Scientists
Sequencing and Analysis
 Key Attributes:
Integer intensive
Significant research into new kinds of statistical
models: hybrids of HMMs and neural nets, dynamic
Bayesian nets, factorial HMMs, Boltzmann trees
Clusters typically used
Large scale database infrastructure common
 Cluster can be dedicated to single task/local data
control
 Cycle requirements can be substantial because of data
 Systems often in excess of 1Tflop (range 1-5)
Federation of American Scientists
Protein Structure Prediction
Summary of Computational
Characteristics
 Pipeline processing (network of interrelated tasks)
 Clustering:
Computationally intensive
Algorithms easier to implement using shared memory parallelism
due to tight coupling, fine grained, non-uniform work load
 Generation of sequence fragments:
ANN algorithm may be ideal for this and for clustering purposes
Fragment library written to a database
 Compute intensive algorithms are clustering (ANN) and optimization
(GA)
 Optimization easier to implement using loosely coupled distributed
compute cluster
Federation of American Scientists
Protein Structure Predication
Wish List

Hardware/software to map the processing pipeline efficiently

Tools to schedule such a pipeline, checkpoint

Well balanced hardware pipeline from archival storage to the compute elements
without bottlenecks

Easily programmable FPGA coprocessor boards to handle integer and other DSP
branch of the pipeline

Hardware and software that can handle truly asynchronous computing as it is
the key to scalability (overlapped computation, communication and I/O)

Efficient ANN and GA libraries similar to LAPACK

Efficient skeleton/template codes for common computation/communication/IO
(OO jargon patterns) across all platforms

Standardized Framework, libraries, database providing the computational
characteristics of the underlying hardware/software environment
Source: G. Chukkapalli, UCSD
Federation of American Scientists
Protein Structure Prediction
Future requirements
 Combine knowledge based prediction with ab initio methods
to improve the prediction accuracy
 Execute the whole pipeline on demand in an automated
fashion
 Generate predicted structures for whole genomes
 Protein design: inverse problem
All these are prohibitively expensive at present
Federation of American Scientists
Molecular Level Modeling
 Biochemical analysis
 Protein binding /drug target evaluation
 Dynamics of molecules
 Very large systems with physics
Federation of American Scientists
Computational Biology
HPC Challenges
Activity
Current Limit
Problem Size
Complexity
Memory
Ab inito study of enzyme catalysis
60 heavy atoms
250 heavy atoms
O(n^3)
O(n)
X-ray refinement of large
assemblies
25,000 atoms
125,000 atoms
O(n^2)
O(n^2)
Large scale protein motion,
membrane transport
200 residues
1000 residues
O(nlogn)
O(n)
Flexible docking of chemical
databases
3000 compounds
1,000,000 compounds
O(n)
O(n)
150 sequences
200 sequences
O(n^3)
O(nxm)
O(nlogn)
O(n)
Phylogenetic mapping
RNA 3-D conformations
10,000 bases
1,000,000 bases
100 residues
1000’s residues
Federation of American Scientists
Source: S. Burke, NIH
Biological
Computing
Assessment
BioCatalysis
Enzymes
-Array
Multiple Alignment
Whole Genome
Mn-Salen
ras
8000 genes
Phylogenetics
Analysis
(QM)
(QM/MM)
(Clustering)
(Pattern Matching)
(Sequence
Comparison)
(Assume 10^5 seconds to
finish computation)
(float)
(float)
(integer)
(integer)
(integer)
Computational requirements
(Ops/sec)
1X1012
10X1012
200X109
100X109
100X1012
Memory access patterns
Random
Partitioned Random
Sequential
Random
Sequential or Random
I/O performance
Moderate
Bandwidth
Communication
NA
Memory Bandwidth
Memory Bandwidth
Compiler speed
Optimization
Optimization
Optimization
Optimization
Critical for FPGA
O/S speed and stability
MTBF
Processor Scale
Processor Scale
Processor Scale
Support for new
architectures
Platform porting
strategies/experiences
Runs on Many
CPU Platforms
Most Scalar and
Many Parallel
Runs on Many
CPU Platforms
Runs on Many CPU
Platforms
All CPUs (ev7 opt)
FPGA
Performance across multiple
architectures
Scalar & Parallel
Spatial Decomp.
Scalar & Parallel
Scalar
Scalar, Parallel,
Vector, FPGA
Parallel
Code size (Lines)
Key algorithms and
improvements
300,000
400,000
3,000
5,000
2,000
Direct, Parallel,
Vector?
Linear Scaling,
Parallel
Parallel, MHz
Needs Parallelization
FPGA
Source: S. Burke, NIH
Data Management

Data management issues will be critically important
- Growth rate of biological data is estimated to be doubling every 6 months
- GenBank grew from 680,338 base pairs in 1982 to 22 billion base pairs in
2002 (compared to 13.5 base pairs as of August 2001
- Rate of data acquistion 100X higher than originally anticipated due to
improved sequencing technology and methods
 Redundancies and database asynchrony is increasing - data-base-to-database
comparisons are required for analysis and validation
 To look at long-range patterns of expression synthetic regions on the order of
10’s of megabases become reasonable lengths for consideration
What other data issues should be highlighted?
Federation of American Scientists
Data Management Issues

New Types of Data Support to extend existing RDBMS:
 Sequences and Strings
 Trees and Clusters
 Networks and Pathways
 Deep Images
 3D Models and Shapes
 Molecules and Coordinate Structures
 Hierarchical Models and Systems Descriptions
 Time Series and Sets
 Probabilities and Confidence Factors
 Visualizations
Source: Davidson, Bristol-Myers Squibb Pharm. Res. Institute
Federation of American Scientists
Systems Biology – Modeling the
Cellular System
 Combine cell signaling, gene regulatory and metabolic networks to
simulate cell behavior
 Hybrid information & physics based model Integrating
Computational/Experimental Data at all levels
 Modeling of network connectivity (sets of reactions: proteins, small
molecules, stochastic, MD)

Difficult to handle computationally
 importance of spatial location within the cell
 instability associated with reactions between small numbers of
molecular species
 combinatorial explosion of large numbers of different species
 >Petaflop problem
Federation of American Scientists
Systems Biology
 Need to simulate gene expression, metabolism and signal transduction
for a single and multiple cells
 Algorithms need to be designed precisely for biological research -
parameter optimizer needs to find as many local minima, including
global minima, as possible because there are multiple possible solutions
of which only one is actually used
 Must be able to simulate both high concentration of proteins that can be
described by differential equations and low concentration of proteins
that need to be handled by stochastic process simulation
 Stochastic methods are being used (STOCHSIM and Gillespie algorithm)
 individual molecules represented rather than concentrations of
molecular species; Monte Carlo methods are used to predict
interactions
 rate equations are replaced by individual reaction probabilities
Federation of American Scientists
Digital Imaging
 Used for monitoring of disease progression, diagnosis, preoperative
planning and intraoperative guidance and monitoring
 Algorithms are computationally demanding
 Key issues are segmentation and registration
 Signal processing techniques are used to enhance features and
generate the desired segmentation
 Results of the segmentation are aligned to other data acquisitions and
to the actual patient during procedures
 Results of the segmentation are visualized using different rendering
methods
Federation of American Scientists
Digital Imaging
Idea
Feature
Enhancement
Modulate selected
characteristics
Method
Parallelization
Applications
Spatial and frequency domain filtering:
convolutions
SMP and MPI style for Fourier
transforms [Frigo,1997] and
convolutions
Noise reduction [Gerig, 1992], removal of partial
volume artefacts [Westin,1997]
Each voxel treated separately
[Friedman, 1975]. SMP for core,
MPI
Classification in different
areas of the body [Kikinis 1992], [Huppi 1998],
[Warfield, 1995, Warfield, 1996]
Classification
k-NN, Parzen
window
Nonparametric supervised statistical
classification [Duda, 1973],
Classify an unknown
[Cover,1967],
voxel based on prototypes [Cover,1968],[Clarke,1993],
[Warfield, 1996],
[Friedman, 1975]
EM
Increase robustness of
statistical approach
through adaptive
behaviour
Classification step as in k-NN,
Iterates between statistical classification
intensity correction
and intensity prediction/correction
[Wells, 1986]: convolutions
[Wells, 1996]
SMP, NUMA
Classification primarily
of brain MRI [Morocz, 1995], [Kikinis, 1997],
[Iosifescu, 1997]
Linear Registration
Intra-subject
Use inherent contrast
similarity to align image
Inter-subject
Measure mismatch of
alignment of two subjects
Multiresolution alignment using XOR
by counting the number of
function [Warfield, 1998]
voxel labels that don't
match.
Nonlinear
Registration
Requires entropy and joint entropy
computation [Wells, 1996a]
Joint histogram computation,
parallelized by computing the
histogram of data chunks. Joint
entropy calculated by a loop over
the histogram
SMP, MPI
Registration of slices for multichannel analysis
[Huppi 1998, Nakajima, 1997]
First data is resampled then
misalignment used to calculate
registration. MPI
Initial alignment for template driven segmentation
[Warfield, 1996]
Low pass filter, upsampling,
Use rubbersheet transform Multiresolution approach with fast local
downsampling, arithmetic
to align two data sets from similarity measurement, and a simplified
operations, solve systems of
different subjects.
regularization model
equations. SMP
Template driven segmentation [Warfield, 1996]
Visualization
Pipeline of marching cubes
Generate highly optimized [Lorensen,1987], triangle reduction
triangle surface models
[Schroeder,1992], and triangle
Generation
smoothing [Taubin,1995]
Surface Model
Volume Rendering
Direct visualization of
Shear warp algorithm [Ylä-Jääski,
volume data without prior 1997], [Lacroute, 1994],
processing
[Saiviroonporn,1998])
Distributed computation of
triangle models for each structure
of a data set (up to 300). LSF
Visualization for surgical applications and for
presentation purposes [Ozlen,1998],
[Chabrerie,1998],[Chabrerie,1998a],
[Kikininis,1996]
Render subvolumes separately.
SMP MPI
Visualize data before segmentation, interactive
editing
Source: R. Kikinis, Brigham and Women's Hospital and Harvard Medical School
Federation of American Scientists
Biomedical Application
and Kernels
Kernels
BioCatalysis
Quantum and MM
Application
Source
Today
Ab Initio Quantum Chemistry
GAMESS
DoD HPCMP TI-03
TeraOp/s sustained
Quantum Chemistry
GAUSSIAN
www.gaussian.com/
TeraOp/s sustained
Quantum Mechanics
Macromolecular Dynamics
NWChem
CHARM
PNNL
http://yuri.harvard.edu/
TeraOp/s sustained
10 TeraOp/s sustained
Energy Minimization
MonteCarlo Simulation
Molecular Mechanical Field Force
AMBER
http://www.amber.ucsf.edu/
10 TeraOp/s sustained
m-Array 8000 Genes
Clustering
CLUSTALW
200 GigaOps/s sustained
Multiple Alignment Phylogenetics
Pattern Matching
NONMEM
Pattern Matching
PHYLIP
Pattern Matching
FASTme
Sequence Comparison
Needleman-Wunsch
http://bimas.dcrt.nih.gov/sw.html
http://www.globomaxservice.com/
products/nonmem.html
http://evolution.genetics.washington.edu/
phylip.html
http://www.ncbi.nlm.nih.gov/
CBBresearch/Desper/FastME.html
http://www.med.nyu.edu/
rcr/rcr/course/sim-sw.html
Sequence Comparison
FASTA
http://www.ebi.ac.uk/fasta33/
100 TeraOps/s sustained
Sequence Comparison
HMMR
http://hmmer.wustl.edu/
100 TeraOps/s sustained
Sequence Comparison
GENSCAN
http://genes.mit.edu/GENSCANinfo.html
100 TeraOps/s sustained
Whole Genome Analysis
Systems Biology
Digital Imaging
Functional Genomics
http://genomics.lbl.gov/~aparkin/
Group/Codebase.html
Biological Pathway Analysis
Complex Systems Simulation and
Analysis
http://ecell.sourceforge.net/
Partial Differential Equation Solver
Ordinary Differential Equation Solver
http://www.nrcam.uchc.edu/
Marching Cubes
Paper & Pencil for Kernels
Triangle Reduction
Paper & Pencil for Kernels
Triangle Smoothing
Noise Reduction
Paper & Pencil for Kernels
Paper & Pencil for Kernels
Artifact Removal
Paper & Pencil for Kernels
100 GigaOps/s sustained
100 GigaOps/s sustained
100 GigaOps/s sustained
100 TeraOps/s sustained
Federation of American Scientists
Download