Life Science Software and High Performance

advertisement
Life Science Software and High
Performance Computing
Seminar Series Part IV
Craig A. Stewart
Fulbright Senior Scholar beim ZIH
Associate Vice President, Research & Academic Computing
License Terms
•
•
•
•
Please cite this presentation as: Stewart, C.A. Life Science Software
and High Performance Computing: Seminar Series Part IV. 2006.
Presentation. Presented at: Technische Universitaet Dresden (Dresden,
Germany, 27 Apr 2006). Available from: http://hdl.handle.net/2022/14767
Portions of this document that originated from sources outside IU are shown
here and used by permission or under licenses indicated within this
document.
Items indicated with a © are under copyright and used here with permission.
Such items may not be reused without permission from the holder of
copyright except where license terms noted on a slide permit reuse.
Except where otherwise noted, the contents of this presentation are
copyright 2007 by the Trustees of Indiana University. This content is
released under the Creative Commons Attribution 3.0 Unported license
(http://creativecommons.org/licenses/by/3.0/). This license includes the
following terms: You are free to share – to copy, distribute and transmit the
work and to remix – to adapt the work under the following conditions:
attribution – you must attribute the work in the manner specified by the
author or licensor (but not in any way that suggests that they endorse you or
your use of the work). For any reuse or distribution, you must make clear to
others the license terms of this work.
Life Science Software and HPC Seminar
Plan as of today
• Today:
– Some thoughts and observations on US national projects and
centers
• Funding agencies
• HPC/grid computing
• Bioinformatics and computational biology
– Performance analysis
• Late June – another visit to Dresden, associated with the ISC
• Late August – another visit to Dresden, associated with Euro-PAR
US Funding agencies (1)
•
•
•
National Science Foundation - $5.5B/year annual budget, fund
about 20% of all basic research in US. Basic research in comp sci,
math, biology, geology, etc. www.nsf.gov
National Institutes of Health - $27.5B/year. Funds largest share of
medical research. 27 separate institutes and centers www.nih.gov
Department of Energy. Funds much applied and basic research.
Funds: Argonne National Laboratory, Brookhaven National
Laboratory, Fermi National Accelerator Laboratory, Lawrence
Berkeley National Laboratory, Lawrence Livermore National
Laboratory, Oak Ridge National Laboratory, Pacific Northwest
National Laboratory, Sandia National Laboratories, Stanford Linear
Accelerator Center, Electron accelerators, Thomas Jefferson
National Accelerator Facility www.doe.gov
US Funding agencies (2)
• Department of Defense. http://www.defenselink.mil/
– Defense Advanced http://www.darpa.mil/
– High Productivity Computing Systems program
http://www.darpa.mil/ipto/programs/hpcs/programplan.htm
• Military branches (esp. Army, Navy, Air Force)
• Department of Homeland Security http://www.dhs.gov/dhspublic/
• National Security Agency www.nsa.gov
• Congressional markups
Some shining successes
•
•
•
•
DARPANet/Internet/Abilene
NSF HPC Centers/NITRD
“Hallmark” demos e.g. Tornado, Caterpillar bulldozer design
It’s really possible for a good researcher to get time on a
nationally shared superocmputer and get help with it
DARPA High Productivity
Computing System program
http://www.darpa.mil/ipto/programs/hpcs/programplan.htm
IBM, Cray, Sun currently phase II industry partners
Real, not peak
http://www.darpa.mil/ipto/programs/hpcs/assessment.htm
Current Top500 list
•
•
DOE impact on top of list!
http://www.top500.org/lists/200
5/11/basic
NSF strategies
•
•
•
•
•
•
Office of Cyberinfrastructure. Daniel Atkins, Director
Report of the National Science Foundation Blue-Ribbon Advisory Panel on
Cyberinfrastructure.
http://www.nsf.gov/publications/pub_summ.jsp?ods_key=cise051203 (aka
“the Atkins Report”).
Draft – NSF’s Cyberinfrastructure vision for the 21st century.
http://www.nsf.gov/od/oci/ci_v5.pdf
NSF Cyberinfrastructure panel
Systems
– $30M/year x 4 solicitations for large shared systems
– $200M for a 1 PetaFLOPS *achieved* system
– Focus on science results
Software
– National Middleware Initiative
National supercomputer centers
•
•
•
•
•
Pittsburgh Supercomputer Center
San Diego Supercomputer Center
National Computational Science Alliance
TeraGrid
Other university centers of note:
– Purdue University
– Ohio Supercomputer Center
– Louisiana State University
– Texas Advanced Computer Center
– Texas Tech
– Rice
– Cal-Tech
– Cornell
– U. Chicago (computation, electronic visualization lab)
– Florida/SURA
NIH
• National Center for Research Resources
• Really focused on clinical resources, not computing
resources
• NIH is perhaps doing more than any other funding agency to
promote openness in research as a result of its data access
policies and support for open source software
• National library of medicine, protein data bank (also
supported by NSF)
A semirandom walk
through some
US projects
CIPRES Cyberinfrastructure for
Phylogenetic Research (CIPRES)
• http://www.phylo.org/
• The largest active phylogenetics group going. “The goal of
the CIPRES project is to enable large-scale phylogenetic
reconstructions on a scale that will enable analyses of huge
datasets containing hundreds of thousands of bio molecular
sequences “ Have 5 years of funding.
• Computational phylogenetics activities: phylogenetic
reconstruction from gene order, gene sequences. Horizontal
gene transfer.
Renci (renaisannce computing
institute)
• http://www.renci.org/
• Led by Dan Reed. “a major collaborative venture of Duke
University, North Carolina State University, the University of
North Carolina at Chapel Hill and the state of North
Carolina.”
• Funding through the National Middleware Initiative
• Key role in the TeraGrid
Argonne National Lab Biosciences Division
• Let by Rick Stevens. http://www.bio.anl.gov/
• LOTS of structural biology. Very focused, well funded and
dedicated group.
Cal-IT2
• Led by Larry Smarr. http://www.calit2.net/
• Lots of areas of focus, including “
– “GEON: The Geosciences Network [GEON]
– Laboratory for the Ocean Observatory Knowledge
INtegration Grid [LOOKING]
– Sensor Networks
BIRN
•
•
•
•
•
•
Biomedical Informatics Research Network
http://www.nbirn.net/
NIH-sponsored attempt to create health-oriented cyberinfrastructure
Function BIRN – brain function and disorders, e.g. schizophrenia
Morphometry BIRN – brain structural disorders, e.g. Alzheimers
Mouse BIRN – studying mouse brain and mouse models of human
brain disorders
• Grid technology, using federated data system approach, based on
Globus, SRB, etc.
Optiputer
• “The OptIPuter, so named for its use of Optical networking,
Internet Protocol, computer storage, processing and
visualization technologies, is an envisioned infrastructure that
will tightly couple computational resources over parallel
optical networks using the IP communication mechanism.
The OptIPuter exploits a new world in which the central
architectural element is optical networking, not computers creating "supernetworks".
• LambdaRAM
• http://www.optiputer.net/index.html
Genomes to Life
• http://www.doegenomestolife.org/
• Original goals:
– Identify and Characterize the Molecular Machines of Life — the
Multiprotein Complexes That Execute Cellular Functions and
Govern Cell Form
– Characterize Gene Regulatory Networks
– Characterize the Functional Repertoire of Complex Microbial
Communities in Their Natural Environments at the Molecular
Level
– Develop the Computational Methods and Capabilities to
Advance Understanding of Complex Biological Systems and
Predict Their Behavior
– (Goals taken directly from Genomes to Life web site)
Genomes to Life refactored
• The Department of Energy’s Office of Science announced ... that it
is revising its plans for the deployment of new research facilities to
support its Genomics:GTL program. … The specific goal of the new
facilities plan will be to accelerate GTL systems biology research in
the area of bioenergy, with the objective of developing cost-effective,
biologically based renewable energy sources to reduce U.S.
dependence on fossil fuels.
• http://www.sc.doe.gov/Sub/Newsroom/News_Releases/DOESC/2006/GTL/index.htm
Current Genomic Pipeline
sequence info
structure info
NR, PFAM
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
Arabidopsis Protein sequences
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
http://eol.sdsc.edu/methodology.html
Store assigned regions in the DB
Scale of Multi-genome Analysis
sequence info
structure info
NR, PFAM
SCOP, PDB
Building FOLDLIB:
PDB chains
SCOP domains
PDP domains
CE matches PDB vs. SCOP
90% sequence non-identical
minimum size 25 aa
coverage (90%, gaps <30, ends<30)
104
entries
~800 genomes
@ 10k-20k per
=~107 ORF’s
Genomes Protein sequences
Prediction of :
signal peptides (SignalP, PSORT)
transmembrane (TMHMM, PSORT)
coiled coils (COILS)
low complexity regions (SEG)
Create PSI-BLAST profiles for Protein sequences
Structural assignment of domains by
PSI-BLAST on FOLDLIB
4 CPU
years
228 CPU
years
3 CPU
years
Only sequences w/out A-prediction
Structural assignment of domains by
123D on FOLDLIB
9 CPU
years
Only sequences w/out A-prediction
Functional assignment by PFAM, NR,
PSIPred assignments
FOLDLIB
Domain location prediction by sequence
http://eol.sdsc.edu/methodology.html
252 CPU
years
3 CPU
years
Store assigned regions in the DB
Other centers of note
• National Resource for Biomedical Supercomputing (NRBSC).
Pittsburgh. Source of MCell. http://www.nrbsc.org/.
• Scientific Computing and Imaging Institute – Christopher R. Johnson
http://www.sci.utah.edu/
• UCSD Bioinformatics Program - http://bioinformatics.ucsd.edu/
• Wash U bioinformatics http://www.ccb.wustl.edu/
• MIT, Johns Hopkins also have interesting programs
• List (incomplete) at http://zlab.bu.edu/~mfrith/BioinfoCenters.html
Some international efforts
• eScience project - http://www.nesc.ac.uk/. EDIAMOND
• Japanese Petaflops Protein Folding project http://www.jsbi.org/journal/GIW02/GIW02P121.pdf
Some activities at IU
•
•
•
•
•
Flybase – authoritative source of annotated fruit fly genomic
information. http://flybase.bio.indiana.edu/
Lifescienceweb http://www.lifescienceweb.org/
– Mutdb http://www.mutdb.org/
– SBLEST “The Structure-Based Local Environment Search Tool
uses vectors of amino acid structural environments to perform K
Nearest Neighbor queries against a database of protein
structures. Our Web services allow for authenticated (password
protected) submission of a protein structure, or selection of an
existing structure and searching it against common databases
and then visualization of the results using UCSF Chimera or
PyMOL.”
http://www.lifescienceweb.org/index.php?mode=sBlest_about
TeraGrid – teragrid.iu.edu
IU IT Strategic Plan
IU Life Sciences Strategic Plan
Some .orgs and commercial
activities
•
•
•
•
•
Bioinformatics.org
– Includes BioBrew Linux
BioPerl http://www.bioperl.org/wiki/Main_Page
BioPhython http://www.biopython.org/
BioJava http://biojava.org/wiki/Main_Page
BioMoby http://biomoby.open-bio.org/index.php/what-is-moby/
•
Bio grid activities
– folding@home http://folding.stanford.edu/
– Protein predictor @ home http://predictor.scripps.edu/
– rosetta@home http://boinc.bakerlab.org/rosetta/
– Fight aids @ home http://fightaidsathome.scripps.edu/
– World community grid http://www.worldcommunitygrid.org/
•
Commercial:
– Apple bioclusters (uses SGE)
– IBM Life Science Institutes of Innovation
– Sun Center of Excellence
– Dell Center of Excellence
Some Good Books
•
•
•
•
•
•
•
•
•
Computational Cell Biology. 2002. Springer Verlag (Fall et al, eds).
Foundations of systems biology. MIT Press, 2001. Kitano (ed)
Winter, P.C., G.I. Hickey, H.L. Fletcher. 1998. Instant notes in genetics.
Springer-Verlag, NY. ISBM 0-387-91562-1
Durbin, R., S. Eddy, A. Krogh, G. Mitchison. 2000. Biological sequence
analysis. Cambridge University Press.
Gibas, C., and P. Jambeck. 2001. Developing bioinformatics computer
skills. O’Reilly.
Tisdall, J. 2001. Beginning perl for bioinformatics. O’Reilly.
Tisdall, J. 2003. Mastering perl for bioinformatics, O’Reilly.
Gusfield, D. 1997. Algorithms on strings, trees, and sequences. Cambridge
University Press.
Berman, F., G.C. Fox, A.J.G. Hey. (eds) 2003. Grid computing: making the
grid infrastructure a reality. Wiley, Sussex
Acknowledgments
•
•
•
•
Funding for projects described in this talk has come from the National
Science Foundation, National Institutes of Health, Lilly Endowment, Inc.,
State of Indiana (particularly through support of I-light Initiative and the 21st
Century Fund)
The work described here was made possible by the faculty, students, and
staff of Indiana University. Thanks especially to the staff of RAC, CPO,
Telecommunications, PTL, UITS generally, the participants in the Indiana
Genomics Initiative, and the participants in the METACyt Initiative.
Several of the slides and ideas presented here were developed by
colleagues or collaborators – the Research and Academic Computing
Division of UITS in general, and Dick Repasky in particular.
Stewart’s visit to Dresden is funded in part by the Center for the
International Exchange of Scholars, the Technical University of Dresden,
and Indiana University
• And thank you very much! This has been fun and
educational for me!
Download