The Future of Bioinformatics (with examples from structural bioinformatics) Philip E. Bourne The University of California San Diego pbourne@ucsd.edu http://www.sdsc.edu/pb/Talks Feb. 25, 2004 World University Network - Worldwide Broadcast Outline Bioinformatics thus far Today – a growth discipline Drivers Data Complexity – biological and data The interface to medical informatics and systems biology Challenges The devil is in the details Quality control Fundamentals versus relevance to biology Feb. 25, 2004 World University Network - Worldwide Broadcast "You can observe a lot just by watching." Bioinformatics Thus Far – Pre 1970 Bioinformatics (2003) 19 2176-2190 1945 Biochemical Pathways - Horowitz 1953 Structure of DNA – W&C 1969 Genetic Variation 1962 Molecular Homology – Florkin 1965 Evolutionary Patterns – Purling 1966 Molecular Modeling - Levinthal 1967 Phylogenetic Trees – Fitch 1969 Properties – Ptitsyn 1970 Dynamic Programming N&W 1953 Game Theory – Neumann and Morgenstern 1959 Grammars – Chomsky 1962 Information Theory – Shannon & Weaver 1966 Cellular automata – Neuman Feb. 25, 2004 World University Network - Worldwide Broadcast Bioinformatics Thus Far – 1970’s Problem Definition Improved Sequence Alignments Sanakoff Smith Waterman Algorithm Exon/Introns Gilbert Public Resources Dayhoff, PDB Feb. 25, 2004 World University Network - Worldwide Broadcast Structural patterns And Properties Richards Structure Prediction Levitt Chou and Fasman Scheraga Bioinformatics Thus Far – 1980’s Computational Biology Emerges Domains recognized Rashin Neural nets Hopfield Tree of Life Emerges Molecular computing Conrad FASTA Lipman & Pearson Nanotechnology Drexler Profiles Gribskov Reductionism begins Thornton Sander Feb. 25, 2004 Clustering Shepard Relational Databases Networks – EMBLnet, BIONET World University Network - Worldwide Broadcast Bioinformatics Thus Far – 1990’s Bioinformatics and Biotechnology Emerge Human Genome Internet/Web Project Feb. 25, 2004 World University Network - Worldwide Broadcast So What is Bioinformatics Today? A relatively new term for a scientific endeavor that has been around much longer Medical informatics preceded it, and defined some of the foundations? A scientific endeavor driven out of a paradigm shift in which biology became a data driven science A scientific endeavor that has gained from fundamental developments is computer and information science e.g., algorithms, ontologies, Bayesian networks, neural networks, text mining … A growth discipline……. Feb. 25, 2004 World University Network - Worldwide Broadcast "Do you mean now?" -- When asked for the time. Bioinformatics - A Vice Chancellor’s View Biological Experiment Collect Data Information Characterize Knowledge Compare Model Discovery Infer Complexity Higher-life Technology 1 Organ 10 Brain Mapping Model Metaboloic Pathway of E.coli Sub-cellular Structure (C) Copyright Phil Bourne 1998 102 Neuronal Modeling 106 Virus Structure Ribosome Human Genome Project Yeast E.Coli C.Elegans Genome Genome Genome 90 1 # People/Web Site Genetic Circuits ESTs Sequence Feb. 25, 2004 100000 Computing Power Cardiac Modeling Cellular Assembly Data 1000 100 Gene Chips World University Network - Worldwide Broadcast 95 00 Year 1 Small Genome/Mo. Human Genome 05 Sequencing Technology Growth in Bioinformatics as Measured by ISMB Attendance 1500 2002 Edmonton CANADA http://www.iscb.org/history.shtml Feb. 25, 2004 World University Network - Worldwide Broadcast Bioinformatics Journal 1400 1200 1000 800 Submissions 600 400 200 0 1997 1998 1999 2000 2001 2002 2003 Bioinformatics Journal 5 Growth in the Journal Bioinformatics 4.5 4 3.5 3 2.5 Impact Factor 2 1.5 1 0.5 0 1997 Feb. 25, 2004 1998 1999 World University Network - Worldwide Broadcast 2000 2001 2002 2003 Drivers – Data Growth and Data Complexity Consider Macromolecular Structure as an example Feb. 25, 2004 World University Network - Worldwide Broadcast Bourne Bioinformatics Editorial 1999 15(9):715 “Over the next 5 years there will be an estimated 10 major structural genomics efforts each yielding 200 structures per year. While these efforts will deplete regular structure determination efforts, improvements in technology and a general expansion of the field will continue to yield 50 structures per week worldwide outside of the structural genomics initiatives.” Net result 35,000 structures by 2005 There were 11,000 structures at the time of this prediction Feb. 25, 2004 World University Network - Worldwide Broadcast "You can observe a lot just by watching." PDB Growth Curve Approx. 24,000 structures today In 2003 approx. 5,000 structures were deposited Feb. 25, 2004 World University Network - Worldwide Broadcast History Feb. 25, 2004 World University Network - Worldwide Broadcast Predictions Can Be Good A Data Centric View of the Future Data complexity High throughput data collection Database versus literature Bioinformatics as data driver Data representation Data integration Feb. 25, 2004 World University Network - Worldwide Broadcast "If you come to a fork in the road, take it." Numbers and Complexity Complexity is increasing (a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA (e)25, antibodies (f) viruses actin (h) the nucleosome Feb. 2004 World University(g) Network - Worldwide Broadcast (i) myosin (j) ribosome Courtesy of David Goodsell, TSRI Complexity - The Ribosome A Nanomachine 50s • Translates mRNA into protein • Molecular Mass: 2.6 million • Maximum Dimension ~25 nm protein mRNA • 2/3 RNA – performs catalysis • 1/3 protein –outer scaffold for the RNA 30s Figure from J. Frank, Wadsworth Center, NY "The ribosome, together with its accessories, is probably the most sophisticated machine ever made.“ R. Garrett (1999) Nature 400 Feb. 25, 2004 World University Network - Worldwide Broadcast High Throughput - The Structural Genomics Pipeline (X-ray Crystallography) Basic Steps Crystallomics • Isolation, Target • Expression, Data Selection • Purification, Collection • Crystallization Bioinformatics • Distant homologs • Domain recognition Automation Bioinformatics • Empirical rules Automation Better sources Structure Solution Structure Refinement Software integration Decision Support MAD Phasing Automated fitting Bioinformatics Throughout the Process Feb. 25, 2004 World University Network - Worldwide Broadcast Functional Annotation Publish Bioinformatics No? • Alignments • Protein-protein interactions • Protein-ligand interactions • Motif recognition An Aside on the Future of Publishing Full Description Captured as the Paper/Database is Written/Deposited Does away with ... ? Oops! ß sandwich? Where? Large loop? Which one?? Loop-sheet-helix??? … the p53 core domain structure consists of a ß sandwich that serves as a scaffold for two large loops and a loop-sheethelix motif ... 1TSR ----Science Vol.265, p346 Corresponding structure from the PDB Feb. 25, 2004 World University Network - Worldwide Broadcast BioEditor - A DTD Driven Domain Specific Editor Feb. 25, 2004 World University Network - Worldwide Broadcast http://bioeditor.sdsc.edu Structural Genomics Targets and their Status from http://targetdb.rcsb.org Bourne et al. 2004 Pacific Symposium on Biocomputing http://www-smi.stanford.edu/projects/helix/psb04/bourne.doc Feb. 25, 2004 World University Network - Worldwide Broadcast The Data - Bioinformatics Cycle Result – Computation and Experiment Become More Synergistic Turn Knowledge into New Data Requirements Data Bioinformatics Turn Data into Knowledge Feb. 25, 2004 World University Network - Worldwide Broadcast Deuterium Exchange Mass Spec to Predict Structure Target Protein Structure Templates CASP DXMS Threading k (Stability) Best Structure(s) Amino Acid Profile Match Method Feb. 25, 2004 World University Network - Worldwide Broadcast COREX Biological Representation The Gene Ontology changes everything Molecular function Biochemical process Cellular location DAG – machine usable The number of papers referencing the gene ontology has increased dramatically in the last year Feb. 25, 2004 World University Network - Worldwide Broadcast Biological Data Representation Future Tools to construct ontologies from free text? Ontologies for details of function, proteinprotein interaction, protocols, complete pathway information Feb. 25, 2004 World University Network - Worldwide Broadcast Data Integration Web Services – the holy grail of interoperability? Feb. 25, 2004 World University Network - Worldwide Broadcast Web Services Its not CORBA – biologists can do it Easy to implement Platform independent Driver to force data providers to define and publish a detailed API Compelling - introduces the prospect of global workflow Feb. 25, 2004 World University Network - Worldwide Broadcast Perl Web Services Client Example A small PERL program to access all Pubmed abstracts containing the word ‘ferritin’ use SOAP::Lite; $ids_ref = SOAP::Lite -> uri(‘http://server.location.edu/pdbWebServices’) -> proxy(‘http://server.location.edu/pdbWebServices’) -> pubmedAbstractQuery($ARGV[0]) -> result; @ids = @($ids_ref); Print “@ids\n”; Mycomputer(1)% web_service.pl ferritin 1AEW 1AQO 1BCF 1BFR 1BG7 1DPS 1EUM 1FHA 1JGC 1JI5 1JIG 1MFR 1QGH 1RCC 1RCD 1RCE 1RCG 1RCI 1RYT 2FHA Feb. 25, 2004 World University Network - Worldwide Broadcast A Biological Complexity Perspective Feb. 25, 2004 World University Network - Worldwide Broadcast REPRESENTATIVE DISCIPLINE EXAMPLE UNITS Anatomy MRI Physiology Heart Cell Biology SCIENTIFIC RESEARCH & DISCOVERY Organisms Neuron REPRESENTATIVE TECHNOLOGY Migratory Sensors Organs Ventricular Modeling Cells Electron Microscopy You Are Here Proteomics Genomics Structure Sequence Macromolecules Biopolymers Infrastructure Medicinal Chemistry Feb. 25, 2004 Protease Inhibitor X-ray Crystallography Technologies Atoms & Molecules World University Network - Worldwide Broadcast Training Protein Docking The Post-Genomic Era The “New” Central Dogma Genomes Gene Products Structure & Function Pathways & Physiology ~ Scientific Challenges - Deciphering the genome, mapping the genotype-phenotype relationships, dissecting organismic function, engineering organisms with altered functionality, figuring out complex traits and polymorphism, understanding physiology. ~ Algorithmic Challenges - comparisons of whole and partial genomes, metrics for similarity and homology, metabolic reconstruction, dissecting pathways, and whole cell modeling. ~ Computational Challenges - creation the informatics infrastructure, creation, annotation, curation and dissemination of databases, development of parallel computational methods. Feb. 25, 2004 World University Network - Worldwide Broadcast Interaction Networks A Protein Interaction Map of Drosophila melanogaster L. Giot, et al. Science, Vol. 302, Issue 5651, 1727-1736, December 5, 2003 Feb. 25, 2004 World University Network - Worldwide Broadcast Phenomena in biological systems may be organized in several layers. Populations Ecological Communities Populations of a Species Physiology and Organisms Integrative physiology, Homeostasis Organs, Tissues Cells Pathways and Information Transfer Integrated metabolism, regulatory, developmental pathways Simple pathways for information transfer, regulation, development Simple metabolic pathways for creating & using other molecules Biological Macromolecules and Structures Biomolecular Assemblies; ligand-receptor complexes Molecules and Structures created by genes, gene products Gene Products: RNAs; Proteins Genes and Genomes Physics and Chemistry e.g. Physical Chemistry, Organic Chemistry, Information theory, Constraints of self-assembling adaptive systems Feb. 25, 2004 World University Network - Worldwide Broadcast Each system layer builds from lower system layers & acquires new emergent properties Populations Ecological Communities Populations of a Species Ecological Processes & Populations Integrative physiology, Homeostasis Organs, Tissues Cells Tissue & Organismal Physiology Developmental & Physiological Processes Pathways and Information Transfer Integrated metabolism, regulatory, developmental pathways Simple pathways for information transfer, regulation, development Simple metabolic pathways for creating & using other molecules Biochemical Pathways & Processes Biological Macromolecules and Structures Biomolecular Assemblies; ligand-receptor complexes Molecules and Structures created by genes, gene products Gene Products: RNAs; Proteins Genes and Genomes Physics and Chemistry Biomolecular Structure & Function Genes Information and Genomes e.g. Physical Chemistry, Organic Chemistry, theory, Constraints of self-assembling adaptive systems Feb. 25, 2004 World University Network - Worldwide Broadcast Physics and Chemistry New Emergent Properties Physiology and Organisms The Next Response Transitional medicine Personalized medicine Merger of medical, chem and bioinformatics Training in cooperative in silico and experimental research Centers that reflect that training ie different to NCBI or EBI Feb. 25, 2004 World University Network - Worldwide Broadcast Think! How the hell are you gonna think and hit at the same time?" Statement of the Director, NIGMS, before the House Appropriations Feb. 25, 2004 World University Network - Worldwide Broadcast Subcommittee on Labor, HHS, Education Thursday, February 25, 1999 Near Term Challenges Better Resources and Algorithms Feb. 25, 2004 World University Network - Worldwide Broadcast Current Data Resources and Algorithms are Challenged by Biological Complexity Our understanding of biological complexity is not reflected in the current generation of biological data resources Hence these resources do not enable the next generation Algorithms are often limited since complexity implies variation Consider an example - the protein kinaselike superfamily Feb. 25, 2004 World University Network - Worldwide Broadcast The SCOP Classification Hierarchy Courtesy Steven Brenner Feb. 25, 2004 World University Network - Worldwide Broadcast An Example of a Structural Superfamily: The Protein Kinase-Like Superfamily SCOP grouping for kinases 1) Class: Alpha+Beta 2) Fold: Protein Kinase Catalytic Core 3) Superfamily: Protein Kinase Catalytic Core 4) Families: 7 8 a) Ser/Thr Kinases b) Tyr Kinases c) Atypical Kinases d) Antibiotic Kinases e) Lipid Kinases Superfamily: not all eukaryotic or protein kinases: some homologues discovered in bacteria that phosphorylate antibiotics, others phosphorylate lipids Feb. 25, 2004 Typical Kinase Core (c-Src, PDB ID: 2SRC) World University Network - Worldwide Broadcast Evolution of the Kinase Superfamily: Comparison of Three Superfamily Members •A: Casein kinase 1 (PDB ID: 1CSN) •B: Aminoglycoside kinase (PDB ID: 1J7L) •C: Phosphatidylinositol 3kinase (PDB ID: 1E8X). •D: The previous three structures with only their shared region superposed (1CSN: light blue, 1J7L: red, 1E8X: yellow). •The three kinases share a minimal core required for ATP binding and phosphotransfer. Feb. 25, 2004 World University Network - Worldwide Broadcast Our Algorithms Need to Continue to Evolve Consider structure comparison and alignment of the diverse protein kinases Feb. 25, 2004 World University Network - Worldwide Broadcast An Example of Manual vs. Automated with Combinatorial Extension (CE) •The manual alignment can be used to better understand the limitations of our automated method •Alignment of helix C of two tyrosine kinases •Insulin Receptor Kinase (pdb id 1IR3) •c-Src (pdb id 2SRC) •Can be aligned with 40% ident, 3.0Å RMSD •In Src, C-helix is displaced and rotated outward •Rotation pushes n-terminal end of helix out very far from n-terminal end of IRK •CE gaps a part of this (yellow), splitting helix, aligning part of IRK helix C with loop leading to helix C in Src Feb. 25, 2004 World University Network - Worldwide Broadcast Orange: IRK, Blue: c-Src Yellow: CE gap region An Example of Manual vs. Automated with CE •A closer look: CE alignment •The CE alignment puts closer C-alpha positions together but does not respect helical relationships •Hand alignment respects helix, aligns more distant C-alpha positions Feb. 25, 2004 World University Network - Worldwide Broadcast Hand alignment Improving CEfam: Multiple Alignments with CE •Example with strands 1 and 2 of kinase superfamily •A: original •B: optimal parameters •C: manual •Parameters also improved results with other protein superfamilies in visual analysis •Just as sequence alignments are benchmarked against structure alignments, structure alignments should be benchmarked to manual results •Improvement in optimization is now being folded into the next generation of CE Feb. 25, 2004 World University Network - Worldwide Broadcast Near Term Challenges Quality Control Consider an example The definition of domains from 3-D structure Feb. 25, 2004 World University Network - Worldwide Broadcast The 3D Domain Assignment Problem Domain is a fundamental structural, functional and evolutionary unit of protein: Compact Stable Have hydrophobic core Fold independently Perform specific function Can be re-shuffled and put together in different combinations Evolution works on the level of domain Feb. 25, 2004 World University Network - Worldwide Broadcast Exact assignments of domains remains a difficult and unresolved problem. There is no complete agreement among experts on domain assignment given a protein structure. Expert methods agree on 80% of all existing manual assignments, the remaining 20% represent “difficult” cases Expert assignment #3 Expert assignment #1 Expert assignment #2 Feb. 25, 2004 World University Network - Worldwide Broadcast Manual vs. automatic consensuses: do they overlap? Chains with manual consensus: 375 (80% of entire dataset) Chains with automatic consensus: 374 (80% of entire dataset) Chains with consensus (automatic or manual) : 424 (90.6% of entire dataset) Automatic consensus only 46 chains (10.9% of chains with consensus) Manual consensus only 47 chains (11.1% of chains with consensus) Manual and automatic consensus agree 328 chains (77.3% of chains with consensus) Automatic consensus and manual consensus disagree 3 chains (0.7% of chains with consensus) Veretnik et al. 2003 JMB submitted Feb. 25, 2004 World University Network - Worldwide Broadcast 1cjaa (actin-fragmin kinase, slime mold): an unusual kinase [complex interface] SCOP, PDP, DomainParser 1 domain Feb. 25, 2004 CATH 1 domain + unassigned World University Network - Worldwide Broadcast DALI 4 domains typical kinase Near Term Challenges – High Throughput Feb. 25, 2004 World University Network - Worldwide Broadcast integrated Genomic Annotation Pipeline - iGAP structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) sequence info Deduced protein sequences NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Feb. 25, 2004 Domain location prediction by sequence World University Network - Worldwide Broadcast Store assigned regions in the DB integrated Genomic Annotation Pipeline iGAP Deduced Protein sequences structure info SCOP, PDB Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) ~800 genomes @ 10k-20k per =~107 ORF’s sequence info NR, PFAM 104 entries Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB 4 CPU years 228 CPU years 3 CPU years Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB 9 CPU years Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSIPred assignments FOLDLIB Feb. 25, 2004 Li, et al., (2003) Genome Biology Domain location prediction by sequence World University Network - Worldwide Broadcast 252 CPU years 3 CPU years Store assigned regions in the DB Towards Workflows and the Grid iGAP APST Scheduler Executables Parameters Input Output Resources MDS/NWS/Ganglia XML Grid Resource Data Manager SCP/GASS/SRB/FTP Information Storage Compute Compute Manager Feb. 25, 2004 Grid Middleware SSH/GRAM/GASS PBS/Loadleveler/Condor World University Network - Worldwide Broadcast THE EOL GRID CONSORTIUM SDSC Blue Horizon The EOL Cluster Sun Enterprise Server Industrial Partners IBM Ceres EOL BII Singapore Encyclopedia Proteomics Inc. Feb. 25, 2004 World University Network - Worldwide Broadcast Titech Japan Near Term Challenges – We need to overcome the “high noon” problem Feb. 25, 2004 World University Network - Worldwide Broadcast High Noon – A Working Definition 12:00 The cost:benefit ratio of entry to bioinformatics tools and resources is too high for the majority of biologists Thus, those who could gain and contribute most from the services provided are not users Feb. 25, 2004 World University Network - Worldwide Broadcast One Approach - MBT Java toolkit for developing custom molecular visualization applications High-quality interactive rendering of: sequence structure function http://mbt.sdsc.edu Feb. 25, 2004 World University Network - Worldwide Broadcast MBT Functionality Provides Data loading Local files (PDB, mmCIF, Fasta, etc) Compressed files (zip, gzip) Remote (http, ftp, OpenMMS?, EJB?) Efficient data access Raw data Derived data (StructureMap) Vizualization (plug-in viewers) Feb. 25, 2004 World University Network - Worldwide Broadcast MBT Architecture Feb. 25, 2004 World University Network - Worldwide Broadcast Future - The Structure Should be the User Interface Ligand - What other entries contain this? Chain - What other entries have chains with >90% sequence identity? Residue - What is the environment of this residue? Feb. 25, 2004 World University Network - Worldwide Broadcast On-going and Longer Term Challenges Feb. 25, 2004 World University Network - Worldwide Broadcast Outstanding Problems in Sequence Analysis & Comparison Feb. 25, 2004 Exon recognition Protein coding gene modeling Protein/EST alignment Large scale sequence comparison and alignment Synteny recognition Polymorphism / variation detection Regulatory pattern recognition Repetitive DNA characterization RNA gene modeling World University Network - Worldwide Broadcast Exemplar Bioinformatics Problems 1. Full genome comparisons 2. Rapid assessment of polymorphic variations 3. Complete construction of orthologous and paralogous groups 4. Structure resolution of large assemblies/complexes 5. Dynamical simulation of realistic systems 6. Rapid structural/topological clustering of proteins 7. Protein folding 8. Computer simulation of membrane insertion 9. Simulation of cellular pathways/ sensitivity analysis of pathways stoichiometry and kinetics Feb. 25, 2004 World University Network - Worldwide Broadcast Bringing the Data View and the Complexity View Together to Define the Bioinformatics “Engineering” Challenge Easy access to any type of biological data across databases Ability to go across databases and types of data Rapidly infer knowledge from new genome sequences Find relationships between sequence, structure and function of gene products Relate genotype to phenotype in species Access and apply polymorphism data seamlessly Feb. 25, 2004 A single computer interface (Web browser?) Computer platform independence Total opaqueness of format differences Compute on a point and click mode Seamless access to files, file uploads and downloads Multimedia capabilities on the interface Ability to integrate new tools/databases painlessly World University Network - Worldwide Broadcast Acknowledgements To all those who have chosen bioinformatics as a career and make the field so rich Particularly those who do so for lesser rewards – the data providers and annotators My group for the fun we had discussing this topic http://rinkworks.com/said/yogiberra.shtml "I didn't really say everything I said." Feb. 25, 2004 World University Network - Worldwide Broadcast