The Future of Bioinformatics Philip E. Bourne The University of California San Diego pbourne@ucsd.edu http://www.sdsc.edu/pb April 12, 2004 Michael Conrad Memorial Lecture Many of Michael’s contributions are now being more fully realized in the fields of bioinformatics and systems biology. We will explore current and future trends in these fields to further appreciate Michael’s vision April 12, 2004 Michael Conrad Memorial Lecture We have Come a Long Way… April 12, 2004 Michael Conrad Memorial Lecture April 12, 2004 Michael Conrad Memorial Lecture It will be through the increasing merger of computer science, computational science, information science and the life sciences that Michael’s foresights will be fully appreciated. Large amounts of complex data puts these disciplines on the same page and the book of bioinformatics can be written. It is therefore appropriate that today we spend time looking at the immediate future of bioinformatics April 12, 2004 Michael Conrad Memorial Lecture Today’s Outline We will address the following questions from two perspectives – data complexity and biological complexity: How did bioinformatics get here? What are the challenges today? Apology – many illustrations are drawn from our own work in structural bioinformatics What will the short and long term future hold? April 12, 2004 Michael Conrad Memorial Lecture ANYTHING Disclaimer - Plotting Change You Are Here TIME April 12, 2004 “The thing about change is that things will be different afterwards.” Michael Conrad Memorial Lecture — Alan McMahon Rules of Prediction Looking back, everything appears to have developed faster than reality Looking forward, everything will develop faster that you predict Hence, we are all very poor at predicting beyond the next 5 years – examples: The Next Fifty Years : Science in the First Half of the Twenty-first Century by John Brockman (Editor) CACM Volume 40 , Issue 2 (February 1997) April 12, 2004 Michael Conrad Memorial Lecture "This is like deja vu all over again." Can I even do 5 years? April 12, 2004 Michael Conrad Memorial Lecture Bourne Bioinformatics Editorial 1999 15(9):715 “Over the next 5 years there will be an estimated 10 major structural genomics efforts each yielding 200 structures per year. While these efforts will deplete regular structure determination efforts, improvements in technology and a general expansion of the field will continue to yield 50 structures per week worldwide outside of the structural genomics initiatives.” Net result 35,000 structures by 2005 There were 11,000 structures at the time of this prediction April 12, 2004 Michael Conrad Memorial Lecture "You can observe a lot just by watching." PDB Growth Curve Approx. 25,000 structures today In 2003 approx. 5,000 structures were deposited April 12, 2004 Michael Conrad Memorial Lecture History April 12, 2004 Michael Conrad Memorial Lecture Predictions Can Be Good So Let Us Review the History of Bioinformatics Thus Far – General Observations A scientific endeavor driven out of a paradigm shift in which biology became a data driven science – Today macromolecular structure data will be used to illustrate this paradigm shift A relatively new term for a scientific endeavor that has been around much longer Medical informatics preceded it, and defined some of the foundations A scientific endeavor that has gained from fundamental developments is computer and information science e.g., algorithms, ontologies, Bayesian networks, simulation, neural networks, text mining and which in turn defines new problem domains for computer science Systems biology may overtake it April 12, 2004 Michael Conrad Memorial Lecture "Do you mean now?" -- When asked for the time. A More Specific Chronology – Pre 1970 Bioinformatics (2003) 19 2176-2190 1945 Biochemical Pathways - Horowitz 1953 Structure of DNA – W&C 1969 Genetic Variation 1962 Molecular Homology – Florkin 1965 Evolutionary Patterns – Purling 1966 Molecular Modeling - Levinthal 1967 Phylogenetic Trees – Fitch 1969 Properties – Ptitsyn 1970 Dynamic Programming N&W 1970 Adaptability - Conrad 1953 Game Theory – Neumann and Morgenstern 1959 Grammars – Chomsky 1962 Information Theory – Shannon & Weaver 1966 Cellular automata – Neuman April 12, 2004 Michael Conrad Memorial Lecture A More Specific Chronology – 1970’s Problem Definition Improved Sequence Alignments Sanakoff Smith Waterman Algorithm Structure Prediction Levitt Chou and Fasman Scheraga Exon/Introns Gilbert Public Resources Dayhoff, PDB April 12, 2004 Structural Patterns And Properties Richards Michael Conrad Memorial Lecture Information processing In molecular systems Conrad A More Specific Chronology – 1980’s Computational Biology Emerges Domains recognized Rashin Neural nets Hopfield Tree of Life Emerges Molecular computing Conrad FASTA Lipman & Pearson Nanotechnology Drexler Profiles Gribskov Reductionism begins Thornton Sander April 12, 2004 Clustering Shepard Relational Databases Networks – EMBLnet, BIONET Michael Conrad Memorial Lecture A More Specific Chronology – 1990Bioinformatics and Biotechnology Emerge Human Genome Internet/Web Project Conrad, M., Adaptability theory as a guide for interfacing computers and human society, Systems Research 10, 3-23 (1993). April 12, 2004 Michael Conrad Memorial Lecture 2004 – Overview of the Current Challenges Genomes Gene Products Structure & Function Pathways & Physiology ~ Scientific Challenges - Deciphering the genome, mapping the genotypephenotype relationships, dissecting organismic function, engineering organisms with altered functionality, figuring out complex traits and polymorphism, understanding physiology. ~ Algorithmic Challenges - comparisons of whole and partial genomes, metrics for similarity and homology, metabolic reconstruction, dissecting pathways, and whole cell modeling. ~ Computational Challenges - creating the informatics infrastructure, information integration, annotation, curation and dissemination of databases, development of parallel computational methods. April 12, 2004 Michael Conrad Memorial Lecture Bioinformatics Journal 1400 Sociological Challenge 1200 1000 800 Submissions 600 400 200 0 1997 1998 1999 2000 2001 2002 2003 Bioinformatics Journal Data from Bioinformatics 5 4.5 4 3.5 3 2.5 Impact Factor 2 1.5 Growth outweighs readership particularly among biologists 1 0.5 0 1997 April 12, 2004 1998 Michael Conrad Memorial Lecture 1999 2000 2001 2002 2003 Bioinformatics - A Vice Chancellor’s View Biological Experiment Collect Data Information Characterize Knowledge Compare Model Discovery Infer Complexity Higher-life Technology 1 Organ 10 Brain Mapping Model Metaboloic Pathway of E.coli Sub-cellular Structure (C) Copyright Phil Bourne 1998 102 Neuronal Modeling 106 Virus Structure Ribosome Human Genome Project Yeast E.Coli C.Elegans Genome Genome Genome 90 1 # People/Web Site Genetic Circuits ESTs Sequence April 12, 2004 100000 Computing Power Cardiac Modeling Cellular Assembly Data 1000 100 Gene Chips Michael Conrad Memorial Lecture 95 00 Year 1 Small Genome/Mo. Human Genome 05 Sequencing Technology A Data Centric View of the Future Data complexity High throughput data collection Database vs literature Bioinformatics as data driver Data representation Data integration "If you come to a fork in the road, take it." April 12, 2004 Michael Conrad Memorial Lecture Numbers and Complexity Complexity is increasing (a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA (e) antibodies (f) viruses (g) actin (h) the nucleosome (i) myosin (j) ribosome Courtesy of David Goodsell, TSRI High Throughput - The Structural Genomics Pipeline (X-ray Crystallography) Basic Steps Crystallomics • Isolation, Target • Expression, Data Selection • Purification, Collection • Crystallization Bioinformatics • Distant homologs • Domain recognition Automation Bioinformatics • Empirical rules Automation Better sources Structure Solution Structure Refinement Software integration Decision Support MAD Phasing Automated fitting Bioinformatics Throughout the Process April 12, 2004 Michael Conrad Memorial Lecture Functional Annotation Publish Bioinformatics No? • Alignments • Protein-protein interactions • Protein-ligand interactions • Motif recognition An Aside on the Future of Publishing Full Description Captured as the Paper/Database is Written/Deposited Does away with ... ? Oops! ß sandwich? Where? Large loop? Which one?? Loop-sheet-helix??? … the p53 core domain structure consists of a ß sandwich that serves as a scaffold for two large loops and a loop-sheethelix motif ... 1TSR ----Science Vol.265, p346 Corresponding structure from the PDB April 12, 2004 Michael Conrad Memorial Lecture BioEditor - A DTD Driven Domain Specific Editor http://bioeditor.sdsc.edu April 12, 2004 Michael Conrad Memorial Lecture Bioinformatics 2003 19(7) 897-898 The Data - Bioinformatics Cycle Result – Computation and Experiment become More Synergistic Turn Knowledge into New Data Requirements Data Bioinformatics Turn Data into Knowledge April 12, 2004 Michael Conrad Memorial Lecture Deuterium Exchange Mass Spec to Predict Structure Woods, Baker et al. Target Protein Structure Templates CASP DXMS Threading k (Stability) Best Structure(s) Amino Acid Profile Match Method April 12, 2004 Michael Conrad Memorial Lecture COREX Biological Representation The Gene Ontology changes everything Molecular function Biochemical process Cellular location DAG – machine usable The number of papers referencing the gene ontology has increased dramatically in the last year April 12, 2004 Michael Conrad Memorial Lecture Biological Data Representation Future Tools to construct ontologies from free text? Ontologies for details of function, proteinprotein interaction, protocols, complete pathway information April 12, 2004 Michael Conrad Memorial Lecture Data Integration Web Services – the holy grail of interoperability? April 12, 2004 Michael Conrad Memorial Lecture Web Services Its not CORBA – biologists can do it You know longer have to remember where you left it – i.e. registries Platform independent Driver to force data providers to define and publish a detailed API Compelling - introduces the prospect of global workflow April 12, 2004 Michael Conrad Memorial Lecture Perl Web Services Client Example A small PERL program to access all Pubmed abstracts containing the word ‘ferritin’ use SOAP::Lite; $ids_ref = SOAP::Lite -> uri(‘http://server.location.edu/pdbWebServices’) -> proxy(‘http://server.location.edu/pdbWebServices’) -> pubmedAbstractQuery($ARGV[0]) -> result; @ids = @($ids_ref); Print “@ids\n”; Mycomputer(1)% web_service.pl ferritin 1AEW 1AQO 1BCF 1BFR 1BG7 1DPS 1EUM 1FHA 1JGC 1JI5 1JIG 1MFR 1QGH 1RCC 1RCD 1RCE 1RCG 1RCI 1RYT 2FHA April 12, 2004 Michael Conrad Memorial Lecture The Future A Biological Complexity Perspective April 12, 2004 Michael Conrad Memorial Lecture REPRESENTATIVE DISCIPLINE EXAMPLE UNITS Anatomy MRI Physiology Heart Cell Biology Neuron Proteomics Genomics Structure Sequence Medicinal Chemistry April 12, 2004 Protease Inhibitor SCIENTIFIC RESEARCH & DISCOVERY Simulation Organisms REPRESENTATIVE TECHNOLOGY Migratory Sensors Organs Ventricular Modeling Cells Electron Microscopy Macromolecules Biopolymers Data Atoms & Molecules Infrastructure Technologies Michael Conrad Memorial Lecture X-ray Crystallography Protein Docking Training Exploring Biological Complexity Requires: We do NOT neglect the details Synergy between theory and experiment which highlights the need for better algorithms and quality control But…. We have existing and emerging technologies to measure complex systems Provides the opportunity to address some of biology’s fundamental questions April 12, 2004 Michael Conrad Memorial Lecture Structure is a Useful Tool to Study Biological Complexity as Nature has Provided a Helping Hand… An average protein is 350 amino acids in length, with 20 amino acids there are 20350 possible proteins – way more than all the atoms in the universe In actuality there may be only 2-5x106 proteins There are likely between 1-5000 unique folds Fold is far more conserved than sequence and permits us to look back farther in evolutionary time than sequence April 12, 2004 Michael Conrad Memorial Lecture But.. much detail remains and our current methodologies fall short.. Consider structure comparison and alignment of the diverse protein kinases April 12, 2004 Michael Conrad Memorial Lecture An Example of a Structural Superfamily: The Protein Kinase-Like Superfamily SCOP grouping for kinases 1) Class: Alpha+Beta 2) Fold: Protein Kinase Catalytic Core 3) Superfamily: Protein Kinase Catalytic Core 4) Families: a) Ser/Thr Kinases b) Tyr Kinases 7 8 c) Atypical Kinases d) Antibiotic Kinases e) Lipid Kinases Superfamily: not all eukaryotic or protein kinases: some homologues discovered in bacteria that phosphorylate antibiotics, others phosphorylate lipids April 12, 2004 Typical Kinase Core (c-Src, PDB ID: 2SRC) Michael Conrad Memorial Lecture Evolution of the Kinase Superfamily: Comparison of Three Superfamily Members •A: Casein kinase 1 (PDB ID: 1CSN) •B: Aminoglycoside kinase (PDB ID: 1J7L) •C: Phosphatidylinositol 3kinase (PDB ID: 1E8X). •D: The previous three structures with only their shared region superposed (1CSN: light blue, 1J7L: red, 1E8X: yellow). •The three kinases share a minimal core required for ATP binding and phosphotransfer. April 12, 2004 Michael Conrad Memorial Lecture An accurate alignment would allow us to look back farther in evolutionary time that sequence alone. Alignment algorithms need to simulate what humans can do and beyond April 12, 2004 Michael Conrad Memorial Lecture An Example of Manual vs. Automated with Combinatorial Extension (CE) •The manual alignment can be used to better understand the limitations of our automated method •Alignment of helix C of two tyrosine kinases •Insulin Receptor Kinase (pdb id 1IR3) •c-Src (pdb id 2SRC) •Can be aligned with 40% ident, 3.0Å RMSD •In Src, C-helix is displaced and rotated outward •Rotation pushes n-terminal end of helix out very far from n-terminal end of IRK •CE gaps a part of this (yellow), splitting helix, aligning part of IRK helix C with loop leading to helix C in Src April 12, 2004 Michael Conrad Memorial Lecture Orange: IRK, Blue: c-Src Yellow: CE gap region Improving CEfam: Multiple Alignments with CE •Example with strands 1 and 2 of kinase superfamily •A: original •B: optimal parameters •C: manual •Parameters also improved results with other protein superfamilies in visual analysis •Just as sequence alignments are benchmarked against structure alignments, structure alignments should be benchmarked to manual results •Improvement in optimization is now being folded into the next generation of CE April 12, 2004 Michael Conrad Memorial Lecture Quality Control Consider an example The definition of domains from 3-D structure April 12, 2004 Michael Conrad Memorial Lecture The 3D Domain Assignment Problem Domain is a fundamental structural, functional and evolutionary unit of protein: Compact Stable Have hydrophobic core Fold independently Perform specific function Can be re-shuffled and put together in different combinations Evolution works on the level of domain April 12, 2004 Michael Conrad Memorial Lecture Exact assignments of domains remains a difficult and unresolved problem. There is no complete agreement among experts on domain assignment given a protein structure. Expert methods agree on 80% of all existing manual assignments, the remaining 20% represent “difficult” cases Expert assignment #3 Expert assignment #1 Expert assignment #2 April 12, 2004 Michael Conrad Memorial Lecture Manual vs. automatic consensuses: do they overlap? Chains with manual consensus: 375 (80% of entire dataset) Chains with automatic consensus: 374 (80% of entire dataset) Chains with consensus (automatic or manual) : 424 (90.6% of entire dataset) Automatic consensus only 46 chains (10.9% of chains with consensus) Manual consensus only 47 chains (11.1% of chains with consensus) Manual and automatic consensus agree 328 chains (77.3% of chains with consensus) Automatic consensus and manual consensus disagree 3 chains (0.7% of chains with consensus) Veretnik et al. 2004 JMB in press April 12, 2004 Michael Conrad Memorial Lecture 1cjaa (actin-fragmin kinase, slime mold): an unusual kinase [complex interface] SCOP, PDP, DomainParser 1 domain April 12, 2004 CATH 1 domain + unassigned Michael Conrad Memorial Lecture DALI 4 domains typical kinase Exemplar Bioinformatics Problems The Next 5 Years… 1. Full genome comparisons 2. Rapid assessment of polymorphic variations 3. Complete construction of orthologous and paralogous groups 4. Structure resolution of large assemblies/complexes 5. Dynamical simulation of realistic systems 6. Rapid structural/topological clustering of proteins 7. Protein folding Exemplar Bioinformatics Problems The Next 5 Years 8. Computer simulation of membrane insertion 9. Simulation of cellular pathways/ sensitivity analysis of pathways stoichiometry and kinetics 10 Comparison of complex networks and pathways 11 Deciphering the metabolome 12 Integration and interpretation of data at different biological scales – genomic to population 13 Identification of biomarkers for use in diagnostic medicine April 12, 2004 Michael Conrad Memorial Lecture These problems will be dealt with by a new generation of scientists comforable at both the bench and computer. Until then bioinforamticians need to work hard to overcome the “high noon” problem April 12, 2004 Michael Conrad Memorial Lecture High Noon – A Working Definition 12:00 The cost:benefit ratio of entry to bioinformatics tools and resources is too high for the majority of biologists Thus, those who could gain and contribute most from the services provided are not users April 12, 2004 Michael Conrad Memorial Lecture One Approach - MBT Java toolkit for developing custom molecular visualization applications High-quality interactive rendering of: sequence structure function http://mbt.sdsc.edu April 12, 2004 Michael Conrad Memorial Lecture MBT Architecture April 12, 2004 Michael Conrad Memorial Lecture Future - The Structure Should be the User Interface Ligand - What other entries contain this? Chain - What other entries have chains with >90% sequence identity? Residue - What is the environment of this residue? April 12, 2004 Michael Conrad Memorial Lecture Beyond 5 Years… Transitional medicine Personalized medicine Merger of medical-, chem- and bio- informatics Societies that reflect this Training in cooperative in silico and experimental research Centers that reflect that training ie different to NCBI or EBI April 12, 2004 Michael Conrad Memorial Lecture Think! How the hell are you gonna think and hit at the same time?" Beyond 5 Years Simulations used in the clinic setting Smart {genome} cards A ubiquitous life sciences Web that permits views from populations to atoms April 12, 2004 Michael Conrad Memorial Lecture "I knew I was going to take the wrong train, so I left early." Acknowledgements To all those who have chosen bioinformatics as a career and make the field so rich Particularly those who do so for lesser rewards – the data providers and annotators My group for the fun we had discussing this topic http://rinkworks.com/said/yogiberra.shtml April 12, 2004 Michael Conrad Memorial Lecture "I didn't really say everything I said."