Modeling Protein Function MED260 Philip E. Bourne Department of Pharmacology, UCSD pbourne@ucsd.edu http://www.sdsc.edu/pb Slides on-line at: http://www.sdsc.edu/pb/edu/med260/med260.ppt MED260 Modeling Protein Function - October 11, 2006 1 Agenda • Why model protein function? • Where does it fit as a technique in modern medical research? • The data deluge as a motivator • The extent of what can be modeled • Ontologies – establishing order from chaos • Examples of what can be learnt • Accuracy – a word of caution MED260 Modeling Protein Function - October 11, 2006 2 Why Model Protein Function • The rate of discovery of new proteins far outweighs our ability to functionally characterize them • Functional discovery of new proteins has implications in: – – – – Drug discovery Biomarker identification Understanding of biological processes Identification of disease states and treatment regimes Why model protein function? MED260 Modeling Protein Function - October 11, 2006 3 REPRESENTATIVE DISCIPLINE EXAMPLE UNITS Anatomy MRI Physiology Heart Cell Biology Neuron Proteomics Genomics Structure Sequence Medicinal Chemistry Protease Inhibitor Where does it fit as a technique in modern medical research? SCIENTIFIC RESEARCH & DISCOVERY Organisms REPRESENTATIVE TECHNOLOGY Migratory Sensors Organs Ventricular Modeling Cells Electron Microscopy Macromolecules Biopolymers Atoms & Molecules X-ray Crystallography Protein Docking REPRESENTATIVE DISCIPLINE EXAMPLE UNITS Anatomy MRI Physiology Heart Cell Biology Proteomics Genomics Medicinal Chemistry SCIENTIFIC RESEARCH & DISCOVERY Organisms Translational Medicine Neuron Structure Sequence Protease Inhibitor Where does it fit as a technique in modern medical research? REPRESENTATIVE TECHNOLOGY Migratory Sensors Organs Ventricular Modeling Cells Electron Microscopy Macromolecules Biopolymers Atoms & Molecules X-ray Crystallography Protein Docking The Ability to Model Protein Function Influences and can be Influenced by Any Level of Biological Complexity - Examples • Genome - rapid increase in sequenced genomes provides new raw material • Proteome – large increase in the number of 3D structures highlights new functions • Interactome – identification of a binding partner points to a new function • Metabolome – isolation of a protein within a metabolic pathway • Cell - localization points to function • Organ – gene expression in heart tissue points to function • Organism – different physiology observed in species can be related to protein functions Where does it fit as a technique in modern medical research? MED260 Modeling Protein Function - October 11, 2006 6 REPRESENTATIVE DISCIPLINE EXAMPLE UNITS Anatomy MRI Physiology Heart Cell Biology Neuron SCIENTIFIC RESEARCH & DISCOVERY Organisms REPRESENTATIVE TECHNOLOGY Migratory Sensors Organs Ventricular Modeling Cells Electron Microscopy We will focus here Proteomics Genomics Medicinal Chemistry Structure Sequence Protease Inhibitor Macromolecules Biopolymers Atoms & Molecules MED260 Modeling Protein Function - October 11, 2006 X-ray Crystallography Protein Docking 7 At All Levels We Are Being Driven By Data Biological Experiment Collect Data Information Characterize Knowledge Compare Model Discovery Infer Complexity Higher-life Technology 1 Organ 10 Brain Mapping 102 Neuronal Modeling 106 Virus Structure Ribosome Human Genome Project Yeast E.Coli C.Elegans Genome Genome Genome 90 1 # People/Web Site Genetic Circuits ESTs Sequence The Data Deluge Virtual Communities Model Metaboloic Pathway of E.coli Sub-cellular Structure 100000 Computing Power Cardiac Modeling Cellular Assembly Data 1000 100 Gene Chips 95 00 Year 1 Small Genome/Mo. Human Genome 05 Sequencing Technology Metagenomics A First Look • New type of genomics • New data (and lots of it) and new types of data – 17M new (predicted proteins!) 4-5 x growth in just few months and much more coming – New challenges and exacerbation of old challenges The Data Deluge MED260 Modeling Protein Function - October 11, 2006 9 Metagenomics: First Results • More then 99.5% of DNA • Everything we touch in very environment turns out to be a gold studied represent unknown mine organisms • Environments studied: – Culturable organisms are exceptions, not the rule • Most genes represent distant homologs of known genes, but there are thousands of new families The Data Deluge – Water (ocean, lakes) – Soil – Human body (gut, oral cavity, human microbiome) MED260 Modeling Protein Function - October 11, 2006 10 Metagenomics New Discoveries Environmental (red) vs. Currently Known PTPases (blue) 1 The Data Deluge MED260 Modeling Protein Function - October 11, 2006 11 The Good News and the Bad News • Good news – Data pointing towards function are growing at near exponential rates – IT can handle it on a per dollar basis • Bad news – Data are growing at near exponential rates – Quality is highly variable – Accurate functional annotation is sparse The Data Deluge MED260 Modeling Protein Function - October 11, 2006 12 Genomes - 2004 • We all know about the human – what is not so well known is: – – – – – The Data Deluge 191 completed microbial genomes 44 archaea 727 bacteria 785 eukaryotes (complete or in progress) Viroids …. MED260 Modeling Protein Function - October 11, 2006 13 Proteome • We are reasonably good at finding proteins in genomes with intergenic regions but not perfect – eg alternative initiation codons • Regulatory elements provide a different set of challenges • We are not so good at assigning functions to those proteins • Moreover the devil is in the details The Extent of What Can Be Modeled MED260 Modeling Protein Function - October 11, 2006 14 Estimated Functional Roles (by % of Proteins) of the Proteome in a Complex Organism The Extent of What Can Be Modeled MED260 Modeling Protein Function - October 11, 2006 15 Functional Nomenclature Needs to be Consistent for Orderly Progress – Enter EC and GO • EC classifies all enzymes http://www.chem.qmul.ac.uk/iubmb/enzyme / • Gene Ontology Consortium characterizes by molecular function, biochemiscal process and cellular location http://www.geneontology.org/ Ontologies – establishing order from chaos MED260 Modeling Protein Function - October 11, 2006 16 Functional Coverage of the Human Genome 40% covered http://function.rcsb.org:8080/pdb/function_distribution/index.html The Extent of What Can Be Modeled Step 1. Learn What You Can from the Protein Sequence • Find it • Pay attention to the quality of the functional annotation – errors are transitive • Understand its 1-D structure – domain organization, {signatures, fingerprints} Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 18 Step 2. Is there a 3D Structure? If so What Can You Learn from That? • • • • Find it Understand it Characterize it Understand its function(s) – these follow a power law at the fold level – some folds are promiscuous (many functions) others are solitary or of unknown function Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 19 (a) myoglobin (b) hemoglobin (c) lysozyme (d) transfer RNA (e) antibodies (f) viruses (g) actin (h) the nucleosome (i) myosin (j) ribosome Courtesy of David Goodsell, TSRI First Why Bother with Structure? An Example: Protein Kinase A This “molecular scene” for cAMP dependant protein kinase depicts years of collective knowledge. Beyond basics, only the atomic coordinates are captured by the PDB. Functional annotation requires the literature Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 21 What Did that Picture Tell Us? • Two domains with associated functions • ATP binding & substrate binding • Through conserved residues and their spatial location details of the ATP and substrate binding and mechanism of the phospho transfer reaction Examples of what can be learnt • So is structure the answer to functional modeling? MED260 Modeling Protein Function - October 11, 2006 22 Question: So is structure the answer to functional modeling? Answer: Partly - The number of unique protein sequences still outnumbers the number of unique structures by 100:1 Enter Structural Genomics Enter Structure Prediction Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 23 The Structural Genomics Pipeline (X-ray Crystallography) Basic Steps Crystallomics • Isolation, Target • Expression, Data Selection • Purification, Collection • Crystallization Examples of what can be learnt Structure Structure Solution Refinement MED260 Modeling Protein Function - October 11, 2006 Functional Annotation Publish 24 Structural Genomics Will Give Us.. • Good news – More structures (definitely) – New folds (some but not as anticipated) – New understanding of specific diseases and pathways (maybe) – Representatives from each major protein family (maybe) • Bad news – Many new structures that are functionally unclassified (definitely) Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 25 What About Structure Prediction? • Current rule We will be able to predict a structure when we know all the structures Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 26 Why is Structure Prediction so Hard? Random 1000 structurally similar PDB polypeptide chains with z > 4.5 (% sequence identity vs alignment length) Twilight Zone Midnight Zone Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 27 Approaches to Structure Prediction • • • • • Homology modeling Threading (aka fold recognition) Ab initio How well do we do? – see CASP Consensus servers – Eva - http://cubic.bioc.columbia.edu/eva/ – LiveBench - http://bioinfo.pl/meta/ Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 28 Step 3. What Can Be Got from Structure When You Have it? From Structural Bioinformatics Ed Bourne and Weissig p394 Wiley 2002 Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 29 Specific Example • Mj0577 – putative ATP molecular switch Mj0577 is an open reading frame (ORF) of previously unknown function from Methanococcus jannaschii. Its structure was determined at 1.7Å (Figure 7a) (Zarembinski et al, 1998). The structure contains a bound ATP molecule, picked up from the E. coli host. The presence of bound ATP led to the proposition that Mj0577 is either an ATPase, or an ATPbinding molecular switch. Further experimental work showed that Mj0577 cannot hydrolyse ATP by itself, and can only do so in the presence of M. jannaschii crude cell extract. Therefore it is more likely to act as a molecular switch, in a process analogous to ras-GTP hydrolysis in the presence of GTPase activating protein. From Structural Bioinformatics Ed Bourne and Weissig p402 Wiley 2002 Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 30 Step 4. Proteins Do Not Function in Isolation But are Part of Complex Interaction Networks http://www.genome.jp/kegg/ Examples of what can be learnt MED260 Modeling Protein Function - October 11, 2006 31 Accuracy - A Word of Caution • Errors are transitive – Proteins A and B are observed to have similar functions through sequence homology – Proteins B and C are observed to have similar functions through sequence homology – Is protein A related to protein C? – Up to 30% of current annotation may be wrong Accuracy - A Word of Caution MED260 Modeling Protein Function - October 11, 2006 32 Questions? MED260 Modeling Protein Function - October 11, 2006 33 Demo of Steps 1-4 • Step 1. Learn What You Can from the Protein Sequence • Step 2. Is there a 3D Structure? If So, What Can You Learn from That? • Step 3. What Can Be Got from Structure When You Have it? • Step 4. Proteins Do Not Function in Isolation But are Part of Complex Interaction Networks MED260 Modeling Protein Function - October 11, 2006 34