Creating Data Resources for Biology Helen M. Berman History of the PDB archive 1960’s Protein crystallography begins to take off Emerging interest in protein folding Use of computer graphics to represent structure Nobel Prize awarded for the first 3D protein structures: myoglobin and hemoglobin Myoglobin: Kendrew, Bodo, Dintzis, Parrish, Wyckoff, Phillips (1958) Nature 181 662-666; Hemoglobin: Perutz (1962) Proc. R. Soc. A265, 161-187; Lysozyme: Blake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206 757; Ribonuclease: Kartha, Bello, Harker (1967) Nature 213, 862-865; Wyckoff, Hardman, Allewell, Inagami, Johnson, Richards (1967) J. Biol. Chem. 242, 3753-3757. Myoglobin Hemoglobin Lysozyme Ribonuclease 3 1970’s Grassroots efforts to archive data Protein crystallographers discuss how to archive data June 1971 Cold Spring Harbor meeting brings groups together (Cold Spring Harbor Symposia on Quantitative Biology, vol. XXXVI, 1972) October 1971 PDB is announced in Nature New Biology (7 structures; vol 233, 1971, page 223) 1975 PDB receives first funding from NSF (~32 structures) Lysozyme Blake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206 757 Ribonuclease Kartha, Bello, Harker (1967) Nature 213, 862-865; Wyckoff, Hardman, Allewell, Inagami, Johnson, Richards (1967) J. Biol. Chem. 242, 3753-3757. Proportion of enzyme classes relative to total enzyme structures Percent Enzymes Ligases Isomerases Lyases Hydrolases Transferases Oxidoreductases Decade: RNA-containing structures tRNA J.L. Sussman, S.-H. Kim (1976) Biochem Biophys Res Commun. 68:89-96; J.D. Robertus, J.E. Ladner, J.T. Finch, D. Rhodes, R.S. Brown, B.F.C. Clark, & A. Klug (1974) Nature 250: 546-551. Protein/RNA complexes RNA only DNA/RNA hybrid Protein/DNA/RNA complexes Decade: 1980’s Technology takes off Structural biology is able to focus on medical problems Community efforts to promote data sharing IUCr guidelines requiring data deposition in the PDB are published DNAcontaining structures B-DNA Z-DNA 1bna Dickerson & Drew (1981) J. Mol. Biol. 149: 761-786 2dcg Wang, Quigley, Kolpak, Crawford, van Boom, van der Marel, Rich (1979) Nature 282: 680-686 Protein/DNA complexes DNA only DNA/RNA hybrid Prot/DNA/RNA complexes Proteinnucleic acid complexes Phage 434 repressor-operator 2or1 Aggarwal, Rodgers, Drottar, Ptashne, & Harrison (1988) Science 242: 899-907 Protein/DNA complexes Protein/RNA complexes Prot/DNA/RNA complexes Year Viruses Hopper, Harrison, Sauer (1984) Structure of tomato bushy stunt virus. V. Coat protein sequence determination and its structural implications J.Mol.Biol. 177: 701-713 Silva, Rossmann (1985) The refinement of southern bean mosaic virus in reciprocal space Acta Crystallogr. B41: 147-157 Cooperative community action Individual letters to editors of journals Committees IUCr commission on Biological Macromolecules ACA/USNCCr Richards committee Funding agencies Articles in journals Marvin Cassman Fred Richards Richard Dickerson 1990’s Number of structures increases exponentially Complexity of structures increases mmCIF dictionary created New databases begin to emerge User base expands dramatically PDB archive moves mmCIF Working Group Members Ribosome structures Electron Microscopy structures Bacteriorhodopsin. Henderson, Baldwin, Ceska, Zemlin, Beckmann, Downing (1990) J.Mol.Biol. 213: 899-929. 30S 50S Ribosome. Ban, Nissen, Hansen, Moore, & Steitz (2000) Science 289: 905-920; Clemons Jr., May, Wimberly, McCutcheon, Capel, & Ramakrishnan (1999) Nature 400: 833-840; Schluenzen, Tocilj, Zarivach, Harms, Gluehmann, Janell, Bashan, Bartels, Agmon, Franceschi, Yonath (2000) Cell 102: 615-623; Yusupova, Yusupov, Cate,& Noller (2001) Cell 106: 233-241. 2000’s wwPDB is formed Continued growth in structures Structural genomics takes off Structures solved as of 2007 wwPDB AC 2009 wwPDB Directors Worldwide Protein Data Bank Formalization of current working practice Members – RCSB PDB (Research Collaboratory for Structural Bioinformatics) – PDBj (Osaka University) – PDBe (EMBL-EBI) – BioMagResBank (University Wisconsin, Madison) MOU signed July 1, 2003 Announced in Nature Structural Biology November 21, 2003 wwpdb.org Guidelines and Responsibilities All members issue PDB IDs and serve as distribution sites for data One member is the archive keeper (RCSB PDB) All format documentation publicly available Strict rules for redistribution of PDB files All sites can create their own websites www.pdb.org www.ebi.ac.uk/pdbe/ www.pdbj.org Number of released entries Depositions to the PDB by decade Year: Archive Contents Public archive – More than 400,000 files (as of June 2009) – Requires over 93 GB of storage – Data dictionaries – Derived data files For each entry – – – – – Atomic coordinates Sequence information Description of structure Experimental data Release status information Internal archive – – – – – Depositor correspondence Depositor contact information Paper records Documentation Historical records from Day One What can the PDB archive tell us? Structure distribution Protein-RNA complexes 582 DNA only RNA only 655 RNA-DNA hybrid 39 1093 755 Other Number of structures Protein-DNA 1301 complexes Structure determination methods 46157 Protein only Year Resolution distribution: other structures Resolution Resolution distribution: protein structures Resolution distribution: all structures Year Redundancy: protein clusters Cluster # Total distinct chains in cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 459 297 196 445 218 330 302 254 229 185 182 178 176 160 153 Percent of distinct/novel structures Distinct and novel protein sequences 70 63% 60 51% 50 40 Structures containing distinct protein sequences (<98%) Structures containing novel protein sequences (<30%) Subset of PSI structures Subset of other SG structures 39% 37% 32% 27% 30 7% 14% 20 7% 25% 16% 4% 2% 10% 10 0 Year 1972-1979 1980-1989 1990-1999 Protein cluster Bacteriophage T4 lysozyme Hen white lysozyme Human lysozyme Mouse immunoglobulin Fc&Fab fragments Human immunoglobulin Fc&Fab fragments HIV-1 protease Trypsin (serine protease) Thrombin Human carbonic anhydrase II Whale myoglobin Human leukocyte antigen Human hemoglobin α-subunit Human hemoglobin β-subunit Ribonuclease A Human cyclin-dependant kinase 2 (CDK2) 2000-2008 First structure Deposition Date 2LZM 2LYZ 1GFE 1GIG 1FC1 2HVP 5PTP 2HGT 1CA2 1MBN 1HLA 3HHB 3HHB 2RNS 1HCK 1977-03-28 1975-02-01 1984-10-12 1993-01-20 1981-05-21 1989-04-10 1977-12-19 1991-06-03 1976-05-22 1973-04-05 1987-10-15 1975-04-01 1975-04-01 1973-04-01 1996-06-03 Lysozyme: Lessons learned T4 bacteriophage (459 structures) Amino acid replacement studies suggest that fraction of amino acid residues that define the structure of T4 lysozyme is about 50% B.W. Matthews (1996) FASEB J.10: 35-41. Insight into folding and catalysis Blake, Koenig, Mair, North, Phillips, Sarma (1965) Nature 206: 757. Hen egg white (297 structures) Low sequence identity Structural similarity of active site to T4 B.W. Matthews, M.G. Remington, M.G. Grutter, W.F. Anderson (1981) J.Mol.Biol. 147: 545-58. Insight into evolution and catalysis Myoglobin and hemoglobin: Lessons learned Whale myoglobin (185 structures) Different ligands: oxygen, carbon dioxide1 Amino acid substitution studies2 Laue studies3 Insight into function and dynamics Other species myoglobin Low sequence identity, same structure4 Insight into evolution Human hemoglobin (178 structures) Insight into function and disease (sickle cell anemia, thalassemia)5 Other species hemoglobin Low sequence identity, same structure4 Profound insight into evolution 1Kuriyan, Lodish et al.6 Wilz, Karplus, Petsko (1986) J. Mol. Biol. 192:133–154; 2Quillin, Arduini, Olson, Phillips, Jr. (1993) J. Mol. Biol. 234: 140–155, Carver, Brantley Jr, Singleton, Arduini, Quillin, Phillips Jr, Olson (1992) J. Biol. Chem. 267:14443–14450; 3Bourgeois, Vallone, Schotte, Arcovito, Miele, Sciara, Wulff, Anfinrud, Brunori (2003) PNAS 100: 8704-8709; 4Dickerson, Geis (1983) Hemoglobin:20 structure, function, and pathology; 5Kidd, Baker, Mathews, Brittain Baker (2001) Prot. Sci. 10:1739-1749, Harrington, Adachi, Royer Jr. (1998) J. Biol. Chem. 273: 32690 - 32696; 6Lodish, Berk, Zipursky, Matsudaira, Balitmore, Darnell (2000) Molecular Cell Biology WH Freeman & Co. TIM barrel proteins: Lessons learned TIM barrel structures (1727) http://www.cathdb.info Share the same fold but represent significant sequence and functional diversity Are enzymes or enzyme-related proteins involved in molecular or energy metabolism Comparative structure analysis indicates evolutionary relatedness of TIM barrel proteins Banner, Bloomer, Petsko, Phillips, Wilson, (1976) Biochem.Biophys.Res. Commun. 72: 146-155 Nagano, Orengo, Thornton (2002) J.Mol. Biol. 321: 741-65. Nagano, Orengo, Thornton (2002) J.Mol. Biol. 321: 741-65. HIV-related structures 122 HIV-1 reverse transcriptase HIV-1 protease Abacavir (GSK) 311 27 39 Nevirapine (BI) Stavudin (BMS) Amprenavir (GSK) 110 Fosamprenavir (GSK) 2HND, 2HNY, 1S1U, 1S1X, 1LW0, 1LWE, 1LWC, 1LWF, 1JLB, 1JLF, 1FKP, 1VRT, 3HVT Protease Reverse Transcriptase 1T7J, 1HPV Efavirenz (BMS) Lamivudine (GSK) Gag protein Integrase Zidovudine (GSK) Emtricitabine (Gilead) Atazanavir (BMS) 2FXE, 2FXD, 2O4K, 2AQU, 2FND 2RKG, 2RKF, 2QHC, 2Z54, 2Q5K, 2O4S, 1RV7, 1MUI 1JKH, 1IKW, 1IKV, 1FKO, 1FK9 Other Lopinavir (Abbott) Nelfinavir (Agouron) Darunavir (Tibotec) 2QAK, 2PYM, 2Q63, 2PYN, 2Q64, 2R5Q, 1OHR Tenofovir (Gilead) Zalcitabine (HoffmannLaRoche) Tipranavir (BI) 2R5P, 2B7Z, 2AVV, 2AVO, 2AVS, 1SGU, 1SDT, 1SDV, 1SDU, 1K6C, 1C6Y, 2BPX, 1HSG, 1HSH 2O4N, 2O4L, 2O4P, 1D4Y, 1D4S 1T05 Etravirine (Tibotec) Year 1S6P Delavirdine (Pfizer) Indinavir (Merck) Ritonavir (Abbott) 2B60, 1RL8, 1SH9, 1N49, 1HXW Saquinavir (Roche) 3D1X, 3D1Y, 3CYX, 2NMW, 2NMZ, 2NNP, 2NMY, 2NNK, 1C6Z, 1FB7 Scientific challenges to the PDB Number of data files continues to increase Information content of each data file is increasing Many more very large macromolecular complexes New structure determination methods By experimental method 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Growth of PDB Depositions By deposition and processing site *(projected) (8322) X-ray NMR EM * Location of PDB Depositors (1999-2009) Number of Structures Increase of PDB data depositions from Asia and Oceania regions Year Technical challenges in data management How do we represent diverse data? How do make a searchable database? How do we integrate with other data resources? How do we make a scalable system? How do we meet the needs of a diverse community? The pipeline: deposition to release PDB Structure Determination Deposition Data Processing Data Archiving Data Distribution & Query Data In: What happens with PDB depositions? RCSB and wwPDB Full Data Flow Processing and Annotation Deposition Integration Web communication with Depositor Dissemination External Loaders RCSB Web Access to Data RCSB Database ADIT RCSB RCSB PDB ID depositors Validation Annotation Shared DB Autodep PDBe BMRB ADIT NMR PDBJ ADIT, ADIT NMR Harvest, Prepare, Prevalidate wwPar tners consumers Data Exchange file (Daily upload) Release Archive Master PDB FTP Archive RCSB at RU PDB FTP RCSB at UCSD PDBe PDB ftp mirror PDBj PDB ftp mirror PDBe Web Access to Data PDBj Web Access to Data After deposition: annotation and validation Check all incoming files – Sequence/structure correspondences – Small molecule ligands – Biological assembly (PISA, authordefined) – Agreement with experimental data – Agreement with known geometrical features (Molprobity, Procheck, SFCheck, NUCheck) Update and maintain data processing database daily Developing method-specific standards X-ray Validation Task Force – – April 14-16, 2008 at EBI-EMBL, Hinxton, UK Randy Read (Chair) NMR Validation Task Force – – September 21, 2009 in Paris, France Guy Montelione, Michael Nilges (Cochairs) Electron Microscopy EMDataBank.org Unified Data Resource for CryoEM Collaborative project between RCSB PDB, PDBe, and Baylor-NCMI is funded by NIH Unified tool for collecting model coordinates and map files in a one-stop shop Merge with wwPDB as part of Common Tool by 2011 EM Coordinate and Map Depositions EMDatabank.org Planning for the Future: wwPDB Deposition and Annotation Tool Goal: To collaboratively develop the new processes and supporting systems that will support the wwPDB over the next 10 years. The new systems will provide a high quality and dependable resource that will effectively: support increases in deposition throughput address the anticipated increase in complexity and experimental variety of submissions focus on quality enhancement through the use of community-based validation tools Common Tool for Deposition and Annotation Manage increased data load without an increase in resources Create global deposition and annotation tools Distribute worldwide data load and eliminate individual points of failure Anticipate new developments in structural biology to keep tools up to date Continuous data annotation to support searching and reporting Data quality Data standardization Extended annotation Improved search functionality Extended search options Example: 2007 Archive Updates Sequences & taxonomy updated for sequences Before All primary citations verified Improved biological representation Symmetry and coordinate transformations for virus entries After Ligand stereochemistry and nomenclature for monomers and non-polymer molecules Diffraction source & beamline updates Miscellaneous uniformity issues C.L. Lawson, S. Dutta, J.D. Westbrook, K. Henrick and H.M. Berman (2008) Representation of viruses in the remediated PDB archive Acta Cryst. D64: 874-882 Data Out: What happens when data are released? FTP site for wwPDB Data downloaded by hundreds of external resources Each wwPDB member maintains websites with different services RCSB PDB portal www.pdb.org MyPDB: Keep up-to-date with new structures...automatically! Framework to store user preferences Saves queries in a private account Notifies users via email when new structures match stored queries Interactive Views of Domain Annotations Structure Explorer Summary Page Information summarized in easyto-read 2-column format Related information presented in customizable “widgets” Abstract from PubMed is displayed Visualization Options 3D Viewers are context-sensitive – Asymmetric unit – Biological assembly Biological assembly is displayed by default Presumed oligomeric state of biological molecule is displayed (for Xray structures) Protein-Ligand Interaction View Simplified user interface Added metal interactions Display of bond orders from Chemical Component Dictionary Integrating sequence, structure, and function http://kb.psi-structuralgenomics.org/KB/ Knowledgebase The Structural Genomics Knowledgebase is a free online resource that gives access to protein information determined by the Protein Structure Initiative (PSI) and other key biological resources to enable a better understanding of the molecular basis of biology and disease. Scope of PSI SGKB Experimental Tracking Target Selection Materials Genomic Based Target Selection Isolation, Expression, Purification, Crystallization Data Collection Structure Determination PDB Deposition & Release Models Annotations Publications Technology Metrics To capture, organize, and provide access to key elements of the structural determination high-throughput pipelines To leverage such information through the generation of molecular models and integration of functional annotation for use by the scientific community Navigating the PSI SGKB Homepage Database searchable by sequence, text, and PDB ID and delivers aggregate reports, inventories Links to PSI projects, external resources, and publications Link to central CommunityNominated Targets Proposal system SG Gateway with Nature delivers research findings, technologies, news and events related to the PSI and structural genomics Publicizes recently solved PSI structures or new editorial content Target information Protocols Technologies Models Publications Links to Biological Resources The PSI SGKB enables knowledge… By connecting protein sequence information to 3D structures and homology models By providing centralized access to experimental protocols, materials, and technologies By fostering community collaborations Structural Views of Biology and Medicine What we have learned so far Sequence-structure-function relationships are complex Low sequence identity-same structure (hemoglobin) Same structure/different function (TIM) Different overall structure/same function (lysozyme) New protein targets lead to new drugs (HIV protease) Technology-science cycle closely coupled in structural biology A structural view of biology is closer than we thought “If it can be done, it will be done” Acknowledgements Funding Agencies for all Projects: NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS, NIDDK Wellcome Trust, EU, CCP4, BBSRC, MRC, EMBL BIRD-JST, MEXT