Creating Data Resources for Biology Helen M. Berman

advertisement
Creating Data Resources for Biology
Helen M. Berman
History of the PDB archive
1960’s
 Protein
crystallography
begins to take off
 Emerging interest
in protein folding
 Use of computer
graphics to
represent
structure
 Nobel Prize
awarded for the
first 3D protein
structures:
myoglobin and
hemoglobin
Myoglobin: Kendrew, Bodo, Dintzis,
Parrish, Wyckoff, Phillips (1958) Nature 181
662-666; Hemoglobin: Perutz (1962) Proc.
R. Soc. A265, 161-187; Lysozyme: Blake,
Koenig, Mair, North, Phillips, Sarma (1965)
Nature 206 757; Ribonuclease: Kartha,
Bello, Harker (1967) Nature 213, 862-865;
Wyckoff, Hardman, Allewell, Inagami,
Johnson, Richards (1967) J. Biol. Chem.
242, 3753-3757.
Myoglobin
Hemoglobin
Lysozyme
Ribonuclease
3
1970’s



Grassroots efforts to
archive data
Protein
crystallographers
discuss how to
archive data
June 1971 Cold
Spring Harbor
meeting brings
groups together
(Cold Spring Harbor
Symposia on Quantitative
Biology, vol. XXXVI, 1972)

October 1971 PDB is
announced in Nature
New Biology
(7 structures; vol 233,
1971, page 223)

1975 PDB receives
first funding from NSF
(~32 structures)
Lysozyme
Blake, Koenig,
Mair, North,
Phillips, Sarma
(1965) Nature 206
757
Ribonuclease Kartha,
Bello, Harker (1967)
Nature 213, 862-865;
Wyckoff, Hardman,
Allewell, Inagami,
Johnson, Richards
(1967) J. Biol. Chem.
242, 3753-3757.
Proportion of
enzyme classes
relative to
total enzyme
structures
Percent
Enzymes
Ligases
Isomerases
Lyases
Hydrolases
Transferases
Oxidoreductases
Decade:
RNA-containing structures
tRNA J.L. Sussman, S.-H. Kim
(1976) Biochem Biophys Res
Commun. 68:89-96; J.D. Robertus,
J.E. Ladner, J.T. Finch, D. Rhodes,
R.S. Brown, B.F.C. Clark, & A. Klug
(1974) Nature 250: 546-551.
Protein/RNA
complexes
RNA only
DNA/RNA hybrid
Protein/DNA/RNA
complexes
Decade:
1980’s

Technology takes
off

Structural biology
is able to focus on
medical problems

Community
efforts to promote
data sharing

IUCr guidelines
requiring data
deposition in the
PDB are
published
DNAcontaining
structures
B-DNA Z-DNA
1bna Dickerson &
Drew (1981) J. Mol.
Biol. 149: 761-786
2dcg Wang,
Quigley, Kolpak,
Crawford, van
Boom, van der
Marel, Rich (1979)
Nature 282: 680686
Protein/DNA complexes
DNA only
DNA/RNA hybrid
Prot/DNA/RNA complexes
Proteinnucleic acid
complexes
Phage 434
repressor-operator
2or1 Aggarwal,
Rodgers, Drottar,
Ptashne, & Harrison
(1988) Science 242:
899-907
Protein/DNA complexes
Protein/RNA complexes
Prot/DNA/RNA complexes
Year
Viruses
Hopper, Harrison, Sauer (1984)
Structure of tomato bushy stunt
virus. V. Coat protein sequence
determination and its structural
implications J.Mol.Biol. 177: 701713
Silva, Rossmann (1985) The
refinement of southern bean mosaic
virus in reciprocal space Acta
Crystallogr. B41: 147-157
Cooperative community action
 Individual letters to editors of
journals
 Committees
 IUCr commission on
Biological Macromolecules
 ACA/USNCCr
 Richards committee
 Funding agencies
 Articles in journals
Marvin Cassman
Fred Richards
Richard Dickerson
1990’s
 Number of structures
increases exponentially
 Complexity of structures
increases
 mmCIF dictionary created
 New databases begin to
emerge
 User base expands
dramatically
 PDB archive moves
mmCIF Working Group Members
Ribosome
structures
Electron Microscopy
structures
Bacteriorhodopsin. Henderson, Baldwin,
Ceska, Zemlin, Beckmann, Downing (1990)
J.Mol.Biol. 213: 899-929.
30S
50S
Ribosome. Ban, Nissen, Hansen,
Moore, & Steitz (2000) Science 289:
905-920; Clemons Jr., May,
Wimberly, McCutcheon, Capel, &
Ramakrishnan (1999) Nature 400:
833-840; Schluenzen, Tocilj, Zarivach,
Harms, Gluehmann, Janell, Bashan,
Bartels, Agmon, Franceschi, Yonath
(2000) Cell 102: 615-623; Yusupova,
Yusupov, Cate,& Noller (2001) Cell
106: 233-241.
2000’s
 wwPDB is formed
 Continued growth in structures
 Structural genomics takes off
Structures solved as of 2007
wwPDB AC 2009
wwPDB Directors
Worldwide Protein Data Bank
 Formalization of current
working practice
 Members
– RCSB PDB (Research
Collaboratory for Structural
Bioinformatics)
– PDBj (Osaka University)
– PDBe (EMBL-EBI)
– BioMagResBank (University
Wisconsin, Madison)
 MOU signed July 1, 2003
 Announced in Nature
Structural Biology
November 21, 2003
wwpdb.org
Guidelines and Responsibilities
 All members issue PDB IDs and serve as
distribution sites for data
 One member is the archive keeper (RCSB PDB)
 All format documentation publicly available
 Strict rules for redistribution of PDB files
 All sites can create their own websites
www.pdb.org
www.ebi.ac.uk/pdbe/
www.pdbj.org
Number of released entries
Depositions to the PDB by decade
Year:
Archive Contents
 Public archive
– More than 400,000 files (as
of June 2009)
– Requires over 93 GB of
storage
– Data dictionaries
– Derived data files
 For each entry
–
–
–
–
–
Atomic coordinates
Sequence information
Description of structure
Experimental data
Release status information
 Internal archive
–
–
–
–
–
Depositor correspondence
Depositor contact information
Paper records
Documentation
Historical records from Day One
What can the PDB archive tell us?
Structure distribution
Protein-RNA complexes
582
RNA only
655
DNA only
RNA-DNA hybrid
39
1093
755
Other
Number of structures
Protein-DNA1301
complexes
Structure
determination
methods
46157
Protein only
Year
Resolution
distribution:
other structures
Resolution
Resolution
distribution:
protein structures
Resolution
distribution:
all structures
Year
Redundancy:
protein clusters
Cluster #
Total distinct chains in cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
459
297
196
445
218
330
302
254
229
185
182
178
176
160
153
Percent of distinct/novel structures
Distinct and
novel protein
sequences
70 63%
60
51%
50
40
Structures containing distinct
protein sequences (<98%)
Structures containing novel
protein sequences (<30%)
Subset of PSI structures
Subset of other SG structures
39%
37%
32%
27%
30
7%
14%
20
7%
25% 16%
4%
2%
10%
10
0
Year
1972-1979
1980-1989 1990-1999
Protein cluster
Bacteriophage T4 lysozyme
Hen white lysozyme
Human lysozyme
Mouse immunoglobulin Fc&Fab fragments
Human immunoglobulin Fc&Fab fragments
HIV-1 protease
Trypsin (serine protease)
Thrombin
Human carbonic anhydrase II
Whale myoglobin
Human leukocyte antigen
Human hemoglobin -subunit
Human hemoglobin -subunit
Ribonuclease A
Human cyclin-dependant kinase 2 (CDK2)
2000-2008
First structure
Deposition Date
2LZM
2LYZ
1GFE
1GIG
1FC1
2HVP
5PTP
2HGT
1CA2
1MBN
1HLA
3HHB
3HHB
2RNS
1HCK
1977-03-28
1975-02-01
1984-10-12
1993-01-20
1981-05-21
1989-04-10
1977-12-19
1991-06-03
1976-05-22
1973-04-05
1987-10-15
1975-04-01
1975-04-01
1973-04-01
1996-06-03
Lysozyme: Lessons learned
T4 bacteriophage (459 structures)
 Amino acid replacement studies
suggest that fraction of amino
acid residues that define the
structure of T4 lysozyme is about
50%
B.W. Matthews (1996) FASEB J.10: 35-41.
Insight into folding and catalysis
Blake, Koenig, Mair, North, Phillips, Sarma
(1965) Nature 206: 757.
Hen egg white (297 structures)
 Low sequence identity
 Structural similarity of active site
to T4
B.W. Matthews, M.G. Remington, M.G. Grutter, W.F. Anderson (1981)
J.Mol.Biol. 147: 545-58.
Insight into evolution and catalysis
Myoglobin and hemoglobin:
Lessons learned
Whale myoglobin (185 structures)
 Different ligands: oxygen, carbon dioxide1
 Amino acid substitution studies2
 Laue studies3
Insight into function and dynamics
Other species myoglobin
 Low sequence identity, same structure4
Insight into evolution
Human hemoglobin (178 structures)
Insight into function and disease (sickle cell
anemia, thalassemia)5
Other species hemoglobin
 Low sequence identity, same structure4
Profound insight into evolution
1Kuriyan,
Lodish et al.6
Wilz, Karplus, Petsko (1986) J. Mol. Biol. 192:133–154; 2Quillin, Arduini, Olson, Phillips, Jr. (1993) J. Mol. Biol. 234: 140–155, Carver, Brantley Jr, Singleton, Arduini, Quillin, Phillips Jr,
Olson (1992) J. Biol. Chem. 267:14443–14450; 3Bourgeois, Vallone, Schotte, Arcovito, Miele, Sciara, Wulff, Anfinrud, Brunori (2003) PNAS 100: 8704-8709; 4Dickerson, Geis (1983) Hemoglobin:
20
structure, function, and pathology; 5Kidd, Baker, Mathews, Brittain Baker (2001) Prot. Sci. 10:1739-1749, Harrington, Adachi, Royer Jr. (1998) J. Biol. Chem. 273: 32690 - 32696; 6Lodish, Berk,
Zipursky, Matsudaira, Balitmore, Darnell (2000) Molecular Cell Biology WH Freeman & Co.
TIM barrel proteins: Lessons learned
TIM barrel structures (1727)
http://www.cathdb.info
 Share the same fold but represent
significant sequence and
functional diversity
 Are enzymes or enzyme-related
proteins involved in molecular or
energy metabolism
 Comparative structure analysis
indicates evolutionary relatedness
of TIM barrel proteins
Banner, Bloomer, Petsko,
Phillips, Wilson, (1976)
Biochem.Biophys.Res.
Commun. 72: 146-155
Nagano, Orengo, Thornton (2002) J.Mol. Biol. 321: 741-65.
Nagano, Orengo,
Thornton (2002) J.Mol.
Biol. 321: 741-65.
HIV-related
structures
HIV-1 reverse
transcriptase
HIV-1 protease
Abacavir (GSK)
122
311
27
39
Nevirapine (BI)
Stavudin (BMS)
Amprenavir (GSK)
110
Fosamprenavir (GSK)
2HND, 2HNY, 1S1U,
1S1X, 1LW0, 1LWE,
1LWC, 1LWF, 1JLB,
1JLF, 1FKP, 1VRT,
3HVT
Protease
Reverse Transcriptase
1T7J, 1HPV
Efavirenz (BMS)
Lamivudine (GSK)
Lopinavir (Abbott)
Gag protein
Integrase
Other
Zidovudine (GSK)
2FXE, 2FXD, 2O4K,
2AQU, 2FND
2RKG, 2RKF,
2QHC, 2Z54,
2Q5K, 2O4S,
1RV7, 1MUI
1JKH, 1IKW, 1IKV,
1FKO, 1FK9
Atazanavir (BMS)
Emtricitabine (Gilead)
Nelfinavir (Agouron)
Darunavir (Tibotec)
2QAK, 2PYM,
2Q63, 2PYN,
2Q64, 2R5Q,
1OHR
Tenofovir (Gilead)
Zalcitabine (HoffmannLaRoche)
Tipranavir (BI)
2R5P, 2B7Z,
2AVV, 2AVO,
2AVS, 1SGU,
1SDT, 1SDV,
1SDU, 1K6C,
1C6Y, 2BPX,
1HSG, 1HSH
2O4N, 2O4L, 2O4P, 1D4Y,
1D4S
1T05
Etravirine (Tibotec)
Year
1S6P
Delavirdine (Pfizer)
Indinavir (Merck)
Ritonavir (Abbott)
2B60,
1RL8,
1SH9,
1N49,
1HXW
Saquinavir (Roche)
3D1X, 3D1Y, 3CYX,
2NMW, 2NMZ,
2NNP, 2NMY,
2NNK, 1C6Z, 1FB7
Scientific challenges to the PDB
 Number of data files continues to increase
 Information content of each data file is increasing
 Many more very large macromolecular complexes
 New structure determination methods
By experimental
method
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
Growth of PDB Depositions
By deposition and
processing site
*(projected)
(8322)
X-ray
NMR
EM
*
Location of PDB Depositors (1999-2009)
Number of Structures
Increase of PDB data depositions from
Asia and Oceania regions
Year
Technical challenges in data
management
 How do we represent diverse data?
 How do make a searchable database?
 How do we integrate with other data resources?
 How do we make a scalable system?
 How do we meet the needs of a diverse community?
The pipeline: deposition to release
Structure
Determination
PDB
Deposition
Data
Processing
Data
Archiving
Data
Distribution
& Query
Data In: What happens with PDB
depositions?
RCSB and wwPDB Full Data Flow
Processing and
Annotation
Deposition
Integration
Web
communication
with Depositor
Dissemination
External
Loaders
RCSB Web
Access to Data
RCSB
Database
depositors
ADIT
RCSB
RCSB
PDB ID
Validation
Annotation
Shared
DB
Autodep
PDBe
BMRB
ADIT NMR
ADIT,
ADIT NMR
PDBJ
Harvest,
Prepare,
Prevalidate
wwPartners
consumers
Data
Exchange
file
(Daily
upload)
Release
Archive
Master PDB
FTP Archive
RCSB at
RU
PDB FTP
RCSB
at UCSD
PDBe
PDB ftp
mirror
PDBj
PDB ftp
mirror
PDBe Web
Access to
Data
PDBj Web
Access to
Data
After deposition: annotation and
validation
 Check all incoming files
– Sequence/structure
correspondences
– Small molecule ligands
– Biological assembly (PISA, authordefined)
– Agreement with experimental data
– Agreement with known geometrical
features (Molprobity, Procheck,
SFCheck, NUCheck)
 Update and maintain data
processing database daily
Developing method-specific
standards

X-ray Validation Task Force
–
–

April 14-16, 2008 at EBI-EMBL, Hinxton,
UK
Randy Read (Chair)
NMR Validation Task Force
–
–
September 21, 2009 in Paris, France
Guy Montelione, Michael Nilges (Cochairs)
EMDataBank.org
Electron Microscopy
Unified Data Resource for CryoEM
 Collaborative project between RCSB PDB,
PDBe, and Baylor-NCMI is funded by NIH
 Unified tool for collecting model coordinates
and map files in a one-stop shop
 Merge with wwPDB as part of Common Tool
by 2011
EM Coordinate and Map Depositions
EMDatabank.org
Planning for the Future:
wwPDB Deposition and Annotation Tool
Goal: To collaboratively develop the new processes
and supporting systems that will support the wwPDB
over the next 10 years.
The new systems will provide a high
quality and dependable resource that
will effectively:



support increases in deposition
throughput
address the anticipated increase in
complexity and experimental variety of
submissions
focus on quality enhancement through
the use of community-based validation
tools
Common Tool for
Deposition and Annotation
Manage increased data load without an
increase in resources
Create global deposition and annotation tools
Distribute worldwide data load and eliminate
individual points of failure
Anticipate new developments in structural
biology to keep tools up to date
Continuous data annotation to
support searching and reporting
Data quality
Data standardization
Extended annotation
Improved search functionality
Extended search options
Example: 2007
Archive Updates
Before
 All primary citations verified
 Sequences & taxonomy updated
for sequences
Improved biological representation
 Symmetry and coordinate
transformations for virus entries
After
 Ligand stereochemistry and
nomenclature for monomers and
non-polymer molecules
 Diffraction source & beamline
updates
 Miscellaneous uniformity issues
C.L. Lawson, S. Dutta, J.D. Westbrook,
K. Henrick and H.M. Berman (2008)
Representation of viruses in the
remediated PDB archive Acta Cryst.
D64: 874-882
Data Out: What happens when data
are released?
 FTP site for wwPDB
 Data downloaded by hundreds of external
resources
 Each wwPDB member maintains websites with
different services
RCSB PDB portal
www.pdb.org
MyPDB: Keep up-to-date
with new structures...automatically!
 Framework to store user
preferences
 Saves queries in a
private account
 Notifies users via email
when new structures
match stored queries
Interactive Views of Domain Annotations
Structure Explorer Summary Page
 Information
summarized in easyto-read 2-column
format
 Related information
presented in
customizable
“widgets”
 Abstract from
PubMed is displayed
Visualization Options
 3D Viewers are
context-sensitive
– Asymmetric unit
– Biological assembly
 Biological assembly
is displayed by
default
 Presumed
oligomeric state of
biological molecule
is displayed (for Xray structures)
Protein-Ligand Interaction View
 Simplified
user interface
 Added metal
interactions
 Display of
bond orders
from
Chemical
Component
Dictionary
Integrating sequence,
structure, and function
http://kb.psi-structuralgenomics.org/KB/
Knowledgebase
The Structural Genomics Knowledgebase is a free online resource that
gives access to protein information determined by the Protein Structure
Initiative (PSI) and other key biological resources to enable a better
understanding of the molecular basis of biology and disease.
Scope of PSI SGKB
Experimental Tracking
Target Selection
Materials
Genomic
Based Target
Selection
Isolation,
Expression,
Purification,
Crystallization
Data
Collection
Structure
Determination
PDB Deposition
& Release
Models
Annotations
Publications
Technology
Metrics
 To capture, organize, and provide access to key elements of the
structural determination high-throughput pipelines
 To leverage such information through the generation of molecular
models and integration of functional annotation for use by the
scientific community
Navigating the PSI SGKB Homepage
 Database searchable by
sequence, text, and PDB ID
and delivers aggregate
reports, inventories
 Links to PSI projects, external
resources, and publications
 Link to central CommunityNominated Targets Proposal
system
 SG Gateway with Nature
delivers research findings,
technologies, news and
events related to the PSI and
structural genomics
 Publicizes recently solved PSI
structures or new editorial
content
Target information
Protocols
Technologies
Models
Publications
Links to Biological Resources
The PSI SGKB enables knowledge…
 By connecting protein sequence
information to 3D structures and
homology models
 By providing centralized access
to experimental protocols,
materials, and technologies
 By fostering community
collaborations
Structural Views of Biology and Medicine
What we have learned so far
 Sequence-structure-function relationships are
complex




Low sequence identity-same structure (hemoglobin)
Same structure/different function (TIM)
Different overall structure/same function (lysozyme)
New protein targets lead to new drugs (HIV protease)
 Technology-science cycle closely coupled in
structural biology
 A structural view of biology is closer than we
thought
 “If it can be done, it will be done”
Acknowledgements
Funding Agencies for all Projects:
NSF, NIGMS, DOE, NLM, NCI,
NCRR, NIBIB, NINDS, NIDDK
Wellcome Trust, EU,
CCP4, BBSRC, MRC, EMBL
BIRD-JST, MEXT
Download