What is in a PDB file? Shuchismita Dutta May 7, 2009 1

advertisement
What is in a PDB file?
Shuchismita Dutta
May 7, 2009
1
Overview
 Exploring the PDB format file
 File formats and dictionaries
 Validation
 Finding PDB format files and other files
Exploring PDB file
Meta data
Coordinates
Title section
OBSLTE 18-JUL-84 1HHB 2HHB 3HHB 4HHB
SPLIT 1JGP 1JGQ 1JGO
CAVEAT 1B86 THERE ARE CHIRALITY ERRORS IN C-ALPHA CENTERS
REVDAT
REVDAT
REVDAT
REVDAT
4
3
2
1
24-FEB-09
01-APR-03
15-OCT-89
17-JUL-84
4HHB
4HHB
4HHB
4HHB
1 VERSN
1 JRNL
3 MTRIX
0
SPRSDE 17-JUL-84 4HHB 1HHB
Remarks: the numbers mean something
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
0
0
0
0
0
0
THIS ENTRY (2Q41) REFLECTS AN ALTERNATIVE MODELING OF THE
ORIGINAL STRUCTURAL DATA (R1XJ5SF) DETERMINED BY AUTHORS
OF THE PDB ENTRY 1XJ5: G.E.WESENBERG,D.W.SMITH,
G.N.PHILLIPS JR.,E.BITTO,C.A.BINGMAN,S.T.M.ALLARD,
CENTER FOR EUKARYOTIC STRUCTURAL GENOMICS (CESG).
Data collection details:
X-ray source, detector, data collection details (200)
Fiber diffraction (205)
NMR (210, 215, 217)
Neutron diffractions (230)
Electron crystallography (240)
Electron Microscopy (245)
Crystallographic details:
Vm, Matthew’s coefficient
Crystallographic symmetry
Remark 3
Data from each refinement software
has its own template and details
Remarks: the numbers mean something
Biological assembly information
Example of a virus (1AYN)
Remarks
Compound details
Missing residues, atoms
Geometry: close contacts, bond length, angle and
torsion deviations, sterochemistry
Ligand details
Related entries
Sequence details
Chemistry sections :
Primary Structure & Ligand
DBREF 1BH0 A 1 29 UNP P01275 GLUC_HUMAN 53 81
SEQADV 1BH0 LYS A 17 UNP P01275 ARG 69 ENGINEERED
SEQADV 1BH0 LYS A 18 UNP P01275 ARG 70 ENGINEERED
SEQADV 1BH0 GLU A 21 UNP P01275 ASP 73 ENGINEERED
SEQRES 1 A 29 HIS SER GLN GLY THR PHE THR SER ASP TYR SER LYS TYR
SEQRES 2 A 29 LEU ASP SER LYS LYS ALA GLN GLU PHE VAL GLN TRP LEU
SEQRES 3 A 29 MET ASN THR
MODRES 2F4K NLE A 65 LEU NORLEUCINE
MODRES 2F4K NLE A 70 LEU NORLEUCINE
HET PO4 D 147 1
HET PO4 B 147 1
HET HEM A 142 43
HET HEM B 148 43
HET HEM C 142 43
HET HEM D 148 43
HETNAM PO4 PHOSPHATE ION
HETNAM HEM PROTOPORPHYRIN IX CONTAINING FE
HETSYN HEM HEME
FORMUL 5 PO4 2(O4 P 3-)
FORMUL 7 HEM 4(C34 H32 FE N4 O4)
FORMUL 11 HOH *221(H2 O)
Secondary Structure & Connectivity
HELIX 1 AA SER A 3 GLY A 18 1 16
HELIX 2 AB HIS A 20 SER A 35 1 16
HELIX 3 AC PHE A 36 TYR A 42 1 7
SHEET 1 A 4 ILE A 18 LEU A 23 0
SHEET 2 A 4 LEU A 111 VAL A 118 -1 O GLY A 115 N TRP A 19
SSBOND
SSBOND
SSBOND
SSBOND
LINK
LINK
LINK
LINK
1
2
3
4
CYS
CYS
CYS
CYS
A
A
A
A
6
30
64
76
NE2 HIS A 87
NE2 HIS B 92
FE HEM B 147
FE HEM A 143
CYS
CYS
CYS
CYS
FE
FE
O1
O1
A
A
A
A
HEM
HEM
OXY
OXY
127 1555 1555 2.02
115 1555 1555 2.02
80 1555 1555 2.03
94
1555 1555 2.01
A
B
B
A
143
147
150
150
1555
1555
1555
1555
1555
1555
1555
1555
CISPEP 1 PRO A 98 PRO A 99 0 0.53
CISPEP 2 GLY A 109 PRO A 110 0 -0.01
1.94
2.07
1.87
1.66
Miscellaneous
SITE
SITE
SITE
SITE
1
1
2
3
ACT
AC1
AC1
AC1
3 HIS H 57 ASP H 102 SER H
12 HIS H 57 ASN H 98 LEU H
12 ASP H 189 ALA H 190 SER
12 GLY H 216 GLY H 219 HOH
195
99 ILE H 174
H 195 TRP H 215
H 264 HOH H 270
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
800
800
800
800
800
800
800
SITE
SITE_IDENTIFIER: ACT
EVIDENCE_CODE: AUTHOR
SITE_DESCRIPTION: CATALYTIC SITE
SITE_IDENTIFIER: AC1
EVIDENCE_CODE: SOFTWARE
SITE_DESCRIPTION: BINDING SITE FOR RESIDUE MID H 1
Crystallographic info, Coordinate
Transformations & coordinates
CRYST1 88.814 95.207 89.164 90.00 104.96 90.00 P 1 21 1 8
ORIGX1 1.000000 0.000000 0.000000 0.00000
ORIGX2 0.000000 1.000000 0.000000 0.00000
ORIGX3 0.000000 0.000000 1.000000 0.00000
SCALE1 0.011259 0.000000 0.003009 0.00000
SCALE2 0.000000 0.010503 0.000000 0.00000
SCALE3 0.000000 0.000000 0.011609 0.00000
MODEL 1
ATOM 1 N
ATOM 2 CA
ATOM 3 C
ATOM 4 O
ATOM 5 CB
ATOM 6 OG
ATOM 7 N
ATOM 8 CA
...
ENDMDL
SER
SER
SER
SER
SER
SER
THR
THR
A
A
A
A
A
A
A
A
41
41
41
41
41
41
42
42
-9.122
-8.282
-7.051
-6.646
-7.845
-7.250
-6.473
-5.290
-10.304
-11.187
-11.693
-11.108
-10.416
-11.264
-12.792
-13.380
89.511
88.650
89.414
90.421
87.393
86.423
88.935
89.552
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12
51.94
52.75
52.51
53.15
51.93
52.59
51.75
50.38
N
C
C
O
C
O
N
C
GLY
GLY
GLY
GLY
GLY
GLY
GLY
GLY
ASER
ASER
ASER
ASER
ASER
ASER
BGLY
BGLY
BGLY
BGLY
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
8
8
8
8
9
9
9
9
10
10
10
10
10
10
10
10
10
10
2.326
3.121
3.533
4.302
3.080
3.330
4.552
4.720
5.404
6.598
6.236
5.150
7.516
8.894
5.404
6.598
6.236
5.150
4.110
3.079
3.408
2.642
4.526
4.880
5.685
6.098
6.014
6.814
8.234
8.733
6.864
6.884
6.014
6.814
8.234
8.733
1.416
2.065
3.476
4.092
4.038
5.396
5.709
6.885
4.753
5.042
5.479
5.233
3.822
4.237
4.753
5.042
5.479
5.233
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.33
0.33
0.33
0.33
0.33
0.33
0.67
0.67
0.67
0.67
Microheterogeneity (1ENM)
42.03
42.27
42.32
44.09
40.18
40.11
39.75
40.96
39.21
38.11
36.87
32.77
39.46
40.79
39.21
38.11
36.87
32.77
Atom type
B-factor
Occupancy
z coordinate
y coordinate
x coordinate
Residue #
N
CA
C
O
N
CA
C
O
N
CA
C
O
CB
OG
N
CA
C
O
Chain ID
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
Residue name
Atom name
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
Alternate conformer ID
ATOM
ATOM
ATOM
ATOM
S.#
Coordinate section: A Closer look
N
C
C
O
N
C
C
O
N
C
C
O
C
O
N
C
C
O
1 N
2 CA
3 C
4 O
5 CB
6 N
7 CA
8 C
9 O
10 CB
11 N
12 CA
13 C
14 O
15 CB
16 CG
17 OD1
18 OD2
19 N
20 CA
21 C
22 O
GLU
GLU
GLU
GLU
GLU
ALA
ALA
ALA
ALA
ALA
ASP
ASP
ASP
ASP
ASP
ASP
ASP
ASP
CYS
CYS
CYS
CYS
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
L
1C
1C
1C
1C
1C
1B
1B
1B
1B
1B
1A
1A
1A
1A
1A
1A
1A
1A
1
1
1
1
63.677
64.338
63.351
63.320
65.320
62.537
61.571
60.631
60.238
60.810
60.262
59.378
57.965
57.476
59.593
58.724
57.452
59.188
57.321
56.005
55.351
56.002
26.331
26.818
27.360
28.565
25.825
26.499
26.988
28.018
27.865
25.845
29.089
30.016
29.526
28.873
31.557
32.268
32.455
32.658
29.802
29.353
30.160
30.636
17.947
16.736
15.717
15.489
16.101
15.096
14.116
14.729
15.872
13.511
14.012
14.691
14.760
13.851
14.587
13.564
13.924
12.472
15.860
16.036
17.077
17.968
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
Residue numbering (1DWD)
31.77
35.78
41.73
49.37
38.64
36.03
33.01
32.42
31.68
33.36
33.13
35.05
31.74
36.72
41.32
46.17
47.60
48.99
22.52
15.35
15.83
18.73
Atom type
B-factor
Occupancy
z coordinate
y coordinate
x coordinate
Chain ID
Residue #
Residue name
Atom name
S.#
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
Insertion codes
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
ATOM
N
C
C
O
C
N
C
C
O
C
N
C
C
O
C
C
O
O
N
C
C
O
Connectivity & Book keeping
SEQRES
coordinates
Sheet
Helix
80
73 81
80 82 84
81 83 88
Nonstd residues
73
80
81
82
remarks
CONECT
CONECT
CONECT
CONECT
MASTER 2487 0 28 47 52 0 0 673322 31 280 104
END
The PDB format guide
 Located at
– http://www.wwpdb.org/documentation/format32/v3.2.html
 Defines all the records that appear in the PDB
file
 Includes templates for all records and
remarks
www.wwpdb.org
Keeping track of all the information
 PDB format file is a report from a database
 The database is built on the PDB exchange and
chemical component dictionaries
 The PDB exchange dictionary captures every
piece of data from the PDB format file to build the
mmCIF format file
 Validation uses dictionaries to
– Check inter-relationships between different data components
– Match information to chemical component dictionary
mmCIF format file
-snip-
PDB format file
PDB Format vs mmCIF Format
 80 characters wide
 Includes header and
coordinates (x, y, z, occupancy
and B-factors) for all atoms.
 Includes name, source and
sequence of all polymers
 Can include a maximum of 62
chains and 99999 atoms.
 Free format
 Includes header and
coordinates (x, y, z, occupancy
and B-factors) for all atoms.
 Includes name, source and
sequence of all polymers
 No restriction to number of
chains or atoms in file.
Keeping track of all the information
 PDB format file is a report from a database
 The database is built on the PDB exchange and
chemical component dictionaries
 The PDB exchange dictionary captures every
piece of data from the PDB format file to build the
mmCIF format file
 Validation uses dictionaries to
– Check inter-relationships between different data components
– Match information to chemical component dictionary
Dictionaries
 PDB Exchange (pdbx) dictionary
– (http://mmcif.pdb.org/)
– Includes the syntax, definitions, relations, boundaries
– Includes examples for the contents of the mmCIF format file.
 Chemical Component Dictionary
– Describes all residues in the PDB files (standard, non-standard amino
acids, nucleotides and other ligands - ions, drugs, cofactors, inhibitors)
– 1-3 alphanumeric character identifier
– Includes model & idealized coordinates for components, connectivities,
name, formula, smiles strings
– Maintained by the wwPDB.
– Used for data processing and validation of structures
Ligand cif file
Ligand Expo - Search Options
Also use for component building
Ligand Expo – Substructure Search
Ligand Expo - Browse Options
PDB Exchange
Dictionary
includes syntax &
definitions for
mmCIF format
files
PDB format file
mmCIF format file
-snip-
Instance of valine
matched to VAL in
Chemical Component
Dictionary
Validation
 Quality assessment
 Is the structure well determined overall?
 Is the structure suitable for your analysis and/or
modeling requirements?
 Are local regions that you are interested in well
determined?
When to Validate?
Step 0: Validation
Refinement
Step 2: Validation Report
Step 1: PDB ID
Archival Data
Depositor
Data
Deposition Primary
Annotation
Validation PDB Entry
Core
Database
Distribution
Site
Step 3: Corrections
Step 4: Depositor Approval
Download Data
Step 5: Functional Annotation
Validation
Use of PDB data
What is validated?
 Chemistry
– Of polymer (match to DB and internal consistency)
– Of ligands, ions, inhibitors (match to dictionary)
 Geometry
– Close contacts
– Bond length, angle, torsion etc. deviations
– Ramachandran plot
 Experimental data
– SF check
– R factors
How to validate?




Molprobity
EDS server
Procheck
Whatcheck/Whatif
 Validation server at RCSB PDB
Electron Density Server report
Real-space R-value
Electron Density Server report
Real-space R-value
Validation at RCSB PDB
Close contacts
Bond length deviations
Torsion angles
Planarity
Missing residues
Link records
SFCheck report
Downloading files

Coordinate
–
–
–
–

PDB
 80 character wide
 Created for X-ray structures
 Updated for NMR, EM and other methods
mmCIF
 More flexible format
 Based on mmCIF (PDBX) dictionary
PDBML
 XML translation of mmCIF format files
Biological Unit
Experimental data
–
–
SF files
 Distributed in mmCIF format
Constraints file
 Validated by BMRB
Archive download
The ftp archive
RCSB PDB website
The Structure Summary page
Asymmetric and Biological Unit
Structure Analysis (RCSB tables)
Summary
 Exploring the PDB format file
– Documentation available from the wwPDB website
 File formats and dictionaries
– Documentation and links available from the wwPDB website
 Validation
– Links available from RCSB PDB website
 Finding PDB format files and other files
– Links available from wwPDB and RCSB PDB websites
Funding
NSF, NIGMS, DOE, NLM, NCI,
NINDS, NIDDK
Wellcome Trust, EU,
CCP4, BBSRC, MRC, EMBL
BIRD-JST, MEXT
NLM
62
Download