High Throughput Processing of the Structural Information of the Protein Data Bank Zoltán Szabadka, Vince Grolmusz Department of Computer Science Eötvös University, Budapest What is wrong with the PDB? • It is not uniform, each author has a different style • It is hard to process it automatically – Residue numbering is not always sequential – The chemical symbols of the atoms are often missing – It is not easy to tell how many ligands there are in an entry, chain ids are not used consistently – It is not clearly indicated if a molecule has missing atoms, and which atoms are missing • There is a need for a “front-end” database to the PDB Flow of data Internet download and check for updates local PDB mirror structural decomposition database of structure and coordinate data SQL query test sets of docking algorithms SQL query list of binding sites SQL query statistical information What type of molecules are there in a PDB entry? • • • • • Protein chains (P) DNA/RNA chains (N) Ligands (L) Metals and other small ions (I) Water molecules (W) Information stored in the database • Covalent structure of molecules • List of components of each entry • Coordinate data for each atom • Interactions between molecules E/R diagram of the database covalent structure id id molecule contains atom bond symbol type id contains monomer num type E/R diagram of the database component structure id entry contains component molecule id pdbid contains type id interaction length atom (x,y,z) PDB file formats PDB format This is the original PDB file format, it contains data records in separate lines, each with fixed length and format, eg. ATOM, HETATM, SEQRES, CONECT, etc. mmCIF format This is a relational database description language, a file contains data tables called categories. XML format The same tables are described by XML tags. The file sizes are huge, a file contains more data tags then data. Structural units of an entry • • • • • The basic structural unit of both the PDB and the mmCIF format is the so called monomer. It can be a molecule, a molecule fragment or just an atom. Each such monomer has an at most three letter long code, called monomer id, eg. ALA for alanine, MG for magnesium ion, ACE for acethyl group, or HOH for water. A protein chain consists of many amino acid monomers, each having a sequence number that indicates its position within the chain. Similarly, DNA/RNA chains consist of many nucleic acid monomers. Metals, small ions, water and most ligands are one monomer having a unique monomer id. • The basic problem is that there are certain ligand molecules that consist of two or more monomers, and this information is not always properly annotated in the PDB entries in either formats. mmCIF data categories • entity List of molecules in the entry, can be of three types: polymer, non-polymer and water. Each molecule has an entity id. • entity_poly Contains the type of polymer entities, eg. polypeptide(L) • struct_asym List of the components in the asymmetric unit. Each component has an asym id and an entity id. • pdbx_poly_seq_scheme Describes the sequence of monomers in a polymer entity. • pdbx_nonpoly_scheme List of the monomers belonging to the non-polymer entities. • atom_site Coordinate data for atoms, whose positions could be experimentally determined. Structural decomposition based on the mmCIF format • • • • • First we read the list of components in the asymmetric unit. For each component, we read its entity type, and for each polymer entity, its polymer type. Then we read the sequence of monomers for the polymer entities, and the list of monomers belonging to the non-polymer entities. The structure of monomers if known ‘a priori’ from a file named components.cif, which can be found at RCSB’s web site. So for each monomer, we have a list of atoms, lacking coordinate information. Now we go through the table atom_site, and for each atom, we find the monomer it belongs to, and fill the coordinates for the atom just found. If an atom of a monomer is not found, it will be marked as missing. Definition of molecule types • Protein chain: a polymer entity of type “polypeptide(L)”, which is at least 10 monomers long • DNA/RNA chain: a polymer entity, which is at least 5 monomers long and its type is either “polydeoxiribonucleotide”, “polyribonucleotide”, or more then half of its monomers are nucleic acids (A,C,G,I,T,U monomer id) • Ion: there is a predefined list of monomer ids, containing metals and small ions • Water: the monomers of the water entity • Ligand: all monomers, that do not belong to the above categories will form the set of ligand monomers Ligands and binding sites • We define a graph on the atoms that have coordinate data. It will have two types of edges: – covalent: if the distance of the two atoms is less then 1.25 times the sum of their covalent radii – VdW: if it is not covalent, but the distance of the two atoms is less then the sum of their Van der Waals radii • The graph is built using a 3 dimensional kd-tree in O(n log n) time • We go through the edges: – if an edge of covalent type connects two ligand molecules, then they will be joined together in one new molecule – if an edge connects a ligand to a protein chain, then this intermolecular interaction will be recorded in the protein-ligand interaction table, marking the binding site of this ligand on the protein surface PDB version: June 6, 2005 • • • • • • • Number of PDB entries: 31,217 Number of entries processed: 26,445 Number of protein chains: 59,842 Number of different sequences: 18,333 Number of ligands: 53,834 Number of different ligand molecules: 6,016 Number of all atoms: 269,237,779 – – – – Number of atoms in protein chains: 240,243,785 Number of atoms in DNA/RNA chains: 7,709,842 Number of atoms in ligands and ions: 2,479,339 Number of atoms in water: 18,804,813 Distribution of elements in ligands and ions Inorganic elements Organic elements H C O N P S Other MG FE CA ZN CL NA MN F K CU CD W I BR HG X CO NI Other The distribution of the organic and the most frequent inorganic elements among the ligands and ions. We found 70 different elements. Distribution of elements in protein chains Element Number H 120638461 C 75710684 O 22672185 N 20660541 S 540432 SE 20730 % 50,22 31,51 9,44 8,60 0,22 0,01 There were 17 different elements in the protein chains, the tables show the number of occurrences, and for the non-standard elements, the monomers that contain them. Element Number Monomers P MIS, CSP, PTR, LLP, SEP, 499 TPO, CYQ, GPL, PAS, ASQ, NEP, SDP, LYX F 116 AS HG I BE B BR CL PB V 53 48 13 9 4 4 2 2 2 EFC, FTR, YOF, BFD, LEF, 4FW, 4F3, MFC CAS, CAF, CZZ, CSR, CZ2 CMH TYI, PHI BFD CLB, CLD, SBL, SBD DBY CLB, CLD CSB SVA Distribution of protein monomers LEU ALA GLY VAL GLU SER LYS ASP THR ILE ARG PRO ASN PHE GLN TYR HIS MET TRP CYS MSE 8,81 8,09 7,58 6,97 6,50 6,22 5,99 5,75 5,72 5,44 4,97 4,68 4,40 3,91 3,81 3,52 2,44 2,05 1,47 1,45 0,17 8,77 8,25 7,66 7,08 6,57 6,10 5,93 5,73 5,71 5,56 4,95 4,65 4,34 3,90 3,73 3,47 2,46 2,14 1,43 1,39 0,14 The table shows the distribution of the 20 natural amino acids and selenomethionine in the different chains and in all chains. The other non-standard monomers are listed below. ACE MLY CGU PCA SEP NH2 CME PTR KCX CSD MLE TPO YOF CEA CAS LLP CSO OCS CSW TYS 186 172 147 122 85 83 76 55 48 48 46 44 39 37 30 28 24 22 21 20 ABA CXM CSS DAL CSX TPQ FME MLZ MVA IIL SME CSE MHO STY NLE M3L SAR SEB BMT MEN 19 18 16 16 15 15 15 14 11 10 10 9 9 9 8 8 8 7 7 7 5HP YCM SCY FTR SAC MIS DLE AYA TRQ IAS TRN BFD CMH DSN CSR NEM OMT HIC DAR CYG 7 7 6 6 5 5 5 5 5 4 4 4 4 4 4 4 4 4 3 3 ASI ALY HMR ORN SET NEP TYI CAF HTR TA4 SEC DOH CSB DTR DMT STA MME DGL ASQ CSP 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 BHD CYM NVA MSA CMT DAH 143 CZ2 TRO LEF HSL DCY DVA MSO NIY LYZ CCS CSZ C5C PAS 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 AEI PAQ OSE SNC TBM DHN CR5 LLY EFC IML DBY 2MR SEG CYD GHG DMG LYX ASB DDE CYQ MHL MCL MFC CLD GLZ PCC DHA DPN SVA TMD CSA S1H AHP AHB 4F3 SBD GPL TYQ CAY PHI ARO LAL CLB BAL C6C DAS OAS 5CS MPT NPH DSE CY4 TRF SOC DHI TMB GLH CZZ 4HT DTY EHP 3AH DHL MTY BUC MGY DAB PEC HLU MDO SBL GLQ TYY BCS 175 PYX MNV SDP TYN 4FW Protein-Ligand interactions 10gs A condition 1 2 C 3 4 Conditions: 1 bond type=VDW 2a no missing atom from protein 2b <10% missing atoms from protein 3 no missing atom from ligand 4 protein size btween 1000 and 10000 5 ligand size between 10 and 100 interaction 1 50988 1&4 45872 1&4&2b 20055 1&4&2b&5 13176 1&4&2b&3&5 10285 1&4&2a&3&5 6053 entry int. type 12798 15289 12072 14196 5752 6558 4660 4900 3655 3691 2193 2261 The table above shows the number of protein-ligand interactions, the number of entries they occur in, and the number of different interaction types while more and more conditions are met. Distribution of missing atoms number of PDB entries 10000 8000 6000 4000 2000 0 0 1-10 11-100 101-1000 1001-10000 10001- number of missing atoms The distribution of the number of missing atoms from protein chains in the PDB entries. Note, that there are relatively few entries, where only a few atoms are missing. Distribution of missing segments 6000 500 5000 400 4000 300 3000 40 37 34 31 28 25 22 19 16 13 10 7 4 40 37 34 31 28 25 22 19 16 13 10 0 7 0 4 100 1 1000 1 200 2000 The distribution of the lengths of missing chain segments at the beginning, at the middle and at the end of the chains. The length is measured in amino acids. Note that in the middle of the chain, typically 4-7 amino acids are missing. 1400 1200 1000 800 600 400 200 40 37 34 31 28 25 22 19 16 13 10 7 4 1 0 Thank You!