Functional Annotation Uncharacterized human proteins in Protein Data Bank By Swagata Naha Das Guided By: 1) Dr. Shuchismita Dutta, Assistant Research Professor, Rutgers University, NJ Senior Education Coordinator and Senior Biocurator Research Collaboratory for Structural Bioinformatics Center for Integrative Proteomics Research 2) Dr. Sona Vasudevan, Director, Georgetown University, DC MS/MD Dual Master's Degree in Systems Medicine, Assistant Professor, Medical Education, Biochemistry and Molecular & Cellular Biology. Project: Functional Annotation Georgetown University, DC PROJECT TITLE: Functional Annotation of uncharacterized human proteins in Protein Data Bank (PDB) INTRODUCTION: On an evolutionary time scale, protein structure appears to be more conserved than protein sequence. Although sequence comparisons are commonly used for functional annotation, structural information can also be used to provide insight or evidence about their biological functions. [Forouhar et al, 2007]. All experimentally determined structures of biological macromolecules are deposited to the Protein Data Bank (PDB, Berman et al., 2003). The PDB is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. The RCSB PDB supports a website where simple and complex queries on the data can be performed and the results can be analyzed and visualized [Deshpande et al 2004] With various genome sequencing projects and structural genomics initiatives we now have a tremendous amount of genomic and proteomic data. The structural genomics initiatives have produced the structures of protein using various methods such as X-ray crystallography and Solution Nuclear Magnetic Resonance (NMR). The PSI-Nature Structural Biology Knowledgebase (SBKB, http://www.sbkb.org) works to integrate the structural biology and structural genomics resources into one site for easy navigation. The knowledge gathered from this huge information can be applied to the understanding of various biological systems and diseases. Due to the high throughput nature of the determination of structures by the Structural genomics and other initiatives, a large number of structures are being produced. However, their analysis and functional annotation are still happening in a slower pace because of the limitation of the standard methods of data analysis and assimilation. As a result, there are many proteins whose functions are still unknown. The growing gap in structure determination and publications describing functional annotation has led many new approaches to explore the structural information, analyze it and eventually use that for functional annotations. For example, Weekes et al., 2010 have developed The Open Protein Structure Annotation Network or TOPSAN, a new web-based platform that combines the openness of the wiki model with the quality control of scientific communication. TOPSAN has the features of automated annotation databases and formal, peer-reviewed scientific research literature. TOPSAN provides the opportunity to explore the scientific research globally and collaboratively so that that this knowledge can be reviewed and validated by the experts. . In another approach, Forouhar et al, 2007 reported some proteins for which determination of structural information was done by X-ray Crystallography. This careful structural analysis, serendipity and structure guided activity screening has imparted valuable information about the Georgetown University, DC | 2012 Page 1 Project: Functional Annotation Georgetown University, DC functions of these proteins. Some of the examples discussed in the study include a novel methyl salicylate esterase with important role in plant innate immunity, identification of protein yggj as a novel RNA methyltransferase with its role in the methylation of U1498 in the 16S ribosomal RNA etc. With the determination of more structures, there is emergence of various databases to store this structural information. However, in order to extract and use these data meaningfully for interpreting biological function, proper validation of the data sources is necessary [Mazumder and Vasudevan, 2008]. The structure-guided comparative analysis of proteins and protocol for predicting protein functions described here is based on the percentage of protein sequence identity scale. The authors define a ten-step procedure that can be considered as a general rule while annotating the uncharacterized proteins. The paper insists that several layers of validation are important to transfer the functional annotation from characterized proteins to the uncharacterized proteins. Besides, the paper also provides the relevant tools and resources, which can be used for the purpose. The goal of this analysis is to identify and propose functional annotations of some of these uncharacterized human proteins in PDB whose structures are already determined. Since there is no well-defined method for the prediction of probable biological functions, we need a thorough and systematic comparative analysis of the proteins both at sequence and structural level. The uncharacterized human proteins are obtained from the functional sleuth of the SBKB and these are analyzed using various bioinformatics tools, databases and literature searches. One of the interesting aspects of this analysis is the use of different resources that can help us to validate the findings regarding potential functions of a protein. If majority of the resources provides the same information – the conclusion about the possible function of a protein is made more reliably, based on the available evidences from the different resources. METHODS: A few steps are followed to get the set of the structures, which are to be analyzed for annotation. This is described in the below mentioned figure. Georgetown University, DC | 2012 Page 2 Project: Functional Annotation Georgetown University, DC Figure1: Workflow of Overall Process. Identification of the truly unknown Human structures The Functional Sleuth list of SBKB is supposed to contain the PDB IDs of all uncharacterized proteins in the PDB. The PDB ID is the 4-character unique identifier of every entry in the Protein Data Bank. The text file (3119 PDB IDs as of June 17, 2012) having structures with unknown functions was examined. Few PDB IDs were identified whose functions were already known. These annotated PDB IDs (approximately 32) were reported to the SBKB staff so that they could revise the logic for selection of these entries from the PDB. Rest of the PDB IDs (with no classification or annotation) was selected for further investigation. As this is a huge list of data, only structures from to Homo sapiens were included for this project - with a goal of determining their probable functions. As a first step in examining the availability of functional annotation of the protein, UniProt (UniProt consortium, 2012) was explored to determine the domains and their possible functions. Georgetown University, DC | 2012 Page 3 Project: Functional Annotation Georgetown University, DC The UniProt is a comprehensive resource for protein sequence and annotation data. This analysis revealed that there were some structures with same accession ID and domain ranges. These PDB entries were inferred as referring to the same structure.. Identification of the MISSED entries: In the initial UniProt exploration it was observed that there were few PDB IDs, which had related PDB IDs (same domain and same UNP ID) and these related structures are already annotated but somehow functional annotation of these entries was missed out. They are termed as ‘MISSED’ entries in the Figure1. This list of ‘MISSED’ PDB entries was reported to PDB annotators so that they could be updated appropriately. This further cleaned up of the list of human protein structures with truly unknown functions In order to identify the human structures with truly unknown functions, the MISSED entries were required to be identified first. The sorting of the MISSED entries and truly unknown PDB IDs are done programmatically using a Python script (see Figure 2) Figure 2: Finding the Missing entries – Workflow of the Program . Georgetown University, DC | 2012 Page 4 Project: Functional Annotation Georgetown University, DC The script takes the human PDB IDs as input and checks with the Protein Data Bank (using restful web services) to get the corresponding Uniprot accession numbers. In the next step, all the corresponding PDB structures are retrieved using these accession numbers from the UniProt. If there was functional annotation for any of the related structures having the same accession number and domain range, it was assumed that the particular uncharacterized structure (Query structure) had the same biological function and hence termed as MISSED entry. Here, 62 such cases were identified (Figure 3). Figure 3: Snapshot of the MISSED (already annotated) human PDB IDs. FS-617M means missed (“M”) entry taken on June 17th (“617”) from functional sleuth (“FS”) There were also few PDB IDS, which had related uncharacterized PDB IDs with same accession number and domain. It is inferred that these PDB IDs refer to the same structures. In this case, 10 such PDB IDs are identified. Two structures are discarded because of the unavailability of their UniProt Identification no or Accession number. Finally, 117 PDB IDs are sorted out which do not have any functional annotation and with this we can now identify the human structures that do not have any known functions. Georgetown University, DC | 2012 Page 5 Project: Functional Annotation Georgetown University, DC General approaches followed for prediction of function: In the effort to predict possible function of the 117 PDB structures, a combination of 3 methods were used – (1) RCSB tools for exploring sequence and structure similarity clusters, (2) a 10 Step method for structure guided comparison of protein structures and (3) exploration of primary and related literature studies Tools from RCSB PDB: The pre-calculated protein sequence and structure alignments at the RCSB Protein Data Bank (PDB) website were used (Prilc et al., 2010). There is a structure alignment web service that calculates the pairwise alignments and another stand-alone application that runs alignments locally and visualizes the results. These resources were also used during the analysis in this project. The sequence clustering in PDB is achieved by Blastclust. This algorithm clusters all the protein chains of at least 20 amino acids at 100%, 95%, 90%, 70%, 50%,40%, 30% sequence similarity. At the higher percentage, the protein from same/similar families can be identified and at the lower level, the structural neighbors are most likely to be present. While examining, the structure alignment, the top hits are evaluated based on the percent identity and, lowest P–value and Coverage1 and Coverage2. Coverage1 is the coverage or percent of residues in query or chain1 and Coverage2 is the coverage or percent of residues in the matched protein or chain2. The percentage identity is the number of identical bases between two sequences in an alignment and the P-value is the probability that an alignment with this score occurs by chance in a database of this size. The lower the P-value the better the alignment is. If one of the best hits in the sequence or structure comparisons had a functional annotation, the query structure was assumed as having the same or similar function. The structural alignment (Jmol view) of the query protein (uncharacterized) and subject protein (best hit) was also visualized to investigate the degree of structural similarity. In cases where ligands were present in the subject (annotated) protein, residues at the ligand-binding site were reviewed to see if they were conserved in the query protein and could provide insight about the biological functions of the query (uncharacterized) protein. The RCSB PDB website provides abstracts and links to PubMed for primary citations of PDB IDs. These were scanned for clues about the function of the protein. The RCSB PDB also provides SCOP and Pfam annotations and details, which are available under the annotation section of the protein under investigation. The Structural Classification of Proteins (SCOP) database describes the relationships of known protein structures in a detailed and comprehensive way. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and distant evolutionary relationships; the third, fold, describes geometrical relationships. [Conte et al., 2000]. Pfam, a domain database is mostly used to have comprehensive coverage. It is a database of Georgetown University, DC | 2012 Page 6 Project: Functional Annotation Georgetown University, DC protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. Where available, these annotations were included programmatically while identifying the truly unknown human structures. Resources from 10- Step structure-guided comparative analysis: The paper based on a 10- Step structure-guided comparative analysis [Mazumder and Vasudevan, 2008] involves the homology determination both at full-length sequence and 3D structural level and also the analysis of sequence and structural motifs based on the different level of percent identity. At the higher level of percent Identity of the pairwise alignments from the BLAST result, if any other structure is available, it is considered that the query structure may have the same overall function. At the lower level of similarity, the structural neighbors are found. All these were done using different resources (databases like PIRSF, COG, Pfam, SCOP etc and tools like BLAST, Cn3d) to provide various level of annotation validation. In the current project all these resources were queried (by PDB ID or by protein sequence obtained from the PDB structures) to obtain information about entries that matched at sequence and/or structural level, irrespective of any percent identity. Study of the primary citations of matched entries and of some related papers also provided information about the biological functions. A little description of the resources used is provided in the following few paragraphs. The PIRSF [Mazumder and Vasudevan, 2008] classifies the UniProtKB sequences primarily into end-to-end similarity into homeomorphic (end-to-end similarity) families and subfamilies (domain level superfamilies are also included) based on their evolutionary relationships. The PIRSF classification system is based on whole proteins rather than on component domains, so it allows annotation of generic biochemical and specific biological functions, as well as classification of proteins without well-defined domains. COGs (prokaryotes) and KOGs (Eukaryotes) [Mazumder and Vasudevan, 2008] consist of clusters of orthologous (and co-orthologous/inparalogous) proteins from completed genomes. Each COG includes orthologous proteins (i.e. connected through vertical evolutionary descent). The identification of orthologous protein sets is based on automatic clustering of proteins from three or more distantly related organisms based on reciprocal BLAST. This is followed by additional automatic recruitment based on a rigorous BLAST-based algorithm, and subsequent extensive manual curation of membership (including splitting of full-length proteins and assigning them to different clusters if necessary) and annotation. It is safe to evaluate domain architecture of the protein, which does not have end-to-end sequence similarity. For the proteins, which have low percent Identity, examination of a protein’s structural neighbors and fold comparisons can reveal distant evolutionary relationships. At the very lower level of Georgetown University, DC | 2012 Page 7 Project: Functional Annotation Georgetown University, DC identity, analysis of sequence/structural motifs is very important in order to infer functions, as they are evolutionarily conserved and stable. Along with other analysis the CDD cluster analysis of the structures is also done to examine the availability of the conserved domain for the structures. Literature Study: Although the PDB provides the information about the primary citation of the structures, if there is any, study of the primary citation and other related papers may help to obtain important biological, structural and biochemical information that in return can be used in predicting the general function of a particular protein. In absence of the primary citations, the general annotations information of the proteins sometimes was found in the UniProt. RESULTS AND DISCUSSIONS: PDB ID Or FASTA RCSB PIRSF COG PFAM CDD BLAST Cn3D Evidence Possible Functions Figure 4: Analysis Workflow The quick analysis based on the findings from the different resources was helpful to assess the possible functional annotation for the uncharacterized proteins. A snapshot of the results and conclusions from this analysis are included in Figure 5. The complete analysis is included in Appendix1. There were a few inconclusive cases, either due to absence of any information or due to presence of conflicting conclusions from the various resources. Georgetown University, DC | 2012 Page 8 Project: Functional Annotation Georgetown University, DC Figure 5: Snapshot of the Quick analysis of the human PDB IDs In looking through the results of the quick analysis it was noticed that there were several structures that matched to a few specific conserved domains using Conserved Domain Database (CDD)and some structures that did not match any conserved domain. Three domains appeared more frequently – PDZ (9 times), FN3 (7 times) and SH3 (7 times). For further investigation the CD clusters are viewed in Cn3D for these domains and the Figures (7,8,9) are shown later in the study. The next few paragraphs will provide some insight about their structure and general functions. PDZ domain: PDZ domain is a structural domain which consists of 80 to 90 residues. This is common and found in signaling proteins of various organisms like bacteria, yeast, viruses, animals and plants. The PDZ domain are protein protein interaction domain and their C and N terminus are found to be very close and folded which gives them a modular structure. The PDZ domain has six beta strands and two alpha helices. This domain primarily recognizes specific ~ 5 residue motifs which is available at the C terminatus of the protein it binds to or any structurally related internal motifs. FN3 domain: The fibronectin type III or FN3 domain is a protein domain which has about 100 amino acid residues. It is evolutionary conserved and possesess beta sandwich structure. Fibronectins bind to various substances including heparin, collagen, DNA, actin, fibrin and finbronectin receptors on cell surfaces which suggest their role in various functions like wound healing; cell adhesion; blood coagulation; cell differentiation and migration; maintenance of the cellular cytoskeleton; and tumour metastasis. Georgetown University, DC | 2012 Page 9 Project: Functional Annotation Georgetown University, DC PDZ FN3 SH3 Figure 6: Structures of PDZ, FN3, SH3 domains SH3 domain: SRC Homology 3 Domain or SH3 domain has about 60 amino acid residues and Beta-barrel fold, which consists of five or six β-strands, arranged as two tightly packed anti-parallel β sheets. The linker regions may contain short helices. They are present in proteins of signaling pathways in addition to the regulation of the activity state of adaptor proteins and other tyrosine kinases and are thought to increase the substrate specificity of some tyrosine kinases by binding far away from the active site of the kinase. To understand the evolutionary relationships between homologous sequences, the sequences of PDB entries in the unknown function list were aligned to other sequences in the CD cluster and visualized using Cn3D. Cn3D is a visualization tool for biomolecular structures, sequences, and sequence alignments. It can correlate the structure and sequence information. The figure 7 shows the Cytokine receptor motif of the aligned structures of FN3 domain (Using Cn3D) and the highlighted region shows the conserved residues through out the structures of the FN3 domain. None of the 7 structures in the study had the signature WSXWS motif. Georgetown University, DC | 2012 Page 10 Project: Functional Annotation Georgetown University, DC Figure 7: The Cytokine receptor motif of the aligned structures of FN3 domain (Using Cn3D) The figure 8 shows the proline rich binding site of the aligned structures of SH3 domain (Using Cn3D) and the highlighted region shows the conserved residues through out the structures of the SH3 domain. Majority of the structures matched to the conserved acidic and hydrophobic signature sequence for this cluster. Georgetown University, DC | 2012 Page 11 Project: Functional Annotation Georgetown University, DC Figure 8: The proline rich binding site of the aligned structures of SH3 domain (Using Cn3D) The figure 9 shows the protein-binding site of the aligned structures of PDZ domain (Using Cn3D) The highlighted region shows the conserved residues through out the structures of the PDZ domain. Only 1 out of the 9 structures in this study has the typical GLGF motif in the proteinbinding site. Figure 9: The protein-binding site of the aligned structures of PDZ domain (Using Cn3D) Georgetown University, DC | 2012 Page 12 Project: Functional Annotation Georgetown University, DC Since none of the set of FN3 domain structures in the current study matched the conserved sequence motif, this set of structures was explored in further detail using UCSF Chimera (Pettersen et al, 2004). Interestingly, 4 of the structures represent domains that belong to the same protein. Even though the topology and folds of the structures are same but it is observed that some local change in residues alters the overall shape and surface properties, which in turn may cause a difference in the function or behavior of these domains. The local differences, which take place at the residue level, actually may differentiate the function of each structure with a particular domain or family. This introduces the specificity of the structure. For the structures studied here, the Serines (at position 6 in Table 1) are buried in the beta sandwich and they are conserved throughout all the seven structures (Figure 10). It is assumed that the Serine in each structure is responsible for the stability of the structure and any change at that position may disrupt the whole structure. Figure 10: The conserved Serines residues (at position 6 in Table1) in all seven structures of FN3 domain Georgetown University, DC | 2012 Page 13 Project: Functional Annotation Georgetown University, DC PDB ID Table 1: Conserved Residues 123 456 1WK0 IKG TPS 1X3D GTS GFS 1X5X GKS NPS 1X4X GAG PFS 1UEM GLS DPS 1UJT FQG MDS 1WIS GTS PPS Shows the 7 structures of FN3 domain and the conserved residues (highlighted) taken from Figure 7 The other group of Serine (at position 3 in Table1) that is protruded towards the surface, sometimes replaced with Glycine. Since Glycine has no side chain, it is able to compensate for any neighboring residues with a larger side chain, either from the adjacent residues on the same beta strand (e.g. in protein structures 1X4X, 1UJT) or coming from a different part of the same protein (e.g. in protein structure 1WK0). For example, in 1WK0, a Serine is found to occupy the space for which a smaller residue Glycine is present to compensate (Figure 11). Figure 11: Serine in 1WK0 Georgetown University, DC | 2012 Page 14 Project: Functional Annotation Georgetown University, DC T in 1X3D K in 1X5X A in 1X4X Figure12: 3 structures 1X5X, 1X3D, 1X4X have variable amino acids exposed to the surface The residues which are located just before the Serine (at position 2 in the Table1), it is observed the nature of these amino acids are varying from basic (K), hydrophobic (L), hydrophilic (T, Q) – that suggests that they will interact with different binding partners depending on their nature. The three structures 1X5X, 1X3D, 1X4X in Figure 12 are from same protein. However, they have surface residues which are different in nature, they are likely to bind to different binding partners. So, it appears that variation of amino acid at this location is going to determine the specificity of the domain. Now to have total knowledge of the specific functions, we have to have a complex structure to know how they bind with other (partner) proteins. Georgetown University, DC | 2012 Page 15 Project: Functional Annotation Georgetown University, DC CONCLUSION: The goal of this analysis was to transfer functional annotation to the human structures. The study on the human structures implies that the resources used may be a powerful source for inferring the biological activities of the uncharacterized proteins. Transfer of annotation greatly depends on the presence of the structures with known functions. It should be done with great caution, as the chances of error will also increase for any mistaken transfer. A single resource may not be sufficient to predict functions and consideration of multiple sources may be proven very effective way of validation at various level. Though the study could not point at the exact functional annotations but it tried to guide to a direction where there are potential evidences about the possible functions of the structures of the domains or the whole protein. In summary the methods applied in this research project attempted to provide the overall probable functions of the uncharacterized human protein in PDB based on the information from different sources in a combined and simple approach, which also covered the biological, structural and bioinformatics aspect of the study. Resources Used PDB http://www.pdb.org UniProt http://www.uniprot.org NCBI http://www.ncbi.nlm.nih.gov PIRSF http://pir.georgetown.edu/pirsf/ COGs/KOGs http://www.ncbi.nlm.nih.gov/COG/ SCOP http://scop.mrc-lmb.cam.ac.uk/scop/ Cn3D/CDTree http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml Python www.python.org, www.biopython.org Georgetown University, DC | 2012 Page 16 Project: Functional Annotation Georgetown University, DC References: 1. “Structure-Guided Comparative Analysis of Proteins: Principles, Tools, and Applications for Predicting Function.” Raja Mazumder, Sona Vasudevan, PLoS Comput Biol 4(9): e1000151. September 26, 2008, PMID:18818720. 2. “TOPSAN: a collaborative annotation environment for structural genomics”, Dana Weekes,S Sri Krishna,Constantina Bakolitsa, Ian A Wilson, Adam Godzik and John Wooley, BMC Bioinformatics. 2010; 11: 426., 2010 August 17, PMID: 20716366. 3. ”Functional insights from structural genomics”, Farhad Forouhar, Alexandre Kuzin, Jayaraman Seetharaman , Insun Lee ,Weihong Zhou, Mariam Abashidze,Yang Chen, Wei Yong, Haleema Janjua,Yingyi Fang, Dongyan Wang, Kellie Cunningham, Rong Xiao, Thomas B. Acton, Eran Pichersky, Daniel F. Klessig, Carl W. Porter, Gaetano T. Montelione, Liang Tong, J Struct Funct Genomics (2007) 8:37–44, 23 June 2007, PMID: 17588214. 4.”Pre-calculated protein structure alignments at the RCSB PDB website”, Andreas Prlić; Spencer Bliven; Peter W. Rose; Wolfgang F. Bluhm; Chris Bizon; Adam Godzik; Philip E. Bourne (2010), Bioinformatics 26: 2983-2985, PMID: 20937596. 5. “SCOP: a Structural Classification of Proteins database, Loredana Lo Conte,a Bart Ailey, Tim J. P. Hubbard,Steven E. Brenner, Alexey G. Murzin, and Cyrus Chothia, Nucleic Acids Res. 2000 January 1; 28(1): 257–259. PMCID: PMC102479. 6. “Announcing the worldwide Protein Data Bank”, Berman H, Henrick K, Nakamura H, Nat Struct Biol. 2003 Dec;10(12):980, PMID: 14634627. 7. “The RCSB protein Data Bank: a redesigned query system and relational database based on the mmCIF schema”, Nita Deshpande, Kenneth J. Addess,Wolfgang F. Bluhm,Jeffrey C. Merino-Ott, Wayne Townsend-Merino,Qing Zhang,Charlie Knezevich,Lie Xie,Li Chen,Zukang Feng, Rachel Kramer Green,Judith L. Flippen-Anderson,John Westbrook,Helen M. Berman,and Philip E. Bourne, Nucleic Acids Res. 2005 January 1; 33(Database Issue): D233–D237, Published online 2004 December 17. doi: 10.1093/nar/gki057, PMCID: PMC540011 8. “UCSF Chimera--a visualization system for exploratory research and analysis”, Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE, J Comput Chem. 2004 Oct;25(13):1605-12, PMID: 15264254. Georgetown University, DC | 2012 Page 17 Project: Functional Annotation Georgetown University, DC Appendix 1 1. Functional Annotation spread sheet Functional Annotation Project Georgetown University, DC | 2012 Page 18