Functional Annotation Uncharacterized human proteins in Protein Data Bank

advertisement
Functional Annotation
Uncharacterized human proteins in
Protein Data Bank
By Swagata Naha Das
Guided By:
1) Dr. Shuchismita Dutta, Assistant Research Professor, Rutgers University, NJ
Senior Education Coordinator and Senior Biocurator
Research Collaboratory for Structural Bioinformatics
Center for Integrative Proteomics Research
2) Dr. Sona Vasudevan, Director, Georgetown University, DC
MS/MD Dual Master's Degree in Systems Medicine,
Assistant Professor, Medical Education,
Biochemistry and Molecular & Cellular Biology.
Project: Functional Annotation Georgetown University, DC PROJECT TITLE: Functional Annotation of uncharacterized
human proteins in Protein Data Bank (PDB)
INTRODUCTION:
On an evolutionary time scale, protein structure appears to be more conserved than protein
sequence. Although sequence comparisons are commonly used for functional annotation, structural
information can also be used to provide insight or evidence about their biological functions.
[Forouhar et al, 2007]. All experimentally determined structures of biological macromolecules are
deposited to the Protein Data Bank (PDB, Berman et al., 2003). The PDB is the single worldwide
repository of information about the 3D structures of large biological molecules, including proteins
and nucleic acids. The RCSB PDB supports a website where simple and complex queries on the data
can be performed and the results can be analyzed and visualized [Deshpande et al 2004]
With various genome sequencing projects and structural genomics initiatives we now have
a tremendous amount of genomic and proteomic data. The structural genomics initiatives have
produced the structures of protein using various methods such as X-ray crystallography and
Solution Nuclear Magnetic Resonance (NMR). The PSI-Nature Structural Biology Knowledgebase
(SBKB, http://www.sbkb.org) works to integrate the structural biology and structural genomics
resources into one site for easy navigation. The knowledge gathered from this huge information can
be applied to the understanding of various biological systems and diseases.
Due to the high throughput nature of the determination of structures by the Structural
genomics and other initiatives, a large number of structures are being produced. However, their
analysis and functional annotation are still happening in a slower pace because of the limitation of
the standard methods of data analysis and assimilation. As a result, there are many proteins whose
functions are still unknown. The growing gap in structure determination and publications describing
functional annotation has led many new approaches to explore the structural information, analyze
it and eventually use that for functional annotations. For example, Weekes et al., 2010 have
developed The Open Protein Structure Annotation Network or TOPSAN, a new web-based platform
that combines the openness of the wiki model with the quality control of scientific communication.
TOPSAN has the features of automated annotation databases and formal, peer-reviewed scientific
research literature. TOPSAN provides the opportunity to explore the scientific research globally and
collaboratively so that that this knowledge can be reviewed and validated by the experts. . In
another approach, Forouhar et al, 2007 reported some proteins for which determination of
structural information was done by X-ray Crystallography. This careful structural analysis,
serendipity and structure guided activity screening has imparted valuable information about the
Georgetown University, DC | 2012 Page 1 Project: Functional Annotation Georgetown University, DC functions of these proteins. Some of the examples discussed in the study include a novel methyl
salicylate esterase with important role in plant innate immunity, identification of protein yggj as a
novel RNA methyltransferase with its role in the methylation of U1498 in the 16S ribosomal RNA
etc.
With the determination of more structures, there is emergence of various databases to
store this structural information. However, in order to extract and use these data meaningfully for
interpreting biological function, proper validation of the data sources is necessary [Mazumder and
Vasudevan, 2008]. The structure-guided comparative analysis of proteins and protocol for
predicting protein functions described here is based on the percentage of protein sequence identity
scale. The authors define a ten-step procedure that can be considered as a general rule while
annotating the uncharacterized proteins. The paper insists that several layers of validation are
important to transfer the functional annotation from characterized proteins to the uncharacterized
proteins. Besides, the paper also provides the relevant tools and resources, which can be used for
the purpose.
The goal of this analysis is to identify and propose functional annotations of some of these
uncharacterized human proteins in PDB whose structures are already determined. Since there is no
well-defined method for the prediction of probable biological functions, we need a thorough and
systematic comparative analysis of the proteins both at sequence and structural level. The
uncharacterized human proteins are obtained from the functional sleuth of the SBKB and these are
analyzed using various bioinformatics tools, databases and literature searches. One of the
interesting aspects of this analysis is the use of different resources that can help us to validate the
findings regarding potential functions of a protein. If majority of the resources provides the same
information – the conclusion about the possible function of a protein is made more reliably, based
on the available evidences from the different resources.
METHODS:
A few steps are followed to get the set of the structures, which are to be analyzed for annotation.
This is described in the below mentioned figure.
Georgetown University, DC | 2012 Page 2 Project: Functional Annotation Georgetown University, DC Figure1: Workflow of Overall Process.
Identification of the truly unknown Human structures
The Functional Sleuth list of SBKB is supposed to contain the PDB IDs of all uncharacterized
proteins in the PDB. The PDB ID is the 4-character unique identifier of every entry in the Protein
Data Bank. The text file (3119 PDB IDs as of June 17, 2012) having structures with unknown
functions was examined. Few PDB IDs were identified whose functions were already known. These
annotated PDB IDs (approximately 32) were reported to the SBKB staff so that they could revise
the logic for selection of these entries from the PDB.
Rest of the PDB IDs (with no classification or annotation) was selected for further
investigation. As this is a huge list of data, only structures from to Homo sapiens were included for
this project - with a goal of determining their probable functions.
As a first step in examining the availability of functional annotation of the protein, UniProt
(UniProt consortium, 2012) was explored to determine the domains and their possible functions.
Georgetown University, DC | 2012 Page 3 Project: Functional Annotation Georgetown University, DC The UniProt is a comprehensive resource for protein sequence and annotation data. This analysis
revealed that there were some structures with same accession ID and domain ranges. These PDB
entries were inferred as referring to the same structure..
Identification of the MISSED entries:
In the initial UniProt exploration it was observed that there were few PDB IDs, which had
related PDB IDs (same domain and same UNP ID) and these related structures are already
annotated but somehow functional annotation of these entries was missed out. They are termed as
‘MISSED’ entries in the Figure1. This list of ‘MISSED’ PDB entries was reported to PDB annotators
so that they could be updated appropriately. This further cleaned up of the list of human protein
structures with truly unknown functions
In order to identify the human structures with truly unknown functions, the MISSED entries
were required to be identified first. The sorting of the MISSED entries and truly unknown PDB IDs
are done programmatically using a Python script (see Figure 2)
Figure 2: Finding the Missing entries – Workflow of the Program
.
Georgetown University, DC | 2012 Page 4 Project: Functional Annotation Georgetown University, DC The script takes the human PDB IDs as input and checks with the Protein Data Bank (using restful
web services) to get the corresponding Uniprot accession numbers. In the next step, all the
corresponding PDB structures are retrieved using these accession numbers from the UniProt.
If there was functional annotation for any of the related structures having the same accession
number and domain range, it was assumed that the particular uncharacterized structure (Query
structure) had the same biological function and hence termed as MISSED entry. Here, 62 such
cases were identified (Figure 3).
Figure 3: Snapshot of the MISSED (already annotated) human PDB IDs. FS-617M means missed
(“M”) entry taken on June 17th (“617”) from functional sleuth (“FS”)
There were also few PDB IDS, which had related uncharacterized PDB IDs with same
accession number and domain. It is inferred that these PDB IDs refer to the same structures. In
this case, 10 such PDB IDs are identified. Two structures are discarded because of the
unavailability of their UniProt Identification no or Accession number. Finally, 117 PDB IDs are
sorted out which do not have any functional annotation and with this we can now identify the
human structures that do not have any known functions.
Georgetown University, DC | 2012 Page 5 Project: Functional Annotation Georgetown University, DC General approaches followed for prediction of function:
In the effort to predict possible function of the 117 PDB structures, a combination of 3
methods were used – (1) RCSB tools for exploring sequence and structure similarity clusters, (2) a
10 Step method for structure guided comparison of protein structures and (3) exploration of
primary and related literature studies
Tools from RCSB PDB:
The pre-calculated protein sequence and structure alignments at the RCSB Protein Data
Bank (PDB) website were used (Prilc et al., 2010). There is a structure alignment web service that
calculates the pairwise alignments and another stand-alone application that runs alignments locally
and visualizes the results. These resources were also used during the analysis in this project.
The sequence clustering in PDB is achieved by Blastclust. This algorithm clusters all the
protein chains of at least 20 amino acids at 100%, 95%, 90%, 70%, 50%,40%, 30% sequence
similarity. At the higher percentage, the protein from same/similar families can be identified and at
the lower level, the structural neighbors are most likely to be present.
While examining, the structure alignment, the top hits are evaluated based on the percent
identity and, lowest P–value and Coverage1 and Coverage2. Coverage1 is the coverage or percent
of residues in query or chain1 and Coverage2 is the coverage or percent of residues in the matched
protein or chain2. The percentage identity is the number of identical bases between two sequences
in an alignment and the P-value is the probability that an alignment with this score occurs by
chance in a database of this size. The lower the P-value the better the alignment is. If one of the
best hits in the sequence or structure comparisons had a functional annotation, the query structure
was assumed as having the same or similar function. The structural alignment (Jmol view) of the
query protein (uncharacterized) and subject protein (best hit) was also visualized to investigate the
degree of structural similarity. In cases where ligands were present in the subject (annotated)
protein, residues at the ligand-binding site were reviewed to see if they were conserved in the
query
protein
and
could
provide
insight
about
the
biological
functions
of
the
query
(uncharacterized) protein.
The RCSB PDB website provides abstracts and links to PubMed for primary citations of PDB IDs.
These were scanned for clues about the function of the protein. The RCSB PDB also provides SCOP
and Pfam annotations and details, which are available under the annotation section of the protein
under investigation. The Structural Classification of Proteins (SCOP) database describes the
relationships of known protein structures in a detailed and comprehensive way. The classification is
on hierarchical levels: the first two levels, family and superfamily, describe near and distant
evolutionary relationships; the third, fold, describes geometrical relationships. [Conte et al., 2000].
Pfam, a domain database is mostly used to have comprehensive coverage. It is a database of
Georgetown University, DC | 2012 Page 6 Project: Functional Annotation Georgetown University, DC protein families that includes their annotations and multiple sequence alignments generated using
hidden Markov models. Where available, these annotations were included programmatically while
identifying the truly unknown human structures.
Resources from 10- Step structure-guided comparative analysis:
The paper based on a 10- Step structure-guided comparative analysis
[Mazumder and
Vasudevan, 2008] involves the homology determination both at full-length sequence and 3D
structural level and also the analysis of sequence and structural motifs based on the different level
of percent identity. At the higher level of percent Identity of the pairwise alignments from the
BLAST result, if any other structure is available, it is considered that the query structure may have
the same overall function. At the lower level of similarity, the structural neighbors are found. All
these were done using different resources (databases like PIRSF, COG, Pfam, SCOP etc and tools
like BLAST, Cn3d) to provide various level of annotation validation. In the current project all these
resources were queried (by PDB ID or by protein sequence obtained from the PDB structures) to
obtain information about entries that matched at sequence and/or structural level, irrespective of
any percent identity. Study of the primary citations of matched entries and of some related papers
also provided information about the biological functions.
A little description of the resources used is provided in the following few paragraphs.
The PIRSF [Mazumder and Vasudevan, 2008] classifies the UniProtKB sequences primarily
into end-to-end similarity into homeomorphic (end-to-end similarity) families and subfamilies
(domain level superfamilies are also included) based on their evolutionary relationships. The PIRSF
classification system is based on whole proteins rather than on component domains, so it allows
annotation of generic biochemical and specific biological functions, as well as classification of
proteins without well-defined domains.
COGs (prokaryotes) and KOGs (Eukaryotes) [Mazumder and Vasudevan, 2008] consist of
clusters of orthologous (and co-orthologous/inparalogous) proteins from completed genomes. Each
COG includes orthologous proteins (i.e. connected through vertical evolutionary descent). The
identification of orthologous protein sets is based on automatic clustering of proteins from three or
more distantly related organisms based on reciprocal BLAST. This is followed by additional
automatic recruitment based on a rigorous BLAST-based algorithm, and subsequent extensive
manual curation of membership (including splitting of full-length proteins and assigning them to
different clusters if necessary) and annotation. It is safe to evaluate domain architecture of the
protein, which does not have end-to-end sequence similarity.
For the proteins, which have low percent Identity, examination of a protein’s structural neighbors
and fold comparisons can reveal distant evolutionary relationships. At the very lower level of
Georgetown University, DC | 2012 Page 7 Project: Functional Annotation Georgetown University, DC identity, analysis of sequence/structural motifs is very important in order to infer functions, as they
are evolutionarily conserved and stable.
Along with other analysis the CDD cluster analysis of the structures is also done to examine the
availability of the conserved domain for the structures.
Literature Study:
Although the PDB provides the information about the primary citation of the structures, if
there is any, study of the primary citation and other related papers may help to obtain important
biological, structural and biochemical information that in return can be used in predicting the
general function of a particular protein. In absence of the primary citations, the general annotations
information of the proteins sometimes was found in the UniProt.
RESULTS AND DISCUSSIONS:
PDB ID Or FASTA RCSB PIRSF COG PFAM CDD BLAST Cn3D Evidence Possible Functions Figure 4: Analysis Workflow
The quick analysis based on the findings from the different resources was helpful to assess
the possible functional annotation for the uncharacterized proteins. A snapshot of the results and
conclusions from this analysis are included in Figure 5. The complete analysis is included in
Appendix1. There were a few inconclusive cases, either due to absence of any information or due to
presence of conflicting conclusions from the various resources.
Georgetown University, DC | 2012 Page 8 Project: Functional Annotation Georgetown University, DC Figure 5: Snapshot of the Quick analysis of the human PDB IDs
In looking through the results of the quick analysis it was noticed that there were several
structures that matched to a few specific conserved domains using Conserved Domain Database
(CDD)and some structures that did not match any conserved domain. Three domains appeared
more frequently – PDZ (9 times), FN3 (7 times) and SH3 (7 times). For further investigation the CD
clusters are viewed in Cn3D for these domains and the Figures (7,8,9) are shown later in the study.
The next few paragraphs will provide some insight about their structure and general functions.
PDZ domain:
PDZ domain is a structural domain which consists of 80 to 90 residues. This is common
and found in signaling proteins of various organisms like bacteria, yeast, viruses, animals and
plants. The PDZ domain are protein protein interaction domain and their C and N terminus are
found to be very close and folded which gives them a modular structure. The PDZ domain has six
beta strands and two alpha helices. This domain primarily recognizes specific ~ 5 residue motifs
which is available at the C terminatus of the protein it binds to or any structurally related internal
motifs.
FN3 domain:
The fibronectin type III or FN3 domain is a protein domain which has about 100 amino acid
residues. It is evolutionary conserved and possesess beta sandwich structure. Fibronectins bind to
various substances including heparin, collagen, DNA, actin, fibrin and finbronectin receptors on cell
surfaces which suggest their role in various functions like wound healing; cell adhesion; blood
coagulation; cell differentiation and migration; maintenance of the cellular cytoskeleton; and
tumour metastasis.
Georgetown University, DC | 2012 Page 9 Project: Functional Annotation Georgetown University, DC PDZ
FN3
SH3
Figure 6: Structures of PDZ, FN3, SH3 domains
SH3 domain:
SRC Homology 3 Domain or SH3 domain has about 60 amino acid residues and
Beta-barrel fold, which consists of five or six β-strands, arranged as two tightly packed anti-parallel
β sheets. The linker regions may contain short helices.
They are present in proteins of signaling pathways in addition to the regulation of the activity state
of adaptor proteins and other tyrosine kinases and are thought to increase the substrate specificity
of some tyrosine kinases by binding far away from the active site of the kinase.
To understand the evolutionary relationships between homologous sequences, the sequences of
PDB entries in the unknown function list were aligned to other sequences in the CD cluster and
visualized using Cn3D. Cn3D is a visualization tool for biomolecular structures, sequences, and
sequence alignments. It can correlate the structure and sequence information.
The figure 7 shows the Cytokine receptor motif of the aligned structures of FN3 domain (Using
Cn3D) and the highlighted region shows the conserved residues through out the structures of the
FN3 domain. None of the 7 structures in the study had the signature WSXWS motif.
Georgetown University, DC | 2012 Page 10 Project: Functional Annotation Georgetown University, DC Figure 7: The Cytokine receptor motif of the aligned structures of FN3 domain (Using Cn3D)
The figure 8 shows the proline rich binding site of the aligned structures of SH3 domain (Using
Cn3D) and the highlighted region shows the conserved residues through out the structures of the
SH3 domain. Majority of the structures matched to the conserved acidic and hydrophobic signature
sequence for this cluster.
Georgetown University, DC | 2012 Page 11 Project: Functional Annotation Georgetown University, DC Figure 8: The proline rich binding site of the aligned structures of SH3 domain (Using Cn3D)
The figure 9 shows the protein-binding site of the aligned structures of PDZ domain (Using Cn3D)
The highlighted region shows the conserved residues through out the structures of the PDZ
domain. Only 1 out of the 9 structures in this study has the typical GLGF motif in the proteinbinding site.
Figure 9: The protein-binding site of the aligned structures of PDZ domain (Using Cn3D)
Georgetown University, DC | 2012 Page 12 Project: Functional Annotation Georgetown University, DC Since none of the set of FN3 domain structures in the current study matched the conserved
sequence motif, this set of structures was explored in further detail using UCSF Chimera (Pettersen
et al, 2004). Interestingly, 4 of the structures represent domains that belong to the same protein.
Even though the topology and folds of the structures are same but it is observed that some local
change in residues alters the overall shape and surface properties, which in turn may cause a
difference in the function or behavior of these domains. The local differences, which take place at
the residue level, actually may differentiate the function of each structure with a particular domain
or family. This introduces the specificity of the structure.
For the structures studied here, the Serines (at position 6 in Table 1) are buried in the beta
sandwich and they are conserved throughout all the seven structures (Figure 10). It is assumed
that the Serine in each structure is responsible for the stability of the structure and any change at
that position may disrupt the whole structure.
Figure 10: The conserved Serines residues (at position 6 in Table1) in all seven structures of FN3
domain
Georgetown University, DC | 2012 Page 13 Project: Functional Annotation Georgetown University, DC PDB ID
Table 1:
Conserved Residues
123
456
1WK0
IKG
TPS
1X3D
GTS
GFS
1X5X
GKS
NPS
1X4X
GAG
PFS
1UEM
GLS
DPS
1UJT
FQG
MDS
1WIS
GTS
PPS
Shows the 7 structures of FN3 domain and the conserved residues (highlighted) taken
from Figure 7
The other group of Serine (at position 3 in Table1) that is protruded towards the surface,
sometimes replaced with Glycine. Since Glycine has no side chain, it is able to compensate for any
neighboring residues with a larger side chain, either from the adjacent residues on the same beta
strand (e.g. in protein structures 1X4X, 1UJT) or coming from a different part of the same protein
(e.g. in protein structure 1WK0). For example, in 1WK0, a Serine is found to occupy the space for
which a smaller residue Glycine is present to compensate (Figure 11).
Figure 11: Serine in 1WK0
Georgetown University, DC | 2012 Page 14 Project: Functional Annotation Georgetown University, DC T in 1X3D K in 1X5X A in 1X4X Figure12: 3 structures 1X5X, 1X3D, 1X4X have variable amino acids exposed to the surface
The residues which are located just before the Serine (at position 2 in the Table1), it is observed
the nature of these amino acids are varying from basic (K), hydrophobic (L), hydrophilic (T, Q) –
that suggests that they will interact with different binding partners depending on their nature.
The three structures 1X5X, 1X3D, 1X4X in Figure 12 are from same protein. However, they have
surface residues which are different in nature, they are likely to bind to different binding partners.
So, it appears that variation of amino acid at this location is going to determine the specificity of
the domain.
Now to have total knowledge of the specific functions, we have to have a complex structure to
know how they bind with other (partner) proteins.
Georgetown University, DC | 2012 Page 15 Project: Functional Annotation Georgetown University, DC CONCLUSION:
The goal of this analysis was to transfer functional annotation to the human structures. The study
on the human structures implies that the resources used may be a powerful source for inferring the
biological activities of the uncharacterized proteins. Transfer of annotation greatly depends on the
presence of the structures with known functions. It should be done with great caution, as the
chances of error will also increase for any mistaken transfer.
A single resource may not be sufficient to predict functions and consideration of multiple sources
may be proven very effective way of validation at various level. Though the study could not point at
the exact functional annotations but it tried to guide to a direction where there are potential
evidences about the possible functions of the structures of the domains or the whole protein. In
summary the methods applied in this research project attempted to provide the overall probable
functions of the uncharacterized human protein in PDB based on the information from different
sources in a combined and simple approach, which also covered the biological, structural and
bioinformatics aspect of the study.
Resources Used
PDB
http://www.pdb.org
UniProt
http://www.uniprot.org
NCBI
http://www.ncbi.nlm.nih.gov
PIRSF
http://pir.georgetown.edu/pirsf/
COGs/KOGs
http://www.ncbi.nlm.nih.gov/COG/
SCOP
http://scop.mrc-lmb.cam.ac.uk/scop/
Cn3D/CDTree
http://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml
Python
www.python.org, www.biopython.org
Georgetown University, DC | 2012 Page 16 Project: Functional Annotation Georgetown University, DC References:
1. “Structure-Guided Comparative Analysis of Proteins: Principles, Tools, and Applications for
Predicting Function.” Raja Mazumder, Sona Vasudevan, PLoS Comput Biol 4(9): e1000151.
September 26, 2008, PMID:18818720.
2. “TOPSAN: a collaborative annotation environment for structural genomics”, Dana Weekes,S Sri
Krishna,Constantina Bakolitsa, Ian A Wilson, Adam Godzik and John Wooley, BMC Bioinformatics.
2010; 11: 426., 2010 August 17, PMID: 20716366.
3. ”Functional insights from structural genomics”, Farhad Forouhar, Alexandre Kuzin, Jayaraman
Seetharaman , Insun Lee ,Weihong Zhou, Mariam Abashidze,Yang Chen, Wei Yong, Haleema
Janjua,Yingyi Fang, Dongyan Wang, Kellie Cunningham, Rong Xiao, Thomas B. Acton, Eran
Pichersky, Daniel F. Klessig, Carl W. Porter, Gaetano T. Montelione, Liang Tong, J Struct Funct
Genomics (2007) 8:37–44, 23 June 2007, PMID: 17588214.
4.”Pre-calculated protein structure alignments at the RCSB PDB website”, Andreas Prlić; Spencer
Bliven; Peter W. Rose; Wolfgang F. Bluhm; Chris Bizon; Adam Godzik; Philip E. Bourne (2010),
Bioinformatics 26: 2983-2985, PMID: 20937596.
5. “SCOP: a Structural Classification of Proteins database, Loredana Lo Conte,a Bart Ailey, Tim J. P.
Hubbard,Steven E. Brenner, Alexey G. Murzin, and Cyrus Chothia, Nucleic Acids Res. 2000 January
1; 28(1): 257–259. PMCID: PMC102479.
6. “Announcing the worldwide Protein Data Bank”, Berman H, Henrick K, Nakamura H, Nat Struct
Biol. 2003 Dec;10(12):980, PMID: 14634627.
7. “The RCSB protein Data Bank: a redesigned query system and relational database based on the
mmCIF schema”, Nita Deshpande, Kenneth J. Addess,Wolfgang F. Bluhm,Jeffrey C. Merino-Ott,
Wayne Townsend-Merino,Qing Zhang,Charlie Knezevich,Lie Xie,Li Chen,Zukang Feng, Rachel
Kramer Green,Judith L. Flippen-Anderson,John Westbrook,Helen M. Berman,and Philip E. Bourne,
Nucleic Acids Res. 2005 January 1; 33(Database Issue): D233–D237, Published online 2004
December 17. doi: 10.1093/nar/gki057, PMCID: PMC540011
8. “UCSF Chimera--a visualization system for exploratory research and analysis”, Pettersen EF,
Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE, J Comput Chem. 2004
Oct;25(13):1605-12, PMID: 15264254.
Georgetown University, DC | 2012 Page 17 Project: Functional Annotation Georgetown University, DC Appendix 1
1. Functional Annotation spread sheet
Functional
Annotation Project
Georgetown University, DC | 2012 Page 18 
Download