Streptococcus pneumoniae is lancet-shaped Gram

advertisement
PREDICTING PROTEIN FUNCTION BASED ON 3D RESIDUE
MOTIF SEARCH: A LINEAR PROGRAMMING APPROACH
LAC Fonseca1, RC. de Melo-Minardi¹,
DEV. Pires1,2, F. Ferré1,2,W Meira Jr.1 and MM Santoro2
1
Department of Computer Science - Universidade Federal de Minas Gerais
Department of Biochemistry and Immunology - Universidade Federal de Minas Gerais
2
According to PFam database about 20% of protein domain families remain with unknown
functions and this ratio is growing since there are various genome and metagenome projects
producing huge quantities of biological sequences. In this scenario, computational methods to
predict function of proteins in newly sequenced genomes is very important since experimental
methods to characterize protein function are expensive and labor intensive.
In this work we use a linear programing approach to predict protein function(and specially
enzyme function) based on 3D residue motifs (or active site) homology. We compare the proposed
method to Pints which is the only competitor software with binaries available.
We model the residues from the query 3D motif as points represented by the last heavy
atoms (LHA) from the side chains. We generate a clique where each edge is labeled by the distance
between the adjacent vertices. Thus, we want to match the edges in order to minimize the sum of
the distances in the query and in the search space.
We optimize this using two constraints: (1) every edge from the query graph must be
matched with an edge from the search space graph and (2) each edge in the search space graph must
be matched to up to one edge from the query graph.
Our goal is to predict function or to classify proteins in terms of their families based on
hypothetical active sites homology. We built a dataset with known active sites to test if our
algorithm was able to recall proteins from the same function based on these known active sites.
Thus we have two main segments in the dataset. The first is a database of proteins which
have an SCOP family assignment and the former is a database of about 1,827 proteins randomly
chosen PDB chains. To build our families dataset, we used CSA - Catalytic Site Atlas, which is a
database that documents enzyme active sites and catalytic residues in enzymes 3D structures.
Since SCOP family is related to function, we tried to retrieve active sites in proteins from the
same SCOP family of the original active sites from CSA.
We compared our method with the Pints algorithm and the results show that our method
perform better than Pints considering metrics like AUC and accuracy. These results reveal that we
were able to not only retrieve active sites but also infer a possible function to a certain protein.
As future work we intend to implement a statistical analysis in order to improve our method
and claim with more confidence if a certain protein really have an active site.
Beyond this scope, we are now investigating whether it is possible to predict action of
known drugs in proteins of pathogens for which this activity has not been described yet, through
site homology, since drugs can act in active sites. For instance, we are now trying to find sites in
HIV proteins that could have an homologous site with a known drug. In this example, our
information about drugs come from a database called DrugBank. The DrugBank database is a
bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical,
pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure,
and pathway) information.
Supported by: CAPES, CNPq, FAPEMIG and FINEP.
Download