VICMpred: SVM-based method for prediction of functional

advertisement
VICMpred: SVM-based method for prediction of
functional proteins of gram-negative bacteria using
amino acid patterns and composition.
Sudipto Saha and G.P.S.Raghava*
Institute of Microbial Technology
Sector-39A, Chandigarh, India
*Address for Correspondence
Dr. G. P. S. Raghava, Scientist
Email: raghava@imtech.res.in
Bioinformatics Centre
Web: http://www.imtech.res.in/raghava/
Institute of Microbial Technology
Phone: +91-172-2690557
Sector 39A, Chandigarh, INDIA
Fax: +91-172-2690632
Abstract
In this study, an attempt has been made to predict major functions of gram-negative
bacterial proteins from its amino acid sequence. The dataset used for training and testing
consists of 670 non-redundant gram-negative bacterial proteins (255 cellular process, 60
information molecule, 285 metabolism proteins and 70 virulence factor). First we
developed a SVM based method using amino acid and dipeptide composition and
achieved an overall accuracy 52.39% and 47.01% respectively. In this study we introduce
a new concept for classification of proteins based on tetra-peptide where we identify the
unique tetra-peptides significantly found in a class of proteins. These tetra-peptides were
used as input feature for predicting function of a protein and achieved an overall accuracy
68.66%. We also developed a hybrid method where tetra-peptide information was used
with amino acids composition and achieved overall accuracy 70.75%. The five-fold cross
validation was used to evaluate the performance method. The webserver VICMpred has
been
developed
for
predicting
function
(http://www.imtech.res.in/raghava/vicmpred/).
of
gram-negative
bacterial
proteins
Introduction
Though there is exponential growth in sequence database of proteins in last decade, but
function of small fraction of proteins has been experimentally characterized. The
experimental assessment of the function of every protein of each newly sequenced
genome is beyond foreseeable resources, our knowledge of most of the new proteins will
be from predictions. Functional prediction is a major challenge in the field of
bioinformatics (1). In the past number of methods have been developed to predict the
function of proteins (2,3,4), but the results obtained by analyzing a significant number of
true sequence similarities, point to the complexity of function prediction. Most of the
methods are indirect methods where attempt have been made to predict subcellular
localization of proteins rather than function. The subcellular localization methods are
based on observation that protein belongs to same compartment of protein has similar
amino acid composition (5,6) and has similar functions. In this study, an attempt has been
made to develop direct method for predicting major functions (virulence factors,
information molecule, cellular process and metabolism molecule) of gram-negative
bacterial proteins including virulence factors that cause pathogenicity to the host system.
Most of the proteins in an organism involves in cellular process, metabolism and in
information storage, remaining can be classified in virulence factors, which allow the
germs to establish themselves in the host. Virulence factors include adhesions (7), toxins
(8) and hemolytic molecules (9). Identification of virulence factors is crucial for drug
development. So, we made an attempt to classify the bacterial proteins into four broad
functional classes. The three broad functional classes were taken from COGs functional
annotation (10). They are i) cellular process, which includes cell division, cell envelope
biogenesis, cell motility and signal transduction molecule; ii) information storage and
processing, in which transcription, translation and DNA replication and repair molecule
is included; iii) metabolic includes energy production and carbohydrate, amino acid,
nucleotide, lipid transport and metabolism.
The similarity search tools like BLAST or FASTA (11) and PSI-BLAST (12) are
commonly used for annotation of genomes. Besides similarity search tools, machine
learning tools also used for classification of proteins where amino acid, pseudo, dipeptide
and property compositions are used as features of protein. The prediction of function of
protein is much more complex than other classifications because sequence similarity is
very poor in proteins having same function, thus most of methods based on similarity
search fail to predict function of a protein (13,14).
In this study we made a systematic attempt to develop better method for predicting
function of proteins. First, we tried traditional strategies for classification of proteins that
includes i) similarity search using PSI-BLAST; ii) SVM-based method using amino acid
composition and iii) SVM-based method using dipeptide composition which also
consider local order of amino acids. It was observed that performance of traditional
approaches was very poor in functional classification of proteins. In order to improve the
performance we used tetra-peptides as features of protein similar to deterministic pattern
of Class A as defined by Brazma et al., (15). The approach relies on identifying short
signaling patterns and group of patterns of each four broad functional class present in
higher number (16). The performance of method based on tetra-peptide was much higher
than traditional methods based on residue composition. The performance was further
improved when new and traditional approaches were combined. In this study we classify
gram-negative
bacterial
proteins
obtained
from
PSORTdb
v.20
(http://www.psort.org/dataset) (17), which is used in the development of SubLoc (18).
Based
on
our
study,
we
have
made
a
webserver,
VICMpred
(http://www.imtech.res.in/raghava/vicmpred/) for predicting function of proteins from its
amino acid sequence.
Materials Methods
Data sets
We obtained 1572 proteins from Hua and Sun (2001) work. We examine the function of
these proteins using SWISS-PROT (19) version 33.0. and kept 1048 proteins for further
processing, whose functions was known. We used PROSET software to create a dataset
of non-redundant proteins where no two proteins have more than 90% sequence identity.
Final dataset consist of 670 non-redudant gram-negative bacterial proteins (255 cellular
process, 60 information molecule, 285-metabolism protein and 70 virulence factors
protein).
Evaluation of the predictive performances
The performance modules constructed in this study were evaluated using a 5-fold crossvalidation technique. In the 5-fold cross-validation, the relevant dataset was randomly
divided into five sets. The training and testing was carried out five times, each time using
one distinct set for testing and the remaining four sets for training. For evaluating the
performance of various modules, accuracy and Matthew’s correlation coefficient (MCC)
were calculated using the following equations:
Accuracy (x) =
MCC (x)=
p( x)
Exp( x)
p ( x ) n( x )  u ( x )o( x )
[ p( x)  u ( x)][ p( x)  o( x)][ n( x)  u ( x)][ n( x)  o( x)]
where x can be any functional class (cellular, information, metabolism and virulence
protein), exp(x) is the number of sequences observed in function x, p(x) is the number of
correctly predicted sequences of function x, n(x) is the number of correctly predicted
sequences not of function x, u(x) is the number of under-predicted sequences and o(x) is
the number of over-predicted sequences.
Support vector machine
The SVM was implemented using freely downloadable software package SVM_light
written by Joachims (20). The software enables the user to define a number of parameters
as well as to select from a choice of inbuilt kernal functions, including a radial basis
function (RBF) and a polynomial kernal. Preliminary tests show that the radial basis
function (RBF) kernel gives results better than other kernels. Therefore, in this work we
use the RBF kernel for all the experiments. The prediction of functional class is a multiclass classification problem. We developed a series of binary classifiers to handle the
multi-classification problem. We constructed N SVMs for N-class classification using 1
vs r (one against rest) strategy. Here, the class number was equal to four for bacterial
protein sequences, The ith SVM was trained with all samples in the ith class with positive
labels and all other samples with negative labels. In this way, four SVMs were
constructed for functional class of bacterial protein to cellular, information, metabolic
and virulence.
Protein features
Amino acid composition. Amino acid composition is the fraction of each amino acid in a
protein. The fraction of all 20 natural amino acids was calculated using the following
equations:
Fraction of amino acid i =
Total number of a min o acid ( i )
Total number of a min o acids in protein
where i can be any amino acid.
Dipeptide composition. Dipeptide
composition was used to encapsulate the global
information about each protein sequence, which gives a fixed pattern length of 400 (20 
20). This representation encompassed the information about amino acid composition
along local order of amino acid. The fraction of each dipeptide was calculated using
following equation:
fraction of dipep (i) =
total number of dipep (i )
total number all possible dipeptides
where dipep(i) is one out of 400 dipeptides.
Ab-initio patterns : We have calculated the frequency of all possible tetra peptides
(20202020=160,000) in each class of proteins. Then we identify the tetra peptides,
which are found more than a threshold for a class of proteins, called significant tetra
peptides for that class. In our case we consider a tetra peptide significant if it is found 6
times in case of cellular proteins; 3 times in case of information molecules; 6 times in
case of metabolic proteins and 4 times in case of virulence proteins. In next step we
compute number of significant tetra peptides of each class are present in a protein. Thus,
four features represented a protein, where each feature represents significant number of
tetra peptides of a class of proteins. Finally we used SVM for classification of proteins
based on these four features. In our study significant tetra peptides were only calculated
from proteins in training set in order to avoid any biasness in prediction. An out line of
this method is shown in the Fig 1.
PSI-BLAST
A module of PSI-BLAST was designed in which query sequences in test dataset were
searched against proteins in training dataset using PSI-BLAST. Three iterations of PSIBLAST were carried out at a cut-off E-value of 0.001. PSI-BLAST was used instead of
normal standard BLAST because PSI-BLAST has the capability to detect remote
homologies. The module could predict any of the four functions (cellular, information,
metabolic and virulence) depending upon the similarity of the query protein to the protein
in the dataset.
Prediction results
The performance of all the modules developed in this study is shown in Table 1. The
performance of all modules was evaluated through 5-fold cross-validation. The
composition-based module (kernal=RBF, =80 ,C=2 j=4) was able to predict with
52.39% accuracy. In the case of the dipeptides composition-based module the
performance of the RBF kernel (=100and C=50, j=1 ) was 5% lower than the amino
acid composition. The results of the PSI-BLAST module were evaluated through 5-fold
cross-validation. The module predicted cellular, information, metabolism and virulence
protein sequence with 23.13, 8.33, 28.77, 25.71 % accuracy, respectively. During 5-fold
cross-validation, only 172 hits were obtained out of total 670 proteins. Therefore, the
performance of this module is poorer in comparison to amino acid and dipeptide
compositions based SVM modules.
It was interesting to note that performance for dipeptide based module was lower than
simple amino acid composition based module, despite dipeptide composition provides
composition as well as order of local amino acids. It is because in case of dipeptide total
number of features are 400 (2020) which is too high, number of features do not occurs
in small number of proteins. Thus SVM is unable to learn properly on too many features.
In order to avoid this problem we introduce a new concept for prediction where we
consider peptides which are occurs in each class of proteins in significant amount. Here
we used frequency of significant tetra peptides found in a class of proteins. This module
called ab inito pattern based module were able to predict functions of protein with
accuracy of 68.66% (kernal=RBF, = 0.001and C=50 and j=5), which is higher than
amino acid composition based and dipeptide-based module.
To further improve the prediction accuracy, hybrid modules on the basis of various
features of proteins were constructed. The first hybrid (hybrid 1) was developed on the
basis of the pattern information and amino acid composition. The prediction accuracy of
the hybrid1 module was 70.30%, which is better than any individual features-based
module. Another module (hybrid 2) was developed on the basis pattern information and
of amino acid composition; its performance was similar to hybrid1 module. Finally, a
hybrid module based on pattern information, amino acid and dipeptide composition was
developed. This hybrid used an input vector of 424 dimensions, comprising 4 for pattern
information, 20 for amino acid composition, 400 for dipeptide composition. As shown in
Table 1 the performance of this module is better than any individual feature-based or
other hybrid modules (hybrid1 and hybrid 2). Finally, a hybrid module with the RBF
kernel (=0.001 and C=100000 j=1), which used pattern, amino acid and dipeptide
composition information, was able to achieve 70.75% overall accuracy.
VICMpred SERVER
Based on our study, we have developed a web server that allows users to predict the
function of a protein (e.g., virulence factors, information molecule, cellular process and
metabolism molecule) from its amino acid sequences. VICMperd is freely available at
http://www.imtech.res.in/raghava/vicmpred/. The common gateway interface (CGI) script
for VICMpred is written using PERL version 5.03. This server is installed on a Sun
Server (420E) under a UNIX (Solaris 7) environment. Users can enter the primary amino
acid sequence for prediction using file uploading or cut-and-paste options.
Discussion
The functional annotation of proteins is one of the major challenges in the era of post
genomics. The most widely used methods for predicting the function of a new protein
involve sequence alignment or similarity search or profile search, like FASTA, BLAST,
PSI-BLAST (10,11). These methods fail in absence of significant similarity between
query protein and annotated proteins. One of the reasons of failure of similarity-based
method is variation in size of proteins either belongs to same or different classes.
The problems with profiles is that they are complicated models with many free
parameters. One is faced with a number of difficult problems like the best ways to set the
position-specific residues scores, to score gaps and insertions and to combine structural
and multiple sequence information.
An alternative way for predicting function of protein is to predict is location in cell,
which is based on assumption that proteins reside in same location also have same
functions. Most of these subcellular localization methods are based on composition
(amino or dipeptide) of protein.
In this study, an attempt have been made to develop direct method for predicting function
of proteins. First we tried traditional approaches which are commonly used in prediction
of subcellular localization. It was observed that performance of PSI-BLAST was poor to
composition based methods (Table 1). This demonstrate that similarity search based
methods are not very effective in function prediction. It was also observed that dipeptide
composition based method perform poor than amino acid composition. This was
unexpected as dipeptide provides more information (composition with local order) than
simple amino acid. In past we observed that dipeptide perform better than amino acid
composition in subcellular localization of proteins. We examined our data and observed
that number of dipeptide was either rare or completely absent due to small number of
proteins used for classification. This demonstrates that higher order composition is not
successful on small set of data. In order to overcome this problem we tried a new
approach. In this approach we used tetra peptides that provides more local order than
dipeptide and tripeptide. Instead of using composition of all tetra peptide, we identify the
tetra peptides found in significant number in each class of protein. We used only
significant tetra peptide found in proteins for classification. We calculate the number of
tetra peptides of each class present in a query sequence. This information is used to
classify the proteins using SVM. We obtain very high accuracy using this approach. One
may compare this approach with pattern searching approach (like PROSITE) where one
need to detect known pattern in a sequence. Here patterns are tetra peptides instead of
PROSITE patterns. There is limited number of PROSITE, so number of proteins does not
have any PROSITE pattern. Where as in our case we are using all tetra peptides found in
significant amount in each class of proteins so number of patterns in our case is too high
(1248, 381, 1443 and 1168 for cellular, information molecule, metabolic and virulence
proteins respectively). Thus there is chance that each query protein will have large
number of tetra peptides of each class. Though specificity of our tetra peptides is lower
than PROSITE patterns but number is 100 times more.
We also developed hybrid modules, which combine our composition-based modules, and
pattern based approach, in order to improve the performance of method further. The
performance of hybrid methods was better than individual. In summary we developed a
effective method for prediction function of bacterial proteins. This method will be very
useful in development of drug and vaccine as it allows predicting virulence proteins.
Though we tried our best to improve the accuracy of prediction still it is not very high.
Another limitation of this method is that it predict single function of protein, where as in
realistic situation it is observed that protein may have multiple functions.
Acknowledgments
We acknowledge the financial support from the Council of Scientific and Industrial
Research (CSIR) and Department of Biotechnology (DBT), Govt. of India.
References
1. Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000 Oct
1;41(1):98-107.
2. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein
function. Cell Mol Life Sci. 2003 Dec;60(12):2637-50. Review.
3. Panchenko AR, Kondrashov F, Bryant S. Prediction of functional sites by analysis of
sequence and structure conservation. Protein Sci. 2004 Apr;13(4):884-92. Epub 2004
Mar 09.
4. Cai YD, Doig AJ. Prediction of Saccharomyces cerevisiae protein functional class
from functional domain composition. Bioinformatics. 2004 May 22;20(8):1292-300.
Epub 2004 Feb 19.
5. Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of
eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res.
2004 Jul 1;32(Web Server issue):W414-9.
6. Garg A, Bhasin M, Raghava GP. SVM-based method for subcellular localization of
human proteins using amino acid compositions, their order and similarity search.
J Biol Chem. 2005 Jan 12; [Epub ahead of print]
7. Irie Y, Mattoo S, Yuk MH. The Bvg virulence control system regulates biofilm
formation in Bordetella bronchiseptica. J Bacteriol. 2004 Sep;186(17):5692-8.
8. Geric B, Rupnik M, Gerding DN, Grabnar M, Johnson S. Distribution of Clostridium
difficile variant toxinotypes and strains with binary toxin genes among clinical isolates in
an American hospital. J Med Microbiol. 2004 Sep; 53(Pt 9):887-94.
9. Ethelberg S, Olsen KE, Scheutz F, Jensen C, Schiellerup P, Enberg J, Petersen AM,
Olesen B, Gerner-Smidt P, Molbak K. Virulence factors for hemolytic uremic syndrome,
Denmark. Emerg Infect Dis. 2004 May;10(5):842-7.
10. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for
genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000 Jan
1;28(1):33-6.
11. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search
tool. J Mol Biol. 1990 Oct 5;215(3):403-10.
12. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-402.
13. Li L, Shakhnovich EI, Mirny LA. Amino acids determining enzyme-substrate
specificity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci U S A. 2003
Apr 15;100(8):4463-8. Epub 2003 Apr 04.
14. Hannenhalli SS, Russell RB. Analysis and prediction of functional sub-types from
protein sequence alignments.J Mol Biol. 2000 Oct 13;303(1):61-76.
15. Brazma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to the automatic
discovery of patterns in biosequences. J Comput Biol. 1998 Summer;5(2):279-305.
16. Perez,A.J., Rodriguez,A., Trelles,O. and Thode,G (2002) A computational strayegy
for protein function assigment which addresses the multidomain problem. Comp. Funct.
Genomics, 3, 423-440.
17. Jennifer L. Gardy, Cory Spencer, Ke Wang, Martin Ester, Gabor E. Tusnady, Istvan
Simon, Sujun Hua, Katalin deFays, Christophe Lambert, Kenta Nakai and Fiona S.L.
Brinkman (2003). PSORT-B: improving protein subcellular localization prediction for
Gram-negative bacteria, Nucleic Acids Research 31(13):3613-17.
18. Hua S, Sun Z. Support vector machine approach for protein subcellular localization
prediction. Bioinformatics. 2001 Aug;17(8):721-8.
19. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its
supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45-8.
20. Joachims,T. (1999) Making large-scale SVM learning particle. In Scholkopf,B.,
Burges,C, and Smola,A.(eds), Advances in kernal Methods Support Vector Learning.
MIT Pres, Cambridge, MA and London, pp 42-56.
Legend of Figure
Figure 1. An out line of Ab-initio patterns prediction method.
Gram –negative Bacterial Proteins
(670)
Virulence Factors
(70)
Information molecule
(60)
1168 significant
tetramers
(frequency >=4)
381 significant
tetramers
(frequency >=3)
Cellular Process
(255)
Metabolism molecule
(285)
1248 significant
tetramers
(frequency >=6)
1443 significant
tetramers
(frequency >=6)
Search for significant tetramers for each class in query sequence
Number of tetramers
( virulence factors) in
query protein
Number of tetramers
(information class) in
query protein
Number of tetramers
(cellular process) in
query protein
Input for SVM
Predicted functional class
Fig 1
Number of tetramers
(Metabolism ) in
query protein
Table 1.The performance of various modules including SVM modules based on various features of protein
sequence and PSI-BLAST.
________________________________________________________________________
Approach
Cellular
ACC
MCC
Composition-based (A) 47.06
Dipeptide-based (B)
Pattern-based (C)
PSI-BLAST
Information
ACC
MCC
Virulence
ACC
MCC
66.67
27.14
0.32
52.39
27.14
0.20
47.01
0.61
68.66
0.12
36.67
45.10
0.11
15.00
0.21
60.35
70.20
0.46
48.33
0.57
72.98
23.13
0.41
Metabolism
ACC
MCC
8.33
0.31
0.23
0.51
28.77
25.71
69.41
0.48
50.0
0.59
77.19
Hybrid (C+B)
69.02
0.54
48.33
0.52
74.04
0.53
Hybrid (C+B+A)
69.80
0.51
53.33
0.58
77.54
0.56
Hybrid (C+A)
ACC: Accuracy; MCC: Matthew’s correlation coefficient.
62.86
Overall
ACC
0.54
62.86
0.65
70.30
58.57
0.54
68.21
0.59
70.75
61.43
Download