VICMpred: SVM-based method for prediction of functional proteins of gram-negative bacteria using amino acid patterns and composition. Sudipto Saha and G.P.S.Raghava* Institute of Microbial Technology Sector-39A, Chandigarh, India *Address for Correspondence Dr. G. P. S. Raghava, Scientist Email: raghava@imtech.res.in Bioinformatics Centre Web: http://www.imtech.res.in/raghava/ Institute of Microbial Technology Phone: +91-172-2690557 Sector 39A, Chandigarh, INDIA Fax: +91-172-2690632 Abstract In this study, an attempt has been made to predict major functions of gram-negative bacterial proteins from its amino acid sequence. The dataset used for training and testing consists of 670 non-redundant gram-negative bacterial proteins (255 cellular process, 60 information molecule, 285 metabolism proteins and 70 virulence factor). First we developed a SVM based method using amino acid and dipeptide composition and achieved an overall accuracy 52.39% and 47.01% respectively. In this study we introduce a new concept for classification of proteins based on tetra-peptide where we identify the unique tetra-peptides significantly found in a class of proteins. These tetra-peptides were used as input feature for predicting function of a protein and achieved an overall accuracy 68.66%. We also developed a hybrid method where tetra-peptide information was used with amino acids composition and achieved overall accuracy 70.75%. The five-fold cross validation was used to evaluate the performance method. The webserver VICMpred has been developed for predicting function (http://www.imtech.res.in/raghava/vicmpred/). of gram-negative bacterial proteins Introduction Though there is exponential growth in sequence database of proteins in last decade, but function of small fraction of proteins has been experimentally characterized. The experimental assessment of the function of every protein of each newly sequenced genome is beyond foreseeable resources, our knowledge of most of the new proteins will be from predictions. Functional prediction is a major challenge in the field of bioinformatics (1). In the past number of methods have been developed to predict the function of proteins (2,3,4), but the results obtained by analyzing a significant number of true sequence similarities, point to the complexity of function prediction. Most of the methods are indirect methods where attempt have been made to predict subcellular localization of proteins rather than function. The subcellular localization methods are based on observation that protein belongs to same compartment of protein has similar amino acid composition (5,6) and has similar functions. In this study, an attempt has been made to develop direct method for predicting major functions (virulence factors, information molecule, cellular process and metabolism molecule) of gram-negative bacterial proteins including virulence factors that cause pathogenicity to the host system. Most of the proteins in an organism involves in cellular process, metabolism and in information storage, remaining can be classified in virulence factors, which allow the germs to establish themselves in the host. Virulence factors include adhesions (7), toxins (8) and hemolytic molecules (9). Identification of virulence factors is crucial for drug development. So, we made an attempt to classify the bacterial proteins into four broad functional classes. The three broad functional classes were taken from COGs functional annotation (10). They are i) cellular process, which includes cell division, cell envelope biogenesis, cell motility and signal transduction molecule; ii) information storage and processing, in which transcription, translation and DNA replication and repair molecule is included; iii) metabolic includes energy production and carbohydrate, amino acid, nucleotide, lipid transport and metabolism. The similarity search tools like BLAST or FASTA (11) and PSI-BLAST (12) are commonly used for annotation of genomes. Besides similarity search tools, machine learning tools also used for classification of proteins where amino acid, pseudo, dipeptide and property compositions are used as features of protein. The prediction of function of protein is much more complex than other classifications because sequence similarity is very poor in proteins having same function, thus most of methods based on similarity search fail to predict function of a protein (13,14). In this study we made a systematic attempt to develop better method for predicting function of proteins. First, we tried traditional strategies for classification of proteins that includes i) similarity search using PSI-BLAST; ii) SVM-based method using amino acid composition and iii) SVM-based method using dipeptide composition which also consider local order of amino acids. It was observed that performance of traditional approaches was very poor in functional classification of proteins. In order to improve the performance we used tetra-peptides as features of protein similar to deterministic pattern of Class A as defined by Brazma et al., (15). The approach relies on identifying short signaling patterns and group of patterns of each four broad functional class present in higher number (16). The performance of method based on tetra-peptide was much higher than traditional methods based on residue composition. The performance was further improved when new and traditional approaches were combined. In this study we classify gram-negative bacterial proteins obtained from PSORTdb v.20 (http://www.psort.org/dataset) (17), which is used in the development of SubLoc (18). Based on our study, we have made a webserver, VICMpred (http://www.imtech.res.in/raghava/vicmpred/) for predicting function of proteins from its amino acid sequence. Materials Methods Data sets We obtained 1572 proteins from Hua and Sun (2001) work. We examine the function of these proteins using SWISS-PROT (19) version 33.0. and kept 1048 proteins for further processing, whose functions was known. We used PROSET software to create a dataset of non-redundant proteins where no two proteins have more than 90% sequence identity. Final dataset consist of 670 non-redudant gram-negative bacterial proteins (255 cellular process, 60 information molecule, 285-metabolism protein and 70 virulence factors protein). Evaluation of the predictive performances The performance modules constructed in this study were evaluated using a 5-fold crossvalidation technique. In the 5-fold cross-validation, the relevant dataset was randomly divided into five sets. The training and testing was carried out five times, each time using one distinct set for testing and the remaining four sets for training. For evaluating the performance of various modules, accuracy and Matthew’s correlation coefficient (MCC) were calculated using the following equations: Accuracy (x) = MCC (x)= p( x) Exp( x) p ( x ) n( x ) u ( x )o( x ) [ p( x) u ( x)][ p( x) o( x)][ n( x) u ( x)][ n( x) o( x)] where x can be any functional class (cellular, information, metabolism and virulence protein), exp(x) is the number of sequences observed in function x, p(x) is the number of correctly predicted sequences of function x, n(x) is the number of correctly predicted sequences not of function x, u(x) is the number of under-predicted sequences and o(x) is the number of over-predicted sequences. Support vector machine The SVM was implemented using freely downloadable software package SVM_light written by Joachims (20). The software enables the user to define a number of parameters as well as to select from a choice of inbuilt kernal functions, including a radial basis function (RBF) and a polynomial kernal. Preliminary tests show that the radial basis function (RBF) kernel gives results better than other kernels. Therefore, in this work we use the RBF kernel for all the experiments. The prediction of functional class is a multiclass classification problem. We developed a series of binary classifiers to handle the multi-classification problem. We constructed N SVMs for N-class classification using 1 vs r (one against rest) strategy. Here, the class number was equal to four for bacterial protein sequences, The ith SVM was trained with all samples in the ith class with positive labels and all other samples with negative labels. In this way, four SVMs were constructed for functional class of bacterial protein to cellular, information, metabolic and virulence. Protein features Amino acid composition. Amino acid composition is the fraction of each amino acid in a protein. The fraction of all 20 natural amino acids was calculated using the following equations: Fraction of amino acid i = Total number of a min o acid ( i ) Total number of a min o acids in protein where i can be any amino acid. Dipeptide composition. Dipeptide composition was used to encapsulate the global information about each protein sequence, which gives a fixed pattern length of 400 (20 20). This representation encompassed the information about amino acid composition along local order of amino acid. The fraction of each dipeptide was calculated using following equation: fraction of dipep (i) = total number of dipep (i ) total number all possible dipeptides where dipep(i) is one out of 400 dipeptides. Ab-initio patterns : We have calculated the frequency of all possible tetra peptides (20202020=160,000) in each class of proteins. Then we identify the tetra peptides, which are found more than a threshold for a class of proteins, called significant tetra peptides for that class. In our case we consider a tetra peptide significant if it is found 6 times in case of cellular proteins; 3 times in case of information molecules; 6 times in case of metabolic proteins and 4 times in case of virulence proteins. In next step we compute number of significant tetra peptides of each class are present in a protein. Thus, four features represented a protein, where each feature represents significant number of tetra peptides of a class of proteins. Finally we used SVM for classification of proteins based on these four features. In our study significant tetra peptides were only calculated from proteins in training set in order to avoid any biasness in prediction. An out line of this method is shown in the Fig 1. PSI-BLAST A module of PSI-BLAST was designed in which query sequences in test dataset were searched against proteins in training dataset using PSI-BLAST. Three iterations of PSIBLAST were carried out at a cut-off E-value of 0.001. PSI-BLAST was used instead of normal standard BLAST because PSI-BLAST has the capability to detect remote homologies. The module could predict any of the four functions (cellular, information, metabolic and virulence) depending upon the similarity of the query protein to the protein in the dataset. Prediction results The performance of all the modules developed in this study is shown in Table 1. The performance of all modules was evaluated through 5-fold cross-validation. The composition-based module (kernal=RBF, =80 ,C=2 j=4) was able to predict with 52.39% accuracy. In the case of the dipeptides composition-based module the performance of the RBF kernel (=100and C=50, j=1 ) was 5% lower than the amino acid composition. The results of the PSI-BLAST module were evaluated through 5-fold cross-validation. The module predicted cellular, information, metabolism and virulence protein sequence with 23.13, 8.33, 28.77, 25.71 % accuracy, respectively. During 5-fold cross-validation, only 172 hits were obtained out of total 670 proteins. Therefore, the performance of this module is poorer in comparison to amino acid and dipeptide compositions based SVM modules. It was interesting to note that performance for dipeptide based module was lower than simple amino acid composition based module, despite dipeptide composition provides composition as well as order of local amino acids. It is because in case of dipeptide total number of features are 400 (2020) which is too high, number of features do not occurs in small number of proteins. Thus SVM is unable to learn properly on too many features. In order to avoid this problem we introduce a new concept for prediction where we consider peptides which are occurs in each class of proteins in significant amount. Here we used frequency of significant tetra peptides found in a class of proteins. This module called ab inito pattern based module were able to predict functions of protein with accuracy of 68.66% (kernal=RBF, = 0.001and C=50 and j=5), which is higher than amino acid composition based and dipeptide-based module. To further improve the prediction accuracy, hybrid modules on the basis of various features of proteins were constructed. The first hybrid (hybrid 1) was developed on the basis of the pattern information and amino acid composition. The prediction accuracy of the hybrid1 module was 70.30%, which is better than any individual features-based module. Another module (hybrid 2) was developed on the basis pattern information and of amino acid composition; its performance was similar to hybrid1 module. Finally, a hybrid module based on pattern information, amino acid and dipeptide composition was developed. This hybrid used an input vector of 424 dimensions, comprising 4 for pattern information, 20 for amino acid composition, 400 for dipeptide composition. As shown in Table 1 the performance of this module is better than any individual feature-based or other hybrid modules (hybrid1 and hybrid 2). Finally, a hybrid module with the RBF kernel (=0.001 and C=100000 j=1), which used pattern, amino acid and dipeptide composition information, was able to achieve 70.75% overall accuracy. VICMpred SERVER Based on our study, we have developed a web server that allows users to predict the function of a protein (e.g., virulence factors, information molecule, cellular process and metabolism molecule) from its amino acid sequences. VICMperd is freely available at http://www.imtech.res.in/raghava/vicmpred/. The common gateway interface (CGI) script for VICMpred is written using PERL version 5.03. This server is installed on a Sun Server (420E) under a UNIX (Solaris 7) environment. Users can enter the primary amino acid sequence for prediction using file uploading or cut-and-paste options. Discussion The functional annotation of proteins is one of the major challenges in the era of post genomics. The most widely used methods for predicting the function of a new protein involve sequence alignment or similarity search or profile search, like FASTA, BLAST, PSI-BLAST (10,11). These methods fail in absence of significant similarity between query protein and annotated proteins. One of the reasons of failure of similarity-based method is variation in size of proteins either belongs to same or different classes. The problems with profiles is that they are complicated models with many free parameters. One is faced with a number of difficult problems like the best ways to set the position-specific residues scores, to score gaps and insertions and to combine structural and multiple sequence information. An alternative way for predicting function of protein is to predict is location in cell, which is based on assumption that proteins reside in same location also have same functions. Most of these subcellular localization methods are based on composition (amino or dipeptide) of protein. In this study, an attempt have been made to develop direct method for predicting function of proteins. First we tried traditional approaches which are commonly used in prediction of subcellular localization. It was observed that performance of PSI-BLAST was poor to composition based methods (Table 1). This demonstrate that similarity search based methods are not very effective in function prediction. It was also observed that dipeptide composition based method perform poor than amino acid composition. This was unexpected as dipeptide provides more information (composition with local order) than simple amino acid. In past we observed that dipeptide perform better than amino acid composition in subcellular localization of proteins. We examined our data and observed that number of dipeptide was either rare or completely absent due to small number of proteins used for classification. This demonstrates that higher order composition is not successful on small set of data. In order to overcome this problem we tried a new approach. In this approach we used tetra peptides that provides more local order than dipeptide and tripeptide. Instead of using composition of all tetra peptide, we identify the tetra peptides found in significant number in each class of protein. We used only significant tetra peptide found in proteins for classification. We calculate the number of tetra peptides of each class present in a query sequence. This information is used to classify the proteins using SVM. We obtain very high accuracy using this approach. One may compare this approach with pattern searching approach (like PROSITE) where one need to detect known pattern in a sequence. Here patterns are tetra peptides instead of PROSITE patterns. There is limited number of PROSITE, so number of proteins does not have any PROSITE pattern. Where as in our case we are using all tetra peptides found in significant amount in each class of proteins so number of patterns in our case is too high (1248, 381, 1443 and 1168 for cellular, information molecule, metabolic and virulence proteins respectively). Thus there is chance that each query protein will have large number of tetra peptides of each class. Though specificity of our tetra peptides is lower than PROSITE patterns but number is 100 times more. We also developed hybrid modules, which combine our composition-based modules, and pattern based approach, in order to improve the performance of method further. The performance of hybrid methods was better than individual. In summary we developed a effective method for prediction function of bacterial proteins. This method will be very useful in development of drug and vaccine as it allows predicting virulence proteins. Though we tried our best to improve the accuracy of prediction still it is not very high. Another limitation of this method is that it predict single function of protein, where as in realistic situation it is observed that protein may have multiple functions. Acknowledgments We acknowledge the financial support from the Council of Scientific and Industrial Research (CSIR) and Department of Biotechnology (DBT), Govt. of India. References 1. Devos D, Valencia A. Practical limits of function prediction. Proteins. 2000 Oct 1;41(1):98-107. 2. Rost B, Liu J, Nair R, Wrzeszczynski KO, Ofran Y. Automatic prediction of protein function. Cell Mol Life Sci. 2003 Dec;60(12):2637-50. Review. 3. Panchenko AR, Kondrashov F, Bryant S. Prediction of functional sites by analysis of sequence and structure conservation. Protein Sci. 2004 Apr;13(4):884-92. Epub 2004 Mar 09. 4. Cai YD, Doig AJ. Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition. Bioinformatics. 2004 May 22;20(8):1292-300. Epub 2004 Feb 19. 5. Bhasin M, Raghava GP. ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W414-9. 6. Garg A, Bhasin M, Raghava GP. SVM-based method for subcellular localization of human proteins using amino acid compositions, their order and similarity search. J Biol Chem. 2005 Jan 12; [Epub ahead of print] 7. Irie Y, Mattoo S, Yuk MH. The Bvg virulence control system regulates biofilm formation in Bordetella bronchiseptica. J Bacteriol. 2004 Sep;186(17):5692-8. 8. Geric B, Rupnik M, Gerding DN, Grabnar M, Johnson S. Distribution of Clostridium difficile variant toxinotypes and strains with binary toxin genes among clinical isolates in an American hospital. J Med Microbiol. 2004 Sep; 53(Pt 9):887-94. 9. Ethelberg S, Olsen KE, Scheutz F, Jensen C, Schiellerup P, Enberg J, Petersen AM, Olesen B, Gerner-Smidt P, Molbak K. Virulence factors for hemolytic uremic syndrome, Denmark. Emerg Infect Dis. 2004 May;10(5):842-7. 10. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000 Jan 1;28(1):33-6. 11. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990 Oct 5;215(3):403-10. 12. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997 Sep 1;25(17):3389-402. 13. Li L, Shakhnovich EI, Mirny LA. Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases. Proc Natl Acad Sci U S A. 2003 Apr 15;100(8):4463-8. Epub 2003 Apr 04. 14. Hannenhalli SS, Russell RB. Analysis and prediction of functional sub-types from protein sequence alignments.J Mol Biol. 2000 Oct 13;303(1):61-76. 15. Brazma A, Jonassen I, Eidhammer I, Gilbert D. Approaches to the automatic discovery of patterns in biosequences. J Comput Biol. 1998 Summer;5(2):279-305. 16. Perez,A.J., Rodriguez,A., Trelles,O. and Thode,G (2002) A computational strayegy for protein function assigment which addresses the multidomain problem. Comp. Funct. Genomics, 3, 423-440. 17. Jennifer L. Gardy, Cory Spencer, Ke Wang, Martin Ester, Gabor E. Tusnady, Istvan Simon, Sujun Hua, Katalin deFays, Christophe Lambert, Kenta Nakai and Fiona S.L. Brinkman (2003). PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria, Nucleic Acids Research 31(13):3613-17. 18. Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001 Aug;17(8):721-8. 19. Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 2000 Jan 1;28(1):45-8. 20. Joachims,T. (1999) Making large-scale SVM learning particle. In Scholkopf,B., Burges,C, and Smola,A.(eds), Advances in kernal Methods Support Vector Learning. MIT Pres, Cambridge, MA and London, pp 42-56. Legend of Figure Figure 1. An out line of Ab-initio patterns prediction method. Gram –negative Bacterial Proteins (670) Virulence Factors (70) Information molecule (60) 1168 significant tetramers (frequency >=4) 381 significant tetramers (frequency >=3) Cellular Process (255) Metabolism molecule (285) 1248 significant tetramers (frequency >=6) 1443 significant tetramers (frequency >=6) Search for significant tetramers for each class in query sequence Number of tetramers ( virulence factors) in query protein Number of tetramers (information class) in query protein Number of tetramers (cellular process) in query protein Input for SVM Predicted functional class Fig 1 Number of tetramers (Metabolism ) in query protein Table 1.The performance of various modules including SVM modules based on various features of protein sequence and PSI-BLAST. ________________________________________________________________________ Approach Cellular ACC MCC Composition-based (A) 47.06 Dipeptide-based (B) Pattern-based (C) PSI-BLAST Information ACC MCC Virulence ACC MCC 66.67 27.14 0.32 52.39 27.14 0.20 47.01 0.61 68.66 0.12 36.67 45.10 0.11 15.00 0.21 60.35 70.20 0.46 48.33 0.57 72.98 23.13 0.41 Metabolism ACC MCC 8.33 0.31 0.23 0.51 28.77 25.71 69.41 0.48 50.0 0.59 77.19 Hybrid (C+B) 69.02 0.54 48.33 0.52 74.04 0.53 Hybrid (C+B+A) 69.80 0.51 53.33 0.58 77.54 0.56 Hybrid (C+A) ACC: Accuracy; MCC: Matthew’s correlation coefficient. 62.86 Overall ACC 0.54 62.86 0.65 70.30 58.57 0.54 68.21 0.59 70.75 61.43