Transmembrane Protein Topology Prediction Using Support Vector Machines Tim Nugent and David Jones Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT Due to the paucity of alpha-helical transmembrane protein crystal structures, in silico approaches are essential for structural analysis. We present a support vector machine-based topology predictor that integrates both signal peptide and re-entrant helix prediction, and present the results of application to a number of complete genomes. Introduction Alpha-helical transmembrane (TM) proteins constitute roughly 30% of a typical genome and are involved in a wide variety of important biological processes. However, due to the experimental difficulties involved in obtaining high quality crystals, this class of protein is severely under represented in structural databases, making up only 1% of known structures in the PDB. Given the biological and pharmacological importance of TM proteins, an understanding of their topology - the total number of TM helices, their boundaries and in/out orientation relative to the membrane - is essential for structural and functional analysis, and directing further experimental work. In the absence of structural data, bioinformatic strategies thus turn to sequence-based prediction methods. Method Correct helix count Correct N-terminal Correct helix Correct count and signal locations peptide Correct re-entrant helix Correct overall topology TMSVM OCTOPUS MEMSAT3 91% 84% 84% 95% 86% 84% 91% 83% 76% 64% 73% 64% 89% 79% 76% 93% 21% 57% Table 2. Overall prediction accuracy. OCTOPUS [1] results were not cross-validated therefore are likely to be overestimated as there is considerable overlap between test and training sets. The graphical output from the program is shown in Figure 2. Signal Peptides and Re-entrant Helices One problem faced by modern topology predictors is the discrimination between TM helices and other features composed largely of hydrophobic residues. These include targeting motifs such as signal peptides and signal anchors, amphipathic helices, and re-entrant helices – membrane penetrating helices that enter and exit the membrane on the same side, common in many ion channel families (Figure 1). The high similarity between such features and the hydrophobic profile of a TM helix frequently leads to crossover between the different types of predictions. Should these elements be predicted as TM helices, the ensuing topology prediction is likely to be disrupted. Figure 2. Results for Ubiquinol Oxidase showing correct topology and signal peptide prediction. The raw SVM scores are shown below the topology schematic. Discriminating between Globular and Transmembrane Proteins Figure 1. A chain from a Potassium channel (PDB code 1r3j) showing a re-entrant helix, thought to function as a selectivity filter. An additional SVM was trained to discriminate between globular and transmembrane proteins, using a data set of 2685 non-redundant chains from globular proteins of known structure, combined with our novel set of 131 TM proteins. PSI-BLAST profiles were generated for all sequences and 10-fold cross validation was used to assess performance. A 0% false positive rate (FP) and 0.4% false negative (FN) rate was achieved, which improved on the MEMSAT3 [2] neural network-based methods (0.5% FP, 0.5% FN). A Novel Topology Predictor We have developed a new TM topology predictor trained and benchmarked with full cross-validation on a novel data set of 131 sequences, with topologies derived solely from crystal structures. The method uses evolutionary information and four support vector machine (SVM) classifiers, combining the outputs using a dynamic programming algorithm, to return a list of predicted topologies ranked by overall likelihood, and incorporates signal peptide and re-entrant helix prediction. In training the SVMs, PSI-BLAST profiles were generated for each sequence and a sliding window approach was applied, with values normalised by Z-score to improve convergence time. Jack knife cross-validation was used to access SVM performance, with sequences with >25% sequence identity removed from training sets. Window size and SVM parameters were optimised using Mathew's correlation coefficient (Table 1). SVM Window Size Kernel MCC Helix/Loop Inside Loop/Outside Loop Re-entrant Helix/¬Re-entrant Helix Signal Peptide/¬Signal Peptide 33 35 27 27 RBF Polynomial RBF RBF 0.80 0.63 0.34 0.76 Table 1. Per residue SVM prediction accuracy. MCC: Mathew's correlation coefficient. A modified version of the original MEMSAT dynamic programming algorithm was used, treating TM helices as discrete units, rather than separating them into inside, outside and middle components. Re-entrant helix and signal peptide states were added. Residues were therefore predicted to lie in one of five different topological regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide. The SVM-based method ('TMSVM') was benchmarked against a selection of leading topology predictors (Table 2), scoring 89% overall accuracy, an improvement of 10% over the next best method. TMSVM was able to detect signal peptides with 92% accuracy and re-entrant helices with 44% accuracy, with no false positives predicted. Figure 3. Ten genomes were filtered using the TM/globular discriminator. Those predicted to be TM proteins were subject to full TM topology prediction. X-axis: TM helix count. Yaxis: Number of proteins. Conclusions Overall, the method predicted the correct topology and location of TM helices for 89% of the test set, a significant improvement over recent methods. The SVM trained to discriminate between TM and globular proteins achieved a false positive rate of 0% and false negative rate of 0.4%, making this method highly suitable for whole genome analysis (Figure 3). However, there is still room for significant improvement in the detection of re-entrant helices. [1] Viklund H., Elofsson A. (2008) OCTOPUS: Improving topology prediction by two-track ANNbased preference scores and an extended topological grammar. Bioinformatics (In press). [2] Jones D.T. (2007) Improving the accuracy of transmembrane protein topology prediction using evolutionary information. Bioinformatics.23:538-544. This work was funded by the Biotechnology and Biological Sciences Research Council