Transmembrane Protein Topology Prediction Using Support Vector Machines

advertisement
Transmembrane Protein Topology Prediction
Using Support Vector Machines
Tim Nugent and David Jones
Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT
Due to the paucity of alpha-helical transmembrane protein crystal structures, in silico approaches are essential for
structural analysis. We present a support vector machine-based topology predictor that integrates both signal
peptide and re-entrant helix prediction, and present the results of application to a number of complete genomes.
Introduction
Alpha-helical transmembrane (TM) proteins constitute roughly 30% of a typical
genome and are involved in a wide variety of important biological processes.
However, due to the experimental difficulties involved in obtaining high quality
crystals, this class of protein is severely under represented in structural databases,
making up only 1% of known structures in the PDB. Given the biological and
pharmacological importance of TM proteins, an understanding of their topology - the
total number of TM helices, their boundaries and in/out orientation relative to the
membrane - is essential for structural and functional analysis, and directing further
experimental work. In the absence of structural data, bioinformatic strategies thus
turn to sequence-based prediction methods.
Method
Correct helix
count
Correct
N-terminal
Correct helix Correct
count and
signal
locations
peptide
Correct
re-entrant
helix
Correct
overall
topology
TMSVM
OCTOPUS
MEMSAT3
91%
84%
84%
95%
86%
84%
91%
83%
76%
64%
73%
64%
89%
79%
76%
93%
21%
57%
Table 2. Overall prediction accuracy.
OCTOPUS [1] results were not cross-validated therefore are likely to be
overestimated as there is considerable overlap between test and training sets. The
graphical output from the program is shown in Figure 2.
Signal Peptides and Re-entrant Helices
One problem faced by modern topology predictors is the discrimination between TM
helices and other features composed largely of hydrophobic residues. These include
targeting motifs such as signal peptides and signal anchors, amphipathic helices,
and re-entrant helices – membrane penetrating helices that enter and exit the
membrane on the same side, common in many ion channel families (Figure 1). The
high similarity between such features and the hydrophobic profile of a TM helix
frequently leads to crossover between the different types of predictions. Should these
elements be predicted as TM helices, the ensuing topology prediction is likely to be
disrupted.
Figure 2. Results for Ubiquinol Oxidase showing correct topology and signal peptide
prediction. The raw SVM scores are shown below the topology schematic.
Discriminating between Globular and Transmembrane Proteins
Figure 1. A chain from a Potassium channel (PDB code 1r3j) showing a re-entrant helix,
thought to function as a selectivity filter.
An additional SVM was trained to discriminate between globular and transmembrane
proteins, using a data set of 2685 non-redundant chains from globular proteins of
known structure, combined with our novel set of 131 TM proteins. PSI-BLAST
profiles were generated for all sequences and 10-fold cross validation was used to
assess performance. A 0% false positive rate (FP) and 0.4% false negative (FN) rate
was achieved, which improved on the MEMSAT3 [2] neural network-based methods
(0.5% FP, 0.5% FN).
A Novel Topology Predictor
We have developed a new TM topology predictor trained and benchmarked with full
cross-validation on a novel data set of 131 sequences, with topologies derived solely
from crystal structures. The method uses evolutionary information and four support
vector machine (SVM) classifiers, combining the outputs using a dynamic
programming algorithm, to return a list of predicted topologies ranked by overall
likelihood, and incorporates signal peptide and re-entrant helix prediction.
In training the SVMs, PSI-BLAST profiles were generated for each sequence and a
sliding window approach was applied, with values normalised by Z-score to improve
convergence time. Jack knife cross-validation was used to access SVM
performance, with sequences with >25% sequence identity removed from training
sets. Window size and SVM parameters were optimised using Mathew's correlation
coefficient (Table 1).
SVM
Window Size
Kernel
MCC
Helix/Loop
Inside Loop/Outside Loop
Re-entrant Helix/¬Re-entrant Helix
Signal Peptide/¬Signal Peptide
33
35
27
27
RBF
Polynomial
RBF
RBF
0.80
0.63
0.34
0.76
Table 1. Per residue SVM prediction accuracy. MCC: Mathew's correlation coefficient.
A modified version of the original MEMSAT dynamic programming algorithm was
used, treating TM helices as discrete units, rather than separating them into inside,
outside and middle components. Re-entrant helix and signal peptide states were
added. Residues were therefore predicted to lie in one of five different topological
regions: inside loop, outside loop, TM helix, re-entrant helix and signal peptide.
The SVM-based method ('TMSVM') was benchmarked against a selection of leading
topology predictors (Table 2), scoring 89% overall accuracy, an improvement of 10%
over the next best method. TMSVM was able to detect signal peptides with 92%
accuracy and re-entrant helices with 44% accuracy, with no false positives predicted.
Figure 3. Ten genomes were filtered using the TM/globular discriminator. Those predicted
to be TM proteins were subject to full TM topology prediction. X-axis: TM helix count. Yaxis: Number of proteins.
Conclusions
Overall, the method predicted the correct topology and location of TM helices for
89% of the test set, a significant improvement over recent methods. The SVM
trained to discriminate between TM and globular proteins achieved a false positive
rate of 0% and false negative rate of 0.4%, making this method highly suitable for
whole genome analysis (Figure 3). However, there is still room for significant
improvement in the detection of re-entrant helices.
[1] Viklund H., Elofsson A. (2008) OCTOPUS: Improving topology prediction by two-track ANNbased preference scores and an extended topological grammar. Bioinformatics (In press).
[2] Jones D.T. (2007) Improving the accuracy of transmembrane protein topology prediction
using evolutionary information. Bioinformatics.23:538-544.
This work was funded by the Biotechnology and Biological Sciences Research Council
Download