Alpha-helical transmembrane protein fold prediction using residue contacts Timothy Nugent and David Jones Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London WC1E 6BT Despite significant efforts to predict alpha-helical transmembrane protein topology, relatively little attention has been directed towards developing a method to pack the helices together. We present a novel approach that uses predicted lipid exposure, residue contacts, sequence statistics and a force-directed algorithm to find the optimal helix packing arrangement. Introduction Method Alpha-helical transmembrane (TM) proteins constitute roughly 30% of a typical genome and are involved in a wide variety of important biological processes including cell signaling, transport of membrane-impermeable molecules and cell recognition. Many are also prime drug targets, and it has been estimated that more than half of all drugs currently on the market target membrane proteins. Despite significant efforts to predict TM protein topology, little attention has been directed toward developing a method to pack the helices together. Since the membrane-spanning region is predominantly composed of alphahelices with a common alignment this task should, in principle, be easier than predicting the fold of globular proteins. However, structural features including re-entrant, tilted and kinked helices render simple lattice models that may work for regularly packed proteins unable to model the diverse packing arrangements now present in structural databases. We present a novel method to predict lipid exposure, residue-residue contacts, helix-helix contacts and finally the helical packing arrangement of TM proteins, benchmarked with full cross-validation on a data set of 74 sequences, each containing at least two TM helices, and all with crystal structures available. Predicting lipid exposure In order to predict lipid exposure, PSI-BLAST profile data with a sliding window approach was used to train a support vector machine (SVM) classifier. To label training examples, we used the Course Grain Database, a repository of molecular dynamics simulation data, for which the proportion of simulation time each residue is exposed to lipid is available. All residues within the membrane that were exposed to lipid for more than half of simulation time were therefore labelled as positive training examples. Under stringent cross validation using a jack knife test, we were able to predict lipid exposure for each residue with 70% accuracy and a Mathew's correlation coefficient (MCC) of 0.39. This compares favourably with result from the LIPS server when benchmarked on the same test set (62% accuracy, MCC 0.23). Contact Precision FPR FNR MCC Accuracy 5.5 Å 5.5 Å 0.93 0.49 0.015 0.37 0.78 0.63 0.33 0.022 63.2% 52.3% TM Contact Predictor TM Contact Predictor+ TMHIT L5 6Å 6Å 6Å 0.98 0.91 0.77 0.0037 0.047 0.12 0.79 0.34 0.47 0.35 0.65 0.45 66.7% 82.6% 73.2% TM Contact Predictor SVMCon SVMCon L5 ProfCon ProfCon L5 8Å 8Å 8Å 8Å 8Å 0.98 0.57 0.82 0.43 0.72 0.0037 0.090 0.0074 0.83 0.043 0.79 0.84 0.95 0.16 0.84 0.36 0.11 0.13 0.017 0.19 66.7% 59.3% 59.5% 45.4% 62.0% TM Contact Predictor TMHCon L5 Table 2. Helix-helix interaction results. + No cross validation on 41 sequences common to TMHIT training set. Results show a significant improvement in accuracy and MCC scores against all methods other than TMHIT. TMHIT was however trained on 41 sequences which were common to our test set, so this score is likely to be an overestimate as we were unable to cross-validate TMHIT results. In the absence of cross-validation for these 41 sequences, our method performs substantially better. Full cross validation on a smaller test set of 14 sequences resulted in scores of 68.4% accuracy for our method and 66.5% for TMHIT. A graph-based approach to find the optimal helix packing arrangement To find the optimal helix packing arrangement the structure is represented as a graph with helices forming vertices and helix-helix interactions forming edges. By employing a force-directed algorithm the method attempts to minimise edge crossing while maintaining uniform edge length, attributes common in native structures. Once the helix positions are determined, a genetic algorithm is used to rotate all helices such that the sum of predicted residue-residue contact distances in minimised. Results for Halorhodopsin are shown in figure 1. Predicting residue contacts and helix-helix interactions Using topology information derived from crystal structures, we analysed interactions between residues on different TM helices. For this study, we only considered interactions within a single chain, rather than between chains. To define residue-residue contacts, and compare our approach with other methods, three contact definitions were used to label data: (i) backbone/side chain heavy atoms are within 5.5 Å, (ii) C-beta atoms are within 6 Å or the distance between interacting pair is less than the sum of their VDW radii + 0.6 Å and (iii) C-beta atoms are within 8 Å (C-alpha for glycine). To predict residue contacts, we also used an SVM classifier. Features included: a 7 residue window centred on each residue in the interacting pair (again using PSI-BLAST profiles), predicted lipid exposure for each residue, and a number of sequence based statistics. These included a binary vector representing distance between residues and values representing the relative position of residues in each TM helix – equivalent to a Z coordinate. SVM training files were roughly balanced producing a positive/negative ratio of 1:1.25. Results are shown in table 1. Method Contact Precision FPR FNR MCC TM Contact Predictor TMHCon L5 5.5 Å 5.5 Å 0.65 0.089 0.0014 0.0021 0.86 0.99 0.29 0.024 TM Contact Predictor TMHIT L5 6Å 6Å 0.64 0.57 0.001 0.0012 0.86 0.88 0.30 0.26 TM Contact Predictor SVMCon SVMCon L5 ProfCon ProfCon L5 8Å 8Å 8Å 8Å 8Å 0.64 0.062 0.086 0.025 0.057 0.0017 0.0083 0.0003 0.46 0.0018 0.85 0.97 1.00 0.41 0.99 0.30 0.029 0.0089 0.035 0.012 Table 1. Residue-residue contact prediction results. L5 = top L/5 scoring predictions assessed, where L is the combined length of all TM helices. Our method is labelled 'TM Contact Predictor'. In order to assess helix-helix interactions one pair of residue-residue contacts was required to be correctly predicted. Our data set therefore contained 593 interacting helices and 815 non-interacting helices. Helix-helix prediction results are shown in table 2. Figure 1. Predicted packing arrangement for Halorhodopsin [PDB: 1E1K] (above) and PDB structure with observed helix-helix interactions labelled (below). Conclusions Our results demonstrate that the use of predicted lipid exposure data combined with evolutionary information and sequence-based statistics can be used to accurately predict the packing arrangement of TM proteins. This method can be used to gain insights into TM protein folding and direct further experimental work. This work was funded by the Biotechnology and Biological Sciences Research Council, and supported by the BioSapiens project, which is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LSHG-CT-2003-503265.