V6-SecondaryStructur.. - Chair of Computational Biology

advertisement
V6 – Secondary Structure of TM proteins
suggested reading for this lecture:
Appl. Bioinf. 1, 21 (2002)
Introduction
Prediction of secondary structure elements
Performance on test sets
Membrane Bioinformatics – Part II
V6 SS 2006
1
Introduction
Membrane proteins are crucial for survival:
- they are key components for cell-cell signaling
- they mediate the transport of ions and solutes across the membrane
- they are crucial for recognition of self.
The pharmaceutical industry preferably targets membrane-bound receptors.
Particularly important: large super-family of G protein-coupled receptors (GPCRs)
- receptors for hormones, neurotransmitters, growth factors, light and
odor-related ligands.
More than 50% of the prescription drugs act on GPCRs.
Membrane Bioinformatics – Part II
V6 SS 2006
2
Topology of Membrane Proteins
Inside the lipid bilayer, the protein backbone may not form hydrogen bonds with the
aliphatic chains of the phospholipid molecules
 the backbone atoms need to form H-bonds among eachother.
 they adopt either -helical or -sheet conformations.
Membrane Bioinformatics – Part II
V6 SS 2006
3
Topology of Membrane Proteins
http://www.biologie.uni-konstanz.de/folding/Structure%20gallery%201.html
Membrane Bioinformatics – Part II
V6 SS 2006
4
History of membrane protein structure determination
1984
bacterial reaction center
noble price to Michel, Deisenhöfer, Huber 1987
1990
EM map of bacteriorhodopsin Henderson
1997 high-resolution structure by Lücke
now several intermediates of the photocycle
1992
porin (complete -barrel)
1998
halorhodopsin
1995
Cytochrome c Oxidase
1998
F1ATPase
1998
KCSA ion channel
2000
aquaporin
2000
rhodopsin (Palczewski)
2002
SERCA Ca2+ ATPase (Toyoshima)
2003
voltage-gated ion channel
2005
NaH Antiporter (Hunte)
noble price to John Walker 1997
noble price to Roderick McKinnon 2003
Membrane Bioinformatics – Part II
V6 SS 2006
5
Lipid bilayer simplifies the prediction problem
TM proteins are forced into two classes: -helical, or -sheet.
-helices are typically tilted with respect to the membrane normal
between 10 – 45°.
The hydrophobic lipid bilayer reduces the three-dimensional structure formation
almost to a 2D problem.
Membrane Bioinformatics – Part II
V6 SS 2006
6
Predicting TM helix location
Hydrophobicity scales provide simple criteria to predict membrane helices.
TMH can be predicted based on the distinctive patterns of hydrophobic (TM) and
polar (non-membrane or water-soluble) regions within the sequence.
Observed patterns:
(1) TM helices are predominantly apolar and 12-35 residues long.
(2) Globular regions between TMH are typically shorter than 60 residues
(3) Most TMH proteins have a specific distribution of the positively charged amino
acids arginine and lysine, „positive-inside-rule“ (Gunnar von Heijne).
Connecting „loop“ regions on the inside of the membrane have more positive
charges than „loop“ regions on the outside.
(4) Long globular regions (> 60 residues) differ in their composition from those
globular regions subject to the „inside-out-rule“:
Membrane Bioinformatics – Part II
V6 SS 2006
7
Kyte-Doolittle hydrophobicity scale (1982)
Assign hydropathy value to each amino acid.
Use sliding-window to identify membrane
regions.
Sum the hydrophobicity scale over all
w residues in the window of length w.
Use threshold T to assign segment
as predicted membrane helix.
w = 19 residues could best discriminate
between membrane and globular proteins.
Threshold T > 1.6 was suggested for the
average over 19 residues.
Membrane Bioinformatics – Part II
V6 SS 2006
8
More refined indices
One drawback of pure hydropathy-based methods is that they fail to discriminate
accurately between membrane regions and highly hydrophobic globular segments.
PRED-TMR algorithm: combine with propensities of finding certain amino acid
residues at the termini of TM helices.
Other hydrophobicity scales:
- Wimley & White : based on partition experiments
of peptides between water/lipid bilayer and
water/octanol
- TMFinder (Liu & Deber scale) : based on HPLC
retention time of peptides with non-polar phase
helicity.
Membrane Bioinformatics – Part II
V6 SS 2006
http://blanco.biomol.uci.edu/hydrophobicity_scales.html
9
Folding of helical membrane proteins
White, FEBS Lett. 555, 116 (2003)
Membrane Bioinformatics – Part II
V6 SS 2006
10
Hydrophobicity Scales
White, FEBS Lett. 555, 116 (2003)
Membrane Bioinformatics – Part II
V6 SS 2006
11
Translocon-assisted folding of TM proteins?
Upper picture (model!):
the newly synthesized polypeptide
chain of a membrane protein is
inserted from the ribosome into the
membrane via interaction with a TM
complex, the “translocon” (EM map
shown).
lower picture:
experiment largely supports the
concerted view.
What determines insertion into the
membrane ?
White, FEBS Lett. 555, 116 (2003)
Membrane Bioinformatics – Part II
V6 SS 2006
12
Integration of H-segments into the microsomal membrane
Ingenious experiment! Introduce marker that shows whether helix segment H
is inserted into membrane or not.
a, Wild-type Lep has two N-terminal TM segments (TM1 and TM2) and a
large luminal domain (P2). H-segments were inserted between residues 226
and 253 in the P2-domain. Glycosylation acceptor sites (G1 and G2) were
placed in positions 96–98 and 258–260, flanking the H-segment. For Hsegments that integrate into the membrane, only the G1 site is glycosylated
(left), whereas both the G1 and G2 sites are glycosylated for H-segments that
do not integrate in the membrane (right).
b, Membrane integration of H-segments with the
Leu/Ala composition 2L/17A, 3L/16A and 4L/15A. Bands
of unglycosylated protein are indicated by a white dot;
singly and doubly glycosylated proteins are indicated by
one and two black dots, respectively.
Hessa et al., Nature 433, 377 (2005)
Membrane Bioinformatics – Part II
V6 SS 2006
13
Insertion determined by simple physical chemistry
measure fraction of singly glycosylated (f1g) vs. doubly glycosylated (f2g) Lep molecules
p
f1g
f1g  f 2 g
K app 
f1g
f2g
Gapp   RT ln K app
c, Gapp values for H-segments with 2–4 Leu residues.
Individual points for a given n show Gapp values obtained when the position of Leu is changed.
d, Mean probability of insertion (p) for H-segments with n = 0–7 Leu residues.
Hessa et al., Nature 433, 377 (2005)
Membrane Bioinformatics – Part II
V6 SS 2006
14
Biological and biophysical Gaa scales
a, Gappaa scale derived from H-segments with the indicated amino acid placed in
the middle of the 19-residue hydrophobic stretch.
Only Ile, Leu, Phe, Val really favor membrane insertion. All polar and charged
ones are very unfavored.
b, Correlation between Gappaa values measured in vivo and in vitro.
c, Correlation between the Gappaa and the Wimley–White water/octanol free
energy scale for partitioning of peptides.
Hessa et al., Nature 433, 377 (2005)
Membrane Bioinformatics – Part II
V6 SS 2006
15
Positional dependencies in Gapp
Tyr and Trp are favorable in
interface region.
a, Symmetrical H-segment scans with pairs of Leu (red), Phe (green), Trp (pink) or Tyr (light blue)
residues. The Leu scan is based on symmetrical 3L/16A H-segments with a Leu-Leu separation of one
residue (sequence shown at the top; the two red Leu residues are moved symmetrically outwards) up to
a separation of 17 residues. For the Phe scan, the composition of the central 19-residues of the Hsegments is 2F/1L/16A, for the Trp scan it is 2W/2L/15A, and for the Tyr scan it is 2Y/3L/14A. The G
app value for the 4L/15A H-segment GGPGAAALAALAAAAALAALAAAGPGG is also shown (dark blue).
b, Red lines show G app values for symmetrical scans of 2L/17A (triangles), 3L/16A (circles), and
4L/15A (squares) H-segments.
c, Same as b but for a symmetrical scan with pairs of Ser residues in H-segments with the composition
2S/4L/13A.
Hessa et al., Nature 433, 377 (2005)
Membrane Bioinformatics – Part II
V6 SS 2006
16
Using observed amino acid propensities
With availability of more and more 3D structures, it became possible to train
statistical approaches based on the observed frequencies of amino acids in
membrane proteins vs. non-membrane proteins.
Similar concept as that in secondary structure prediction for globular proteins.
TMpred : uses statistical amino acid preferences for scoring
SPLIT (Juretic et al.) :
- uses derived amino acid preferences for the „state“ membrane helix for a data set
of integral membrane proteins with partially known secondary structure
- combine with preferences for -strand, turn and non-regular secondary structure
based on sets of soluble proteins with known structure.
This method can identify shorter, unstable or movable membrane-helices.
Membrane Bioinformatics – Part II
V6 SS 2006
17
Incorporating more information: TopPred
TopPred (von Heijne 1992)
predicts the complete topology of membrane proteins by using
- hydrophobicity analysis
- automatic generation of possible topologies
- ranking these topologies by the positive-inside rule.
TopPred uses a particular sliding trapezoid window to detect segments of
outstanding hydrophobicity.
The two bases of the trapezoid are 11 and 21 residues long.
TopPred chooses thresholds by considering a segment as TM helix that yielded
the optimal difference between the number of positively charged residues at the
inside and at the outside.
Membrane Bioinformatics – Part II
V6 SS 2006
18
Improvements from dynamic programming: MEMSAT
MEMSAT (1994) implemented statistical tables (log likelihoods) compiled from
well-characterized TM proteins
and a dynamic programming algorithm to recognize membrane topology models
by expectation maximisation.
Residues are classified as being one of 5 structural states:
Li
inside loop
Lo
outside loop
Hi
inside helix end
Hm
helix middle
Ho
outside helix end.
Helix end caps are defined to span over 4 adjacent residues (one helical turn).
Compile propensities of amino acids for 5 states.
Calculate score of relating given sequences to a predicted topology.
Finding optimal score is guaranteed by dynamic programming.
Membrane Bioinformatics – Part II
V6 SS 2006
19
Using evolutionary information
It is known from predicting secondary structures of globular proteins that using
multiple sequence alignment information improves prediction accuracy
significantly.
PHDtm: predict location and topology of TM helices by a system of neural
networks.
Was later combined with dynamical programming.
Membrane Bioinformatics – Part II
V6 SS 2006
20
Using evolutionary information
TMAP (1996):
uses propensity values determined for segments of 21 consecutive residues in
transmembrane segments (Pm),
and for the flanking 4-residue caps of TM helices (Pe).
Residues with high Pm tend to be hydrophobic
residues with high Pe tend to be polar and basic.
Compute compositional difference in the protein segments exposed to the
two surfaces of a membrane for 12 important residues:
mostly at the outside of membranes: Asn, Asp, Gly, Phe, Pro, Trp, Tyr, Val
mostly inside: Ala, Arg, Cys, Lys.
Use consensus over these 12 residues to predict topology.
Membrane Bioinformatics – Part II
V6 SS 2006
21
Using grammatical rules
The lipid bilayer constrains the structure of the membrane-passing regions of
proteins in many ways.
TMHMM (Sonnhammer et al. 1998, Krogh et al. 2001) and HMMTOP (Tusnady &
Simon 1998, 2001) implement Hidden Markov Models.
TMHMM: uses cyclic model with 7 states for
- TM helix core
- TM helix caps on the N- and C-terminal side
- non-membrane region on the cytoplasmic side
- 2 non-membrane regions on the non-cytoplasmic side (for short and long loops to
account for different membrane insertion mechanism)
- a globular domain state in the middle of each non-membrane region
Membrane Bioinformatics – Part II
V6 SS 2006
22
Using grammatical rules
HMMTOP: uses hidden Markov model distinguishing 5 structural states
- inside non-membrane regions
- inside TMH-cap
- membrane helix
- outside TMH-cap
- outside non-membrane region
This model is similar to MEMSAT.
Membrane Bioinformatics – Part II
V6 SS 2006
23
Availability of prediction methods.
Many of these servers are also available through a Meta-Server META-PP at the
site of Burkhard Rost.
Membrane Bioinformatics – Part II
V6 SS 2006
24
Prediction accuracy
Often, authors claimed that their methods are > 90% accurate.
However, Chen and Rost claim that most authors have significantly overestimated
the accuracy of their methods.
(1) there are not enough high-resolution structures to allow a statistically
significant analysis.
Training and test sets may share or have homologous members.
Using low-resolution experiments, e.g. gene fusion, is no work around.
Low-resolution experiments differ from high-resolution structures almost as much
as prediction methods.
(2) All methods optimise some parameters.
Methods perform much better on proteins for which they were developed than on
new proteins.
Membrane Bioinformatics – Part II
V6 SS 2006
25
Prediction accuracy
(3) Methods using evolutionary information failed due to the surprising fact that
membrane helices are not entirely conserved across species.
This is surprising since it implies that those proteins either do not perform similar
cellular functions, e.g. GPCRs, or that we can actually realize the function with a
different number of membrane regions in some cases.
(4) Levels of prediction accuracy between methods can often not be compared
appropriately to one another since they are frequently based on different
measures for prediction accuracy and on different data sets.
Membrane Bioinformatics – Part II
V6 SS 2006
26
Most methods get number of helices right
All methods based on advanced algorithms tend to underestimate TM helices %obs > %prd.
a
Data set: Sequence-unique subset of 36 high-resolution TM helical proteins from PDB. This is the largest subset of all 105 high-resolution membrane chains, which
fulfils the condition that no pair in the set has significant sequence similarity as defined in Rost (1999).
b Methods
c Per-segment accuracy: Q percentage of proteins for which all TM helices are predicted correctly (allowed deviation of up to 3 residues), Q %obs
ok
htm percentage of all
observed helices that are correctly predicted, Q%prdhtm percentage of all predicted helices that are correctly predicted, TOPO percentage of proteins for which the
topology (orientation of helices) is correctly predicted (empty for methods that do not predict topology).
d Per-residue accuracy: Q percentage of correctly predicted residues in two-states: membrane helix / non-membrane helix, Q%obs percentage of all observed TMH
2
2T
helix residues that are correctly predicted, Q %prd2T percentage of all predicted TMH helix residues that are correctly predicted, Q%obs2N percentage of all observed
non-TMH helix residues that are correctly predicted, Q%prd2N percentage of all predicted non-TMH helix residues that are correctly predicted.
e ERROR: the estimates for per-segment accuracy resulted from a bootstrap experiment with M = 100 and K = 18; the estimates for per-residue accuracy were
obtained by standard deviations over Gaussian distributions for the respective score.
f Numbers in italics: two standard deviations below the numerically highest value in each column (set in bold letters).
NOTE: all methods are tested on the same set of proteins. However, the numbers are NOT from a cross-validation experiment, ie some methods may have used
some of the proteins for training. Generally, newer methods are more likely to be overestimated than older ones. In particular, HMMTOP2, TMHMM1, and WW have
been developed using ALL the proteins listed here.
Membrane Bioinformatics – Part II
V6 SS 2006
27
Prediction accuracy
About 86% of the TMH residues predicted by the best methods are correctly
predicted.
Assume that we consider a prediction of a membrane helix correct if the predicted
and the observed helical regions differ by less than 3 residues.
 the best current methods correctly predict all membrane helices for 70 – 75% of
all proteins.
However, the topology is predicted correctly for only about half of all proteins.
The best method, HMMTOP2, had all proteins listed in its training set.
Simple hydrophobicity scales are less accurate than advanced methods.
Membrane Bioinformatics – Part II
V6 SS 2006
28
All methods confuse TM helices with signal peptides
Signal peptides that are cleaved off secreted proteins usually contain stretches of
hydrophobic residues resembling membrane helices.
The most accurate specialists for membrane prediction (TMHMM and PHDhtm)
falsely predict about 30 – 40% of all signal peptides as TM helices.
Simple hydrophobicity scales predict more than 90% of the signal peptides as TM
helices.
Membrane Bioinformatics – Part II
V6 SS 2006
29
Many methods predict TM helices in globular proteins
Simple hydrophobicity scales reach levels close to 100% false positives.
Advanced methods (SOSUI; TMHMM1, PHDhtm) predict TM helices in less than
2% of all globular proteins.
Different methods predict similar numbers of TM proteins in genomes:
about 10 – 30%.
The overall content of TM proteins in genomes of different complexity is similar.
However, eukaryotes have significantly more proteins with > 10 TM helices than
all other species.
Also, the distribution is different:
eukaryotes have more 7 TM proteins (receptors)
prokaryotes have more 6TM and 12TM proteins (ABC transporters).
Membrane Bioinformatics – Part II
V6 SS 2006
30
Future directions
Meta servers yield improved predictions.
> 90% correct topologies can be obtained by a simple majority vote between the
results of various methods.
TM helix prediction and signal peptide prediction should be combined
Useful: databases for particular families of TM proteins and sequence motifs
e.g. GPCR database
Membrane-specific substitution matrices improve database searches
e.g. PHAT by Henikoff & Henikoff improved alignments of TM proteins
Membrane Bioinformatics – Part II
V6 SS 2006
31
Summary
TM helices are typically continuous stretches of mostly hydrophobic residues.
Simple methods based on summing up hydrophobicities work okay but not really
well.
Advanced methods include additional features such as the „positive-inside rule“.
The currently most successful methods are based on Hidden Markov Models or
Neural Networks.
Evaluating performance accuracy should be done using carefully separated
training and test sets.
It is possible to discriminate signal peptides and TM helices.
Only Split 4.0 may detect short non-membrane spanning helices.
Membrane Bioinformatics – Part II
V6 SS 2006
32
Download