Supplementary data 1: The 16 Protein Blocks (PBs)

advertisement
Supplementary data 1: The 16 Protein Blocks (PBs)
From left to right and top to bottom, the 16 Protein Blocks (de Brevern et al., 2000),
labelled from a to p, are displayed using DINO (Philippsen, 2003). They correspond to 5residue fragments, defined by 8 dihedral angles ( and ). For each PB, the N-cap is on the
reader's left and the C-cap on the right. They were obtained by an unsupervised classifier
similar to Kohonen Maps (Kohonen, 1982, 2001) and Hidden Markov Models (Rabiner,
1989).
de Brevern AG, Etchebest C, Hazout S. Bayesian probabilistic approach for predicting
backbone structures in terms of protein blocks. Proteins 2000;41:271-287.
Ansgar Philippsen. DINO: Visualizing structural biology (2003) (http://www.dino3D.org).
Kohonen T. Self-organizing formation of topologically correct feature maps. Biol Cybernet
1982;43:59-69.
Kohonen T. Self-Organizing Maps, 3rd edition. 2001;Springer-Verlag, Berlin, Germany.
Rabiner LR. A tutorial on Hidden Markov Models and selected applications in speech
recognition. Proc. of the IEEE 1989;77:257-285.
Supplementary data 2: Hybrid Protein Model (HPM)
The main principles of HPM (de Brevern and Hazout, 2001, 2003; Benros et al., 2003)
are summarized in the following sections: (i) Hybrid Protein (HP) topology, (ii) HP
initialization, (iii) training strategy.
(i) HP topology. The Hybrid Protein corresponds to a self-organizing neural network.
Its topology is a ring of N neurons or clusters (Figure a), which is represented by a matrix of
PB probability distributions of dimensions 16 x N (the structural alphabet (de Brevern et al.,
2000) is composed of 16 PBs; cf. Figure b). Each site s (i.e., column, varying from 1 to N) of
the matrix corresponds to the vector of the frequencies of the different PBs in site s, noted
Fs(PB). A neuron centred in position s is defined by L successive probability distributions (in
this study, L = 2w + 1 = 7) located in positions (s - w) to (s + w). Two consecutive neurons
overlap and have (L-1) probability distributions in common.
(ii) HP initialization. The Hybrid Protein matrix is initialized with N column vectors
corresponding to the reference frequencies of the 16 PBs, that is, their frequencies as observed
in the databank (noted FR(PB)), modified by adding a weak random noise :
Fs ( PB )  FR ( PB )(1   ) .
The value of is randomly drawn within the range [-0.10; +0.10]. The probability sum is then
readjusted to 1 at each site s.
(iii) HPM training. The HPM training relies on the same concept of competition as the
“Self-Organizing Maps” (SOM; Kohonen, 1982, 2001). As in the SOM method, the training
is iterative: several cycles are necessary to stabilize the Hybrid Protein matrix, that is, obtain
the PB probability distributions. A cycle is carried out when the entire fragment training
databank is presented to the Hybrid Protein. Moreover, since the different neurons overlap,
the modification of the PB distributions associated with the winner (that is, the neuron
selected by competition) influences the neurons located in its neighbourhood. Thus,
information is implicitly diffused.
The learning of a given encoded protein fragment F: {PBx} (x varying from –w to +w)
is a two-step procedure, with an identification step and a local enrichment step. For each
fragment F taken randomly from the databank, the identification step consists of searching for
the most probable neuron in the HPM library (Figure c). For this purpose, a score
corresponding to a logarithm of likelihood ratio is computed at each site s (s varying from 1 to
N) along the Hybrid Protein matrix, as follows:
Sc( s ) 
x w
 Fs  x ( PBx ) 
.
F
(
PB
x
)
 R

 ln 
x w
PBx corresponds to the protein block located in position x in fragment F. Fs+x(PBx) is the
frequency of PBx in position (s + x) in the HP matrix. FR(PBx) is the observed frequency of
PBx in the databank. This score measures the compatibility of fragment F with a given neuron
centred in position s and represented by L successive PB probability distributions. Fragment F
is assigned to the winner, that is, the neuron associated with the maximum score: Scmax =
max[Sc(s)]. It is centred in position sopt of the HP matrix. In the local enrichment step (Figure
d), the L PB probability distributions of the winner are slightly modified to increase its
likeness to the protein fragment F presented. This procedure is applied to HP matrix sites
from (sopt - w) to (sopt + w). For the protein block PBx observed in position x in the protein
fragment F and located at position (sopt + x) in the HP matrix, we increase its frequency:
Fsopt  x (PBx ) 
Fsopt  x (PBx )  
1 
and decrease the frequencies of the other 15 PBs:
Fsopt  x (PB) 
Fsopt  x (PB)
1 
.
These equations allow us to keep the frequency values within the range [0; 1] and the sum per
site equal to 1. As in the SOM method, the training parameter  decreases during the training
according to the equation:  = 0/(1 + t/T), where t denotes the number of protein fragments
already presented to the Hybrid Protein and T the total number of fragments in the training
databank. The initial training parameter 0 is set to 0.2 in this study.
de Brevern AG, Hazout S. Compacting local protein folds with a Hybrid Protein Model.
Theor Chem Acc 2001;106(1/2):36-47.
de Brevern AG, Hazout S. Hybrid Protein Model for optimally defining 3D protein structure
fragments. Bioinformatics 2003;19:345-353.
Benros C, de Brevern AG, Hazout S. Hybrid Protein Model (HPM): A method for building a
library of overlapping local structural prototypes. Sensitivity study and improvements of the
training. IEEE Int Work NNSP 2003;1:53-70.
de Brevern AG, Etchebest C, Hazout S. Bayesian probabilistic approach for predicting
backbone structures in terms of protein blocks. Proteins 2000;41:271-287.
Kohonen T. Self-organizing formation of topologically correct feature maps. Biol Cybernet
1982;43:59-69.
Kohonen T. Self-Organizing Maps, 3rd edition. 2001;Springer-Verlag, Berlin, Germany.
Supplementary data 3: Distribution of the C rmsd of all the protein fragments of each
cluster superimposed with their representative local structure prototype (in red), and
distribution of the C rmsd of pairs of unrelated 11-residue protein fragments selected
randomly from the databank (in black).
The black histogram shows the distribution of C rmsd computed with 100 000 pairs
of 11-residue protein fragments. A pair of protein fragments was randomly drawn from the
databank and the C rmsd was calculated if these fragments encoded into series of 7 PBs
differed by more than 5 PBs. This distribution has a mean of 4.5 Å (standard deviation, sd =
1.1 Å). The p-value for a random match with C rmsd < 2.0 Å and with C rmsd < 2.5 Å is
respectively 10-3 and 10-2. These p-values could be even smaller if the fragments compared
were completely different. Yang and Wang (2003), for instance, compared nine-residue
fragments from all- proteins with nine-residue fragments from all- proteins and obtained a
p-value for a random match with rmsd < 2.4 Å equal to 10-2 for shorter fragments. The red
histogram shows that the 120 prototypes of the library ensure a good 3D local approximation
with a mean accuracy of 1.61 Å C rmsd (sd = 0.77 Å).
Yang AS, Wang L. Local structure prediction with local structure-based sequence profiles.
Bioinformatics 2003;19:1267-74.
Supplementary data 4: Mean C rmsd value of the 120 clusters of the library
Cluster
mean C rmsd ( Å)
± sd
Cluster
mean C rmsd ( Å)
± sd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
2.10 ± 0.45
2.08 ± 0.52
2.31 ± 0.45
2.14 ± 0.52
2.03 ± 0.48
2.15 ± 0.44
1.65 ± 0.50
1.66 ± 0.51
1.58 ± 0.37
1.52 ± 0.38
1.57 ± 0.39
1.79 ± 0.43
1.89 ± 0.45
1.85 ± 0.35
1.96 ± 0.48
2.02 ± 0.40
2.14 ± 0.52
2.07 ± 0.71
1.90 ± 0.63
1.62 ± 0.66
1.29 ± 0.71
0.76 ± 0.65
0.54 ± 0.41
0.49 ± 0.41
0.42 ± 0.40
0.28 ± 0.28
0.42 ± 0.40
1.07 ± 0.66
1.25 ± 0.76
1.68 ± 0.74
1.88 ± 0.72
2.03 ± 0.58
1.87 ± 0.44
2.39 ± 0.52
2.05 ± 0.49
2.01 ± 0.54
1.84 ± 0.59
2.09 ± 0.64
1.42 ± 0.63
1.13 ± 0.70
0.57 ± 0.52
1.12 ± 0.54
1.85 ± 0.76
1.46 ± 0.64
1.76 ± 0.54
1.93 ± 0.50
1.96 ± 0.40
1.99 ± 0.40
2.28 ± 0.51
2.06 ± 0.47
2.06 ± 0.35
1.96 ± 0.33
2.09 ± 0.44
2.24 ± 0.39
2.01 ± 0.32
2.12 ± 0.44
1.88 ± 0.38
1.85 ± 0.42
1.77 ± 0.47
1.76 ± 0.46
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
1.99 ± 0.44
1.70 ± 0.51
1.70 ± 0.50
1.91 ± 0.54
1.97 ± 0.64
1.51 ± 0.61
1.05 ± 0.71
1.11 ± 0.74
1.56 ± 0.64
1.57 ± 0.60
2.01 ± 0.66
2.19 ± 0.64
2.10 ± 0.64
2.10 ± 0.40
2.10 ± 0.49
2.03 ± 0.35
1.88 ± 0.43
1.73 ± 0.49
1.90 ± 0.36
1.74 ± 0.33
1.94 ± 0.43
2.11 ± 0.51
1.99 ± 0.50
2.26 ± 0.48
2.42 ± 0.46
2.30 ± 0.44
2.19 ± 0.43
2.18 ± 0.41
2.14 ± 0.36
2.12 ± 0.31
1.79 ± 0.39
2.30 ± 0.46
1.89 ± 0.57
1.68 ± 0.43
1.72 ± 0.39
1.84 ± 0.47
1.74 ± 0.40
1.66 ± 0.32
1.70 ± 0.43
1.78 ± 0.45
1.74 ± 0.42
2.04 ± 0.39
2.30 ± 0.41
1.98 ± 0.55
1.82 ± 0.45
1.79 ± 0.49
1.82 ± 0.52
1.84 ± 0.41
1.82 ± 0.48
1.84 ± 0.43
1.86 ± 0.32
1.90 ± 0.37
2.44 ± 0.38
2.12 ± 0.39
1.89 ± 0.46
2.17 ± 0.43
2.23 ± 0.45
2.04 ± 0.48
1.86 ± 0.38
2.13 ± 0.37
The C rmsd values are obtained by superimposing all protein fragments in a cluster with
their representative prototype. The fragments are 11 C long.
Supplementary data 5: Discriminating power of cluster #67’s expert.
Distribution of the probabilities computed for protein fragments (a) that do not belong
to cluster #67 and (b) that belong to cluster #67. The probability value pRmin, which equals
0.53, is associated with the minimal error risk Rmin, which corresponds to the minimal average
fraction of false-positive (FP0.53) and false-negative (FN0.53) fragments. It equals 20%. In the
prediction strategy, the optimal probability threshold for sequence-structure compatibility is
set at p0 = 0.8, that is, well above pRmin. This stringent threshold has the advantage of ensuring
a reduced proportion of false positives (FP0.8 equals about 5%).
Supplementary data 6: Definition of four categories of prototypes
We defined four categories of prototypes. They were obtained by hierarchical
clustering of the 120 prototypes of the library, from the C rmsd values obtained after we
optimally superimposed them pairwise. Figure (a) shows the tree obtained by hierarchical
clustering of the prototypes (package R; Ihaka and Gentlemen, 1996). We analyzed the
different groups of prototypes obtained by cutting the tree at the level displayed in red. Four
categories of prototypes are considered: helical structures (H), core of extended structures (E),
edges of extended structures and short extended structures (Ed, grouping Ed1 and Ed2), and
connecting structures (C). The table in (b) reports the number of prototypes for each category
and identifies them. Figure (c) shows the location in the Hybrid Protein Model (HPM) of the
prototypes included in the different categories: (H) in blue, (E) in red, (Ed1) in brown, (Ed2) in
orange, and (C) in yellow.
Ihaka R, Gentleman R. R: a language for data analysis and graphics. J Comp Graph Stat
1996;5:299-314.
Supplementary data 7: Prediction rates per prototype
Prototype
Category
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
C
C
C
C
C
Ed
Ed
E
E
E
E
Ed
Ed
Ed
C
C
C
C
C
C
C
H
H
H
H
H
H
H
H
C
C
C
C
Ed
Ed
C
C
C
C
H
H
H
H
H
C
C
C
C
C
Ed
Ed
Ed
Ed
Ed
Ed
Ed
Ed
Ed
Ed
E
Prediction rate (%)
1.5 Å
2Å
2.5 Å
1.3
4.6
0.2
6.1
3.1
1.6
28.7
20.1
20.1
17.8
14.6
7.3
3.7
2.2
5.5
2.6
4.9
10.6
12.5
25.7
38.8
56.0
60.3
62.7
63.4
68.6
66.5
47.6
47.6
28.2
23.3
14.7
17.3
3.9
9.1
8.9
12.2
5.6
36.4
52.4
56.8
47.9
15.0
34.5
19.7
15.2
3.9
6.0
5.4
10.4
1.3
1.4
1.2
1.3
2.6
0.6
1.2
3.0
11.8
6.0
9.9
17.2
5.7
24.9
12.2
4.8
54.3
41.8
42.8
36.9
36.4
26.4
18.0
11.6
17.6
12.9
16.4
23.9
25.7
38.3
51.4
68.3
67.5
69.0
70.3
74.6
73.0
61.8
58.6
40.7
34.1
28.3
44.5
10.5
18.5
24.6
26.7
14.6
48.8
68.6
65.6
59.0
28.9
53.8
36.1
33.0
24.3
21.5
14.9
29.1
14.5
7.2
9.4
6.7
9.6
10.3
6.8
12.8
31.5
27.8
39.7
40.4
23.7
40.6
31.3
18.6
73.4
62.9
61.1
53.3
51.7
48.7
43.8
41.1
40.2
38.7
31.9
37.9
43.4
50.2
62.0
71.4
71.2
73.9
73.3
76.9
76.1
73.0
68.2
51.6
46.0
43.1
71.4
21.7
41.5
47.4
46.4
28.4
59.9
76.4
70.6
68.0
42.6
65.7
52.4
50.7
51.7
43.0
37.2
58.8
49.7
29.8
28.3
19.2
41.7
29.5
33.5
38.5
49.9
49.0
Prototype
Category
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
E
Ed
Ed
Ed
C
C
C
H
H
H
C
C
C
C
C
C
Ed
Ed
Ed
E
Ed
Ed
Ed
C
C
C
C
C
C
C
C
C
Ed
Ed
Ed
E
E
E
E
Ed
Ed
Ed
C
C
C
Ed
E
E
Ed
Ed
Ed
C
C
C
Ed
Ed
Ed
Ed
C
C
Prediction rate (%)
1.5 Å
2Å
2.5 Å
1.0
18.2
10.2
9.1
11.9
27.2
54.1
43.1
20.4
20.9
10.1
6.0
9.9
2.4
5.7
1.2
6.8
12.7
1.3
4.5
1.9
1.3
4.3
1.1
0.7
2.0
3.1
2.8
6.0
2.0
9.4
1.2
13.0
21.1
9.5
10.2
9.1
8.4
11.8
10.4
11.0
2.2
1.0
12.9
13.9
15.0
9.8
7.8
7.4
9.3
2.9
3.5
0.1
0.7
4.8
0.0
0.8
5.1
8.6
1.5
13.9
37.1
28.3
23.0
23.8
44.3
60.4
56.0
38.1
36.8
21.4
13.4
17.2
14.7
16.1
14.5
20.3
33.3
24.2
23.1
13.8
10.3
17.7
10.8
5.4
8.2
15.2
15.3
16.1
12.1
27.2
5.9
24.4
42.6
37.3
25.9
27.8
30.6
30.6
26.0
38.4
20.1
10.7
25.8
33.2
32.1
28.4
27.3
24.3
27.7
20.9
26.3
2.8
15.2
21.6
7.1
5.8
19.4
32.9
8.9
36.1
52.4
49.7
40.0
39.2
54.2
66.3
66.0
58.1
52.6
34.1
26.3
34.9
40.8
34.8
35.8
43.2
57.0
60.1
54.5
40.3
27.4
36.5
29.1
21.0
25.5
39.8
36.2
35.1
44.3
60.3
26.0
48.8
65.7
60.2
50.6
51.5
50.4
51.0
47.0
56.2
46.4
31.5
46.2
55.9
59.0
51.1
51.1
46.6
48.6
51.8
56.9
22.2
40.7
49.0
29.8
22.3
40.6
63.5
33.5
This table reports for each prototype: (i) its category, i.e., helical structures (H), core of
extended structures (E), edges of extended structures and short extended structures (Ed), or
connecting structures (C), and (ii) the individual prediction rates at thresholds of 1.5 Å, 2 Å
and 2.5 Å C rmsd.
Supplementary data 8: Prediction rate per prototype at threshold 2.5 Å function of the
mean C rmsd value of the clusters.
With the geometric evaluation, the prediction rate of a local structure prototype seems
related to its cluster mean C rmsd value, that is, to how well the prototype approximates the
local structures belonging to its cluster. This point can be explained in part by the fact that the
prototypes corresponding to repetitive structures ensure the best local approximations (e.g.
those located in the helical region [#23 - #27] of the Hybrid Protein). Their clusters are the
most heavily populated, and a structural redundancy is observed in the library mainly for
these prototypes.
Supplementary data 9: Distribution of the protein fragments belonging to the four
categories of prototypes in the six confidence levels of the prediction
Distribution of the prototype categories (%)
Confidence
level
Helical
Extended
core
Extended
edges
Connecting
structures
6
5
4
3
2
1
22.9
16.7
23.6
22.3
8.5
6.0
5.7
16.3
32.3
28.3
9.9
7.5
2.2
9.3
29.7
35.5
14.1
9.2
3.4
7.6
26.7
38.4
14.5
9.4
All
100.0
100.0
100.0
100.0
Supplementary data 10: Comparison of the backbone torsion angle prediction
accuracies of the HPM+experts method with other published methods.
For comparison purposes, we encoded the local structure candidates proposed with our
method (HPM+experts) into the four conformational states: A, B, G, E (see Figure 1 of Yang
and Yang, 2003) and computed a backbone torsion angle consensus prediction, either with the
first candidate proposed (First rank, MNAC = 1) or with the candidate the closest to the true
local structure among those proposed (Best candidate, MNAC = 5).
The LSBSP1+consensus results (Yang and Wang, 2003), the HMMSTR results
(Bystroff et al., 2000) and the SVM results (Kuang et al., 2004) are all reproduced from Table
2 of Kuang et al. (2004).
It must be noted that the comparisons are not straightforward, notably due to the use of
different test sets. Moreover, it must be emphasized that the HPM+experts method makes
predictions from a single sequence without use of information from homologous proteins
(profiles) and without taking advantage of the high prediction rates of secondary structure
prediction methods such as PSI-PRED (Jones, 1999).
HPM+experts
Accuracy (%)
Test cases
A
B
G
E
Total
33521
25783
3086
1135
63525
LSBSP1+consensus (level=1)
First rank
Best candidate
65.3
68.6
27.9
0
63.6
77.8
81.7
32.3
0
75.8
Test cases
17466
12732
1491
461
32150
Accuracy (%)
82.7
71.2
32.8
6.5
74.6
HMMSTR
SVM profile x sec
Test cases Accuracy (%)
Test cases Accuracy (%)
9625
7749
837
199
18410
82.0
71.6
15.5
22.6
74.0
50689
40268
4760
1648
97365
Yang AS, Wang L. Local structure prediction with local structure-based sequence profiles.
Bioinformatics 2003;19:1267-74.
Bystroff C, Thorsson V, Baker D. HMMSTR: a hidden Markov Model for local sequencestructure correlations in proteins. J Mol Biol 2000;301:173-190.
Kuang R, Leslie CS, Yang AS. Protein backbone angle prediction with machine learning
approaches. Bioinformatics 2004;20:1612-1621.
Jones DT. Protein secondary structure prediction based on position-specific scoring matrix. J
Mol Biol 1999;292:195-202.
82.5
79.6
32.9
0.3
77.3
Supplementary data 11: Proportion of correctly predicted residues.
1) Nine-residue fragments
9-residue fragments < 1.4 Å
HPM+experts
% correct
Yang and Wang (2003)
First rank
Best candidate
53.5
72.2
62.1
The rmsd prediction accuracy measure corresponds to the proportion of test residues
for which at least one of the overlapping nine-residue segments is predicted correctly, that is
less than 1.4 Å from the true local structure (% correct; Yang and Wang, 2003; Bystroff and
Baker, 1998). We computed this criterion by considering only the nine central residues of our
eleven-residue fragments and compared our results with those of Yang and Wang (2003).
2) Eleven-residue fragments
11-residue fragments
< 1.5 Å
< 2.0 Å
< 2.5 Å
First rank
Best candidate
First rank
Best candidate
First rank
Best candidate
Prediction rate (%)
13.6
22.2
20.6
35.1
29.5
51.2
% correct
44.5
59.7
62.7
80.1
78.9
92.7
For the different C rmsd thresholds (1.5 Å, 2 Å and 2.5 Å), the table reports the
proportion of correctly predicted eleven-residue fragments (Prediction rate) and the
corresponding proportion of correctly predicted residues (% correct), when considering either
the top scoring candidate or the best candidate among those proposed by our method.
Yang AS, Wang L. Local structure prediction with local structure-based sequence profiles.
Bioinformatics 2003;19:1267-74.
Bystroff C, Baker D. Prediction of local structure in proteins using a library of sequence-
structure motif. J Mol Biol 1998;281:565-77.
Download