A normalized root-mean-square distance for comparing protein

advertisement
FOR THE RECORD
A normalized root-mean-square distance for comparing protein
three-dimensional structures
OLIVIERO CARUGO1,2
AND
SÁNDOR PONGOR1
1
Protein Structure and Function Group, International Centre for Genetic Engineering and Biotechnology,
34012 Trieste, Italy
2
Department of General Chemistry, University of Pavia, 27100 Pavia, Italy
(RECEIVED February 15, 2001; FINAL REVISION April 3, 2001; ACCEPTED April 10, 2001)
Abstract
The degree of similarity of two protein three-dimensional structures is usually measured with the rootmean-square distance between equivalent atom pairs. Such a similarity measure depends on the dimension
of the proteins, that is, on the number of equivalent atom pairs. The present communication presents a simple
procedure to make the root-mean-square distances between pairs of three-dimensional structures independent of their dimensions. This normalization may be useful in evolutionary and fold classification studies
as well as in simple comparisons between different structural models.
Keywords: Root-mean-square distance; structure classification; structure comparison; three-dimensional
similarity
Quantitative comparison of three-dimensional structures is a
fundamental task in structural biology (Carugo and Eisenhaber 1997; Peters-Libeu and Adman 1997), especially in
such fields as domain fold classification and structural evolution studies (Domingues et al. 2000; Yang and Honig
2000). A very popular quantity used to express the structural
similarity is the root-mean-square distance (rmsd) calculated between equivalent atoms in two structures, defined as
rmsd =
冑
兺d
i
n
2
i
(1)
where d is the distance between each of the n pairs of
equivalent atoms in two optimally superposed structures.
The rmsd is 0 for identical structures, and its value increases
as the two structures become more different. Rmsd values
are considered as reliable indicators of variability when applied to very similar proteins, like alternative conformations
Reprint requests to: Dr. Oliviero Carugo, International Centre for Genetic Engineering and Biotechnology, Area Science Park, Padriciano 99,
34012 Trieste, Italy; e-mail: carugo@ icgeb.trieste.it; fax: 39 040 22 65 55.
Article and publication are at www.proteinscience.org/cgi/doi/10.1110/
ps.690101.
1470
of the same protein. On the other hand, rmsd data calculated
for structure pairs of different sizes cannot be directly compared, because the rmsd value obviously depends on the
number of atoms included in the structural alignment.
Clearly, an rmsd value of, say, 3 Å has a different significance for proteins of 500 residues than for those of 50
residues; accordingly, the structural variability of fold types
cannot be easily compared in quantitative terms (Irving et
al. 2001). In other words, rmsd is a good indicator for structural identity, but less so for structural divergence.
The present communication aims to define a normalized,
size-independent rmsd formula that could help to overcome
this problem. In order to derive a formula between rmsd and
protein dimension, one would need a database of structural
alignments, in which all other parameters, such as secondary structure content and amino acid composition of the
protein, are either constant (which is not possible) or are
evenly distributed with respect to protein chain length. Such
experimental data are presently not available. For example,
the FSSP database (Holm and Sander 1996) contains a reasonably high number of structural alignments (about
23,000), but 80% of these have small rmsd values (0–2 Å),
which reflects the fact that the percentage of sequence identity is very high (more than 90% residue identity in 60% of
the alignments).
Protein Science (2001), 10:1470–1473. Published by Cold Spring Harbor Laboratory Press. Copyright © 2001 The Protein Society
Normalized root-mean-square distance
We therefore decided to create a large artificial set of
rmsd values via extensive self-comparison of 180 nonhomologous (maximal identity 25%) protein structures, selected from the protein data bank (Berman et al. 2000) using
the PDB_SELECT (Hobohm and Sander 1994) algorithm.
These proteins were selected so as to represent the largest
possible variability of amino acid content, sequence length
as well as secondary structure content (Table 1). Each structure was compared, using the algorithm of Kabsch (1976,
1978), with 400,000 of its randomized variants created
through random shuffling of the C␣ equivalencies. All C␣
atoms were included in superposing each structure with all
its variants. Overall, we obtained 400,000 rmsd observations in each of the 180 randomization experiments, which
corresponds to a database of 72 million structural alignments. As expected, the distribution of rmsd values thus
obtained depends on the size of the protein. The rmsd values
are not evenly distributed, rather, the histograms are biased
toward the high rmsd values (Fig. 1a). Moreover, there are
characteristic differences between proteins of different
length, illustrated by, for example, the different rmsd limits
of the 2000 smallest rmsd values in the two experiments, as
shown by the shaded areas in Figure 1a.
In order to check the effect of the uneven distribution of
rmsd values, we prepared separate rmsd-versus-chainlength plots for different subsets of the database, selected to
represent different rmsd ranges without changing the other
parameters (secondary structure content, amino acid composition, etc.). This was achieved by first ordering the structural alignments in growing order of the rmsd values in each
of the 180 data sets and then selecting the first n smallest
rmsd values from each data set. This procedure guarantees
that the data sets will be equal with respect to all parameters; only the range of the rmsd values will be different:
that is, gradually increasing the number n of observations in
the data sets means not only an increase of the data size but
Fig. 1. (a) Typical distributions of rmsd values after 400,000 random superpositions for proteins of different sizes (PDB codes
indicated in parentheses); the percentage of observations in each range of 0.4 Å is reported. (b) A typical rmsd-versus-chain-length plot;
the upper limits of the smallest 2000 observations (examples indicated by perpendicular lines in a) are plotted for each of the 180
experiments. (c) The dependence of the rmsd-versus-chain-length plots as a function of the different number of smallest observations
(indicated in parentheses); the lines were determined by fitting a logarithmic equation of the form y ⳱ a + b ln(x) to the data
(0.95 < r < 1.00); 䊊 is a reference value, corresponding to 100 residues, chosen to normalize the curves. (d) Dividing the rmsd values
by the corresponding reference value (indicated with 䊊 in c) causes the curves in the previous figure to collapse into one single curve.
www.proteinscience.org
1471
Carugo and Pongor
Table 1. Protein structures examined in the present work
idcode
n
h
e
t
o
idcode
n
h
e
t
o
idcode
n
h
e
t
o
1a7kA
1amx_
1auxA
1b12A
1b87A
1bq2_
1bor_
1bu2A
1by1A
1c20A
1ckv_
1dldA
1dgvA
1eus_
1gcf_
1hcd__
1ihfA
1ksr_
1mtyB
1pgs_
1pyaA
1qleC
1qrjB
1qstA
1r63_
1stu_
1tig_
1upuA
2af8_
2ezl_
2pcbA
2qwc_
2yfpA
3sil_
8rucI
1a7m_
1aru_
1b3kA
1beg_
1bw6A
1cby_
1cmyA
1d8lA
1dkdA
1ej3A
1gdoA
1irl_
1mut_
1otfA
1pdnC
1qhgA
1qkfA
1qu8A
1rypL
1tiiD
1vcaA
2aviA
2bidA
2shl_
3stdA
358
150
292
239
181
323
56
229
209
128
141
220
183
358
109
118
96
100
384
311
81
273
199
160
63
68
88
224
86
99
294
385
224
378
123
180
336
373
97
56
227
141
140
146
187
238
133
129
59
123
163
73
46
212
98
199
121
197
48
162
31
8
29
8
27
32
0
60
64
65
16
58
46
3
0
3
42
0
64
5
27
68
57
35
67
38
35
39
50
59
49
3
9
6
22
58
43
31
61
55
30
74
39
34
61
25
52
12
42
47
56
22
0
39
26
5
2
57
0
30
25
54
31
46
30
28
4
0
0
3
21
0
2
44
44
49
29
43
1
47
27
1
1
29
0
31
35
27
0
0
7
45
53
47
24
0
7
31
4
0
30
0
29
29
4
34
3
23
32
5
10
21
0
33
39
57
51
0
52
48
25
21
20
20
28
21
55
27
25
15
36
25
31
28
20
30
9
36
19
22
22
16
22
22
16
12
16
15
21
19
22
25
24
25
27
22
24
21
14
23
16
15
19
17
18
21
22
28
10
24
20
27
50
16
21
20
22
21
27
14
19
17
20
26
15
19
41
14
11
17
27
17
20
25
36
19
20
21
16
26
23
16
20
14
17
19
14
18
29
22
21
26
14
22
28
19
26
17
21
21
24
11
14
19
17
20
23
37
15
24
15
30
50
12
14
19
24
22
21
9
1agnA
1aojA
1avgI
1b6bA
1b9lA
1bhu_
1bp3A
1buyA
1bynA
1ceuA
1cn3A
1d2hA
1dtjB
1evtD
1gnhA
1hoe_
1iyu_
1lbd_
1nfdA
1pho_
1qhkA
1qmcA
1qslA
1qu0C
1rgs_
1svpA
1tiv_
1xikA
2cgpA
2jhbA
2pcfB
2tbd_
3cla_
7prcC
1a3k_
1a8p_
1avwB
1b65A
1bmy_
1bx9A
1cczA
1cpzA
1dipA
1dztA
1eqfA
1ghj_
1lkfA
1nflA
1ounB
1pex_
1ghlA
1qovM
1r2aA
1sxl_
1tuc_
2abd_
2ayh_
2nlrA
2trxA
5daaA
373
60
142
168
119
102
186
166
128
51
283
252
62
192
206
74
79
238
203
330
47
52
402
183
264
160
86
340
200
143
250
134
213
332
137
257
171
363
107
210
171
68
77
183
267
79
292
259
121
192
203
302
46
97
61
86
214
222
108
277
30
5
8
26
32
4
60
57
5
51
7
32
44
9
9
0
0
66
5
2
32
6
34
3
34
6
0
70
40
26
8
28
30
42
2
26
2
31
48
56
4
28
40
10
64
0
3
58
28
7
37
64
63
18
5
60
6
6
33
27
26
45
41
30
39
25
0
1
48
0
41
31
32
52
44
49
47
3
43
56
28
56
29
51
27
46
0
4
30
25
40
27
29
4
58
34
42
26
0
10
51
29
0
44
0
42
59
0
49
42
27
4
0
13
44
0
52
50
28
31
21
22
27
21
17
35
22
20
18
25
25
19
6
13
21
26
25
20
22
25
21
21
19
22
17
26
50
13
19
24
23
21
22
30
18
23
20
18
27
23
22
28
31
21
21
25
20
29
12
33
19
14
11
30
21
19
19
23
27
23
23
28
25
23
12
36
18
22
29
24
27
18
18
26
26
26
28
12
30
16
19
17
18
24
21
23
50
14
12
24
28
24
20
24
21
18
36
24
25
11
24
15
29
25
15
33
18
13
11
18
16
18
26
39
30
21
23
21
12
19
1ak0_
1ap8_
1axn_
1b6rA
1b9xA
1boeA
1bqv_
1bxwA
1cl7M
1cflD
1d0mA
1de9A
1dujA
1ewiA
1gsa_
1hsm_
1jlyA
1liaA
1oczB
1pslA
1qklA
1qqvA
1qsoA
1qu5A
1rip_
1tbaA
1tnm_
1xrc_
2def_
2myo_
2pii_
2tmvP
3csmA
8prn_
1a79A
1aep_
1awj_
1bec_
1bu9A
1cd3_
1cjkA
1d4uA
1dj7B
1eioA
1exg_
1hhhB
1mrj_
1nfa_
1p32A
1qgiA
1qj8A
1qpvA
1rof_
1tif_
1u2fA
2atcB
2bby_
2nmbA
3ncmA
6gsvA
264
213
323
349
340
46
110
172
142
368
312
276
187
114
314
79
299
164
227
304
127
67
149
182
81
67
91
378
146
118
112
154
252
289
171
153
77
238
168
294
189
111
73
127
110
100
247
178
182
259
148
133
60
76
90
152
69
147
92
217
64
21
79
34
7
0
46
0
84
7
49
27
28
11
35
62
7
76
30
72
20
42
34
10
0
21
0
32
22
47
26
45
64
4
34
80
0
5
48
63
38
23
4
9
3
0
40
2
36
58
0
30
12
36
16
6
51
20
4
49
4
23
0
29
46
24
0
60
0
49
9
26
29
25
29
0
45
0
25
0
14
3
30
22
11
6
46
25
20
0
29
5
0
55
31
0
5
49
1
2
30
11
44
58
56
49
25
24
26
6
82
30
10
30
18
5
7
20
51
9
20
26
15
19
29
43
23
12
7
21
22
22
18
31
18
22
30
15
22
14
32
30
19
35
28
39
27
26
25
34
15
26
17
24
20
14
27
20
27
20
18
38
27
21
22
21
21
26
17
21
14
23
40
17
38
41
10
31
27
28
12
29
6
19
18
33
31
28
9
24
20
25
25
33
18
16
18
9
23
14
33
25
17
33
60
34
26
17
33
19
29
25
19
17
15
7
68
26
24
16
15
29
25
11
19
30
13
49
20
15
5
17
38
17
29
48
32
29
17
14
Each entry is identified by its four-letter identification code, followed by the chain identifier. The following features are indicated for each entry: the number
of residues (n) and the percentages of residues in helical (h), extended (e), turn (t), and other (o) backbone conformation. The secondary structures, as
assigned by DSSP, were simplified as follows: helical if 310-␣ or ␲-helix (G, H, and I respectively in DSSP), extended if ␤-bulge or strand (B and E),
turn if bend or reverse turn (S and T), and others in the remaining cases.
1472
Protein Science, vol. 10
Normalized root-mean-square distance
also an inclusion of higher rmsd values. The data of each
subset could be fitted with a logarithmic function with correlation coefficients higher than 0.95 (an example is shown
in Fig. 1b). The fitted curves are different as higher rmsd
data are included in the calculation, which results in the
series of curves shown in Figure 1c. This observation therefore confirms that the uneven distribution of rmsd values
would bias the parameters obtained by simple curve fitting.
Interestingly, dividing the rmsd values with a reference
value, chosen here as the value of the fitted rmsd curve at
100 residues, rmsd100 (Fig. 1c), makes the curves collapse
into one single logarithmic curve (Fig. 1d) that is described
by the following equation:
rmsd
= −1.3 + 0.5 ln N
rmsd100
(2)
where N is the number of amino acid residues. This curve is
accordingly independent of both the number n of observations included in the calculation and the magnitude of rmsd
values; a statistical bias is therefore not likely. Given that
−1.3 ≅ 1 − ln(10), the equation can be rearranged to give
冑
rmsd
= 1 + ln
rmsd100
N
100
(3)
can be considered as a normalized, size-independent indicator of structural variability. For example, suppose that the
C␣ atoms of two pairs of protein structures, 50 and 200
residues long, respectively, can be superposed to give a final
rmsd value of 1.0 Å. For the first pair of sequences sharing
N ⳱ 50 equivalent residues, the corresponding rmsd100
value will be 1.524 Å The second pair of structures
(N ⳱ 200) is considerably more similar to each other
(rmsd100 ⳱ 0.741 Å) despite the fact that the crude rmsd
values are the same. In other words, the normalized rmsd100
qualitatively reflects the intuitive view that larger structures
have a higher probability to differ one from the other. Because the data were derived from proteins with more than 40
residues we suggest that the rmsd100 formula should be
applied to alignments that include more than 40 residues.
On the other hand, it follows from the mathematical form of
the equation that the formula can be applied only for structural alignments with more than 14 residues; for smaller N
values the ratio in equation 2 would be negative.
We think that the normalized rmsd can be useful in estimating the quality of an NMR ensemble of models, in
applying multivariate statistical techniques to structural bioinformatic problems, as well as in comparing limited sets of
protein three-dimensional structures.
Acknowledgments
It is interesting to note that the value 100, the residue
number corresponding to the chosen reference value,
rmsd100, appears in the equation. We repeated the normalization procedure on the entire data set with residue numbers of 50, 75, 150, and 200, respectively, and in fact found
that a generalized equation is valid with correlation coefficients 0.96–0.99:
冑
N
L
rmsd
= 1 + ln
rmsdL
References
(4)
where L is the number of residues chosen as a reference. In
other words, the relative root-mean-square distance rmsd/
rmsdL is a simple function of the relative dimension N/L.
Equation 3 can be simply rearranged to give a formula for
a normalized rmsd value:
rmsd100 =
rmsd
冑
1 + ln
N
100
We thank János Murvai and Alessandro Pintar for helpful discussions.
The publication costs of this article were defrayed in part by
payment of page charges. This article must therefore be hereby
marked “advertisement” in accordance with 18 USC section 1734
solely to indicate this fact.
(5)
The chain length of 100 residues was primarily chosen
because this is the mean number of amino acids per domain
(Xu and Nussinov 1998). rmsd100 is therefore an rmsd value
that would be observed for a pair of structures of 100 residues exhibiting the same degree of similarity as the structures actually compared. In other words, the rmsd100 value
Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T., Weissig, H., Shindyalov, I., and Bourne, P. 2000. The Protein Data Bank. Nucleic Acid Res.
28: 235–242.
Carugo, O. and Eisenhaber, F. 1997. Probabilistic evaluation of similarity between pairs of three-dimensional protein structures utilizing temperature
factors. J. Appl. Cryst. 30: 547–549.
Domingues, F.S., Koppensteiner, W.A., and Sippl, M.J. 2000. The role of protein structure in genomics. FEBS Lett. 476: 98–102.
Hobohm, U. and Sander, C. 1994. Enlarged representative set of protein structures. Protein Sci. 3: 522–531.
Holm, L. and Sander, C. 1996 Mapping the protein universe. Science 273:595–
602.
Irving J.A., Whisstock J.C., and Lesk A.M. 2001. Protein structural alignments
and functional genomics. Proteins 42:378–382.
Kabsch, W. 1976. A solution for the best rotation to relate two sets of vectors.
Acta Crystallogr. A 32: 922–923.
———. 1978. A discussion of the solution for the best rotation to relate two sets
of vectors. Acta Crystallogr. A 34: 827–828.
Peters-Libeu, C. and Adman, E.T. 1997. Displacement-parameter weighted coordinate comparison: I. Detection of significant structural differences between oxidation states. Acta Crystallogr. D 53: 56–76.
Xu, D. and Nussinov, R. 1998. Favorable domain size in proteins. Fold. Des. 3:
11–17.
Yang, A.-S. and Honig, B. 2000. An integrated approach to the analysis and
modeling of protein sequences and structures. I. Protein structural alignment
and a quantitative measure for protein structural distance. J. Mol. Biol. 301:
665–678.
www.proteinscience.org
1473
Download