prot24916-sup-0001-suppinfo

advertisement

Supporting Material

Methods

The Fourier method differs from currently prevalent alignment-based algorithms in two fundamental ways, which were set forth in detail in the references given in the main text. We briefly summarize them here.

1. The 20 naturally-ocurring amino acids are represented by numerical parameters derived, by factor analysis, from their physical properties. Ten property factors have been shown to account for essentially all the variance of the physical properties, and it is therefore possible to represent each amino acid as a ten-vector, and an

N-residue sequence as a set of 10 numerical chains of length N. The property factors are complete and orthonormal, by construction, and therefore sequences are represented numerically by parameters which, in addition to being physically based, are both exhaustive and non-redundant.

2. The resulting numerically encoded sequences are Fourier transformed.

The result of this operation is a set of Fourier coefficients, indexed by two parameters- the wave number k and an index l which indicates which of the 10 property factor strings gives rise to the coefficient. Each individual Fourier coefficient is global in character, since it contains information from the entire sequence. The Fourier coefficients, like the property factors, are complete and orthonormal by construction, and, taken together, provide a complete numerical representation of the protein sequence. Note that, in k space, chain length is removed as a variable, and therefore chains of different lengths can be compared rigorously. The Fourier coefficients describe properties of the chains which scale with length. It has been shown that the average and variance properties of the

Fourier coefficients can be calculated analytically, so that the statistical significance of the magnitude of a given coefficient can be determined exactly.

A. The Sequence Distance Function

In recent work we have demonstrated that architectural families are distinguished from one another by Fourier coefficients at a limited set of lowk wave numbers.

MANOVA (Multivariate Analysis of Variance) analysis demonstrated that the only values of k at which there are statistically significant differences, in the full 10dimensional property space, between sets of sequences with different folds are 0≤ k

≤6.

The sequence-space metric used in that work was based on those k values. In the present work we extend that metric, using ANOVA results from reference

Ошибка! Закладка не определена.

, to include contributions, for particular property factors, from k values in the range 7≤ k

≤10. This extension can be shown to slightly increase the correlation between sequence and structure distances.

This provides a basis for the construction of the intersequence distance function.

As in previous work, we define a standard score for the Fourier coefficient (denoted in this case by Z k

(l) ) for property l , at wave number k , as

Z k

( l ) = c k

( l ) c k

( l ) s

( c k

( l ) )

N

, (S1) where c is the unnormalized sine or cosine Fourier coefficient, the angle brackets denote an average over all permutations of the original N-residue wild-type sequence, and σ is the associated standard deviation. This normalization removes any dependence on sequence composition alone, and creates a function which explicitly reflects the influence of the specific linear arrangement of amino acids along the sequence. We then define a k dependent distance between any two sequences P and Q, Δ k

(P,Q), and the total distance between the sequences as

D

( P , Q )

=

[

10

å

k

=

0

D 2 k

( P , Q ) ]

1/2

. (S2)

The exact definition of Δ k

(P,Q), which depends on some or all of the 10 property factors, and details of the performance of the extended function, are given in the Supplementary

Material. The distance function is a simple Cartesian metric in the space of centered, normalized Fourier coefficients Z k

(l) (equation 1), but different combinations of sine and cosine coefficients are used at different k values, reflecting statistically significant differences found previously.

B. The Structure Distance Function

We must define a parallel, independent distance function which measures the degree of structural similarity between proteins, without reference to sequence. We devised such a function in previous work and applied it to the quantitative classification of known protein structures. This approach, the Generalized Bond Matrix (GBM) method, describes a structure in terms of a set of matrices of bond lengths, bond angles and bond dihedral angles. In the present case we use the nearest-neighbor virtual bond

(C

α

) backbone. The size of each matrix is determined by a preselected fragment length- in this work, a 4-C

α

fragment. The representation is therefore sensitive to local structural characteristics. At the same time, the complete distribution of these matrices (which describe the overlapping fragments which make up a structure) is a global characteristic of the structure, and can be used as a fingerprint. A distance function is then defined which acts on two fingerprints to quantitate the degree of similarity between the associated structures. Because the fingerprints are normalized by sequence length, it is possible to meaningfully compare the structures of proteins of different size. In contrast to other methods, the GBM method is suitable for the rapid, simultaneous pairwise comparison of very large sets of structures.

Here we use a low-resolution (LRGBM) version of the algorithm, which was demonstrated [

Ошибка! Закладка не определена.

] to give results very similar to the full-resolution comparison method, and which is even more rapid in execution. In the

LRGBM formulation, the structure of a protein is represented in a four-dimensional space by integrating over the populations of predefined regions of the high-resolution GBM fingerprint, which are denoted as A

R

,E

R

,E

0

and E

L

. We computed these coordinates for

the members of a very large dataset of proteins of known structure (described below). A principal component analysis shows that, in this representation, the space of structures is actually 3-dimensional, and the coordinates of a structure are given by w w w

1

2

3

= -

= -

0.522

0.068

p p

(

(

E

E

= -

0.776

p ( E

L

L

L

)

-

0.522

p ( E

0

)

-

0.354

p ( E

0

)

+

0.602

p ( E

0

)

-

0.298

p ( E

R

)

+

0.605

p ( A

R

)

)

-

0.928

p ( E

)

+

0.179

p ( E

R

R

)

+

0.093

p ( A

R

)

+

0.062

p ( A

R

)

(S3)

(S4)

) , (S5) where p(X) is the fractional occupation of region X. The distance δ(P,Q) between proteins

P and Q in structure space is taken as a simple Cartesian metric, given by d

( P , Q )

=

[

3

å

m

=

1

( w m

( P )

w m

( Q )) 2 ] 1/2

. (S6)

This distance can also be given in the form of a standard score which, for clarity, we denote in this case as ζ(P,Q), defined by z

( P , Q )

=

( d

( P , Q )

d

( P , Q ) ) s

( d

)

. (S7)

C. The Database

We use a protein dataset based on the CATH sequence/structure database, in which proteins are classified by four parameters: C,A,T and H, denoting Class,

Architecture, Topology and Homology. We use a set of 12011 domains drawn from the

CathDomainSeqs.S60.ATOM.v.3.2.0 dataset. The sequences in this set have no more than 60% sequence identity. To the best of our knowledge, this is one of the largest datasets ever used in studies of this type.

It should be noted that the CATH database makes no distinction, at the highest heirarchical level (specified by the value of the parameter C), between subtypes of mixed helix-sheet/barrel structures. Differentiation of subtypes appears at the A and T levels of classification. In future work, the effect of such distinctions will be examined.

Definition of the Sequence Distance Function

The sequence distance function is a generalization of that used in reference. As noted in the body of the article, we include ANOVA results from reference

Ошибка!

Закладка не определена.

, and thereby extend the distance function to include contributions from 7≤ k ≤10, in addition to MANOVA results indicating the significance of differences between sequences in the range 0≤ k

≤6. As in the previous references, we denote sine and cosine Fourier coefficients at wave number k for property factor l by a k

(l) and b k

(l) , respectively. We define standard scores in terms of averages of the Fourier coefficients over an ensemble of all possible permutations of the N-member protein sequence, denoted by angle brackets, and the corresponding standard deviation, as follows:

Z ( a k

( l ) )

= a k

( l ) s

a k

( l )

( a with an analogous expression for Z(b k

( l )

).

( l ) k

)

N

We are interested in calculating the distance between the sequences of two proteins P and Q. We define auxiliary functions l k

( P , Q )

=

[ Z ( a k

( l )

[ P ])

-

Z ( a k

( l )

[ Q ])]

2 and r k

( P , Q )

=

[ Z ( b k

( l )

[ P ])

-

Z ( b k

( l )

[ Q ])]

2

.

We also define k -dependent weighting vectors with components μ k

(l)

and ν k

(l)

. We write the square of the distance function as follows:

D 2 k

( P , Q )

=

10

å

l

=

1 m k

( l ) l k

( l ) +

10

å

n

( l ) k r

( l ) k

, l

=

1 which can be written in vector form as

.

The weighting vectors

μ k

and

ν k

are given by the following explicit formulae:

μ

0

= μ

1

= μ

3

= μ

4

= μ

6

=(0,0,0,0,0,0,0,0,0,0);

μ

2

= μ

5

=(1,1,1,1,1,1,1,1,1,1)

μ

7

=(1,1,1,0,1,0,0,0,0,0);

μ

8

=(1,1,0,0,1,1,0,1,0,1);

μ

9

=(1,1,0,1,1,0,0,1,0,1);

μ

10

=(1,0,0,1,1,0,0,1,1,0);

ν

0

= ν

1

= ν

2

= ν

3

= ν

4

= ν

6

=(1,1,1,1,1,1,1,1,1,1);

ν

5

=(0,0,0,0,0,0,0,0,0,0);

ν

7

=(1,0,1,0,1,1,1,0,0,0);

ν

8

=(1,0,0,1,1,0,0,1,1,1);

ν

9

=(1,1,1,1,0,1,0,1,1,0);

ν

10

=(1,0,0,0,0,1,0,0,0,0).

Comparison with the Previous Distance Function

The table below is an expansion of table 1 of reference 21 of the main text, and compares the performance of the extended distance function with that of the function used in previous work, described in that paper. The comparison is with respect to fold identification. We ask in what fraction of cases a match to the fold of a given protein (as expressed by various subsets of its CATH parameters) will be found in at least one of NN nearest sequence neighbors of that protein. Results are shown for 3 different values of

NN, and for all four levels of CATH classification. It should be remembered, however, that actual fold identification requires a match only to the CAT level. By that criterion, it can be seen that the extended sequence distance function clearly outperforms the earlier version. The degree of improvement ranges from 3% for NN=20 to 17% for NN=1. The p values for all cases of CAT matching are in the extreme significance range (p<<0.001).

The predicted fraction and associated standard deviation, which can be calculated by elementary combinatorial methods are shown, as are Z values for the new and old distance functions. We note also that the correlation coefficient between sequence distances given by the new distance function and the associated structure distances remains R~0.8.

Observed

Fraction

Older

Data

Predicted

Fraction

SD Z (New) Z (Old)

NN=1

CATH

CAT

CA

C

0.34

0.34

0.39

0.59

0.27

0.29

0.34

0.56

0.0051

0.03

0.117

0.395

0.00067

0.0015

0.0029

0.0042

499.85

206.67

94.14

46.43

395.37

173.33

76.90

39.29

NN=10

CATH 0.48 0.44 0.047 0.002

CAT

CA

0.55

0.78

0.97

0.52

0.77

0.97

0.2

0.625

0.967

0.005

0.009

0.013

216.50

70.00

17.22

0.23

196.50

64.00

16.11

0.23

NN=20

C

CATH 0.53 0.5 0.089 0.003

CAT

CA

0.62

0.88

0.6

0.87

0.289

0.789

0.0068

0.019

147.00

48.68

4.79

137.00

45.74

4.26

C 0.99 0.99 0.998 0.019

-0.42

-0.42

Download