Using Sequence Motifs for Enhanced Neural Network Jan Gorodkin ,

From: ISMB-99 Proceedings. Copyright ' 1999, AAAI (www.aaai.org). All rights reserved.
Using Sequence Motifs for Enhanced Neural Network
Prediction of Protein Distance Constraints
Jan Gorodkin1 2, Ole Lund1, Claus A. Andersen1 , and Søren Brunak1
;
1 Center for
Biological Sequence Analysis, Department of Biotechnology,
The Technical University of Denmark, Building 208, DK-2800 Lyngby, Denmark
gorodkin@cbs.dtu.dk, olund@strubix.dk, ca2@cbs.dtu.dk, brunak@cbs.dtu.dk
Phone: +45 45 25 24 77, Fax: +45 45 93 15 85
2 Department of
Genetics and Ecology, The Institute of Biological Sciences
University of Aarhus, Building 540, Ny Munkegade, DK-8000 Aarhus C, Denmark
Abstract
Correlations between sequence separation (in residues) and
distance (in Angstrom) of any pair of amino acids in polypeptide chains are investigated. For each sequence separation we
define a distance threshold. For pairs of amino acids where
the distance between Cα atoms is smaller than the threshold,
a characteristic sequence (logo) motif, is found. The motifs change as the sequence separation increases: for small
separations they consist of one peak located in between the
two residues, then additional peaks at these residues appear,
and finally the center peak smears out for very large separations. We also find correlations between the residues in the
center of the motif. This and other statistical analyses are
used to design neural networks with enhanced performance
compared to earlier work. Importantly, the statistical analysis explains why neural networks perform better than simple statistical data-driven approaches such as pair probability
density functions. The statistical results also explain characteristics of the network performance for increasing sequence
separation. The improvement of the new network design is
significant in the sequence separation range 10–30 residues.
Finally, we find that the performance curve for increasing sequence separation is directly correlated to the corresponding
information content. A WWW server, distanceP, is available
at http://www.cbs.dtu.dk/services/distanceP/.
Keywords: Distance prediction; sequence motifs; distance
constraints; neural network; protein structure.
Introduction
Much work have over the years been put into approaches
which either analyze or predict features of the threedimensional structure using distributions of distances, correlated mutations, and more lately neural networks, or combinations of these e.g. (Tanaka & Scheraga 1976; Miyazawa
& Jernigan 1985; Bohr et al. 1990; Sippl 1990; Maiorov
& Crippen 1992; Göbel et al. 1994; Mirny & Shaknovich
1996; Thomas, Casari, & Sander 1996; Lund et al. 1997;
Olmea & Valencia 1997; Skolnick, Kolinski, & Ortiz 1997;
Present address: Structural Bioinformatics Advanced Technologies A/S, Agern Alle 3, DK-2970 Hørsholm, Denmark
Copyright c 1999, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
Fariselli & Casadio 1999). The ability to adopt structure
from sequences depends on constructing an appropriate cost
function for the native structure. In search of such a function
we here concentrate on finding a method to predict distance
constraints that correlate well with the observed distances
in proteins. As the neural network approach is the only approach so far which includes sequence context for the considered pair of amino acids, these are expected not only to
perform better, but also to capture more features relating distance constraints and sequence composition.
The analysis include investigation of the distances between amino acid as well as sequence motifs and correlations for separated residues. We construct a prediction
scheme which significantly improve on an earlier approach
(Lund et al. 1997).
For each particular sequence separation, the corresponding distance threshold is computed as the average of all
physical distances in a large data set between any two amino
acids separated by that amount of residues (Lund et al.
1997). Here, we include an analysis of the distance distributions relative to these thresholds and use them to explain qualitative behavior of the neural network prediction
scheme, thus extending earlier studies (Reese et al. 1996).
For the prediction scheme used here it is essential to relate
the distributions to their means. Analysis of the network
weight composition reveal intriguing properties of the distance constraints: the sequence motifs can be decomposed
into sub-motifs associated with each of the hidden units in
the neural network.
Further, as the sequence separation increases there is a
clear correspondence in the change of the mean value, distance distributions, and the sequence motifs describing the
distance constraints of the separated amino acids, respectively. The predicted distance constraints may be used as
inputs to threading, ab initio, or loop modeling algorithms.
Materials and Method
Data extraction
The data set was extracted from the Brookhaven Protein
Data Bank (Bernstein et al. 1977), release 82 containing
5762 proteins. In brief entries were excluded if: (1) the sec-
ondary structure of the proteins could not be assigned by the
program DSSP (Kabsch & Sander 1983), as the DSSP assignment is used to quantify the secondary structure identity
in the pairwise alignments, (2) the proteins had any physical chain breaks (defined as neighboring amino acids in the
sequence having Cα -distances exceeding 4:0Å) or entries
where the DSSP program detected chain breaks or incomplete backbones, (3) they had a resolution value greater than
2:5Å, since proteins with worse resolution, are less reliable
as templates for homology modeling of the C α trace (unpublished results).
Individual chains of entries were discarded if (1) they had
a length of less than 30 amino acids, (2) they had less than
50% secondary structure assignment as defined by the program DSSP, (3) they had more than 85% non-amino acids
(nucleotides) in the sequence, and (4) they had more than
10% of non-standard amino acids (B, X, Z) in the sequence.
A representative set with low pairwise sequence similarity was selected by running algorithm #1 of Hobohm et al.
(1992) implemented in the program RedHom (Lund et al.
1997).
In brief the sequences were sorted according to resolution
(all NMR structures were assigned resolution 100). The sequences with the same resolution were sorted so that higher
priority was given to longer proteins.
The sequences were aligned utilizing the local alignment
program, ssearch (Myers & Miller 1988; Pearson 1990) using the pam120 amino acid substitution matrix (Dayhoff &
Orcutt 1978), with gap penalties ,12, ,4. As a cutoffpfor
sequence similarity we applied the threshold T = 290= L,
T is the percentage of identity in the alignment and L the
length of the alignment. Starting from the top of the list
each sequence was used as a probe to exclude all sequence
similar proteins further down the list.
By visual inspection seven proteins were removed from
the list, since their structure is either ’sustained’ by DNA or
predominantly buried in the membrane. The resulting 744
protein chains are composed of the residues where the Cα atom position is specified in the PDB entry.
Ten cross-validation sets were selected such that they all
contain approximately the same number of residues, and all
have the same length distribution of the chains. All the data
are made publicly available through the world wide web
page http://www.cbs.dtu.dk/services/distanceP/.
Information Content / Relative entropy measure
Here we use the relative entropy to measure the information content (Kullback & Leibler 1951) of aligned regions
between sequence separated residues. The information content is obtained by summing for the respective position in
the alignment, I = ∑Li=1 Ii , where Ii is the information content of position i in the alignment. The information content
at each position will sometimes be displayed as a sequence
logo (Schneider & Stephens 1990). The position-dependent
information content is given by
qik
Ii = ∑ qik log2
;
(1)
pk
k
where k refers to the symbols of the alphabet considered
(here amino acids). The observed fraction of symbol k at
position i is qik , and pk is the background probability of finding symbol k by chance in the sequences. pk will sometimes
be replaced by a position-dependent background probability, that is the probability of observing letter k at some position in the alignment in another data set one wishes to compare to. Symbols in logos turned 180 degrees indicate that
qik < pk .
Neural networks
As in the previous work, we apply two-layer feed-forward
neural networks, trained by standard back-propagation, see
e.g. (Brunak, Engelbrecht, & Knudsen 1991; Bishop 1996;
Baldi & Brunak 1998), to predict whether two residues are
below or above a given distance threshold in space. The
sequence input is sparsely encoded. In (Lund et al. 1997)
the inputs were processed as two windows centered around
each of the separated amino acids. However, here we extend
that scheme by allowing the windows to grow towards each
other, and even merge to a single large window covering the
complete sequence between the separated amino acids. Even
though such a scheme increases the computational requirements, it allow us to search for optimal covering between the
separated amino acids.
As there can be a large difference in the number of positive (contact) and negative (no contact) sequence windows,
for a given separation, we apply the balanced learning approach (Rost & Sander 1993). Training is done by a 10 set
cross-validation approach (Bishop 1996), and the result is
reported as the average performance over the partitions. The
performance on each partition is evaluated by the Mathews
correlation coefficient (Mathews 1975)
C=
pN
Pt Nt , Pf N f
( t + N f )(Nt + Pf )(Pt + N f )(Pt + Pf )
;
(2)
where Pt is the number of true positives (contact, predicted
contact), Nt the number of true negatives (no contact, no
contact predicted), Pf the number of false positives (no contact, contact predicted), and N f is the number of false negatives (contact, no contact predicted).
The analysis of the patterns stored in the weights of the
networks is done through the saliency of the weights, that
is the cost of removing a single weight while keeping the
remaining ones. Due to the sparse encoding each weight
connected to a hidden unit corresponds exactly to a particular amino acid at a given position in the sequence windows
used as inputs. We can then obtain a ranking of symbols on
each position in the input field. To compute the saliencies
we use the approximation for two-layer one-output networks
(Gorodkin et al. 1997), who showed that the saliencies for
the weights between input and hidden layer can be written
as
skji = s ji = w2jiW j2 K ;
(3)
where w ji is the weight between input i and hidden unit j,
and W j the weight between hidden unit j and the output. K
is a constant. The kth symbol is implicitly given due to the
Sequence separation 2
Sequence separation 11
250
2000
Sequence separation 16
120
200
100
1000
100
50
-6
60
0
-4
-2
0
2
Physical distance (Angstrom)
20
-20 -15 -10 -5 0
5 10 15 20
Physical distance (Angstrom)
Sequence separation 3
1000
0
25
-20
Sequence separation 12
-10
0
10
20
30
Physical distance (Angstrom)
40
Sequence separation 20
90
200
80
800
70
Number
600
400
Number
150
Number
80
40
500
0
150
Number
Number
Number
1500
100
60
50
40
30
200
50
0
0
20
10
-6
-4
-2
0
2
4
Physical distance (Angstrom)
-20
Sequence separation 4
800
0
-10
0
10
20
Physical distance (Angstrom)
Sequence separation 13
70
200
700
Number
Number
Number
300
Sequence separation 60
100
40
30
20
200
50
10
100
0
-8
50
50
150
400
-10
0
10 20 30 40
Physical distance (Angstrom)
60
600
500
-20
-6
-4
-2
0
2
4
6
Physical distance (Angstrom)
0
-20
-10
0
10
20
Physical distance (Angstrom)
30
0
-40
-20
0
20
40
60
80
Physical distance (Angstrom)
Figure 1: Length distributions of residue segments for corresponding sequence separations, relative the respective mean values. Sequence
separations 2, 3, 4, 11, 12, 13, 16, 20, and 60 are shown. These show the representative shapes of the distance distributions. The vertical line
through zero indicates the displacement with respect to the mean distance.
sparse encoding. If the signs of w ji and W j are opposite,
we display the corresponding symbol upside down in the
weight-saliency logo.
Results
We conduct statistical analysis of the data and the distance
constraints between amino acids, and subsequent use the results to design and explain the behavior of a neural network
prediction scheme with enhanced performance.
Statistical analysis
For each sequence separation (in residues), we derive the
mean of all the distances between pairs of Cα atoms. We use
these means as distance constraint thresholds, and in a prediction scheme we wish to predict whether the distance between any pair of amino acids is above or below the threshold corresponding to a particular separation in residues. To
analyze which pairs are above and below the threshold, it
is relevant to compare: (1) the distribution of distances between amino acid pairs below and above the threshold, and
(2) the sequence composition of segments where the pairs of
amino acids are below and above the threshold.
First we investigate the length distribution of the distances
as function of the sequence separation. A complete investigation of physical distances for increasing sequence separation is given by (Reese et al. 1996). In particular it
was found that the α-helices caused a distinct peak up to
sequence separation 20, whereas β-strands are seen up to
separations 5 only. However, when we perform a simple
translation of these distributions relative to their respective
means, the same plots provide essentially new qualitative information, which is anticipated to be strongly correlated to
the performance of a predictor (above/below the threshold).
In particular we focus on the distributions shown in Figure 1,
but we also use the remaining distributions to make some
important observations.
When inspecting the distance distributions relative to their
mean, two main observations are made. First, the distance
distribution for sequence separation 3, is the one distance
distribution where the data is most bimodal. Thus sequence
separation 3 provides the most distinct partition of the data
points. Hence, in agreement with the results in (Lund et
al. 1997) we anticipate that the best prediction of distance
constraints can be obtained for sequence separation 3. Furthermore, we observe that the α-helix peak shifts relative to
the mean when going from sequence separation 11 to 13.
The length of the helices becomes longer than the mean distances. This shift interestingly involve the peak to be placed
at the mean value itself for sequence separation 12. Due to
this phenomenon, we anticipate, that, for an optimized predictor, it can be slightly harder to predict distance constraints
for separation 12 than for separations 11 and 13.
The peak present at the mean value for sequence separation 12 does indeed reflect the length of helices as demonstrated clearly in Figure 2. Rather than using the simple rule
that each residue in a helix increases the physical distance
with 1.5 Angstrom (Branden & Tooze 1999), we computed
the actual physical lengths for each size helix to obtain a
more accurate picture. The physical length of the α-helices
was calculated by finding the helical axis and measuring the
translation per Cα -atom along this axis. The helical axis
was determined by the mass center of four consecutive Cα atoms. Helices of length four are therefore not included,
since only one center of mass was present. We see that helices at 12 residues coincide with sequence separation 12.
Again we use this as an indication that at sequence separation 12 it may be slightly harder to predict distance constraints than at separation 13.
Separation and helix length
Physical distance (Angstrom)
35
Average sequence separation
Average helix length
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Separation/Length (Residues)
Figure 2: Mean distances for increasing sequence separation and
computed average physical helix lengths for increasing number of
residues.
As the sequence separation increases the distance distribution approaches a universal shape, presumably independent of structural elements, which are governed by more local distance constraints. The traits from the local elements
do as mentioned by (Reese et al. 1996) (who merge large
separations) vanish for separations 20–25. (Here we considered separations up to 100 residues.) In Figure 1 we see
that the transition from bimodal distributions to unimodal
distributions centered around their means, indicates that prediction of distance constraints must become harder with increasing sequence separation. In particular when the universal shape has been reached (sequence separation larger than
30), we should not expect a change in the prediction ability
as the sequence separation increases. The universal distribution has its mode value approximately at ,3:5 Angstrom,
indicating that the most probable physical distance for large
sequence separations corresponds to the distance between
two Cα atoms.
Notice that the universality only appears when the distribution is displaced by its mean distance. This is interesting
since the mean of the physical distances grows as the sequence separation increases. From a predictability perspective, it therefore makes good sense to use the mean value as
a threshold to decide the distance constraint for an arbitrary
sequence separation.
A useful prediction scheme must rely on the information
available in the sequences. To investigate if there exists a
detectable signal between the sequence separated residues,
for each sequence separation, we constructed sequence logos (Schneider & Stephens 1990) as follows: The sequence
segments for which the physical distance between the separated amino acids was above the threshold were used to generate a position-dependent background distribution of amino
acids. The segments with corresponding physical distance
below the threshold were all aligned and displayed in sequence logos using the computed background distribution
of amino acids. The pure information content curves are
shown for increasing sequence in Figure 3. The corresponding sequence logos are displayed in Figure 4. We used a
“margin” of 4 residues to the left and right of the logos, e.g.,
for sequence separation 2, the physical distance is measured
between position 5 and 7 in the logo.
The change in the sequence patterns is consistent with the
change in the distribution of physical distances. Up to separations 6–7, the distribution of distances (Figure 1) contains
two peaks, with the β-strand peak vanishing completely for
separation 7–8 residues. For the same separations, the sequence motif changes from containing one to three peaks.
For larger sequence separations, the motif consists of three
characteristic peaks, the two located at positions exactly corresponding to the separated amino acids. The third peak appear in the center. This peak tells us that for physical distances within the threshold, it indeed matters whether the
residues in the center can bend or turn the chain. We see
that exactly such amino acids are located in the center peak:
glycine, apartic acid, glutamic acid, and lysine, all these are
medium size residues (except glycine), being hydrophilic,
thereby having affinity for the solvent, but not being on the
outer most surface of the protein (see e.g., (Creighton 1993)
Separation: 13
Separation: 2
0.12
0.06
0.10
Bits
Bits
0.08
0.06
0.04
0.04
0.02
0.02
0.00
1
2
3
4
5
6
7
8
9
10
11
4
6
8
10
12
14
Position
Separation: 7
Separation: 20
0.06
0.03
0.04
0.02
0.02
16
18
20
22
0.01
0.00
2
4
6
8
10
12
14
0.00
16
2
4
6
8
10 12 14 16 18 20 22 24 26 28
Position
Position
Separation: 9
Separation: 30
0.06
0.02
0.04
Bits
Bits
2
Position
Bits
Bits
0.00
0.01
0.02
0.00
2
4
6
8
10
12
14
16
0.00
18
10
15
20
25
Position
Separation: 12
Separation: 60
0.06
30
35
0.02
0.04
Bits
Bits
5
Position
0.01
0.02
0.00
2
4
6
8
10
12
14
16
18
20
Position
0.00
5
10 15 20 25 30 35 40 45 50 55 60 65
Position
Figure 3: Sequence information profiles for increasing sequence separation.
for amino acid properties). As the sequence separation increases from about 20 to 30 the center peak smears out, in
agreement with the transition of the distance distributionthat
shifts to the universal distribution in this range of sequence
separation. The sequence motif likewise becomes “universal”. Only the peaks located at the separated residues are
left.
The composition of single peak motifs resemble to a large
degree the composition of the center for motifs having three
peaks. However, the slightly increased amount of leucine
moves from the center of the one peak motifs to the separated amino acid positions in the three peak motifs. The
reverse happens for glycine. Interestingly, as the sequence
separation increases valine (and isoleucine) become present
at the outermost peaks. In agreement with the helix length
shift from below to above the threshold at separations 12
Separation: 13
I
A
L
V
I
N
PQ
R
PQ
K
T
N
D
S
F
YR
FHHF
I
YFY
R
S
N
MTY
F
H
S
R
I
R
R
P
M
Q
Q
F
P
C
W
M
C
W
F
Y
I
M
C
W
I
T
I
G
KG
V
TV
S
YY
bits
CW
P
H
C
WW
H
H
H
W
C
C
W
C
C
KA
D
T
P
Q
S
N
Q
P
N
Q
K
E
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
W
D
T
P
Q
E
N
S
Q
K
H
T
K
W
WWWW
M
W
M
H
Q
F
W
M
N
H
W
V
E
MYV
T
P
H
M
L
D
Q
M
Q
H
H
P
Q
T
K
K
S
QP
DA
NE
N
P
P
H
C
C
H
C
H
Y
C
A
V
V
S
S
I
I
N
N
N
P
P
P
R
R
R
S
I
I
YYY
Q
F
FFQR
RQQQR
RRR
I
I
I
GR
V
D
QW
Q
Q
H
M
C
H
M
C
W
H
M
C
Q
W
H
Y
M
H
M
MW
H
M
F
H
Q
Q
Y
P
MWWWWWW
M
Q
Q
R
Q
M
H
H
Y
F
I
F
F
H
WW
bits
T
S
ND
R
W
W
H
M
G
H
H
YH
HYY
P
AAA
PK
NE
D
N
P
L
QPSD
R
H
H
H
C
S
G
K
S
N
D
P
R
G
T
PT
D
K
S
N
P
R
N
K
H
H
H
CC
W
W
W
C
C
CWWW
C
W
P
YY
H
D
K
K
E
E
E
K
N
N
P
P
Q
R
R
R
P
N
F
F
Q
Q
Y
W
W
W
M
M
M
Q
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
W
ET
S
S
G
P
R
K
N
K
E
E
K
D
E
DT
TS
S
D
L
I
FY
MY
P
M
P
P
M
R
QQ
P
E
H
M
M
M
F
R
F
P
R
N
H
HQ
F
H
F
WCYY
C
C
C
C
W
C
WW
W
I
G
V
I
D
K
R
HYQ
N
K
PQ
Q
H
N
H
H
K
I
R
I
R
E
Y
V
RY
F
P
K
K
N
LT
P
P
E
F
E
S
N
FQ
C
Y
L
A
V
A
E
A
NK
DD
L
L
YY
F
V
A
L
DE
FT
I
CM
M
F
CCCC
C
C
A
LA
LGG
N
QPR
S
Y
HQ
T
F
PPY
P
PN
Y
Y
F
N
F
FF
P
PNRT
D
F
N
S
P
KD
N
T
PPP
I
I KSG
I
TTD
I
I
VS
NN I
DS
G
T
DD
T
E
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
NS
REG
A
T
C
W
C
I
RT
PK
SA
Y
F
MY
FY
N
S
N
Y
W
M
C
W
C
W
M
N
H
C
C
R
E
NTTT
I
P
I
P
N
K
D
K
D
E
T
S
WC
G
N
I
V
F
D
VGS
Y
HY
C
M
H
H
M
W
L
A
V
L I
T
Q
K
T
K
M
H
M
F
YY
C
H
C
H
C
H
C
P
TN
K
P
D
E
A
SAT
F
Y
H
F
Y
F
I
GS
L
A
G
V
YR
S
T
T
I
I
F
I
I
V
ETNNNT
L I
K
D
PPPK
NEG
GG
S
V
TV
PK
DV
DTN
I
S
ET
K
E
E
D
D
K
L
VLA
L
L
A
A
FTV
V
G
K
DG
TDS
SS
G
G
T
EA
D
S
I
VV
V
0.0
L
T
L
G
G
LG
C
S
T
Separation: 30
A
E
bits
T
I
Y
E
I
D
L
LA
LA
A
PS
D
K
A
L
LAA
E
I
I
R
P
P
P
P
P
R
R
R
R
I
I
K
I
E
E
F
F
F
V
I
K
Q
P
P
P
N
R
N
E
N
A
K
Q
Q
M
M
H
F
F
F
R
N
AA
Q
Q
Q
H
M
M
Y
M
Y
P
Y
H
H
N
F
Y
P
N
P
E
D
Q
Q
Q
H
H
M
F
Y
Y
F
Y
Y
F
Y
F
Y
P
R
N
N
K
E
D
Q
Q
Q
Q
F
Y
F
P
R
F
P
P
P
K
Q
R
F
F
F
R
P
R
P
R
K
E
K
I
P
R
P
Q
Y
Y
P
P
P
R
I
E
K
I
A
V
Q
A
Q
Y
Q
Q
Q
Y
R
R
R
K
K
P
N
I
K
PYY
K
L
Q
A
F
R
F
A
TG
I
I
V
S
C
W
Q
W
W
M
H
M
W
Q
Y
W
M
W
H
M
W
W
I
I
I
KD
TR
I
NRT
RG
D
N
H
Y
H
H
Y
F
FY
P
HYY
H
H
H
W
W
H
PS
T
G
K
WWW
WWC
C
H
C
H
C
H
WWW
C
C
C
R
K
K
E
K
E
E
R
N
QN
N
P
P
R
P
F
Q
Q
Y
W
W
W
M
M
M
Q
T
P
E
D
N
S
Y
P
N
K
Y
Separation: 60
G
A
| |
bits
E
D
L
A
GL
VG
V
AT
L I
T
G
G
I
A
L
FV
I
T
Y
L
K
L
L
V
L
L
L
L
V
I
L
I
L
A
A
GG
D
S
V
T
G
D
G
V
G
V
D
S
V
D
V
T
GG
D
V
Q
Q
F
H
M
H
M
H
M
H
M
Q
F
W
H
M
W
H
M
C
W
Y
H
W
C
C
C
W
C
W
C
C
C
C
C
C
C
C
W
C
W
C
W
C
C
W
C
W
C
W
C
A
L
CM
F
Y
C
W
H
M
W
H
M
C
W
F
H
N
M
Y
C
C
W
A
I
I
D
E
D
A
E
D
E
V
L
D
L
AA
L
A
A
L
E
E
P
L
G
L
A
E
W
WW
W
M
W
M
W
M
M
M
M
W
M
Q
M
Q
Q
F
P
R
Y
Q
Q
P
R
M
E
I
V
I
F
L
L
F
V
F
P
R
P
R
P
K
E
QQ
I
QQ
I
V
K
K
M
W
H
H
M
M
M
W
M
Q
M
C
W
M
M
W
H
H
P
Q
Q
G
M
P
Q
N
K
T
H
E
D
S
H
R
R
N
QPP
P
R
QQQ
Y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
C
W
C
W
H
M
H
MW
C
W
H
H
M
H
C
C
C
M
G
D
S
T
D
S
E
D
S
K
E
D
S
G
N
Y
K
E
D
S
G
T
K
E
D
S
T
E
D
S
E
N
S
I
K
E
D
S
K
D
N
S
N
I
N
S
Y
I
D
N
S
T
Y
I
H
M
Y
H
M
C
H
M
C
W
C
W
W
M
Q
Q
H
Q
Q
H
M
H
M
H
H
M
F
F
P
R
I
P
R
E
P
R
F
E
T
F
Y
F
Y
P
R
R
F
R
F
P
R
P
R
P
R
P
R
E
I
E
D
K
I
R
P
R
K
T
I
W
W
W
W
W
W
H
M
M
W
M
M
M
H
M
H
M
M
H
M
Q
D
V
F
D
V
Q
F
P
R
I
P
R
Q
L
F
P
R
P
R
F
P
E
D
V
E
T
D
V
V
L
A
Q
Q
Q
W
M
M
H
M
M
H
M
Q
H
M
H
M
H
Q
Q
Q
Q
H
H
Q
F
Q
Q
Y
F
Q
Q
F
Q
Y
F
Q
Q
Y
P
R
Q
P
I
P
F
P
R
Y
E
E
I
K
E
I
F
E
D
V
P
R
F
P
R
E
D
I
P
R
N
P
R
P
R
P
R
L
P
R
T
E
T
I
P
E
L
A
L
A
K
E
T
I
E
I
V
E
D
D
K
E
I
I
E
L
A
L
L
S
Q
I
T
G
PK
ND
S
E
G
M
H
F
W
P
N
M
V
NT
R
QPD
V
D
H
R
M
H
MC
HW
S
M
TY
W
RE
G
I
Y
Y
I
K
N
HM
TT
F
FYHH
K
K
C
M
F
D
KG
G
P
F
M
P
P
WWWWY
Q
F
W
W
S
G
E
K
T
D
Q
N
S
P
K
P
N
K
E
D
S
T
P
N
F
A
R
K
N
N
KE
QPK
E
K
P
P
R
L
A
L
F
R
F
F
P
R
F
F
R
N
P
R
I
K
K
E
K
E
T
I
E
T
I
Q
Q
Q
Q
W
F
Y
R
R
F
P
R
P
P
R
P
R
P
R
P
R
I
E
T
K
E
T
E
T
E
T
V
P
R
P
R
T
L
A
V
P
V
R
F
F
F
Q
Q
Y
P
R
I
I
L
L
A
L
Q
I
A
I
K
I
K
I
K
E
A
A
K
E
K
I
L
A
L
A
L
A
L
A
L
P
R
P
R
F
I
I
P
R
F
P
R
K
E
I
E
T
K
T
E
I
T
I
E
T
L
K
L
L
A
V
V
V
V
A
A
E
T
V
K
E
T
E
K
T
V
L
A
F
K
L
L
G
L
L
A
G
A
P
V
Q
A
L
Q
G
H
AG
LV
NVS
T
L
F
YHF
A
G
I G
TN
T
V
V
I
DA
T
YN
ASGG
FS
I F
A
H
F
YYMW
H
I
Y
E
N
LV
L
A
G
G
EAAA
GA
G
GA
DAGT
DV
G
R
T
N
RS
GG
G
E
QP
P
R
K
N
C
W
C
W
H
C
W
H
M
Y
Q
G
H
H
Y
H
W
C
H
H
Y
N
K
D
N
S
V
I
W
H
M
M
C
C
H
C
W
C
C
C
C
C
H
C
H
C
W
H
W
H
Y
H
H
C
W
C
W
C
P
S
C
C
C
M
Q
Y
H
Q
Q
Q
Y
Y
F
K
N
I
F
N
Y
K
N
D
G
G
G
V
V
A
G
T
G
G
D
D
NSS
K
VV
ER
R
L
E
K
DN
KTK
E
E
D
A
SAA
N
I
R
D
R
K
D
PNDDL
D
D
I
V
V
S
N
C
W
H
C
W
C
W
H
M
C
W
H
W
H
M
Q
Y
F
R
N
S
T
K
D
N
S
Q
T
Y
N
S
G
A
T
Y
K
R
N
S
G
A
T
Y
K
R
N
S
T
K
N
S
G
A
T
I
S
G
T
K
S
G
A
T
I
F
K
N
S
G
T
Y
F
K
N
S
T
Y
F
K
N
S
V
T
Y
F
K
N
S
T
Y
F
L
K
N
S
G
V
T
K
N
S
V
T
Y
K
D
S
G
V
T
K
D
S
G
V
G
K
P
D
N
S
Q
V
I
K
D
N
S
G
V
T
I
D
N
S
Q
V
T
I
D
N
S
V
I
F
N
Y
I
F
P
D
N
S
V
T
F
P
S
T
Y
F
K
D
S
Q
T
Y
F
P
N
Y
P
H
N
S
Y
F
D
N
S
V
Y
F
K
D
N
S
V
Y
K
N
S
K
D
N
S
G
V
Y
K
E
D
N
S
G
V
Y
I
D
N
S
G
V
D
N
S
Y
N
Y
N
Q
Y
N
Y
N
YW
N
Y
N
Y
N
YY
N
Y
P
E
L
REL
A
NDA
G
V
L
G
L
M
D
I
YH
G
F
TTA
Y
M
L
WC
H
M
F
Y
H
M
F
Y
R
N
Q
Y
N
Q
Y
D
N
S
TG
AG
A
Q
QQ
P
R
K
K
L
G
V
FS
SA
TATL
NDS
I
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
A
A
A
L
I
FY
D
I
F
H
M
C
W
H
M
C
W
H
M
C
W
C
PK
E
KD
S
VST
F
F
HW
F
YY
Y
QRSE
GG
G
VV
ST
T I
I
0.0
T
I
E
PK
RS
C
M
W
V
GN
A
L
Y
C
C
W
M
W
F
R
V
Y
I
G
R
I
T
C
C
Y
F
CM
WC
W
H
I
Q
C
R
YW
I
T
A
L
V
S
T
D
V
I
Y
FYFP
R
I
F
I V
T
Q
K
C
V
V
S
L I
G
E
S
L
PD
N
H
PQ
T
P
K
N
P
S
QN
S
N
ET
DS
I
K
K
G
V
S
L
A
DG
I
F
R
P
E
E
D
V
YR
R
NA
F
L
R
F
MY
G
R
P
VY
E
TA
VL
V
T
V
F
V
E
I
K
E
QPD
L
A
A
L
E
A
L
D
T
T
K
ES
G
DN
K
G
E
C
Q
Q
C
H
C
C
C
M
W
H
M
Y
H
M
I
K
LTVT
W
L
AR
G
A
Q
Q
N
R
CYN
M
MW
Q
Q
C
C
R
H
M
V
Y
I
H
M
W
W
Q
S
D
E
D
S
I
H
F
H
Q
L
G
FY
C
H
AY
G
R
D
V
R
N
QRR
A
N
L
A
S
L
E
T
S
R
YM
0.02
L
A
P
NS
RP
D
S
G
D
S
E
T
A
V
Y
F
R
K
E
T
E
SP
K
FL
G
L
T
V
E
I
I
D
E
K
DN
I
F
L
A
V
L
A
A
V
ESD
S EG
K
GKNSA
L
A
V
L
A
DK
I
G
A
G
GD
AEG
V
L
A
0.04
L
L
|
|
T
S
L
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
C
C
C
C
C
C
C
C
W
VG
V
C
C
C
H
M
H
M
W
H
N
M
T
I
W
W
WW
W
W
W
W
M
M
M
Q
F
Y
F
P
R
W
W
WWW
W
M
H
M
H
P
R
N
I
H
M
M
H
M
H
M
H
H
M
M
M
F
Q
Y
E
D
WW
M
M
H
H
H
H
F
Q
Y
P
M
C
C
S
Y
MY
C
H
W
M
H
I
C
C
C
C
C
C
C
C
N
N
N
I
N
I
N
I
N
N
Y
F
M
Y
F
H
M
Y
M
W
K
C
C
C
C
C
C
C
C
W
H
H
H
N
H
L
L
H
F
F
E
K
E
S
S
D
L
I
N
R
K
Q
N
K
P
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
V
F
GG
V
S
Q
E
L
G
G
L
W
Q
Y
F
Q
P
R
F
Y
Q
F
YW
K
R
A
PDSS
NSDD
R
A
W
C
C
C
H
H
H
W
L
T
I
G
D
D
K
K
D
D
K
S
K
TVVVVV
DT
N
N
N
V
N
E
D
D
D
E
D
E
N
DTT
G
S
ET
E
S
I
K
S
S
S
SS
N
LT
I
T
K
DTTTT
N
T
E
K
A
PK
D
K
D
E
K
E
E
V
S
I
E
G
ESTT
V
T
ES
V
DGGGGG
ES
ST
T
VT
G
G
L
L
V
LA
I
I
Y
G
V
Q
Q
S
S
G
S
S
C
S
G
G
C
C
S
S
H
M
W
H
W
H
W
H
Y
R
D
T
I
D
R
V
T
I
V
T
I
L
K
R
V
T
I
L
K
E
D
V
T
I
L
K
V
T
E
D
T
N
C
D
D
ES
G
D
A
D
Y
R
Y
R
H
M
H
N
R
E
P
N
Q
D
P
K
G
A
L
V
AV
V
L
VV
LA
0.0
A
V
SGV
V
SS
FT
TTG
I
F
E
L
QR
LA
NA
A
I
F
F
F
F
F
MC
R
Y
Y
Y
Y
E
V
TT
I
I
K
L
I
ST
P
T
T
P
S
N
H
P
C
H
M
S
G
GDG
P
PFPFYYFFP
K
P
PN
NN
NS
PFFP
N
GS
D
N
K
C
GGG
G
P
D
N
D
R
RD
A
L
A
LG
E
A
D
T
N
KS
E
D
D
SA
KS
E
V
L
GNEAL
L
GG
L
VVT
I
I
T
L
L
L
LA
A
AA
T
I
G
PS
ND
I
R
V
AG
K
G
C
C
WW
Q
C
C
W
Y
Q
H
M
M
C
M
M
M
F
M
C
Y
Q
WC
W
W
Q
C
F
R
H
M
H
M
W
W
G
M
M
M
M
G
E
D
S
G
V
K
PP
KS
TG
T
I
TD
I TVVS
Y
LT
I
KG
A
V
DNTS
S
M
Q
M
H
F
M
C
A
Y
QY
F
R
V
V
V
A
TV
ST
DD
VT
V
S
I
I L
RQ
FR
A
V
QQQYQFRE
F
YQ
MH
M
Y
Y
L
L
GR
TD
V
T
ESG
QQ
FRR
Q
F
QQQ
I
G
E
A
L
A
L
D
G
I
V
I
I
KE
I EK
V
RR
EL
KR
KA
D
N
I V
R
S
NR
NR
FR
E
V
A
KV
I K
A
E
A
KLL
G
DE
LDSN
S
N
V
E
G
E
LV
D
AEAEK
A
L
V
L
G
L
L
E
ADEA
GKDG
ALSKL
L
L
A
0.04
L
A
G
A
GA
I
AAA
V
I
H
LLLA
G
K
L
A
A
C
C
M
WWW
W
DDDG
VVL
V
A
LVV
R
R
YQ
M
YQ
F
Y
H
Q
MMCM
H
M
MY
M
M
M
M
M
WW
G
Q
QY
C
C
I
G
GDSE
SESE
SD
DEKTTE
G
G
SKTKKS
L
V
Q
FFFR
YQ
F
Q
F
M
Y
M
I
KA
I L
V
GGG
V
E
S
T
TD
VV
VT
G
V
GV
SV
TT
G
S
G
G
G
I
FY
L
A
A
G
L
I
R
Q
|
|
A
L
MY
G
C
C
H
E
D
T
S
N
G
F
H
YHMM
I
H
PA
R
MY
F
CC
WW
K
C
P
R
Y
I
W
P
Q
R
N
H
T
V
K
E
D
S
Q
I
1
2
3
4
5
6
7
8
9
10
11
V
F
F
bits
L
A
D
E
S
K
YP
bits
V
K
E
T
S
VT
F
DT
I
V
S
G
S
P
K
N
Y
H
Y
F
P
H
C
T
F
N
S
N
F
P
D
K
P
P
T
T
P
N
S
G
NTS
K
N
DT
N
SV
S
I
G
bits
L
A
MY
F
F
PG
NVY
R
E
L
0.02
KV
QNQ
R
R
Q
F
E
A
L
I
L
K
EEKA
K I EVL
I EA
RRR
I
N
N
RQQ
RRR
I
I
V
C
W
D
E
CCCCC
C
C
W
F
H
M
Y
H
Y
C
W
C
R
YF
G
N
D
E
R
PK
G
A
ND
S
K
VD
E
A
L
A
L
|
G
0.0
YF I V
T
R
0.02
I
D
K
A
SN
LED
N
S
Separation: 12
0.04
M
I
L
Separation: 9
0.06
R I
AEE
V
0.0
I
N
H
N
W
W
I
I
T
K
Q
T
Q
D
S
S
E
C
C
F
H
Y
M
I
L
G
D
F
M
Y
F
H
Y
M
V
R
W
LA
F
V
0.04
L
LDDL
EAAA
KKK
LVV
A
ALES
LED
Q
Q
C
|
I
G
F
R
H
M
C
Y
H
M
C
W
C
W
Q
P
A
YH
R
V
V
AA
GGA
L G
C
C
W
H
M
W
H
M
W
0.02
Y
V
QQ
NQ
I
KP
E
N
S
G
T
A
ER
V
D
V
DT
G
Y
M
W
0.0
V
F
L
A
L
T
PG
N
R
G
D
V
I
K
TSG
V
GG
G
H
W
P
S
K
Separation: 20
E
0.04
I
L
L
F
F
0.06
RT
M
Y
M
FL
A
V
R
Q
F
YY
M
M
Q
M
M
W
W
W
AV
L
E
0.0
D
0.02
KG
F
T
M
H
M
RRVL
R
Q
I
R
Q
F
M
F
FY
Q
MW
Q
I
R
G
A
I
L
V
I
V
E
I
A
|
0.04
A
V
K
E
R
QQ
LTVV
I
R
I
K
S
E
T
RQ
A
V
GD
S
V
G
D
A
E
L
D
L
ENNS
K
D
PPPK
E
DN
NS
E
L
KA
L
E
R
Separation: 7
0.06
E
S
T
Q
PQ
R
G
L
A
E
L
I
0.02
L
NEE
I
A
E
GSE
EKD
KSE
DT K
G
S
L
RAL
S
N
LA
A
R
L
KA
K
DDG
A
V
T
E
DD
EK
D
N
A
L
0.04
L
G
AE
LK
D
L
A
GG
G
| |
| |
| |
L
A
|
0.06
A
AA
0.12
0.1
0.08
0.06
0.04
0.02
0.0
LL
bits
Separation: 2
Figure 4: Sequence logos for increasing sequence separation. A margin of 4 residues on both ends of the logo is used. Symbols displayed upside down indicate that their amount is lower than the background amount. This figure is available in full color at
http://www.cbs.dtu.dk/services/distanceP/.
and 13, the center peak shifts from slight over to slight underrepresentation of alanine: helices are no longer dominant
for distances below the threshold.
The smearing out of the center peak for large sequence
separations does not indicate lack of bending/turn properties, but reflects that such amino acids can be placed in a
large region in between the two separated amino acids. What
we see for small separations, is that the signal in addition
must be located in a very specific position relative to the
separated amino acids.
Neural networks: predictions and behavior
As found above, optimized prediction of distance constraints
for varying sequence separation, should be performed by a
non-linear predictor. Here we use neural networks. Clearly,
far most of the information is found in the sequence logos,
so we expect only a few hidden neurons (units) to be necessary in the design of the neural network architecture. As
earlier, we only use two-layer networks with one output unit
(above/below threshold). We fix the number of hidden units
to 5, but the size of the input field may vary.
We start out by quantitatively investigating the relation between the sequence motifs in the logos above, and
the amount of sequence context needed in the prediction
scheme. First, we choose the amount of sequence context,
by starting with local windows around the separated amino
acids and the extending the sequence region, r, to be included. We only include additional residues in the direction
towards the center. The number of residues used in the direction away from the center is fixed to 4, except in the cases of
r = 0 and r = 2, where that number is zero and two respectively. At some point the two sequence regions may reach
each other. Whether they overlap, or just connect does not
affect the performance. For each r = 0; 2; 4; 6; 8; 10; 12; 14,
we trained networks for sequence separations 2 to 99, resulting in the training of 8000 networks, since we used 10 crossvalidation sets. The performances in “fraction correct” and
“correlation coefficient” are shown in Figure 5.
Figure 5(b) shows the performance in terms of correlation coefficient. We see that each time the sequence separation becomes so large that the two context regions do not
connect the performance drops when compared to contexts
that merge into a single window. This behavior is generic, it
happens consistently for increasing size of sequence context
region, though the phenomenon is asymptotically decreasing. An effect can be found up to a region size of about 30,
see Figure 6. Several features appear on the performance
curve, and they can all be explained by the statistical observations made earlier. First, the curve with no context (r = 0)
corresponds to the curve one obtains using probability pair
density functions (Lund et al. 1997). The bad performance
of such a predictor is to be expected due to the lack of motif
that was not included in the model. The green curve (r = 4)
is the curve that corresponds to the neural network predictor in (Lund et al. 1997). As observed therein a significant
improvement is obtained when including just a little bit of
sequence context. As the size of the context region is increased, so is the performance up to about sequence separation 30. Then the performance remains the same indepen-
dent of the sequence separation. We even note the drop in
performance for sequence separation 12, as predicted from
the statistical analysis above. It is also characteristic that,
as the distribution of physical distances approaches the universal shape, and as the center peak of the sequence logos
vanishes, the performance drops and becomes constant with
approximately 0.15 in correlation coefficient. The change
in prediction performance takes place at sequence separations for which changes in the distribution of physical distance and sequence motifs change. Due to the not vanishing signal in the logos for large separations, it is possible
to predict distance constraints significantly better than random, and better than with a no-context approach. The conclusion is not surprising: the best performing network is that
which uses as much context as possible. However, for large
sequence separations more than 30–35 residues, the same
performance is obtained by just using a small amount of
sequence context around the separated amino acids, as in
(Lund et al. 1997). We found that the performance between
the worst and best cross-validation set is about 0.1 for separations up to 12 residues, then about 0.07 for separations up
to 30 residues, and then 0.1–0.2 for separations larger than
30. Hence, at sequence separation 31 the performance start
to fluctuate indicating that the sequence motif becomes less
defined throughout the sequences in the respective crossvalidation sets. Hence we can use the networks as an indicator for when a sequence motif is well defined.
As an independent test, we predicted on nine CASP3
(http://predictioncenter.llnl.gov/casp3/) targets (and submitted the predictions). None of these targets (T0053 T0067
T0071 T0077 T0081 T0082 T0083 T0084 T0085) have sequence similarity to any sequence with known structure. The
previous method (Lund et al. 1997) had on the average
64.5% correct predictions and an average correlation coefficient of 0.224. The method presented here has on average
70.3% correct predictions and an average correlation coefficient of 0.249. The performance of the original method corresponds to the performance previously published (Lund et
al. 1997). However, the average measure is a less convenient
way to report the performance, as the good performance on
small separations are biased due to the many large separation
networks that have essentially the same performance. Therefore we have also constructed performance curves similar to
the one in Figure 5 (not shown). The main features of the
prediction curve was present. In Figure 8 we show a prediction example of the distance constraints for T0067. The
server (http://www.cbs.dtu.dk/services/distanceP/) provides
predictions up a sequence separation of 100. Notice that
predictions up to a sequence separation of 30 clearly capture
the main part of the distance constraints.
We also investigated the qualitative relation between the
network performance and information content in the sequence logos. Interestingly, and in agreement with the observations made earlier, we found that the two curves have
the same qualitative behavior as the sequence separation increase (Figure 7). Both curves peak at separation 3, both
curves drop at separation 12, and both curves reaches the
plateau for sequence separation 30. We note that the relative
entropy curve increases slightly as the separation increases.
Mean performances
(a)
0.75
r=0
r=2
r=4
r=6
r=8
r=10
r=12
r=14
Fraction correct
0.70
0.65
0.60
0.55
0.50
0.45
0
20
40
60
80
Sequence Separation (residues)
100
Mean performances
(b)
0.50
r=0
r=2
r=4
r=6
r=8
r=10
r=12
r=14
Correlation coefficient
0.40
0.30
0.20
0.10
0.00
0
20
40
60
80
Sequence separation (residues)
100
Figure 5: The performance as function of increasing sequence context when predicting distance constraints. The sequence context is the
number of additional residues r = 0; 2; 4; 6; 8; 10; 12; 14 from the amino acid and toward the other amino acid. The performance is showed in
percentage (a), and in correlation coefficient (b). This figure is available in full color at http://www.cbs.dtu.dk/services/distanceP/.
180
160
140
120
100
80
60
40
CASP3 target T0067
20
This is due to a well known artifact for decreasing sampling
size, and entropy measures. The correlation between the
curves also indicates that the most information used by the
predictor stem from the motifs in the sequence logos.
180
Single window curves
0.50
160
Plain
Correlation coefficient
0.40
140
Profile
120
0.30
100
80
0.20
60
0.10
40
0.00
20
0
20
40
60
80
Sequence Separation (residues)
100
Figure 6: The performance curve using a single windows only.
We also show the performance when including homologue in the
respective tests. This complete a profile.
Relative entropy and NN performance
Arbitrary scale
20
40
60
80
]0.4 ; 0.5]
]0.7 ; 0.8]
]0.2 ; 0.3]
]0.5 ; 0.6]
]0.8 ; 0.9]
]0.3 ; 0.4]
]0.6 ; 0.7]
]0.9 ; 1.0]
Figure 8: Prediction of distance constraints for the CASP3 target T0067. The upper triangle is the distance constraints of published coordinates, where dark points indicate distances above
the thresholds (mean values) and light points distances below the
threshold. The lower triangle shows the actual neural network
predictions. The scale indicates the predicted probability for sequence separations having distances below the thresholds. Thus
light points represent prediction for distances below the thresholds and dark points prediction for distances above the threshold. The numbers along the edges of the plot show the position in the sequence. This figure is available in full color at
http://www.cbs.dtu.dk/services/distanceP/.
NN-performance
Relative entropy
0
] - ; 0.2]
100
Sequence separation (residues)
Figure 7: Information content and network performance for increasing sequence separation.
Finally, we conclude by an analysis of the neural network
weight composition with the aim of revealing significant
motifs used in the learning process. In general, we found the
same motifs appearing for the each of the 10 cross-validation
sets for a given sequence separation. As described in the
Methods section, we compute saliency logos of the network
parameters, that is we display the amino acids for which
the corresponding weights, if removed, cause the largest increase in network error. We made saliency logos for each of
the 5 hidden neurons. For the short sequence separations the
network strongly punish having a proline in the center of the
input field, that is a proline in between the separated amino
acids. This makes sense as proline is not located within secondary structure elements, and the helix lengths for small
separations are smaller than the mean distances. (The network has learnt that proline is not present in helices.) In
contrast, the presence of aspartic acid can have a positive or
negative effect depending on the position in the input field.
For larger sequence separations, some of the neurons display a three peak motif resembling what was found in the sequence logos. The center of such logos can be very glycine
rich, but also show a strong punishment for valine or tryptophan. Sometimes two neurons can share the motif of what is
present for one neuron in another cross-validation set. Other
details can be found as well. Here we just outlined the main
features: the networks not only look for favorable amino
acids, they also detects those that are not favorable at all. In
Figure 9 we show an example of the motifs for 2 of the 5
hidden neurons for a network with sequence separation 13.
D
NN
DHD
H
N
A
Q
K
H
G
G
G
R
R
G
E
V
E
V
GHK
G
XH
P
N
P
F
G
KC
C
D
N
G
N
DY
N
N
P
N
X
W
F
K
D
V
Y
F
D
V
V
K
W
K
N
N
Y
H
W
I
W
X
X
E
T
H
N
W
V
E
F
H
H
Y
N
V
Y
V
Y
R
S
S
V
D
S
Y
W
T
F
T
T
H
D
R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
C
M
M
W
G
Q
F
H
M
M
T
L LI
P
G
V
DG
E
G
GP
D
NP
FMYLF
I
D
YWPPE
TR
FHN
E
CSCN
E
NS
I
V
I
V
V F
I
C
L
MNL I R
EN
E
FAL
SE
K
GDQRL
M
DG
GN
R
CVC
QVL
Y
C
I
NK
G
D K
DKNNP
GP
KX
W Y FV
Y FWV
WV
C
SS
PN
W
L
FV
P
T
N
P
N
G
W
KG
D
C
I
FML
G
N
F
G E
XXL
NPR
N
RP
G
D
IF
W
WW
W
I
X
G
TX
DAX
S
HH
S
GW
C
I
F
KG
N
W
E
W
L
W
Q
EE
LWC
RC
S
TW
FWV
CF
Q
C
HA
QT
V
C
CH
R
I
M
Y
I
M
E
V
C
H
H
Y
K
V
A
C
T
W
H
A
Q
D
R
Y
T
P
P
E
R
D
E
E
X
D
X
EA
Q
SY
V
GV
I
Q
H
R
T
Q
P
M
Q
Y
N
L
A
R
Y
P
R
A
R
H
W
A
S
F
Y
T
T
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Y
K
M
Y
L
G
S
M
M
V
R
E
D
M
S
K
KW
H
F
M
D
M
T
Q
H
M
V
Q
C
Y
Q
P
A
D
T
T
X
T
Q
H
X
X
X
R
TA
P
I
Q
V
I
A
Y
A
X
P
H
E
H
L
X
T
S
S
P
S
D
K
F
K
H
S
DS
T
Y
D
A
E
NX
K
K
A
P
H
X
A
XX
X
T
W
F
Q
A
F
A
H
I
S
M
K
N
S
K
T
V
S
Q
D
C
T
W
R
N
GX
K
L
F
K
E
M
Q
R
E
YYA
G
C
N
E
L
K
R
M
C
G
GGG
G
FML
M
S
C
FAW
N
S
DW
N
I
T
Y
T
Q
P
A
N
S
A
Q
L
A
K
L
F
Y
T
XX
E
Q
I
X
G
E
EQS
E
X
S
K
NW
K
G
E
K
V
E
M
R
Q
K
A
L
X
R
S
G
X
P
I
R
X
W
X
K
Y
H
A
T
Q
Q
A
P
K
A
D
K
T
R
N
T
N
T
H
F
H
M
Y
M
N
S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
M
C
S
R
E
L
Q
MAY
A
C
A
R
F
A
X
Y
X
X
D
H
H
X
E
C
A
H
X
P
P
C
L
A
E
R
W
R
K
T
C
I
R
S
V
E
Q
DS
KP
D
P
P
I
W
Y
E
Q
T
N
K
N
C
V
K
K
D
R
S
P
Q
G
E
D
W
A
I
R
W
F
Q
E
E
M
I
L
T
S
D
X
H
X
F
W
T
R
H
T
Y
X
L
T
L
M
Y
E
H
M
V
Y
H
A
Y
Q
H
P
T
NT
E
E
R
T
Q
S
E
K
H
S
Q
V
T
C
T
K
S
S
W
N
W
C
G
CFVCFF
P
N
I
N
P
KQ
H
H
YCV
R
Q
E
M
W
M
FW
P
W
W
D
Q
K
K
CY
G
L
SPK
V
G
S
EQ
T
R
A
X
X
X
G
C
F
L
H
L
X
D
A
S
A
I
W
ML
F
W
T
D
N
Y
C
P
V
E
K
F
F
A
N
A
G
Y
L
H
A
M
D
I
F
K
R
G
I
D
H
Y
P
DG
D
Q
D
L
C
L
MY
FW
Y
L
G
T
H
R
I
V
P
V
I
V
M
P
SSP
DPK
Q
C
G
C
C
N
D
H
R
W
M
Y
R
F
H
N
H
E
R
R
H
T
L
R
G
G
F I
VMFML
MYL
F
I
V
C
V
S
QG
P
X
W
V
N
P
VV
G N N
D
N
DD
I
MV
F
I
C
I
V
D
L W
MVL
Y
C
X
I YI WI I
G
N
This work was supported by The Danish National Research
Foundation. OL was supported by The Danish 1991 Pharmacy Foundation.
K
I
I
I
YR
L
H
M
R
AX
Q
S
A
Q
V
M
MTA
L
M
I
D
K
Y
FQ
V
D
I
Acknowledgments
H
PY
Q
RX
Saliency
|
|
|
Y
Y
W
V
F
K
L
F
K
R
Y
H
T
C
R
I
F
LA
AVA
V
WY
I
K
Y
I
E
Q
S
N
P
X
H
W
G
V
W
C
CT
ST
S
DTSD
N
H
R
F
F
Q
X
X
X
X
X
I
F
X
R
W
F
V
X
K
X
R
I
YGY
KQ
E
W
M
T
E
QXC
T
XS
CG
|
|
L
X
N
SPDN
E
D
EM
K
R
Q
S
S
F
L
A
Y
D
M
F
I
Y
Q
Q
C
H
L
I
E
I
C
V
T
S
S
E
Y
E
R
V
W
A
T
I
L
N
SS
D
L
PG
C
Saliency
CTSS
T
L
I
MA
F
A
I
D
PPCXX
CH H
C
HP G
Saliency
I R
I
A
D
T
0
T
1
TD
2
I
LL
Q
S
3
CCC
4
A
V
F
W
C
0
L
MKQQL
AL
QLRRM
A
MK
E
A
WL
A
A
A
Q
R
I M
A
A
K
WME
QVMQ
EM
KE
M
K
W
L
MQ
R
EY
M
I E
I
F
L
R
R
0
1
Q
K
P
GGGG
1
PPPPP PP
P
|
2
predicted separation 3 as having optimal performance. We
found that the information content of the logos has the same
qualitative behavior as the network performance for increasing sequence separation. Finally, the weight composition of
the network was analyzed and we found that the sequence
motif appearing was in agreement with the sequence logos.
The perspectives are many: the network performance may
be improved further by using three windows for sequence
separations between 20 and 30 residues. The performance of
the network can be used as inputs to a new collection of networks which can clean up certain types of false predictions.
Combining the method with a secondary structure prediction
should give a significant performance improvement on the
short sequence separations. The relation between information content and network performance might be explained
quantitatively through extensive algebraic considerations.
Figure 9: Three saliency-weight logos from the trained neural network at sequence separations 9 and 13. For each position of the
input field the saliency of the respective amino acids is displayed.
Letters turned upside down contribute negatively to the weighted
sum of the network input. The top logo (sequence separation 9) illustrates strong punishment for proline and glycine upstreams. On
the middle logo (sequence separation 13), N is turned upside down
on the peaks near the edges. On the bottom logo the I’s close to
the edges contribute positively, and the N in the center peak also
contribute positively. The I’s in the center peak contributes negatively (they are upside down). This figure is available in full color
at http://www.cbs.dtu.dk/services/distanceP/.
Final remarks
We studied prediction of distance constraints in proteins. We
found that sequence motifs were related to the distribution of
distances. Using the motifs we showed how to construct an
optimal neural network predictor which improve an earlier
approach. The behavior of the predictor could be completely
explained from a statistical analysis of the data, in particular
a drop in performance for sequence separation 12 residues
was predicted from the statistical analysis. We also correctly
References
Baldi, P., and Brunak, S. 1998. Bioinformatics — The
machine learning approach. Cambridge Mass.: MIT Press.
Bernstein, F. C.; Koetzle, T. G.; Williams, G. J. B.;
Meyer, E. F.; Brice, M. D.; Rogers, J. R.; Kennard,
O.; Shimanouchi, T.; and Tasumi, M. 1977. The
protein data bank: A computer based archival file for
macromolecular structures. J. Mol. Biol. 122:535–542.
(http://www.pdb.bnl.gov/).
Bishop, C. M. 1996. Neural Networks for Pattern Recognition. Oxford: Oxford University Press.
Bohr, H.; Bohr, J.; Brunak, S.; Cotterill, R. M. C.; Fredholm, H.; Lautrup, B.; and Petersen, S. B. 1990. A novel
approach to prediction of the 3-dimensional structures of
protein backbones by neural networks. FEBS Lett. 261:43–
46.
Branden, C., and Tooze, J. 1999. Introduction to Protein
Structure. New York: Garland Publishing Inc., 2nd edition.
Brunak, S.; Engelbrecht, J.; and Knudsen, S. 1991. Prediction of human mRNA donor and acceptor sites from the
DNA sequence. J. Mol. Biol. 220:49–65.
Creighton, T. E. 1993. Proteins. New York: W. H. Freeman
and Company, 2nd edition.
Dayhoff, M. O. ans Schwartz, R. M., and Orcutt, B. C.
1978. A model of evolutionary change in proteins. Atlas of
Protein Sequence and Structure 5, Suppl. 3:345–352.
Fariselli, P., and Casadio, R. 1999. A neural network based
predictor of residue contacts in proteins. Prot. Eng. 12:15–
21.
Göbel, U.; Sander, C.; Schneider, R.; and Valencia, A.
1994. Correlated mutations and residue contacts in proteins. Proteins 18:309–317.
Gorodkin, J.; Hansen, L. K.; Lautrup, B.; and Solla, S. A.
1997. Universal distribution of saliencies for pruning in
layered neural networks. Int. J. Neural Systems 8:489–498.
(http://www.wspc.com.sg/journals/ijns/85_6/gorod.pdf).
Hobohm, U.; Scharf, M.; Schneider, R.; and Sander, C.
1992. Selection of a representative set of structures from
the Brookhaven Protein Data Bank. Prot. Sci. 1:409–417.
Kabsch, W., and Sander, C. 1983. Dictionary of protein
secondary structure: Pattern recognition and hydrogenbonded and geometrical features. Biopolymers 22:2577–
2637.
Kullback, S., and Leibler, R. A. 1951. On information and
sufficiency. Ann. Math. Stat. 22:79–86.
Lund, O.; Frimand, K.; Gorodkin, J.; Bohr, H.; Bohr,
J.; Hansen, J.; and Brunak, S. 1997. Protein distance constraints predicted by neural networks and probability density functions.
Prot. Eng. 10:1241–1248.
(http://www.cbs.dtu.dk/services/CPHmodels).
Maiorov, V. N., and Crippen, G. M. 1992. Contact potential
that recognizes the correct folding of globular proteins. J.
Mol. Biol. 227:876–888.
Mathews, B. W. 1975. Comparison of the predicted
and observed secondary structure of T4 phage lysozyme.
Biochem. Biophys. Acta 405:442–451.
Mirny, L. A., and Shaknovich, E. I. 1996. How to derive a protein folding potential? A new approach to an old
problem. J. Mol. Biol. 264:1164–1179.
Miyazawa, S., and Jernigan, R. L. 1985. Estimation of
effective interresidue contact energies from protein crystal structures: cuasi-chemical approximation. Macromolecules 18:534–552.
Myers, E. W., and Miller, W. 1988. Optimal alignments in
linear space. CABIOS 4:11–7.
Olmea, O., and Valencia, A. 1997. Improving contact
predictions by the combination of correlated mutations and
other sources of sequence information. Folding & Design
2:S25–S32.
Pearson, W. R. 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol. 183:63–
98.
Reese, M. G.; Lund, O.; Bohr, J.; Bohr, H.; Hansen, J. E.;
and Brunak, S. 1996. Distance distributions in proteins: A
six parameter representation. Prot. Eng. 9:733–740.
Rost, B., and Sander, C. 1993. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol.
232:584–599.
Schneider, T. D., and Stephens, R. M. 1990. Sequence
logos: a new way to display concensus sequences. Nucl.
Acids Res. 18:6097–6100.
Sippl, M. J. 1990. Calculation of conformational ensembles from potentials of mean force. An approach to the
knowledge-based prediction of local structures in globular
proteins. J. Mol. Biol. 213:859–83.
Skolnick, J.; Kolinski, A.; and Ortiz, A. R. 1997. MONSSTER: A method for folding globular proteins with a
small number-comment of distance restraints. J. Mol. Biol.
265:217–241.
Tanaka, S., and Scheraga, H. A. 1976. Medium- and long
range interaction parameters between amino acids for predicting three-dimensional structures of proteins. Macromolecules 9:945–950.
Thomas, D. J.; Casari, C.; and Sander, C. 1996. The prediction of protein contacts from multiple sequence alignments.
Prot. Eng. 9:941–948.