Distance Geometry of Molecules

advertisement
Distance Geometry of Molecules
Ågren Sara
Göteborg, Sweden
Braghieri Elena
Milano, Italy
Gramsch Simone
Kaiserslautern, Germany
Ikonen Samuli
Jyväskylä, Finland
Pantazi Chara
Barcelona, Spain
Sabak Anna
Warshaw, Poland
Weenink Jan Willem
Eindhoven, Holland
Instructor: Simon Kokkendorff
Lyngby, Denmark
Abstract
This project deals with distance geometry of proteins. One problem is to
realize the 3-d structure starting from experimental measurements of the distances between atoms, either when these distances are exact or when we just
know some upper and lower bounds of them.
On the other hand, knowing the distance matrix of the whole set of atoms in
the protein, we made an attempt to classify them on the basis of their secondary
structure. In order to do that, we first tried to recognize and characterize local
regular structures in proteins directly from the distance matrix.
2–1
2–2
14th ECMI modelling week Lund 2000
1 Introduction
This project deals with the geometric structure of proteins, with an approach based
on distance geometry. The project is divided into two related subproblems. One of
our aims is to realize the 3-dimensional structure, the so called conformation, from
knowledge of inter-atomic distances in the protein. The other problem is to try to
classify proteins using these distances, that is knowledge of the distance matrix. Both
subproblems are of relevance to any kind of industry that deals with biochemistry.
All the proteins in the biological world are formed from the same 20 amino acids.
A single amino acid has as its center a tetrahedral carbon atom called the -carbon,
. It is bonded on one side to an amino group ( ) and on the other side to a
). The third bond is always to hydrogen ( ) and the fourth
carbonyl group ( bond is to the functional side chain ( ). Amino acids differ from each other in the
structure of the functional side chain. The molecules of amino acids can join with
each other by the so called peptide bond, creating a polymer molecule - the protein.
A typical protein can consist of several hundred amino acids.
We want to investigate the geometric structure of a protein. Since the entire geometry
-atoms, we shall for the sake
can essentially be deduced from the positions of the
of simplicity only be interested in the so called backbone, i.e. the polygonal curve
with the
-atoms as corners.1
In the classification part of the project we look at the so called secondary structure.
These are local regular structures, which are formed when there are interactions between different parts of the protein chain, i.e. interactions between atoms belonging
to different amino acids. The most typical examples of these secondary structures
are -helices and -pleated sheets, see section 4 and section 2.2. For more details on chemistry and protein terminology we refer to [5] or [4]. In our work we
have used data from real proteins found in the Protein Data Bank on the web at
http://www.rcsb.org/pdb.
In section 2 there is a more detailed description of the problems we are dealing with.
The problem of finding the coordinates of atoms in the molecule is described in section 3. In section 4 and 5 we are talking about investigating the secondary structure
and classification of proteins in terms of this structure. Section 6 contains conclusions
on our work and section 7 summarises what further research could be done.
2 Problem description
2.1 The Embedding Problem
We start out by describing the first subproblem, realization of the -dimensional
structure of the backbone using distance information. Suppose that we have information about distances between atoms in the protein, perhaps from experimental
measurements. We have collected these distances in the distance matrix ,
where is the distance between atom ! and atom " . We are now interested in finding coordinates of the atoms, such that the resulting euclidean distances corresponds
to our given distances. So what we are looking for is an isometric embedding
#$&%
('*),+*-/././.0-1) 23547698,-
where ): denotes atom ! .
It would be usual mathematical practice to assume exact knowledge about distances
1
we shall often use the word protein, even though we implicitly consider only the backbone.
Distance Geometry of Molecules
2–3
between atoms in the protein, but in real life we would not have any exact measuring
results. In order to make things more clear, we have divided this problem further into,
increasingly difficult, subproblems.
(a) In this first subproblem we assume that we know the exact distances between
the atoms in the protein.
(b) In this problem we have a complete set of upper and lower bounds of the distances between the atoms in the protein. The problem is to find coordinates
consistent with the bounds. These bounds could be derived from theoretical
knowledge of the chemistry involved or may come from actual measurements.
(c) In this final case we have only an incomplete set of bounds on distances between atoms in protein. For example some distances could be impossible to
measure.
Some natural questions arise from these considerations: how much information do we
need in order to describe the structure of the protein? And also questions dealing with
uniqueness of solution: how much information from distances do we need in order to
have a unique solution (up to rigid motions of 6 8 ) to the embedding problem?
2.2 Classification of proteins
Our task in this subproblem is to classify proteins into groups with similar structure,
but we can do that in several ways. We focus our attention on classifying the proteins
after their secondary structure. When we look at the secondary structure we look at
the backbone of the molecule which consists of the
atoms; the structure can be
characterised by how the backbone of carbon atoms is folded.
There are (at least) three different categories of secondary structure. A protein can
consist of one, two or all three of these structures. One type of structure is when
the folding has a random structure, in which no pattern can be found at all. In the
other two, there are certain patterns. The -helix structure is one of these. Here
the backbone forms a helical coil with approximately :. amino acids per turn,
see figure 1 on page 2–7. The other type of folding is called -sheet structure. In
this structure the chain of carbon atoms that constitute the backbone of the protein is
winded to form a sort of two dimensional sheet. Again we refer to e.g. [4] for more
details.
To classify the proteins we therefore investigate which secondary structure they have,
and we proceed in the following two different ways:
(a) using a filtered binary version of the distance matrix
(b) making a graph of the chain and using graph theory to study the data.
3 Embedding in
We present the mathematical background on which we based our studies. Hence we
first present two theorems and then describe in detail how they’re involved in our
algorithm.
2–4
14th ECMI modelling week Lund 2000
3.1 Finding coordinates when all data are available
atoms '*) -1) + -/././. -1) 2 3 with given
Definition 1. The Gram matrix of a set of vectors ' +0-/././. -2 3 6 8 is the matrix
of the dot products of the vectors.
matrix , where
Definition 2. The distance matrix is the is the distance between atom ! and atom " .
We consider a protein which consists of
.
internal distances ):-1) These distances could actually be results of measurements, and if they are precise
enough, they should form a metric space in the mathematical sense. Hence, since a
distance function satisfies the symmetry property, it follows that is a symmetric
matrix. Moreover, the elements of the diagonal are all equal to zero, since the distance
of an atom from itself is zero.
We are interested in finding a system of Euclidean coordinates such that the Euclidean
atoms
distances are equal to the metric distances . We choose one of the
) to be in the origin of the Euclidean system of coordinates. (For computational
reasons we actually select this atom, such that the sum of its distances from the other
atoms is minimal.) We define a metric matrix for
points in a metric space to be
the quadratic matrix:
of dimension , where:
.
We are going to prove that the transformation from distance coordinates to Euclidean
coordinates is equivalent to diagonalization of the metric matrix. For more details
see [2].
!
points
Theorem 1. A necessary and sufficient condition that a configuration of
in a metric space is embeddable in 6 8 , is that the corresponding metric matrix is
positive semidefinite of rank at most 3.
Proof. We first suppose that we know the distances for the
atoms, and that
every atom corresponds to a point in the Euclidean space. Moreover, we choose an
atom of the protein to be the origin of our Euclidean system. We are going to prove
that the metric matrix is positive semi-definite of rank at most 3. Let + - 8 be
, and define the matrix
the Euclidean coordinates of the -th point where /
.
.
.
%
of dimension . The Gram matrix of the coordinate vectors is:
"
%&
$
( (% ' %
#" " " $ (1)
)
*
It is easy to see that a Gram matrix is positive semidefinite of rank at most 3 (for 3%
%
dimensional vectors). For example one can extend to a
matrix by filling
% %
% %
in some zero-rows. We clearly have
, from which the announced
properties follows since the product of any square matrix with its transpose is positive
semidefinite. Moreover, by the law of the cosines we have:
'
+' %
%
*' *
-
(2)
so the metric matrix is equal to the Gram matrix and hence positive semi-definite of
rank at most 3.
Distance Geometry of Molecules
2–5
We suppose now, that we have
points in a metric space, and that the metric matrix is positive semi-definite of rank at most 3. Since the metric matrix is symmetric,
it can be diagonalized by an orthogonal matrix:
'
(3)
where is the
matrix of the eigenvalues. We can choose
such that the
nonzero, positive eigenvalues are among the first three elements on the diagonal of
. Scaling the eigenvectors by the square roots of the corresponding eigenvalues,
%
+ of dimension , which has only zeros
we obtain the new square matrix
below the third row. We have:
From (3) we have that:
%
' % ' + ' + ' .
'
.
(4)
(5)
From 4 and 5 we obtain that the two matrices and
are equal. This means that
%
the first three elements of the columns of are the coordinates of a set of vectors
for which the corresponding Gram matrix is equal to the metric matrix. Since the
elements of the Gram matrix are dot products of the vectors we can apply the law
we obtain that
of cosines. Hence along the diagonal of the matrix equation for all ! .
Using this we obtain also for off-diagonals that for all ! - "
. Hence
the distance between the atoms of the molecules is invariant in the two structures
(distance metric and Euclidean metric are the same).
In the proof we assume, that the metric matrix has at most three positive eigenvalues, and that the rest of them are zero. In practice, errors in measurements may
cause that some of them are not zero, and even worse are negative, which means that
it is impossible to obtain the square root of the matrix . Hence, from a first point of
view, the algorithm is going to fail. So we need to create an algorithm that should be
able to calculate the Euclidean coordinates from approximations of the real distances
of the atoms.
What we could do, and will do, is to perform the steps described in the proof, but
at the point when we need the square root of , we keep the three largest positive
eigenvalues and set all other to zero. This choice of approximate embedding can be
justified mathematically in the sense, that what we obtain is the Euclidean distance
matrix closest to our original measured distance matrix, wrt. an appropriate norm on
the set of distance matrices. In practice we find, that if the data are “sensible”, we
will obtain three large positive eigenvalues for the metric matrix, while all others are
magnitudes below in absolute value. The reader can find more details in [2].
3.2 Finding Coordinates when the Data are Given with Error Bounds
The distance geometry problem with given error bounds consists of finding a set of
3 in 6 8 such that:
positions ' +0-/././. -
! - "
.
The strategy we used to solve the problem consists of two different parts. The main
idea is to use optimisation, but in order to be able to solve the problem for large
2–6
14th ECMI modelling week Lund 2000
proteins, which has many atoms, we need a good initial guess. We can get this
by taking a random distance matrix. Below follows a description about how we
generated this initial infeasible solution and then follows a description about how
we can transform our problem into an optimisation problem and how we can find a
solution of it.
3.2.1
Generation of an Infeasible Initial Solution Using Random Distance Matrices
In order to get a good initial guess to the optimisation problem, we start by calculating
an approximative solution to the problem, where the data is given with error bounds.
This solution will not be feasible in the sense that all distances will not be within
the given error bounds. The initial guess is given by taking the interval between the
error bounds and computing the middle point of this interval. Then a random numbers generator is used, that produces normally distributed random numbers, for every
interval, with the mean value corresponding to the middle point of the interval and
the variance equal to the length of the interval.
After this procedure we test if the generated random distance matrix is approximately
embeddable. ”Approximately” means that we check the three largest eigenvalues of
the corresponding metric matrix following theorem 1. If there are negative eigenvalues we put them equal to zero, otherwise we have a positive semidefinite matrix.
Then we use our algorithm which finds an embedding in 6 8 to get the approximated
coordinates.
3.2.2
Optimisation Procedure to Find Feasible Embedding
By converting our problem into an optimisation problem we can use the initial guess
given from the above to get a feasible solution. We have used an unconstrained
optimisation formulation, where the feasibility is measured by the cost function. That
is, we add a penalty term to the cost function that is big when the bounds are broken.
This has been suggested by Mor and Wu (see: [3]). The formulation of our distance
geometry problem as a global optimisation problem is as follows: find the global
minimum of the function
: $
where the pairwise function 6 8 4 6 is defined by:
!
- ) - .
The set of positions '+0-/././.*- 3 solves the distance geometry problem if and
only if
3.2.3
is a global minimiser of
and
.
Computational Experiments
We implemented both the algorithm for finding an initial guess to the optimisation
problem and the optimisation algorithm in Matlab. The code is found in the appendix.
In both cases we used distance matrices created by the original data found in the protein data base, but we made our error bounds by adding random numbers to the exact
distances.
Distance Geometry of Molecules
2–7
For the approximated infeasible solutions, we got from our first algorithm, usually
about 50 of the bounds were violated.
Unfortunately we could not get the optimisation algorithm to work. Possible problems might have been that the problem was too large for the Matlab optimisation
routine we used or that we used the routine in an ineffective way. If there had been
more time we would have implemented a steepest descent algorithm ourselves, so
that we could control what happens in the optimisation loop.
3.2.4
Concluding Remarks
In the paper of Mor and Wu (see: [3]) there are ways of improving the minimisation algorithm given. They suggest to transform the cost function by a Gaussian
transform into a smoother function with fewer local minima before applying the minimisation routine. This gives of course an approximated solution, but one can use our
random-distance-matrix-algorithm as a starting point and then iterate the improved
optimisation-algorithm with a smoothing factor, that tends to zero.
4 Classification of Proteins
A common wish when working with molecules or, in this case, proteins is to classify
different proteins into different groups. This could be done in several different ways.
One would be to sort the protein after what function it has. Another way would be
looking at the structure of the protein molecule classifying them after similarities
in the structure. When one looks at 3d pictures of the bone-structure of different
proteins it is easy to see that some proteins have similar structures while some do not
(see figure 1).
This way of classifying the proteins can also be interesting in a mathematical point
25
30
20
20
15
10
10
5
0
0
−5
−10
−10
−20
−15
−20
30
−30
60
50
5
20
0
10
30
25
40
−5
15
10
20
−15
−10
20
30
−10
0
5
10
−20
a) 256b1-protein
0
b) 1brx-protein
35
5
0
30
−5
25
−10
20
−15
15
−20
10
−25
5
40
−30
20
35
30
30
25
20
20
15
10
0
40
10
35
30
0
25
20
−10
10
5
15
−20
0
c) 5ebx-protein
10
5
d) 2phy-protein
Figure 1: a), b) show proteins with -helix structure and c), d) proteins with -sheet structure.
of view, so this will be our way of classifying proteins. We want to take into account
the types of substructures that the protein has but it would also be interesting to try to
find a more global measurement of how compact the molecule is, that is if the atoms
2–8
14th ECMI modelling week Lund 2000
are situated all together or if they form different groups. The latter seem to be the
more difficult part so the main focus in our work was to work with a classification
that was based on the different types of substructures in the molecule. A large part
of the coming sections will therefore be about how we can locate these different
substructures just by using the information about the distances between the atoms.
We use two different, but yet similar strategies to do this. The first one is to use a
filtered binary version of the distance matrix. The other one is to make a graph of
the atoms in the chain and use graph theory to interpret the data. In the latter we also
discuss some ideas about how graph theory can be used to also get a more global
measure of the protein.
4.1 Classification Using the Distance Matrix
The first approach when attempting to classify the protein molecules was to look at
the distance matrix. By studying the colour visualisation of the distance matrix for
several proteins one can easily see that different types of molecules give different
patterns, and the similarities in the pattern for those molecules that are of the same
type is striking. Because of this fact the distance matrix seems to be a good tool when
classifying the molecules.
Figure 2 shows distance matrices for different proteins, but the first two proteins are
10
20
20
40
30
60
40
80
50
100
60
120
70
140
80
160
90
180
100
200
10
20
30
40
50
60
70
80
90
100
20
40
a) 256b1-protein
60
80
100
120
140
160
180
200
b) 1brx-protein
10
20
20
40
30
60
40
80
50
100
120
60
10
20
30
40
50
60
c) 5ebx-protein
20
40
60
80
100
120
d) 2phy-protein
Figure 2: a), b) show distance matrices of proteins with -helix structure and c), d) distance matrices
of proteins with -sheet structure.
of one type and the other two are of another type.
In order to make it easier to analyse the patterns, we filtered the distance matrix and
turn it into a binary matrix. We decided to concentrate on the small distances below
a certain cut off value. The distances below the cut off value are set to 1 and the rest
are set to 0.
The cut off value was decided through a trial and error process, and we found that
a suitable cut off value was 8Å for most of the proteins we looked at. In figure 3 the
filtered, binary version of the distance matrix, shown in figure 2, is visualised.
Distance Geometry of Molecules
2–9
0
0
10
20
20
40
30
60
40
80
50
100
60
120
70
140
80
160
90
180
100
200
0
10
20
30
40
50
60
nz = 996
70
80
90
0
100
20
40
a) 256b1-protein
60
80
100
120
nz = 2054
140
160
180
200
b) 1brx-protein
0
0
10
20
20
40
30
60
40
80
50
100
60
120
0
10
20
30
nz = 564
40
50
60
c) 5ebx-protein
0
20
40
60
nz = 1174
80
100
120
d) 2phy-protein
Figure 3: a), b) show filtered distance matrices of proteins with -helix structure and c), d) filtered
distance matrices of proteins with -sheet structure.
The main focus was now to find the patterns in the binary distance matrix that correspond to the different substructures ( -helixes and -sheets). The spiral structure
of the -helix is such, that atoms that are next to each other in the chain in groups
of four also are close in the euclidean sense. In the binary distance matrix this can
be seen as bands along the diagonal. Figure 1a) shows the the bone structure of the
protein 256b1, which contains four -helixes. Figure 3a) shows the visualisation of
the corresponding binary distance matrix, which clearly has four thick bands along
the diagonal indicating the different helices.
By just searching for the bands along the diagonal that exceeds a certain length the
-helix structures can easily be found. A search routine was implemented in Matlab
to do this. For the code see Appendix. The program worked well on most molecules
that we tested, that is for most of those proteins that contain helix-structures the program found these and the program found no helices in the proteins with only sheetstructures. There were however some proteins with helix structure, that the program
could not deal with. The problem was probably that the cut off value was too small
for these proteins. The radius of the helices can differ quite a bit and in the cases
where the program failed we believed that the radius was so large that the atoms in
the helix were to far apart to be recognised as helix-structure.
A search routine to find the ) -sheets was not implemented, but the patterns that
correspond to these can also be recognised in the binary distance matrix. These
patterns constitute islands that are anti parallel to the diagonal (for sheets with anti
parallel strands). See figure 3c) which shows the visualisation of the binary distance
matrix of the protein 5ebx.
In figure 4 there is a simplified version of a sheet. In this figure all atoms that are
2–10
14th ECMI modelling week Lund 2000
m+9
m
m+1
m+2
m+14
m+8
m+7
m+15
m+16
Figure 4: The figure show a simplified version of a -sheet from a protein.
considered close to atom
is marked, and from this picture it should be clear
how the pattern in the distance matrix occur. A search routine that finds these sheets
should be based on an algorithm which sums the elements in each row anti parallel to
the diagonal. For those rows that the sum is high a local search should be performed
to find the whole island to see which atoms that constitute the sheet.
When these different substructures of the proteins are found the protein can be classified after which substructures that they contain. We can then form four different
groups. Those proteins that have -helices, those that have -sheets, those that have
both structures and those that have non, the so call random folded proteins.
A more global measurement of the protein structure might be possible if one uses
some sort of image processing on the filtered distance matrix. We think image processing would enable us to compare two filtered distance matrices in a more intelligent way then just calculating the difference matrix, and such a measure could perhaps also give us a way of ”globally” classifying the proteins.
5 Classification using graph theory
5.1 Why graph theory?
In the previous sections we represented the data essential for our investigations in
the form of a matrix. In a distance matrix all distances in 3D-space between the
atoms where given, in the same order as the atoms follow each other in the protein
chain. The same information can also be represented in the form of a graph. The
question is which surplus value this will give. We show in this section, that the operations available in graph theory bring new insight in how one can easily identify
sub-structures in the proteins and generate global measurements for classifying the
proteins as well. The way in which these insight are obtained are for all mentioned
methods more or less the same. The protein is modelled as a graph. Some operations
on this graph give some simplified vector. These vectors are more easily comparable than complete matrices, and give in general enough information to distinguish
different types of proteins.
5.2 Constructing graphs
%
We can represent the 3D-structure of the protein by a graph.
9- where the set of vertices represent the -atoms. The set of edges
Distance Geometry of Molecules
2–11
we can construct in different ways. The most obvious way to do this is similar to the
way we created the binary distances matrices. That is:
4 &
where is the 3D-distance between the ! and "
- -atoms in the protein chain, and a certain threshold, that has to be chosen such
Å is default.)
that the important substructures are distinguishable ( 5.2.1
Generalized definition of the edge set
An alternative definition of the edge-set can be very interesting. Because the edges
between subsequent vertices are trivial it is a good idea to leave them out, which gives
us the following definition:
$
! " ' 0- /
3
$
Remark that + gives us the same definition for the edge set as given above.
$
$
So this alternative definition is in fact a generalization. and give
8
us new graphs with similar features. In with higher an analysis of the components give us results about some local structures. We suppose that each component
(especially the components that have a regular structure, like a block) represents a
substructure. Some other structures however will be lost with higher , so one has to
be very careful with this.
5.3 Local sub-structures
5.3.1
Degrees of the graph
A basic idea to be able to say something about the sub-structures in the protein is to
look at the array of degrees of the graph. The conjecture is that certain sub-structures
have certain properties in this array, namely a sub-array of a degrees that is higher
than the average degree. We call this local regularity. The distribution of the degrees
is defined as the frequency with which each degree occurs. By looking at the distribution of the degrees, we can say something about the number of different local
sub-structures. This distribution can also be useful for the global classifying of the
proteins. Apart from this other ways of finding local sub-structures can be thought
of, we discuss one of this ideas in the next sections. Remark that the generalized
definition of the edge-set give the same results in this case.
5.3.2
Divide graph in blocks
To explain the idea of this section we first have to explain some terms. A vertex of
a connected graph is a cut-vertex of if and only if there exist vertices and
( ) such that is on every path of . A non-trivial connected graph
with no cut-vertices is called a non-separable graph. A block of a graph is a maximal
non-separable subgraph . The idea now is that having a block in the graph gives
some indication of some sub-structure. This idea has to be tested to see if it works
for real-life cases.
5.3.3
Other ideas
There are a lot of other similar concepts in graph theory that all can help us, like local
regularity, looking at the cut-vertices, graph-coloring, etc. Some trial and error can
2–12
14th ECMI modelling week Lund 2000
help to determine which concepts work best.
5.4 Global classifying
Apart from finding local sub-structures we were also interested in classifying the
proteins in different classes without first identifying the sub-structures.
5.4.1
Connectivity and edge-connectivity
Let be a connected graph (A graph with a path between every two vertices). A
subset of the edge set of an connected graph is an edge cutset of if is
disconnected.
By the next iterative process an idea of the global structure can be obtained. In this
%
process the vector is similar if the global structure is similar.
1. Start with the graph
and !
2. Find the smallest cutset
%
3. Define
.
of the components of
and replace by
4. If there are still nontrivial components of
rithm
.
and ! by !
.
goto 2. otherwise finish the algo-
Similar results are obtained by (vertex) connectivity. The connectivity of a graph
is defined as the minimum number of vertices whose deletion from
produces a
disconnected or trivial graph.
5.4.2
Distance matrix
- beFor a nontrivial graph and a pair - of vertices of the distance tween and is the length of a shortest -path in if such a path exists. This
shortest path can be obtained by the algorithm of Moore’s Breath-First Algorithm
(see [6], page 101). The distance matrix is the matrix with this distance for each
pair of vertices. What to do with this distance matrix? Well, if we simply count the
frequency of all numbers in this matrix, we again have a array which gives us a good
idea about the global structure of the graph.
6 Conclusions
The first part of our project was to find the coordinates of atoms in the molecule,
knowing only distances between atoms. We are able to do that. It can be done using
theorem 1. The proof of this theorem is constructive and can be directly used to find
the embedding in 6 8 . This is possible when all the distances between the atoms are
given. In that case we have an exact and unique solution of our problem. Another
goal was to find the coordinates when we only have lower and upper bounds on the
distances instead of exact numbers. We tried to find the random set of coordinates
and then optimise it until all the constrains are fulfilled. However, we didn’t manage
to implement our algorithm.
We also tried to check how much information we need to find the coordinates, i.e.
whether we need all the distances in the distance matrix or not. And if not, what is
the least number of elements of the matrix we need to obtain the unique solution. But
Distance Geometry of Molecules
2–13
we didn’t manage to do much research on that topic.
The second part of our project was to investigate the secondary structure of the proteins. Since -helices and - sheets are main (most common) structures, we have
focused on finding them in the protein. We have implemented the algorithm, which
finds helices in the molecule using filtered distance matrix. As output the program
gives the number an atom where a helix starts, and the number of atoms of which
the helix consists. It also draws a filtered distance matrix for currently investigated
protein, a molecule in 3d and a helices cut out of the protein. So, our program gives
an information about location and length of helices, provided there are any in the
molecule. Similar algorithm can be applied to find -sheets. Knowledge about helices and sheets is the most important knowledge about secondary structure of the
proteins. We can get it using our program.
We have also tried to achieve the same aim (investigate the seondary structure), using
a different tool — the graph theory. We have implemented the algorithm which draws
a graph and calculates the degrees of knots based on a filtered distance matrix. Looking at the graph we can find parallel and anti-parallel - sheets, while degrees give us
an information about how compact (folded) the molecule is. We think that graph theory can be also useful in general classification of the proteins. Two similarly-looking
graphs should correspond with two similarly-built molecules.
7 Further Research
Of course some aspects of our work could be more deeply studied, in order to reach
a better understanding of the solution of the problems.
Concerning the first problem, when there are errors in the data, we would like to
investigate the solution space of the optimisation problem, and again make an optimisation routine to find a solution, for example implementing a steepest descent
algorithm.
In the second problem some more work could be done, for example: implementing a
search routine to find -sheets structures in the proteins (as we’ve done for -helices
structures), or again including a variable cut off limit in the search routine. Moreover, it should be interesting to try to use image processing to classify molecules, and
make further implementations using graph theory to get a global measurement of the
structure of the molecule.
2–14
14th ECMI modelling week Lund 2000
Appendix
Algorithm for Finding Approximate Solution
function [C,counter]
= approximation(Matrix)
M=Matrix;
N=size(M,2);
% --- test pos. semidefinite
[Eigenvectors, Eigenvalues] = eig(M);
helpvector=diag(Eigenvalues);
% --- sorting of the eigenvalues after their size
[sortedvec,index] = sort(helpvector);
for i=1:N
backindex(i)=index(N-i+1);
end;
% --- consider the 3 largest eigenvalues
counter = 0;
for i=1:3
if (helpvector(backindex(i))>0)
counter=counter+1;
end;
end;
% Eigenvectors=helpmatrix;
% Eigenvalues=diag(turnedvector);
Chelp=Eigenvaluesˆ(0.5)*Eigenvectors’;
for i=1:3
C(i,:)=Chelp(i,:);
end;
counter
Algorithms for Calculating a Distance Matrix with Random Errors
function main(iterations)
%
%
%
%
This program uses random distances to solve the distance
geometry problem with given error bounds.
The variable iterations is the maximum number of tries to
find a random distance matrix, that fits.
Distance Geometry of Molecules
2–15
load pdb2PHY.dat -ascii;
A=pdb2PHY’;
n=size(A,2);
epsilon=0.1;
% number of atoms
% error bounds
D=distancematrix(A);
M=metricmatrix(D);
%
%
%
%
computation of
-procedure not
computation of
-procedure not
the distancematrix
included here
the metricmatrix
included here
for i=1:iterations
% computation of random distance-matrix
ND=newdistmatrix(D,epsilon);
% corresponding metric matrix
NM=metricmatrix(ND);
% test if embeddable
[C,counter]=approximation(NM);
if counter == 3
break;
end;
end;
error=minus(D,distancematrix(ND));
figure(1)
image(D)
% visualization of the original distance matrix
figure(2)
image(ND)
% visualization of the random distance matrix
figure(3)
% visualization of the error-matrix
image(error)
function New
= newdistmatrix(Matrix,epsilon)
D=Matrix;
N=size(D,2);
eps=epsilon;
hurtedbounds=0;
for i=1:N
for j=1:N
if (i>j)
rn=randn;
errormatrix(i,j)=(abs(rn).*eps)./D(i,j);
if (abs(rn)-1>0)
2–16
14th ECMI modelling week Lund 2000
hurtedbounds=hurtedbounds+1;
end;
New(i,j)=D(i,j) + rn.*eps;
New(j,i)=New(i,j);
end;
end;
end;
error=0;
number=0;
for i=1:N
for j=1:N
if (i>j & errormatrix(i,j)˜=0)
error=error+errormatrix(i,j);
number=number+1;
end;
end;
end;
erroraverage=error./number.*100
hurtedbounds
totalnumberofbounds=0.5*Nˆ2-N
Optimisation Algorithm
load pdb1NXB.dat -ascii
A=pdb1NXB;
D=distancematrix(A’);
%N=size(D,1);
[k,d]=size(A);
epsilon=min(min(D))-0.5;
%ANew=A;
for i=1:k
for j=1:d
ANew(i,j)=A(i,j)+randn*epsilon;
end
end
global L;
global U;
for i=1:N
for j=1:N
L(i,j)=D(i,j)-epsilon;
U(i,j)=D(i,j)+epsilon;
end;
end;
for i=1:3
BOX1(:,i)=-100;
BOX2(:,i)=100;
Distance Geometry of Molecules
2–17
end
%global ANew;
%options=optimset(’TolFun’,10,’Display’,’iter’);
%Xmin=fmincon(’funktion’,ANew,[],[],[],[],BOX1,BOX2,[],options)
fminsearch(’funktion’,ANew)
Search Algorithm for Finding -helix Structures in the Distance matrix
clear;
limit=8
load icse.dat -ascii
M=icse;
[n,dimension]=size(M)
figure(1)
for i=1:n
text(M(i,1),M(i,2),M(i,3),int2str(i));
end
plot3(M(:,1),M(:,2),M(:,3));
for i=1:n
for j=1:n
D(i,j)=sqrt((M(i,1)-M(j,1))ˆ2+(M(i,2)-M(j,2))ˆ2+
(M(i,3)-M(j,3))ˆ2);
if D(i,j)>limit
DMIN(i,j)=0;
else
DMIN(i,j)=D(i,j);
end
DMIN(i,i)=1;
end
end
figure(2)
spy(DMIN);
figure(3)
image(D)
j=1;
i=1;
while i<n-4
if DMIN(i,i+3)˜=0 & DMIN(i,i+4)˜=0
m=1;
istar=i;
i=i+1;
while DMIN(i,i+3)˜=0 & DMIN(i,i+4)˜=0
m=m+1;
i=i+1;
end
2–18
14th ECMI modelling week Lund 2000
if m>4
H(j,1)=istar;
H(j,2)=m;
j=j+1;
end
else
i=i+1;
end
end
figure(4)
hold on
for i=1:j-1
a=H(i,1);
b=H(i,1)+H(i,2);
plot3(M(a:b,1),M(a:b,2),M(a:b,3));
end
References
[1] I.J. Schoenberg, ”Remarks to Maurice Frechet’s article ’Sur la definition
axiomatique d’une classe d’espace distancies vectoriellement applicable sur
l’espace de Hilbert”, Annals of Mathematics, 1935.
[2] Timothy F. Havel, Irwin D. Kuntz and Gordon M. Crippen, ”The theory and
practise of distance geometry”, Bulletin of Mathematical Biology, 45: 665-720,
1983
[3] Jorge J. Moré and Zhijun Wu, ”Distance Geometry Optimization for Protein
Structures”, Journal of Global Optimization, 15: 219-234, 1999.
[4] N.J. Darby and T.E. Creighton, “Protein Structure”, IRL Press, 1993.
[5] Holtzclaw, Robinson, Odom, “General Chemistry”, Ninth Edition, D.C. Heath
and Company, 1991.
[6] Chartrand and Oellerman, “Applied and algorithmic graph theory”, McGrawHill, 1993.
Download