Distance Geometry of Molecules Ågren Sara Göteborg, Sweden Braghieri Elena Milano, Italy Gramsch Simone Kaiserslautern, Germany Ikonen Samuli Jyväskylä, Finland Pantazi Chara Barcelona, Spain Sabak Anna Warshaw, Poland Weenink Jan Willem Eindhoven, Holland Instructor: Simon Kokkendorff Lyngby, Denmark Abstract This project deals with distance geometry of proteins. One problem is to realize the 3-d structure starting from experimental measurements of the distances between atoms, either when these distances are exact or when we just know some upper and lower bounds of them. On the other hand, knowing the distance matrix of the whole set of atoms in the protein, we made an attempt to classify them on the basis of their secondary structure. In order to do that, we first tried to recognize and characterize local regular structures in proteins directly from the distance matrix. 2–1 2–2 14th ECMI modelling week Lund 2000 1 Introduction This project deals with the geometric structure of proteins, with an approach based on distance geometry. The project is divided into two related subproblems. One of our aims is to realize the 3-dimensional structure, the so called conformation, from knowledge of inter-atomic distances in the protein. The other problem is to try to classify proteins using these distances, that is knowledge of the distance matrix. Both subproblems are of relevance to any kind of industry that deals with biochemistry. All the proteins in the biological world are formed from the same 20 amino acids. A single amino acid has as its center a tetrahedral carbon atom called the -carbon, . It is bonded on one side to an amino group ( ) and on the other side to a ). The third bond is always to hydrogen ( ) and the fourth carbonyl group ( bond is to the functional side chain ( ). Amino acids differ from each other in the structure of the functional side chain. The molecules of amino acids can join with each other by the so called peptide bond, creating a polymer molecule - the protein. A typical protein can consist of several hundred amino acids. We want to investigate the geometric structure of a protein. Since the entire geometry -atoms, we shall for the sake can essentially be deduced from the positions of the of simplicity only be interested in the so called backbone, i.e. the polygonal curve with the -atoms as corners.1 In the classification part of the project we look at the so called secondary structure. These are local regular structures, which are formed when there are interactions between different parts of the protein chain, i.e. interactions between atoms belonging to different amino acids. The most typical examples of these secondary structures are -helices and -pleated sheets, see section 4 and section 2.2. For more details on chemistry and protein terminology we refer to [5] or [4]. In our work we have used data from real proteins found in the Protein Data Bank on the web at http://www.rcsb.org/pdb. In section 2 there is a more detailed description of the problems we are dealing with. The problem of finding the coordinates of atoms in the molecule is described in section 3. In section 4 and 5 we are talking about investigating the secondary structure and classification of proteins in terms of this structure. Section 6 contains conclusions on our work and section 7 summarises what further research could be done. 2 Problem description 2.1 The Embedding Problem We start out by describing the first subproblem, realization of the -dimensional structure of the backbone using distance information. Suppose that we have information about distances between atoms in the protein, perhaps from experimental measurements. We have collected these distances in the distance matrix , where is the distance between atom ! and atom " . We are now interested in finding coordinates of the atoms, such that the resulting euclidean distances corresponds to our given distances. So what we are looking for is an isometric embedding #$&% ('*),+*-/././.0-1) 23547698,- where ): denotes atom ! . It would be usual mathematical practice to assume exact knowledge about distances 1 we shall often use the word protein, even though we implicitly consider only the backbone. Distance Geometry of Molecules 2–3 between atoms in the protein, but in real life we would not have any exact measuring results. In order to make things more clear, we have divided this problem further into, increasingly difficult, subproblems. (a) In this first subproblem we assume that we know the exact distances between the atoms in the protein. (b) In this problem we have a complete set of upper and lower bounds of the distances between the atoms in the protein. The problem is to find coordinates consistent with the bounds. These bounds could be derived from theoretical knowledge of the chemistry involved or may come from actual measurements. (c) In this final case we have only an incomplete set of bounds on distances between atoms in protein. For example some distances could be impossible to measure. Some natural questions arise from these considerations: how much information do we need in order to describe the structure of the protein? And also questions dealing with uniqueness of solution: how much information from distances do we need in order to have a unique solution (up to rigid motions of 6 8 ) to the embedding problem? 2.2 Classification of proteins Our task in this subproblem is to classify proteins into groups with similar structure, but we can do that in several ways. We focus our attention on classifying the proteins after their secondary structure. When we look at the secondary structure we look at the backbone of the molecule which consists of the atoms; the structure can be characterised by how the backbone of carbon atoms is folded. There are (at least) three different categories of secondary structure. A protein can consist of one, two or all three of these structures. One type of structure is when the folding has a random structure, in which no pattern can be found at all. In the other two, there are certain patterns. The -helix structure is one of these. Here the backbone forms a helical coil with approximately :. amino acids per turn, see figure 1 on page 2–7. The other type of folding is called -sheet structure. In this structure the chain of carbon atoms that constitute the backbone of the protein is winded to form a sort of two dimensional sheet. Again we refer to e.g. [4] for more details. To classify the proteins we therefore investigate which secondary structure they have, and we proceed in the following two different ways: (a) using a filtered binary version of the distance matrix (b) making a graph of the chain and using graph theory to study the data. 3 Embedding in We present the mathematical background on which we based our studies. Hence we first present two theorems and then describe in detail how they’re involved in our algorithm. 2–4 14th ECMI modelling week Lund 2000 3.1 Finding coordinates when all data are available atoms '*) -1) + -/././. -1) 2 3 with given Definition 1. The Gram matrix of a set of vectors ' +0-/././. -2 3 6 8 is the matrix of the dot products of the vectors. matrix , where Definition 2. The distance matrix is the is the distance between atom ! and atom " . We consider a protein which consists of . internal distances ):-1) These distances could actually be results of measurements, and if they are precise enough, they should form a metric space in the mathematical sense. Hence, since a distance function satisfies the symmetry property, it follows that is a symmetric matrix. Moreover, the elements of the diagonal are all equal to zero, since the distance of an atom from itself is zero. We are interested in finding a system of Euclidean coordinates such that the Euclidean atoms distances are equal to the metric distances . We choose one of the ) to be in the origin of the Euclidean system of coordinates. (For computational reasons we actually select this atom, such that the sum of its distances from the other atoms is minimal.) We define a metric matrix for points in a metric space to be the quadratic matrix: of dimension , where: . We are going to prove that the transformation from distance coordinates to Euclidean coordinates is equivalent to diagonalization of the metric matrix. For more details see [2]. ! points Theorem 1. A necessary and sufficient condition that a configuration of in a metric space is embeddable in 6 8 , is that the corresponding metric matrix is positive semidefinite of rank at most 3. Proof. We first suppose that we know the distances for the atoms, and that every atom corresponds to a point in the Euclidean space. Moreover, we choose an atom of the protein to be the origin of our Euclidean system. We are going to prove that the metric matrix is positive semi-definite of rank at most 3. Let + - 8 be , and define the matrix the Euclidean coordinates of the -th point where / . . . % of dimension . The Gram matrix of the coordinate vectors is: " %& $ ( (% ' % #" " " $ (1) ) * It is easy to see that a Gram matrix is positive semidefinite of rank at most 3 (for 3% % dimensional vectors). For example one can extend to a matrix by filling % % % % in some zero-rows. We clearly have , from which the announced properties follows since the product of any square matrix with its transpose is positive semidefinite. Moreover, by the law of the cosines we have: ' +' % % *' * - (2) so the metric matrix is equal to the Gram matrix and hence positive semi-definite of rank at most 3. Distance Geometry of Molecules 2–5 We suppose now, that we have points in a metric space, and that the metric matrix is positive semi-definite of rank at most 3. Since the metric matrix is symmetric, it can be diagonalized by an orthogonal matrix: ' (3) where is the matrix of the eigenvalues. We can choose such that the nonzero, positive eigenvalues are among the first three elements on the diagonal of . Scaling the eigenvectors by the square roots of the corresponding eigenvalues, % + of dimension , which has only zeros we obtain the new square matrix below the third row. We have: From (3) we have that: % ' % ' + ' + ' . ' . (4) (5) From 4 and 5 we obtain that the two matrices and are equal. This means that % the first three elements of the columns of are the coordinates of a set of vectors for which the corresponding Gram matrix is equal to the metric matrix. Since the elements of the Gram matrix are dot products of the vectors we can apply the law we obtain that of cosines. Hence along the diagonal of the matrix equation for all ! . Using this we obtain also for off-diagonals that for all ! - " . Hence the distance between the atoms of the molecules is invariant in the two structures (distance metric and Euclidean metric are the same). In the proof we assume, that the metric matrix has at most three positive eigenvalues, and that the rest of them are zero. In practice, errors in measurements may cause that some of them are not zero, and even worse are negative, which means that it is impossible to obtain the square root of the matrix . Hence, from a first point of view, the algorithm is going to fail. So we need to create an algorithm that should be able to calculate the Euclidean coordinates from approximations of the real distances of the atoms. What we could do, and will do, is to perform the steps described in the proof, but at the point when we need the square root of , we keep the three largest positive eigenvalues and set all other to zero. This choice of approximate embedding can be justified mathematically in the sense, that what we obtain is the Euclidean distance matrix closest to our original measured distance matrix, wrt. an appropriate norm on the set of distance matrices. In practice we find, that if the data are “sensible”, we will obtain three large positive eigenvalues for the metric matrix, while all others are magnitudes below in absolute value. The reader can find more details in [2]. 3.2 Finding Coordinates when the Data are Given with Error Bounds The distance geometry problem with given error bounds consists of finding a set of 3 in 6 8 such that: positions ' +0-/././. - ! - " . The strategy we used to solve the problem consists of two different parts. The main idea is to use optimisation, but in order to be able to solve the problem for large 2–6 14th ECMI modelling week Lund 2000 proteins, which has many atoms, we need a good initial guess. We can get this by taking a random distance matrix. Below follows a description about how we generated this initial infeasible solution and then follows a description about how we can transform our problem into an optimisation problem and how we can find a solution of it. 3.2.1 Generation of an Infeasible Initial Solution Using Random Distance Matrices In order to get a good initial guess to the optimisation problem, we start by calculating an approximative solution to the problem, where the data is given with error bounds. This solution will not be feasible in the sense that all distances will not be within the given error bounds. The initial guess is given by taking the interval between the error bounds and computing the middle point of this interval. Then a random numbers generator is used, that produces normally distributed random numbers, for every interval, with the mean value corresponding to the middle point of the interval and the variance equal to the length of the interval. After this procedure we test if the generated random distance matrix is approximately embeddable. ”Approximately” means that we check the three largest eigenvalues of the corresponding metric matrix following theorem 1. If there are negative eigenvalues we put them equal to zero, otherwise we have a positive semidefinite matrix. Then we use our algorithm which finds an embedding in 6 8 to get the approximated coordinates. 3.2.2 Optimisation Procedure to Find Feasible Embedding By converting our problem into an optimisation problem we can use the initial guess given from the above to get a feasible solution. We have used an unconstrained optimisation formulation, where the feasibility is measured by the cost function. That is, we add a penalty term to the cost function that is big when the bounds are broken. This has been suggested by Mor and Wu (see: [3]). The formulation of our distance geometry problem as a global optimisation problem is as follows: find the global minimum of the function : $ where the pairwise function 6 8 4 6 is defined by: ! - ) - . The set of positions '+0-/././.*- 3 solves the distance geometry problem if and only if 3.2.3 is a global minimiser of and . Computational Experiments We implemented both the algorithm for finding an initial guess to the optimisation problem and the optimisation algorithm in Matlab. The code is found in the appendix. In both cases we used distance matrices created by the original data found in the protein data base, but we made our error bounds by adding random numbers to the exact distances. Distance Geometry of Molecules 2–7 For the approximated infeasible solutions, we got from our first algorithm, usually about 50 of the bounds were violated. Unfortunately we could not get the optimisation algorithm to work. Possible problems might have been that the problem was too large for the Matlab optimisation routine we used or that we used the routine in an ineffective way. If there had been more time we would have implemented a steepest descent algorithm ourselves, so that we could control what happens in the optimisation loop. 3.2.4 Concluding Remarks In the paper of Mor and Wu (see: [3]) there are ways of improving the minimisation algorithm given. They suggest to transform the cost function by a Gaussian transform into a smoother function with fewer local minima before applying the minimisation routine. This gives of course an approximated solution, but one can use our random-distance-matrix-algorithm as a starting point and then iterate the improved optimisation-algorithm with a smoothing factor, that tends to zero. 4 Classification of Proteins A common wish when working with molecules or, in this case, proteins is to classify different proteins into different groups. This could be done in several different ways. One would be to sort the protein after what function it has. Another way would be looking at the structure of the protein molecule classifying them after similarities in the structure. When one looks at 3d pictures of the bone-structure of different proteins it is easy to see that some proteins have similar structures while some do not (see figure 1). This way of classifying the proteins can also be interesting in a mathematical point 25 30 20 20 15 10 10 5 0 0 −5 −10 −10 −20 −15 −20 30 −30 60 50 5 20 0 10 30 25 40 −5 15 10 20 −15 −10 20 30 −10 0 5 10 −20 a) 256b1-protein 0 b) 1brx-protein 35 5 0 30 −5 25 −10 20 −15 15 −20 10 −25 5 40 −30 20 35 30 30 25 20 20 15 10 0 40 10 35 30 0 25 20 −10 10 5 15 −20 0 c) 5ebx-protein 10 5 d) 2phy-protein Figure 1: a), b) show proteins with -helix structure and c), d) proteins with -sheet structure. of view, so this will be our way of classifying proteins. We want to take into account the types of substructures that the protein has but it would also be interesting to try to find a more global measurement of how compact the molecule is, that is if the atoms 2–8 14th ECMI modelling week Lund 2000 are situated all together or if they form different groups. The latter seem to be the more difficult part so the main focus in our work was to work with a classification that was based on the different types of substructures in the molecule. A large part of the coming sections will therefore be about how we can locate these different substructures just by using the information about the distances between the atoms. We use two different, but yet similar strategies to do this. The first one is to use a filtered binary version of the distance matrix. The other one is to make a graph of the atoms in the chain and use graph theory to interpret the data. In the latter we also discuss some ideas about how graph theory can be used to also get a more global measure of the protein. 4.1 Classification Using the Distance Matrix The first approach when attempting to classify the protein molecules was to look at the distance matrix. By studying the colour visualisation of the distance matrix for several proteins one can easily see that different types of molecules give different patterns, and the similarities in the pattern for those molecules that are of the same type is striking. Because of this fact the distance matrix seems to be a good tool when classifying the molecules. Figure 2 shows distance matrices for different proteins, but the first two proteins are 10 20 20 40 30 60 40 80 50 100 60 120 70 140 80 160 90 180 100 200 10 20 30 40 50 60 70 80 90 100 20 40 a) 256b1-protein 60 80 100 120 140 160 180 200 b) 1brx-protein 10 20 20 40 30 60 40 80 50 100 120 60 10 20 30 40 50 60 c) 5ebx-protein 20 40 60 80 100 120 d) 2phy-protein Figure 2: a), b) show distance matrices of proteins with -helix structure and c), d) distance matrices of proteins with -sheet structure. of one type and the other two are of another type. In order to make it easier to analyse the patterns, we filtered the distance matrix and turn it into a binary matrix. We decided to concentrate on the small distances below a certain cut off value. The distances below the cut off value are set to 1 and the rest are set to 0. The cut off value was decided through a trial and error process, and we found that a suitable cut off value was 8Å for most of the proteins we looked at. In figure 3 the filtered, binary version of the distance matrix, shown in figure 2, is visualised. Distance Geometry of Molecules 2–9 0 0 10 20 20 40 30 60 40 80 50 100 60 120 70 140 80 160 90 180 100 200 0 10 20 30 40 50 60 nz = 996 70 80 90 0 100 20 40 a) 256b1-protein 60 80 100 120 nz = 2054 140 160 180 200 b) 1brx-protein 0 0 10 20 20 40 30 60 40 80 50 100 60 120 0 10 20 30 nz = 564 40 50 60 c) 5ebx-protein 0 20 40 60 nz = 1174 80 100 120 d) 2phy-protein Figure 3: a), b) show filtered distance matrices of proteins with -helix structure and c), d) filtered distance matrices of proteins with -sheet structure. The main focus was now to find the patterns in the binary distance matrix that correspond to the different substructures ( -helixes and -sheets). The spiral structure of the -helix is such, that atoms that are next to each other in the chain in groups of four also are close in the euclidean sense. In the binary distance matrix this can be seen as bands along the diagonal. Figure 1a) shows the the bone structure of the protein 256b1, which contains four -helixes. Figure 3a) shows the visualisation of the corresponding binary distance matrix, which clearly has four thick bands along the diagonal indicating the different helices. By just searching for the bands along the diagonal that exceeds a certain length the -helix structures can easily be found. A search routine was implemented in Matlab to do this. For the code see Appendix. The program worked well on most molecules that we tested, that is for most of those proteins that contain helix-structures the program found these and the program found no helices in the proteins with only sheetstructures. There were however some proteins with helix structure, that the program could not deal with. The problem was probably that the cut off value was too small for these proteins. The radius of the helices can differ quite a bit and in the cases where the program failed we believed that the radius was so large that the atoms in the helix were to far apart to be recognised as helix-structure. A search routine to find the ) -sheets was not implemented, but the patterns that correspond to these can also be recognised in the binary distance matrix. These patterns constitute islands that are anti parallel to the diagonal (for sheets with anti parallel strands). See figure 3c) which shows the visualisation of the binary distance matrix of the protein 5ebx. In figure 4 there is a simplified version of a sheet. In this figure all atoms that are 2–10 14th ECMI modelling week Lund 2000 m+9 m m+1 m+2 m+14 m+8 m+7 m+15 m+16 Figure 4: The figure show a simplified version of a -sheet from a protein. considered close to atom is marked, and from this picture it should be clear how the pattern in the distance matrix occur. A search routine that finds these sheets should be based on an algorithm which sums the elements in each row anti parallel to the diagonal. For those rows that the sum is high a local search should be performed to find the whole island to see which atoms that constitute the sheet. When these different substructures of the proteins are found the protein can be classified after which substructures that they contain. We can then form four different groups. Those proteins that have -helices, those that have -sheets, those that have both structures and those that have non, the so call random folded proteins. A more global measurement of the protein structure might be possible if one uses some sort of image processing on the filtered distance matrix. We think image processing would enable us to compare two filtered distance matrices in a more intelligent way then just calculating the difference matrix, and such a measure could perhaps also give us a way of ”globally” classifying the proteins. 5 Classification using graph theory 5.1 Why graph theory? In the previous sections we represented the data essential for our investigations in the form of a matrix. In a distance matrix all distances in 3D-space between the atoms where given, in the same order as the atoms follow each other in the protein chain. The same information can also be represented in the form of a graph. The question is which surplus value this will give. We show in this section, that the operations available in graph theory bring new insight in how one can easily identify sub-structures in the proteins and generate global measurements for classifying the proteins as well. The way in which these insight are obtained are for all mentioned methods more or less the same. The protein is modelled as a graph. Some operations on this graph give some simplified vector. These vectors are more easily comparable than complete matrices, and give in general enough information to distinguish different types of proteins. 5.2 Constructing graphs % We can represent the 3D-structure of the protein by a graph. 9- where the set of vertices represent the -atoms. The set of edges Distance Geometry of Molecules 2–11 we can construct in different ways. The most obvious way to do this is similar to the way we created the binary distances matrices. That is: 4 & where is the 3D-distance between the ! and " - -atoms in the protein chain, and a certain threshold, that has to be chosen such Å is default.) that the important substructures are distinguishable ( 5.2.1 Generalized definition of the edge set An alternative definition of the edge-set can be very interesting. Because the edges between subsequent vertices are trivial it is a good idea to leave them out, which gives us the following definition: $ ! " ' 0- / 3 $ Remark that + gives us the same definition for the edge set as given above. $ $ So this alternative definition is in fact a generalization. and give 8 us new graphs with similar features. In with higher an analysis of the components give us results about some local structures. We suppose that each component (especially the components that have a regular structure, like a block) represents a substructure. Some other structures however will be lost with higher , so one has to be very careful with this. 5.3 Local sub-structures 5.3.1 Degrees of the graph A basic idea to be able to say something about the sub-structures in the protein is to look at the array of degrees of the graph. The conjecture is that certain sub-structures have certain properties in this array, namely a sub-array of a degrees that is higher than the average degree. We call this local regularity. The distribution of the degrees is defined as the frequency with which each degree occurs. By looking at the distribution of the degrees, we can say something about the number of different local sub-structures. This distribution can also be useful for the global classifying of the proteins. Apart from this other ways of finding local sub-structures can be thought of, we discuss one of this ideas in the next sections. Remark that the generalized definition of the edge-set give the same results in this case. 5.3.2 Divide graph in blocks To explain the idea of this section we first have to explain some terms. A vertex of a connected graph is a cut-vertex of if and only if there exist vertices and ( ) such that is on every path of . A non-trivial connected graph with no cut-vertices is called a non-separable graph. A block of a graph is a maximal non-separable subgraph . The idea now is that having a block in the graph gives some indication of some sub-structure. This idea has to be tested to see if it works for real-life cases. 5.3.3 Other ideas There are a lot of other similar concepts in graph theory that all can help us, like local regularity, looking at the cut-vertices, graph-coloring, etc. Some trial and error can 2–12 14th ECMI modelling week Lund 2000 help to determine which concepts work best. 5.4 Global classifying Apart from finding local sub-structures we were also interested in classifying the proteins in different classes without first identifying the sub-structures. 5.4.1 Connectivity and edge-connectivity Let be a connected graph (A graph with a path between every two vertices). A subset of the edge set of an connected graph is an edge cutset of if is disconnected. By the next iterative process an idea of the global structure can be obtained. In this % process the vector is similar if the global structure is similar. 1. Start with the graph and ! 2. Find the smallest cutset % 3. Define . of the components of and replace by 4. If there are still nontrivial components of rithm . and ! by ! . goto 2. otherwise finish the algo- Similar results are obtained by (vertex) connectivity. The connectivity of a graph is defined as the minimum number of vertices whose deletion from produces a disconnected or trivial graph. 5.4.2 Distance matrix - beFor a nontrivial graph and a pair - of vertices of the distance tween and is the length of a shortest -path in if such a path exists. This shortest path can be obtained by the algorithm of Moore’s Breath-First Algorithm (see [6], page 101). The distance matrix is the matrix with this distance for each pair of vertices. What to do with this distance matrix? Well, if we simply count the frequency of all numbers in this matrix, we again have a array which gives us a good idea about the global structure of the graph. 6 Conclusions The first part of our project was to find the coordinates of atoms in the molecule, knowing only distances between atoms. We are able to do that. It can be done using theorem 1. The proof of this theorem is constructive and can be directly used to find the embedding in 6 8 . This is possible when all the distances between the atoms are given. In that case we have an exact and unique solution of our problem. Another goal was to find the coordinates when we only have lower and upper bounds on the distances instead of exact numbers. We tried to find the random set of coordinates and then optimise it until all the constrains are fulfilled. However, we didn’t manage to implement our algorithm. We also tried to check how much information we need to find the coordinates, i.e. whether we need all the distances in the distance matrix or not. And if not, what is the least number of elements of the matrix we need to obtain the unique solution. But Distance Geometry of Molecules 2–13 we didn’t manage to do much research on that topic. The second part of our project was to investigate the secondary structure of the proteins. Since -helices and - sheets are main (most common) structures, we have focused on finding them in the protein. We have implemented the algorithm, which finds helices in the molecule using filtered distance matrix. As output the program gives the number an atom where a helix starts, and the number of atoms of which the helix consists. It also draws a filtered distance matrix for currently investigated protein, a molecule in 3d and a helices cut out of the protein. So, our program gives an information about location and length of helices, provided there are any in the molecule. Similar algorithm can be applied to find -sheets. Knowledge about helices and sheets is the most important knowledge about secondary structure of the proteins. We can get it using our program. We have also tried to achieve the same aim (investigate the seondary structure), using a different tool — the graph theory. We have implemented the algorithm which draws a graph and calculates the degrees of knots based on a filtered distance matrix. Looking at the graph we can find parallel and anti-parallel - sheets, while degrees give us an information about how compact (folded) the molecule is. We think that graph theory can be also useful in general classification of the proteins. Two similarly-looking graphs should correspond with two similarly-built molecules. 7 Further Research Of course some aspects of our work could be more deeply studied, in order to reach a better understanding of the solution of the problems. Concerning the first problem, when there are errors in the data, we would like to investigate the solution space of the optimisation problem, and again make an optimisation routine to find a solution, for example implementing a steepest descent algorithm. In the second problem some more work could be done, for example: implementing a search routine to find -sheets structures in the proteins (as we’ve done for -helices structures), or again including a variable cut off limit in the search routine. Moreover, it should be interesting to try to use image processing to classify molecules, and make further implementations using graph theory to get a global measurement of the structure of the molecule. 2–14 14th ECMI modelling week Lund 2000 Appendix Algorithm for Finding Approximate Solution function [C,counter] = approximation(Matrix) M=Matrix; N=size(M,2); % --- test pos. semidefinite [Eigenvectors, Eigenvalues] = eig(M); helpvector=diag(Eigenvalues); % --- sorting of the eigenvalues after their size [sortedvec,index] = sort(helpvector); for i=1:N backindex(i)=index(N-i+1); end; % --- consider the 3 largest eigenvalues counter = 0; for i=1:3 if (helpvector(backindex(i))>0) counter=counter+1; end; end; % Eigenvectors=helpmatrix; % Eigenvalues=diag(turnedvector); Chelp=Eigenvaluesˆ(0.5)*Eigenvectors’; for i=1:3 C(i,:)=Chelp(i,:); end; counter Algorithms for Calculating a Distance Matrix with Random Errors function main(iterations) % % % % This program uses random distances to solve the distance geometry problem with given error bounds. The variable iterations is the maximum number of tries to find a random distance matrix, that fits. Distance Geometry of Molecules 2–15 load pdb2PHY.dat -ascii; A=pdb2PHY’; n=size(A,2); epsilon=0.1; % number of atoms % error bounds D=distancematrix(A); M=metricmatrix(D); % % % % computation of -procedure not computation of -procedure not the distancematrix included here the metricmatrix included here for i=1:iterations % computation of random distance-matrix ND=newdistmatrix(D,epsilon); % corresponding metric matrix NM=metricmatrix(ND); % test if embeddable [C,counter]=approximation(NM); if counter == 3 break; end; end; error=minus(D,distancematrix(ND)); figure(1) image(D) % visualization of the original distance matrix figure(2) image(ND) % visualization of the random distance matrix figure(3) % visualization of the error-matrix image(error) function New = newdistmatrix(Matrix,epsilon) D=Matrix; N=size(D,2); eps=epsilon; hurtedbounds=0; for i=1:N for j=1:N if (i>j) rn=randn; errormatrix(i,j)=(abs(rn).*eps)./D(i,j); if (abs(rn)-1>0) 2–16 14th ECMI modelling week Lund 2000 hurtedbounds=hurtedbounds+1; end; New(i,j)=D(i,j) + rn.*eps; New(j,i)=New(i,j); end; end; end; error=0; number=0; for i=1:N for j=1:N if (i>j & errormatrix(i,j)˜=0) error=error+errormatrix(i,j); number=number+1; end; end; end; erroraverage=error./number.*100 hurtedbounds totalnumberofbounds=0.5*Nˆ2-N Optimisation Algorithm load pdb1NXB.dat -ascii A=pdb1NXB; D=distancematrix(A’); %N=size(D,1); [k,d]=size(A); epsilon=min(min(D))-0.5; %ANew=A; for i=1:k for j=1:d ANew(i,j)=A(i,j)+randn*epsilon; end end global L; global U; for i=1:N for j=1:N L(i,j)=D(i,j)-epsilon; U(i,j)=D(i,j)+epsilon; end; end; for i=1:3 BOX1(:,i)=-100; BOX2(:,i)=100; Distance Geometry of Molecules 2–17 end %global ANew; %options=optimset(’TolFun’,10,’Display’,’iter’); %Xmin=fmincon(’funktion’,ANew,[],[],[],[],BOX1,BOX2,[],options) fminsearch(’funktion’,ANew) Search Algorithm for Finding -helix Structures in the Distance matrix clear; limit=8 load icse.dat -ascii M=icse; [n,dimension]=size(M) figure(1) for i=1:n text(M(i,1),M(i,2),M(i,3),int2str(i)); end plot3(M(:,1),M(:,2),M(:,3)); for i=1:n for j=1:n D(i,j)=sqrt((M(i,1)-M(j,1))ˆ2+(M(i,2)-M(j,2))ˆ2+ (M(i,3)-M(j,3))ˆ2); if D(i,j)>limit DMIN(i,j)=0; else DMIN(i,j)=D(i,j); end DMIN(i,i)=1; end end figure(2) spy(DMIN); figure(3) image(D) j=1; i=1; while i<n-4 if DMIN(i,i+3)˜=0 & DMIN(i,i+4)˜=0 m=1; istar=i; i=i+1; while DMIN(i,i+3)˜=0 & DMIN(i,i+4)˜=0 m=m+1; i=i+1; end 2–18 14th ECMI modelling week Lund 2000 if m>4 H(j,1)=istar; H(j,2)=m; j=j+1; end else i=i+1; end end figure(4) hold on for i=1:j-1 a=H(i,1); b=H(i,1)+H(i,2); plot3(M(a:b,1),M(a:b,2),M(a:b,3)); end References [1] I.J. Schoenberg, ”Remarks to Maurice Frechet’s article ’Sur la definition axiomatique d’une classe d’espace distancies vectoriellement applicable sur l’espace de Hilbert”, Annals of Mathematics, 1935. [2] Timothy F. Havel, Irwin D. Kuntz and Gordon M. Crippen, ”The theory and practise of distance geometry”, Bulletin of Mathematical Biology, 45: 665-720, 1983 [3] Jorge J. Moré and Zhijun Wu, ”Distance Geometry Optimization for Protein Structures”, Journal of Global Optimization, 15: 219-234, 1999. [4] N.J. Darby and T.E. Creighton, “Protein Structure”, IRL Press, 1993. [5] Holtzclaw, Robinson, Odom, “General Chemistry”, Ninth Edition, D.C. Heath and Company, 1991. [6] Chartrand and Oellerman, “Applied and algorithmic graph theory”, McGrawHill, 1993.