A Study on Pairwise Comparison of Protein Structure by Dali and the Multiple Structural Alignment of Protein Structures by Multiprot Rajalekshmy Usha CS 7301-501, Fall 2006 University of Texas at Dallas INTRODUCTION The structure of a protein can throw light on its function and its evolutionary history. To acquire this information we need to know the knowledge of the structure and its relationship with other proteins. To know the knowledge of the structure and its relationship, we need to know the different folds that proteins adopt and detailed information about the structure of many proteins. Nearly all proteins have structural similarities with other proteins and most of them share a common evolutionary origin. Individual structures provide explanations of specific biochemical functions and mechanisms, whereas comparisons of structures give insight to general principles governing these molecules, the interactions they make and their biological roles. The three-dimensional structures form the foundation of structural bioinformatics. All the structural analyses are indispensable on them. To facilitate the comparison of known protein structures, there are many algorithms and resulting web resources that provide protein structure comparison at the level of domain or complete polypeptide chain. DALI is one such method. HOW IT ALL STARTED The first comprehensive collection of macromolecular sequences appeared in the Atlas of Protein Sequence and Structure, which was published from 1965 through 1978, under the editorship of Dr. Margaret O. Dayhoff. Dr.Dayhoff and her research team pioneered in the development of computational methods for comparing the protein sequences, in detecting the duplications within the sequences, in detecting the distantly related sequences and in deducing the evolutionary relationship from the alignments of protein sequences. Since then many sequence databases have been established and their relationship detected using dynamic programming algorithms that is popular in computer science. These methods handle the insertions and deletions occurring between distant evolutionary relatives very efficiently. The structural data has always been sparse than the sequence data. While PDB (Protein Data Bank) has 39,464 structural entries to date, NCBI (National Center for Biotechnology Information) has over 12 million entries on sequence data. The main reason is due to the technical difficulties involved in structure determination. The structure of the proteins can be determined experimentally by using NMR and X-Ray crystallography. In NMR spectroscopy, radio waves and magnetic waves are used. These waves can penetrate through highly purified biological samples. The process is time consuming because it requires interactive analysis of data with a trained scientist. X-ray crystallography is yet another common method used to determine the structure of macromolecules such as proteins and nucleic acids. The three-dimensional structure is determined from the crystallographic data. The protein samples must first be crystallized. Since most of the proteins are sensitive to high temperature and high concentrations of organic solvents, it is difficult to get crystallized protein samples of passable quality. X-rays are passed through the crystallized protein samples and their diffraction patterns are analyzed. A model of the molecule is built from the diffraction data. The structure of the molecule goes through several rounds of refinement and the final model into the final crystallographic databases such as Protein Data Bank (PDB) and the Cambridge Structural Database. It is interesting to note that about 90% of the protein structures in PDB are obtained from X-ray crystallography and around 9% from NMR. [6] The structural classifications began to emerge during the mid-1990s although the first crystal structure was solved in 1970s. The various structural classifications began to emerge, primarily with Structural Classification of Proteins (SCOP) (Murzin et al., 1995; Lo Conte, 2000), DALI (Holm and Sander, 1996), and CATH (Orengo et al., 1997; Pearl et al., 2001) databases and data resources. Several other classifications are DDBASE (Sowdhamini et al., 1998), 3Dee(Dengler, Siddiqui, and Barton, 2001), DaliDD (Holm and Sander; Dietmann and Holm, 2001). These databases use different algorithms for comparing three-dimensional structures. Each of them uses different criteria to measure the similarities in the protein structures. [5] WHY USE PROTEIN STRUCTURE FOR COMPARISON? Traditionally, proteins with similar amino acid sequences are used to infer the structure and function of a protein. This is because it was assumed that proteins with similar sequences have similar functions and structures and are evolutionary related.. However sequence similarity searches can evolutionary relationships only when there is a sequence identity up to 25%. For those proteins below this percentage of similarity, enter a “twilight zone” of similarity. A structural similarity search can extend the validity of evolutionary relationship beyond the borders of the “twilight zone”. It has been confirmed that the structures are much better conserved than the sequence over a long periods of time (Chothia and Lesk, 1986). The possible reason is that the structure is able to adopt a wide range of mutations and physical forces favor certain structures. Protein structures are stored in the form of a PDB file, which consist of a list of 3 dimensional coordinates of all the atoms in the proteins. The PDB file has little or no information on functional data. However, by studying the structure we can derive information relating to biochemical function. This is especially helpful if the protein in question has unknown function. This fact is illustrated in figure 1. It is expected for homologous proteins to have similar structure and functions. But there are exceptions. There is diversity of functions with homologous superfamilies. For example, take the case of lysozyme and -lactalbumin (Acharya et al., 1991). These two homologous proteins have high sequence identity but each of them has different function (see figure 2(a) and (b)). In contrast, the globin family of proteins has undergone multiple amino acid changes through the course of evolution but their function remains the same (see figure 3(a) and (b)). When we look at the analogous proteins, they may share similar structure but have no sequence similarity. These analogues are examples of convergent evolution toward the same function. The proteins that exhibit these features are trypsin and subtilisin (Wallace, Laskowski, and Thornton, 1996) (See figure 4(a) and (b)). Figure 1.The figure shows the relationship between 3D structure and function. Figure 2(a). Lysozyme and alphalactalbumin have 40% sequence identity between them and they have similar structure but different function. Lysozyme has an O-glucosyl hydrolase but lactalbumin does not have this enzymatic activity. However, both lysozyme and alphalactalbumin have a sugar-binding site. The catalytic residues of alphalactalbumin have changed over the course of evolution and it has lost its enzymatic property. Figure 2(b). Superimposed image of lysozyme (PDB ID: 1gd6:A) and lactalbumin (PDB ID : 1b9o:A) Figure 3(a). Both the hemoglobin share only 8% of sequence identity but their overall fold and function is identical. Figure 3(b). The superimposed image of the two hemoglobin (1VHB:A and 2LHB) from Figure 3(a). Ser-His-Asp triad is found here Ser-HisAsp triad Figure 4(a). Subtilisin and Chymotrypsin are both endopepsidases. They do not share any sequence identity and they have unrelated folds. However, each of them has an identical Serine-Histamine-Aspartine catalytic triad, which catalyses the peptide bond hydrolysis. The two enzymes are a classic example of convergent evolution. Figure 4(b).Superimposed image of subtilisin (IGNV) and chymotrypsin(1GL0:E). Note that there are less structural similarities. By comparing 3D structures of proteins with computational tools like DALI and MultiProt, the biologists can identify new types of protein architecture, identify common structural core and discover evolutionary relationship between protein molecules. It helps to organize the growing set of known protein shapes. It is also useful in comparing the structure of a protein with unknown function to structures of proteins with known functions. From these comparative studies, we can derive functional information from the closest match within defined bounds. However, caution must be exercised because as illustrated from the examples above, if two proteins are structurally similar but have no sequence identity, they must have been evolved as the result of convergent evolution. Similarly, two proteins with similar structures and similar functions may not be evolutionary related. 1. DALI The 1993 paper of Holm and Sander (1993 a) popularized the protein structure comparison. This paper describes the use of distance alignment matrices in comparing the protein structure. Using distance matrices to compare the two protein structures have been practiced since the 1970s (Phillips, 1970; Nishikawa and Ooi, 1974; Liebman, 1980; Sippl, 1982). DALI stands for Distance matrix alignment. Liisa Holm is the creator of DALI. DALI is completely automated and it is too large and complex to be installed in the external sites. Dali Server is a standalone version of a search engine and is an automatic network service for the comparison of protein structure in 3D. DaliLite is a program in DALI for pair wise structure comparison and for structure database searching. There are two ways to compose a request in DALI. The first is a Database Search and the second one is a pair wise protein comparison using a program called DaliLite. There are several other popular methods that do structure comparisons like DALI. Some of them are Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998), COMPARER (Sali and Blundell, 1990), SARF2 (Alexandrov and Takahashi, and Go, 1992), SSAP (Taylor and Orengo, 1989), and VAST (Vector Alignment Search Tool; Gibrat et al., 1996). Each of them uses different algorithms. For example, if we take the case of CE and Comparer, CE uses a distance approach at a level of octameric fragment; that is, a comparison of C-alpha distance matrices is made for every combination of eight residues in each protein chain. COMPARER uses the comparison of residues’ properties, segments of residues and the relations between the residues, and the relations between the segments. GENERAL METHODOLOGY USED IN STRUCTURE COMPARISON AND ALIGNMENT The common methodology followed in various methods of protein structure comparison and alignment usually involves three to four steps. They are listed below: (1) Represent the two proteins A and B (which are essentially polypeptide chains, domains or other amino acid fragments) in some coordinate independent space so that they can be readily compared. (2) Compare the two proteins A and B. (3) Optimize the alignment between A and B. (4) Measure the statistical significance of the alignment against a random set of structure comparison. From the above list, the step 3 and 4 can be combined as one. This methodology is commonly used for pair-wise structure comparison and alignment. Such a general approach is found to be an NP-hard problem that has been solved heuristically by all the methods [5]. To find the optimal alignment between the two proteins A and B, find the highest number of atoms with the lowest RMSD (Root Mean Square Deviation). There is also a need to get a balance between the local regions with very good alignments and an overall global alignment. The methodology used in multiple structure alignment is different and is discussed in details later in the paper. MultiProt algorithm is used in this paper specifically to illustrate multiple protein structure comparison. PROTEIN STRUCTURE COMPARISON ALIGNMENT IN DALI DALI (Distance Matrix Alignment; Holm and Sander 1993a) uses distance matrices to represent each protein structure. In this method, each structure is represented as a two-dimensional array of distances between all the C-alpha atoms. Figure 19 illustrates how a distance matrix is obtained for a protein structure. The objective is to perform a one-to-one comparison between the residues and remove any non-matching residues between the two proteins being compared. The input structures are broken into hexapeptide fragments. The Dali method then calculates the distance matrix after evaluating the contact patterns between successive fragments. Similar 3D structures have similar inter-residue distances and in most favorable cases, the 3D superimposition of a pair of secondary structure elements will lead to superimposition of the entire structures. Therefore, similarities in secondary structures i.e., in the backbones of the proteins are reflected along the matrix’s main diagonal. Off-diagonal similarities in the matrix reflect the tertiary structure similarity. [4] When the off-diagonals are parallel to the main diagonal, the features of the proteins are parallel. When the off-diagonals are perpendicular to the main diagonals, the features are anti-parallel. This representation is memory intensive since the square matrix used in the Dali matrix is symmetrical about the diagonal. [6] The distance matrices are collapsed into regions of overlap i.e., sub-matrices of fixed size. These sub-matrices are subsequently stitched together if there is an overlap between the adjacent fragments. As the result, common structural motif made up of several disjoint regions of backbones become visible when residues with no structural equivalence with the other structure are removed. COMPARISON ALGORITHM AND OPTIMIZATION Dali uses Branch and Bound Algorithm as a comparison algorithm. It is followed by Monte Carlo optimization to get the final result of the alignment. The branch-and-bound search (Lathrop and Smith, 1996) has a systematic pair wise comparison between all the elementary contact patterns in the two distance matrices. The hexapeptide-hexapeptide contact patterns of protein A ( i A … i A 5 , j A … j A5 ) are paired with ( i B … i B 5 , j B … j B 5 ) of protein B. Here the hexapeptide i A … i A 5 is matched with i B … i B 5 and the hexapeptide j A … j A 5 is matched with j B … j B 5 . Similar contact patterns are stored in a non-exclusive list of pairs (the pair list), which is the raw material for structural alignment. [2] The branch-and-bound then iteratively decompose the search space into smaller subsets and an upper bound of the evaluation function is evaluated for each subset. The subsequent iteration of the algorithm repeats the process on the subset with the highest upper bound. The branch-and-bound algorithm is guaranteed to obtain an optimal alignment for the locally similar region but it is slow with an exponential number of steps in worst case. In Monte Carlo alignment, the pairs of contact patterns are assembled into a larger set of aligned pairs. Intervening rows and columns that fall outside the equivalency is removed. This helps to maximize the similarity score, which is the score for each pair of aligned residues. (Equation 1.) scorei, j distance A i, j distance B i, j ω distance i, j … Equation1 = 0.2 distance i, j Here i and j represent a pair of aligned residues in protein A and B, distance i, j represent the average of distance A i, j and distance B i, j . is the envelope function and = e where d is the arithmetic mean of the distance between residues of proteins A and B of d 2 / r2 , coordinates i, j and r = 20oA. Envelope function is used to give lower weights to those residues that are far apart and it helps to reduce their relative contribution to the overall score. This score is summed over all pairs of residues from both the protein structures. So, higher similarity score corresponds to a good fit or better alignment between the structures. Using both branch-and-bound and Monte Carlo Algorithm has increased the sensitivity and specificity of the pair wise protein comparison by Dali. THE BRANCH AND BOUND ALGORITHM Only non-gapped segment pairs are used to reduce the complexity of structural alignment problem. A natural segmentation uses the secondary structures of the query structures. The details of the algorithm that is used to maximize the alignment between two proteins are illustrated in figure 5(a). Figure 5(a). The algorithm is followed from top to bottom and is represented in three different schematics: 3 D chain at the left and a two-dimensional distance matrix in the middle and sequence alignment in the right. The structurally equivalent fragments in two proteins - protein 1 and protein 2 - are labeled as a, b, c and a' , b' , c' respectively. In the two dimensional distance matrices, equivalent fragments of proteins have boxes of same patterns. The boxes along the diagonal of the matrix represent the intra-fragment distances and the off-diagonal boxes represent the inter-fragment distances. One can calculate the similarity score by taking the pair wise distances between all the equivalent elements in the distance matrices. First Row: Start the alignment by finding equivalent hexapeptide-hexapeptide fragments a b and a 'b ' . Second Row: Next take only the fragment b and b ' to find additional matching contact patterns to find additional alignments. This process is called “branching”. The fragment c of protein 1 and c' of protein 2 seems to overlap along with b and b ' . Third Row: The overlapping pair b c and b'c' is merged with previously aligned pair a b and a'b' Fourth Row: The two aligned pairs are merged or collapsed into a new alignment a b c and a'b'c' . The collapse is performed after the removal of insertions or deletions followed by reordering of segments in b ' and c' in the second protein. This process is called “bounding”. the single solution Figure 5(b). Another view on branch and bound algorithm. Image source: [3] Further detailed explanation on branch-and-bound algorithm is explained below by referring to the figure 5(b) above: Proteins A and B are represented as a distance matrices(as labeled in figure 5(b)) and each points on the matrix is the residue-residue distance. The square matrices shown within the matrix is a set of contacts made by the two segments and the secondary structure segments (i.e., the natural segmentation, as mentioned above) made in matrix A are , and . The solution space consists of all possible placements of residues in protein B in relative to a segment of residues in A. The idea is to divide the solution space into smaller solution subspace until you get the highest possible upper bound for the single alignment trace that is left. To explain the algorithm schematically, consider that our solution space is a circle as shown in the upper left and right of the figure 5(b). Every point in this solution space of circle is a possible solution. A line drawn within a circle that divides the solution space into two subspaces (i.e., two semi-circles). Each of these spaces got an upper bound with a value of 9 for the left semi-circle and 12 for the right semi-circle. We choose to go to the right since it has the maximum upper bound value. The right semi-circle is further divided into two subspaces (i.e., quarters) with an upper bound of 10 to the top quarter and an upper bound of 16 for the bottom quarter. We choose to go to the bottom quarter and split it again. The two upper bounds for two areas are 14 to the right and 12 to the left. We choose the right to proceed on with the algorithm. Continue the process of splitting the subspace until it shrinks to a single alignment trace. The single solution is indicated with an arrow as shown in figure 5(b). I might be wrong but I believe the final single solution is supposed to be in the area indicated as a star as shown in figure 5(b). The upper bound for each segment-segment submatrices of the matrix for protein A is selected by calculating the similarity scores between the matrix A and for each of the accessible submatrices in B. In other words, the similarity score is calculated for a predefined set of residues in protein A (a single highlighted square in matrix A as shown in figure 5(b)) with a set of squares in B (also highlighted in the same figure above). The predefined set consists of residues in secondary structure elements (i.e., the natural segmentation). The upper bound for the total similarity score, which is the sum of all the segment-segment submatrices in A, for one set of solutions is calculated by summing up all of the separate upper bounds for each of the segment-segment pair of matrix A [3]. Therefore, branch and bound algorithm recursively splits the search space into smaller subsets with paired fragments. The maximal similarity score is computed for each of the subsets and they are summed together to get the highest bound in sum-of-pairs similarity score. This naturally corresponds to optimal alignment. MONTE CARLO ALGORITHM In Dali, Monte Carlo optimization is performed where the user can achieve an iterative improvement by performing a random walk exploration of the search space with occasional excursions to the non-optimal section of the structural alignment (Holm and Sander 1993a). At first, a random basic move is made. The probability of accepting this move is mathematically expressed as ebeta *(S’ –S), where S’ is the new similarity score, S is the old similarity score and beta is a parameter. A move always involves an addition or deletion of a residue equivalence assignment. Monte Carlo algorithm uses two basic modes of operations: the expansion mode and the trimming mode. In expansion mode, an alignment is incremented when a matching contact pattern is obtained from the residue pair and this allow for a possible extensions of the alignment. Since the alignment is one-to-one mapping between the two protein pairs, say A and B, addition of a new fragment almost always involves a removal of a previous inconsistent equivalence assignment. As the result there is a corresponding increment and decrement of total similarity score. In trimming mode, any fragments that give a negative contribution to the total similarity score are removed from the alignment. In Monte Carlo, the trimming is done after the first and every fifth subsequent expansion cycles. DALI INPUT INTERFACE The Dali Email Server for Database Searches In Database searching the user can submit the coordinates of the query protein structure and Dali uses it to compare against those in the PDB. A multiple alignment of the structural neighbors will be sent to the user through email. The user can submit the protein structure by email. A *.pdb file containing the coordinates of the protein structure is uploaded interactively through the webpage http://www.ebi.ac.uk/dali/Interactive.html. The user need to type in the user’s email address at the textbox provided in the email box. Use plain text and not encoded messages (MIME or BinHex) to enter into the textbox. Commercial users need to type in the password. Press the “Submit Query” button to submit the request. The request is send through the email address: dali@ebi.ac.uk. The result will be emailed back to the address provided by the user within a few minutes or within a few days of submission depending upon the protein structure the user has submitted. In case of longer delays, you can notify via the email address dali-help@ebi.ac.uk. The set is constructed such that the sequence identity between any two chains of set should be less than 25%. The interface for the database search request is shown in figure 7. The DaliLite Server for comparing two protein structures The server for pair wise protein comparison can be accessed at http://www. ebi.ac.uk/DaliLite . Two inputs are required to submit the request. The user can submit the PDB code of two known protein structures or the user can upload the structural coordinates in pdb format. If the protein structure has different domains/chains, they can be entered at the space provided. To submit the request, click the “Run DaliLite” button. Dali Database The Dali database is based on an exhaustive all-against-all 3 dimensional comparison of protein structures that are currently available in PDB (Protein Data Bank). The input interface for the Dali Database is at http://www.bioinfo.biocenter.helsinki.fi/dali/start. The classification and alignments of protein structures are automatically maintained and regularly updated from the PDB by the Dali search engine. The input interface of the Dali Database is shown in figure 13. You can enter a PDB ID, or a protein name or a key word at the given textbox. As you scroll down the page, you will see link for downloading the sequence files, mysql dumpfiles and the DALI standalone application. The dumpfiles are computer-readable database dumps for largescale studies. To download the DALI standalone application, read and sign the license agreement, keep a copy for yourself and return the license agreement by paper mail to Liisa Holm’s address that is provided. Once that is done you can proceed on to download the program. The download is available for academic use only and is prohibited for commercial use. Moreover, this application does not create or update the database. Once you have downloaded the application, unzip and untar the distribution files. The INSTALL file provides information on how to install and run the application. For usage instruction, run DaliLite with the option –help. DALI OUTPUT INTERFACE The DaliLite Server for comparing two protein structures The output interface after submitting the PDB ID code of two protein structures is shown in Figure 8. Please note that the first structure, the query structure, is named as mol1 and the second structure, the subject structure, is named mol2 by Dali. If chain A for the query structure is submitted, then Dali names it as mol1A and so on. If no chain is specified for the structures, DALI will produce the result of comparisons for all the chains between the subject and the query structure. Click the button “Submit Another Job” if the user wants to submit a new set of protein structures. The link under “Structural Alignment” directs the user to a page containing the two-dimensional sequence alignment between the two protein chains. It also includes the DSSP secondary structure information (See figure 10). The column “Superimposed C-alpha Traces” give the user a detail on the matching superimposed C-alpha atoms for both the structures. The column on “PDB Files” give the links for the original pdb file (i.e., for the protein with PDB ID: 1CDK in figure 9(a)) and the second pdb file contain details for the protein with PDB ID: 1CJA (as given in the figure 9) after it has been rotated and translated to match the first protein i.e., 1CDK. To view the full superposition of the original, unchanged protein and the second modified protein, upload both their pdb files in a structure viewer like Pymol or Rasmol (See figure 9(b), (c) and (d)). The user can get information on the number of residues that were used for alignment between the two structures; Z-scores; RMSD (Root Mean Square Deviation); percentage of Sequence Identity. The details on how to calculate Z-scores and RMSD will be given in the next section. As you scroll down the same screen as in figure 9, you get details on “Additional data” section and “Inputs” section (See figure 11(a)). The links provided under the “Additional data” section provide the following files to be viewed: the rotation-translation matrix used in the rotation and translating the second protein structure to superpose over the first structure; a list of structurally equivalent residue ranges (see figure 12) and the log file which gives details on all the steps taken by the DaliLite application. The “Inputs” Section gives details on the pdb files for the two protein structures that the user has used to enter at the DaliLite input interface is displayed (Figure 11(a) and 11(b)). This is for the user to check that DaliLite server has used the desired pdb files for the two proteins. Dali Database Figure 14 show the result for the query “hemoglobin”. Note that the PDB entry for the hemoglobin with the PDB ID 1a00 has two chains B and D. The representative for these two PDB entries is given as 1dxtB(3rd column) and the domain fold class is given as 806 (5th column). Clicking on the fold index (i.e., 806) gives details of all the proteins belonging to same fold class (See figure 15). The browse link shows details on structural neighbors for each domain. It also provides the 1D structural alignment between hemoglobin and each of the top 50 neighbors are shown (Figure 16 and 17). The “interact” link directs you to another page that let you see multiple structure and sequence alignments in 1D (See Figure 18). Clicking on the “Structure Alignment” gives details on multiple structure alignment similar to sequence alignment. Click the “Structure/Sequence Alignment” button to get the details on structural alignment by related sequences. This is performed by PSI-Blast and stored in the Adda Database (Heger and Holm 2003). This view is especially helpful in tracking down the protein sequence that has been conserved across the protein families over time. Conserved functional sites are a strong indicator for common evolutionary origins. STATISTICAL ANALYSIS OF RESULT FROM DALI The similarity score for the structural comparison is derived from an all-on-all comparison of structures in PDB. The DALI score is represented as a standard deviation from the original score derived from the database background distribution. These scores are also known as Z-scores. Zscores is an important estimate on the quality of the structural alignment. A score greater than 20 implies that the two structures are definitely homologous. A score in the range of 8 to 20 means that the two structures are probably homologous. A score in the range of 2 to 8 indicates a grey area and that below 2 means that the two proteins are structurally dissimilar. Dali does not return any result for the proteins having score below 2. [7] RMSD (Root Mean Square Deviation) gives the average distance between the backbones of the superimposed proteins. RMSD is a scoring system used to find an optimal alignment between two protein structures. The unit of measurement for RMSD is Angstroms Å . Identical structures have an RMSD value of 0; when a pair of proteins has similar structures, the value falls in the range 1 3Å . Protein structures with little similarities have an RMSD value greater than 3Å . The scoring scheme based on RMSD is very good in finding alignments in global similarities. However, it is not sufficient to find alignments where there are local similarities [8]. The pitfalls of using RMSD as a scoring criterion are: (1) all atoms are treated equally (for example, residues on the surface of protein structures such as the hydrophilic molecules have higher degrees of freedom than those in the core (the hydrophobic molecules). (2) protein structures with best alignment does not always produce a minimal RMSD value. (3) the significance of RMSD always depends on the size of the atom. The formula used for calculating RMSD (Equation (3)) and Z-scores (Equation (2)) is given below: Z x x …. Equation2 is the raw score to be standardized; is the standard deviation and 1 i N 2 RMSD i N i 1 …. is the distance between N pairs of equivalent C atoms. Figure 6. An example on how to calculate RMSD is the mean. Equation3 Figure 7. Input interface for the Database search request in DALI server Figure 8. The input interface for the pair wise comparison of protein structures. 1CPC is the PDB code for Cphycocyanin of cyanobacterium. It has 4 chains – A, B, K and L. 1KTP is the PDB code for C-phycocyanin of synechococcus vulcanus. It has two chains - A and B. Alternatively one can upload their pdb files. Figure 9(a). The figure shows the output interface for the pair wise comparison of protein structure. This page gives details on the protein structure used and the statistical data from the comparison of the two structures. The table gives value for Z-score, RMSD (Root Mean Square Distance), Sequence Identity expressed as a percentage. Figure 9(b). The superimposed C-alpha traces of the pdb file (CA_1.pdb – see figure 9(a) above) provided by DaliLite as viewed in Jmol. The percentage of sequence identity is just 11%. Figure 9(c) The figure shows original 1CDK:A protein that was used in the request as viewed in Jmol. DaliLite give a link for the pdb file (“mol1_original.pdb”) for the original molecule. (See figure 9(a)). Figure 9(d). The figure shows the rotated and translated 1CJK:A protein that was used by DaliLite to align with 1CDK:A (show in fig.9(c)) for structural comparison as viewed in Jmol. The coordinates for this rotated-translated protein are from the pdb file provided by the DaliLite (“mol2_1.pdb”) (See figure 9(a)). Figure 10. This figure shows the details on the structural alignment using the amino acid sequence. Figure 11(a). The additional data section and the Inputs section Figure 11 (b) Figure 12. The figure shows the details on residue ranges after clicking on its corresponding link shown in Fig. 11(a). The numbers below the arrows at the bottom of the figure indicates the columns that the comment area at the top of the figure refers to. Figure 13. The Dali Database input interface. You can either enter a PDB identifier, or a protein name or a keyword to get result. As you scroll down this screen, you have link on downloading sequence files, mysql dumpfiles and the DALI standalone application. Figure 14. The output for the query “hemoglobin” in the Dali Database input interface. Figure 15. Fold Query Result Figure 16. The figure shows the structural neighbor list for Hemoglobin when the browse link shown in figure 14 is clicked. Figure 17. The details on the 1D structural alignment between hemoglobin and deoxyhemoglobin from the Dali Database. Note that deoxyhemoglobin is the second structural neighbor to hemoglobin as shown in the figure 16 above. Statistical data on the structural alignment are also provided. Figure 18. Click on any of the 4 button after selecting the given structures give information of multiple structure and sequence alignment among the proteins and the pdb code of their superimposed structures. Figure 19. The given figure illustrates how to calculate a distance matrix for protein A. 2. MULTIPROT MultiProt method performs simultaneous multiple structural alignments of protein structures. Although the algorithm is robust and efficient, it treats proteins as rigid structures and performs rigid structural comparison. The proteins in actuality are flexible molecules [10]. Multiple structural alignment is naturally more powerful that pairwise structural alignment of protein molecules since multiple structural alignment involves much more information than pairwise structural alignment. Moreover, there are only a few freely available tools or webservices for three dimensional protein structural analysis. Most of these tools/web services use all the input molecules of the proteins to do the structural alignment whereas MultiProt does not use all the input molecules to do the match. The main concept used in the MultiProt method is to detect a common geometrical core among the protein structures [9]. The method is able to detect between structural similar and dissimilar molecules and exclude the dissimilar ones from the alignment. The method can efficiently perform structural alignment on tens and possibly hundreds of protein structures simultaneously. Performing the alignment simultaneously eliminate the issue of bias in the superposition and find the structures with similar fragments of maximal length by disregarding the order of residues in the chains. Depending on the size of protein structures used, the application can perform the algorithm for a few seconds to a few minutes. MultiProt Server can be reached at http://bioinfo3d.cs.tau.ac.il/MultiProt/ . The application is available for download from the same hyperlink given above but work only on Linux Operating System. THE MULTIPROT ALGORITHM The algorithm for MultiProt structural algorithm performs three different structural alignments in three stages. The first stage consists of multiple structural alignments of contiguous fragments, the second stage involves the best multiple alignments based on the global structural similarities and the third stage consists of bio-core detection, where another scoring scheme is enforced. The scoring scheme requires that the aligned points are of the same biological type. In essence, a biocore classification is done. The goal of the method is to detect structurally similar fragments of maximal length. Stage 1: Multiple structural fragment Alignment: Consider two protein molecule M p and M q . The fragment of molecule M p starts at i and has a length of l . Similarly the fragment of molecule M q starts at j and has a length of l . The fragment M p can be denoted as F p i l and the fragment of M q is denoted as F q l . These j two fragments are -congruent since there exists a rigid 3-D transformation, T , that superimpose both the fragment with an RMSD . has a predefined threshold value. The default value for in MultiProt method is 3 Å. Here i t molecules are of protein M p are matched against j t molecules in protein M q . The congruent pair of M p and M q is denoted as RMSD opt Fi p F jq l and RMSDopt Fi p F jq l .RMSDopt i.e., RMSDopt Fi p F jq l min T RMSD Fi p l , T F jq l …. Equation4 where T is a rigid 3D transformation [9, 10]. All the -congruent fragments Fi p F jq l can be obtained in polynomial time. To have an -congruent multiple alignment, the algorithm requires a pivot molecule participate in the alignment. All the matched points must be within distance from the appropriate pivot molecule. In order for the algorithm not be dependent on the choice of the pivot molecule, the method iteratively choose every molecule as a pivot molecule. The pseudo code [9] below illustrates how the rest of the molecules are aligned with respect to the pivot molecule: Input : m molecules S M 1 ...M m for i i to m - 1 M pivot M i S' S \ M pivot Alignments MultipleFr agmentAlig nment(M pivot , S' ) GlobalMult ipleAlignm ent(Alignm ents) End Essentially, when performing multiple fragment alignments, a set of multiple transformations Ti1 ,..., Ti r aligns the molecules M i1 ,..., M ir with M pivot . At this point, the multiple structural alignment only does the 3D transformations and align fragments as short as 3 aminoacid-long. It cannot detect which points or amino acids are used to match in the 3D space. It is possible to get more than one alignment/solution when several fragments from the same molecule align with the pivot molecule because the algorithm tries to get all possible solutions through the alignment. To par down redundant alignments, the method performs a cut so that only one fragment for each molecule is selected for multiple alignment. The number of possible alignments for the given cut grows exponentially with the number of molecules [9,11]. The time complexity of computing the congruent pair Fi p F jq l is Ol and the greedy iterative approach toward alignment takes O M p M q [11]. We proceed on to global structural alignment in the second stage. It detects larger structural cores among the aligned structures. Stage 2: Global Multiple Alignment: The goal at this stage is to find the best multiple alignments based on the global structural similarity, as defined below: (*) Given m molecules, a parameter and a threshold value , for each r such that 2 r m, find the largest –congruent multiple alignment containing exactly r molecules.[9] This is a hard problem because MTSA (MulTiple Structural Alignment) problem is NP hard even with exact congruence with 0 , so a heuristic solution is applied. From every cut at most only one fragment for each molecule is selected and the resulting multiple structural alignment have the highest possible score and also preserves the -congruence. A rigid transformation is applied after multiple correspondences are applied between the pivot molecule and the other molecules. This help to minimize the RMSD between the matching points. Generally multiple iterations (default is 3) are performed to get the best global alignment. Stage 3: Bio-Core Detection: In bio-core detection, another scoring scheme that specifies that the matching points must be of same biological class. The classifications are based on whether the residues are hydrophobic, charged/polar, aromatic, and Glycine. The highest scoring solution is achieved at this stage and it is complementary to stage 1 and 2. Complexity of the Algorithm: If m is the number of input molecules and n is the size of the longest fragment, the complexity of Stage 1 is O m n 2 . The complexity of stage 2 and 3 are O m 2 n 3 n O m n 2 . Theoretically, the overall algorithm time complexity is bounded by O m n [9]. 2 3 INPUT FORMAT OF MULTIPROT The input format for MultiProt is shown in figure 20. The interface is pretty simple. There is a textbox to enter the multiple PDB entries, each of them are separated by a space. Alternately you can upload a zip file containing the pdb structures. The default value set for the RMSD threshold is 3 Å, which the user can modify to his own value. Once all the required fields are entered, click the “Submit Query “ to submit the request. OUTPUT FORMAT OF MULTIPROT The output format for the alignment request is shown in Figure 21. It gives the summary of each protein structures submitted, the number of C atoms present in each protein structures that have been submitted in a tabular form. It also give details on various combinations of structural alignments, the alignment sizes (i.e., the number of atoms involved in each alignment) and the RMSD values and a link for the pdb file in each case of alignment. Clicking on the pdb link allows the user to download the file and later use it to view the superimposed image of the protein structures involved in structural alignment using web services such as Pymol, Jmol etc. Clicking on the number link under the Alignment size column gives the user a detailed view of residues involved in the alignment for each protein structures submitted. Figure 20. The figure shows the input interface for MultiProt. Figure 21. The output interface for MultiProt Figure22. Details on the residues aligned for each protein structure the user has submitted. Figure 23: The superimposed image of the four protein structures that you see in Figure 21 through 22. Figure 23 shows the superimposed image of all the four protein structures as shown in the images 21 through 22. All the proteins belongs to the “Superfamily: Trimerix LpxA-like enzymes. Each of the protein is taken form a different family. All four of the proteins have a helical structure as a common core. For the multiple structural alignment in our example, the alignment size was 70 with an RMSD of 1.03Å. MULTIPROT AND DALI In order to compare the performance between DALI and MultiProt, a pairwise comparison in the performance is given in the tabular form in Table 1. The original source of “hard to detect” pairwise alignment was obtained from Shindyalov and Bourne. I performed a comparison of my data with the data in Table 2 [11]. From the comparison, we can see that DALI was unable to perform a pair wise comparison between the proteins with PDB ID 2AZA:A and 1PAZ in both cases and results for rest of the proteins were almost similar to each other. There is a slight improvement in performance when I compared the results of MultiProt1 and MultiProt2 between Table 1 and Table 2. The improvement is made in the sense that there is equal or slightly less number of residues aligned with the same or slightly less RMS value. I believe it has to do with the improvement in the algorithm used in MultiProt. In short, MultiProt is a better structural alignment tool when compared to DALI. Table1. Pair wise Structural Alignment Test between Dali and MultiProt Molecule 1 Molecule 2 DALI MultiProt1 MultiProt2 PDB ID Size PDB ID Size Sal RMSD Sal RMSD Sal RMSD 1fxi:A 96 1ubq 76 60 2.6 44 1.67 50 1.81 1ten 89 3hhr:B 195 86 1.9 81 1.35 82 1.35 3hla:B 99 2rhe 114 75 3.0 60 1.8 67 1.9 2aza:A 129 1paz 120 X X 70 1.70 78 1.75 1cew:I 108 1mol:A 94 81 2.3 73 1.69 72 1.68 1cid 177 2rhe 114 97 3.2 81 1.64 81 1.64 1crl 534 1ede 310 211 3.5 153 1.92 195 2.04 2sim 381 1nsb:A 390 X X 199 1.92 226 2.11 1bge:B 159 2gmf:A 121 94 3.3 73 1.8 83 2.0 1tie 166 4fgf 124 114 3.1 79 1.75 81 2.01 The table 1. above gives a comparison in pairwise structural alignment using DALI and MultiProt. MultiProt1 refers to the result obtained when the sequence order is preserved. MultiProt2 refers to the result obtained when the sequence order is not preserved. Sal refers to the number of residues that are aligned. Table 2 is from Statsky, Nussinov and Wolfson [11]. Table 2: [11] CONCLUSIONS Here we have presented two powerful for comparing the structure of proteins. DALI is limited to perform pairwise structural comparison of proteins and MultiProt deals with simultaneous multiple structural alignment of proteins. However both these methods use rigid molecules of proteins. From the discussions and figures, it is obvious that MultiProt is much powerful and faster than DALI. Each of the protein comparison method uses different algorithms for their comparison. When DALI uses Branch-and-Bound and Monte-Carlo Optimization Algorithm, MultiProt uses its method in three different stages and each of these stages performs three different alignments. Compared to DALI, comparison method in MultiProt is much more complex and it is extremely efficient and fast. I conclude this paper with a figure (Figure 24.) of 10 proteins of tansferases family aligned together using the multiple alignment method of MultiProt: Figure 24(a). The proteins of the transferases family are aligned together using the MultiProt method. The proteins involved in the alignment are: 10gs:A, 1axd:A, 1b48:A, 1c72:A, 1f2e:A, 1gnw:A, 1gwc:B, 1jlv:A, 1ljr:A, and 1pd2: 1. The number of residues aligned is 82 and the RMSD is 1.57. The tabular data at the lower left corner of the figures gives the data on the number of C-alpha atoms present in each protein molecules. REFERENCES [1] Hobohm U, Scharf M, Schneider R, Sander C (1992): Selection of representative protein data sets. Protein Sci 1:409-17. [2] Holm L, Sander C (1993a): Protein structure comparison by alignment of distance matrices. J. Mol Biol 233(1): 123-38. [3] Holm L, Sander C (1996): Mapping the protein universe. Science 273: 595-603 [4] Holm L, Sander C (1998): Protein folds and families: sequence and structure alignments. Nucleic Acid Research: 244-47. [5] Bourne P.E., Weissig H. Structural Bioinformatics. Wiley-Liss, Hoboken, NewJersey. [6] http://www.wikipedia.org/ [7] Holm L et al, Current Protocols in Bioinformatics. Dali - Structural comparison in Proteins (** I got this pdf file off the internet; no details on the publication or the date available) [8] Shindyalov, I. N. and Bourne, P. E., Protein structure alignment by incremental combinatorial extension of the optimum path. Protein Engineering, 11:739-747, 1998. [9 ] M. Shatsky, R. Nussinov, H.J. Wolfson (2002): MultiProt - a Multiple Protein Structural Alignment Algorithm. 2nd Workshop on Algorithms in Bioinformatics (WABI’02 as part of ALGO’02), Rome, Italy, Sept. 2002, Lecture Notes in Computer Science 2452: 235-250, Springer Verlag. [10] M. Shatsky, O. Dror, D. Schneidman-Duhovny, R. Nussinov, H.J. Wolfson (2004): BioInfo3D: a suite of tools for structural bioinformatics. Nucleic Acids Research,32: W503 – W507. [11] M. Shatsky, R. Nussinov, H.J. Wolfson (2004): A Method for Simultaneous Alignment of Multiple Protein Structures. Protein: Structure, Functions and Bioinformatics, 56: 143 – 156.