GENERAL METHODOLOGY USED IN STRUCTURE COMPARISON

advertisement
A Study on Pairwise Comparison of Protein Structure by
Dali and the Multiple Structural Alignment of Protein
Structures by Multiprot
Rajalekshmy Usha
CS 7301-501, Fall 2006
University of Texas at Dallas
INTRODUCTION
The structure of a protein can throw light on its function and its evolutionary history. To acquire
this information we need to know the knowledge of the structure and its relationship with other
proteins. To know the knowledge of the structure and its relationship, we need to know the
different folds that proteins adopt and detailed information about the structure of many proteins.
Nearly all proteins have structural similarities with other proteins and most of them share a
common evolutionary origin.
Individual structures provide explanations of specific biochemical functions and mechanisms,
whereas comparisons of structures give insight to general principles governing these molecules,
the interactions they make and their biological roles. The three-dimensional structures form the
foundation of structural bioinformatics. All the structural analyses are indispensable on them. To
facilitate the comparison of known protein structures, there are many algorithms and resulting
web resources that provide protein structure comparison at the level of domain or complete
polypeptide chain. DALI is one such method.
HOW IT ALL STARTED
The first comprehensive collection of macromolecular sequences appeared in the Atlas of Protein
Sequence and Structure, which was published from 1965 through 1978, under the editorship of Dr.
Margaret O. Dayhoff. Dr.Dayhoff and her research team pioneered in the development of
computational methods for comparing the protein sequences, in detecting the duplications within
the sequences, in detecting the distantly related sequences and in deducing the evolutionary
relationship from the alignments of protein sequences. Since then many sequence databases have
been established and their relationship detected using dynamic programming algorithms that is
popular in computer science. These methods handle the insertions and deletions occurring
between distant evolutionary relatives very efficiently.
The structural data has always been sparse than the sequence data. While PDB (Protein Data
Bank) has 39,464 structural entries to date, NCBI (National Center for Biotechnology
Information) has over 12 million entries on sequence data. The main reason is due to the technical
difficulties involved in structure determination. The structure of the proteins can be determined
experimentally by using NMR and X-Ray crystallography.
In NMR spectroscopy, radio waves and magnetic waves are used. These waves can penetrate
through highly purified biological samples. The process is time consuming because it requires
interactive analysis of data with a trained scientist. X-ray crystallography is yet another common
method used to determine the structure of macromolecules such as proteins and nucleic acids.
The three-dimensional structure is determined from the crystallographic data. The protein
samples must first be crystallized. Since most of the proteins are sensitive to high temperature and
high concentrations of organic solvents, it is difficult to get crystallized protein samples of
passable quality. X-rays are passed through the crystallized protein samples and their diffraction
patterns are analyzed. A model of the molecule is built from the diffraction data. The structure of
the molecule goes through several rounds of refinement and the final model into the final
crystallographic databases such as Protein Data Bank (PDB) and the Cambridge Structural
Database. It is interesting to note that about 90% of the protein structures in PDB are obtained
from X-ray crystallography and around 9% from NMR. [6]
The structural classifications began to emerge during the mid-1990s although the first crystal
structure was solved in 1970s. The various structural classifications began to emerge, primarily
with Structural Classification of Proteins (SCOP) (Murzin et al., 1995; Lo Conte, 2000), DALI
(Holm and Sander, 1996), and CATH (Orengo et al., 1997; Pearl et al., 2001) databases and data
resources. Several other classifications are DDBASE (Sowdhamini et al., 1998), 3Dee(Dengler,
Siddiqui, and Barton, 2001), DaliDD (Holm and Sander; Dietmann and Holm, 2001). These
databases use different algorithms for comparing three-dimensional structures. Each of them uses
different criteria to measure the similarities in the protein structures. [5]
WHY USE PROTEIN STRUCTURE FOR COMPARISON?
Traditionally, proteins with similar amino acid sequences are used to infer the structure and
function of a protein. This is because it was assumed that proteins with similar sequences have
similar functions and structures and are evolutionary related.. However sequence similarity
searches can evolutionary relationships only when there is a sequence identity up to 25%. For
those proteins below this percentage of similarity, enter a “twilight zone” of similarity. A
structural similarity search can extend the validity of evolutionary relationship beyond the borders
of the “twilight zone”. It has been confirmed that the structures are much better conserved than
the sequence over a long periods of time (Chothia and Lesk, 1986). The possible reason is that the
structure is able to adopt a wide range of mutations and physical forces favor certain structures.
Protein structures are stored in the form of a PDB file, which consist of a list of 3 dimensional
coordinates of all the atoms in the proteins. The PDB file has little or no information on
functional data. However, by studying the structure we can derive information relating to
biochemical function. This is especially helpful if the protein in question has unknown function.
This fact is illustrated in figure 1.
It is expected for homologous proteins to have similar structure and functions. But there are
exceptions. There is diversity of functions with homologous superfamilies. For example, take the
case of lysozyme and -lactalbumin (Acharya et al., 1991). These two homologous proteins have
high sequence identity but each of them has different function (see figure 2(a) and (b)). In
contrast, the globin family of proteins has undergone multiple amino acid changes through the
course of evolution but their function remains the same (see figure 3(a) and (b)).
When we look at the analogous proteins, they may share similar structure but have no sequence
similarity. These analogues are examples of convergent evolution toward the same function. The
proteins that exhibit these features are trypsin and subtilisin (Wallace, Laskowski, and Thornton,
1996) (See figure 4(a) and (b)).
Figure 1.The figure shows the relationship between 3D structure and function.
Figure 2(a).
Lysozyme and
alphalactalbumin have
40% sequence
identity between
them and they
have similar
structure but
different
function.
Lysozyme has
an O-glucosyl
hydrolase but
lactalbumin does
not have this
enzymatic
activity.
However, both
lysozyme and
alphalactalbumin have
a sugar-binding
site. The
catalytic residues
of alphalactalbumin have changed over the course of evolution and it has lost its enzymatic property.
Figure 2(b). Superimposed image
of lysozyme (PDB ID: 1gd6:A)
and lactalbumin (PDB ID :
1b9o:A)
Figure 3(a).
Both the
hemoglobin
share only 8% of
sequence
identity but their
overall fold and
function is
identical.
Figure 3(b). The superimposed
image of the two hemoglobin
(1VHB:A and 2LHB) from Figure
3(a).
Ser-His-Asp
triad is found
here
Ser-HisAsp triad
Figure 4(a). Subtilisin and Chymotrypsin are both endopepsidases. They do not share any sequence identity and they
have unrelated folds. However, each of them has an identical Serine-Histamine-Aspartine catalytic triad, which
catalyses the peptide bond hydrolysis. The two enzymes are a classic example of convergent evolution.
Figure 4(b).Superimposed image of subtilisin
(IGNV) and chymotrypsin(1GL0:E). Note that
there are less structural similarities.
By comparing 3D structures of proteins with computational tools like DALI and MultiProt, the
biologists can identify new types of protein architecture, identify common structural core and
discover evolutionary relationship between protein molecules. It helps to organize the growing set
of known protein shapes. It is also useful in comparing the structure of a protein with unknown
function to structures of proteins with known functions. From these comparative studies, we can
derive functional information from the closest match within defined bounds. However, caution
must be exercised because as illustrated from the examples above, if two proteins are structurally
similar but have no sequence identity, they must have been evolved as the result of convergent
evolution. Similarly, two proteins with similar structures and similar functions may not be
evolutionary related.
1. DALI
The 1993 paper of Holm and Sander (1993 a) popularized the protein structure comparison. This
paper describes the use of distance alignment matrices in comparing the protein structure. Using
distance matrices to compare the two protein structures have been practiced since the 1970s
(Phillips, 1970; Nishikawa and Ooi, 1974; Liebman, 1980; Sippl, 1982). DALI stands for
Distance matrix alignment. Liisa Holm is the creator of DALI. DALI is completely automated
and it is too large and complex to be installed in the external sites. Dali Server is a standalone
version of a search engine and is an automatic network service for the comparison of protein
structure in 3D. DaliLite is a program in DALI for pair wise structure comparison and for
structure database searching. There are two ways to compose a request in DALI. The first is a
Database Search and the second one is a pair wise protein comparison using a program called
DaliLite.
There are several other popular methods that do structure comparisons like DALI. Some of them
are Combinatorial Extension (CE) algorithm (Shindyalov and Bourne, 1998), COMPARER (Sali
and Blundell, 1990), SARF2 (Alexandrov and Takahashi, and Go, 1992), SSAP (Taylor and
Orengo, 1989), and VAST (Vector Alignment Search Tool; Gibrat et al., 1996). Each of them
uses different algorithms. For example, if we take the case of CE and Comparer, CE uses a
distance approach at a level of octameric fragment; that is, a comparison of C-alpha distance
matrices is made for every combination of eight residues in each protein chain. COMPARER
uses the comparison of residues’ properties, segments of residues and the relations between the
residues, and the relations between the segments.
GENERAL METHODOLOGY USED IN STRUCTURE COMPARISON AND
ALIGNMENT
The common methodology followed in various methods of protein structure comparison and
alignment usually involves three to four steps. They are listed below:
(1) Represent the two proteins A and B (which are essentially polypeptide chains, domains or
other amino acid fragments) in some coordinate independent space so that they can be readily
compared.
(2) Compare the two proteins A and B.
(3) Optimize the alignment between A and B.
(4) Measure the statistical significance of the alignment against a random set of structure
comparison.
From the above list, the step 3 and 4 can be combined as one. This methodology is commonly
used for pair-wise structure comparison and alignment. Such a general approach is found to be an
NP-hard problem that has been solved heuristically by all the methods [5]. To find the optimal
alignment between the two proteins A and B, find the highest number of atoms with the lowest
RMSD (Root Mean Square Deviation). There is also a need to get a balance between the local
regions with very good alignments and an overall global alignment. The methodology used in
multiple structure alignment is different and is discussed in details later in the paper. MultiProt
algorithm is used in this paper specifically to illustrate multiple protein structure comparison.
PROTEIN STRUCTURE COMPARISON ALIGNMENT IN DALI
DALI (Distance Matrix Alignment; Holm and Sander 1993a) uses distance matrices to represent
each protein structure. In this method, each structure is represented as a two-dimensional array of
distances between all the C-alpha atoms. Figure 19 illustrates how a distance matrix is obtained
for a protein structure. The objective is to perform a one-to-one comparison between the residues
and remove any non-matching residues between the two proteins being compared.
The input structures are broken into hexapeptide fragments. The Dali method then calculates the
distance matrix after evaluating the contact patterns between successive fragments. Similar 3D
structures have similar inter-residue distances and in most favorable cases, the 3D
superimposition of a pair of secondary structure elements will lead to superimposition of the
entire structures. Therefore, similarities in secondary structures i.e., in the backbones of the
proteins are reflected along the matrix’s main diagonal. Off-diagonal similarities in the matrix
reflect the tertiary structure similarity. [4] When the off-diagonals are parallel to the main
diagonal, the features of the proteins are parallel. When the off-diagonals are perpendicular to the
main diagonals, the features are anti-parallel. This representation is memory intensive since the
square matrix used in the Dali matrix is symmetrical about the diagonal. [6]
The distance matrices are collapsed into regions of overlap i.e., sub-matrices of fixed size. These
sub-matrices are subsequently stitched together if there is an overlap between the adjacent
fragments. As the result, common structural motif made up of several disjoint regions of
backbones become visible when residues with no structural equivalence with the other structure
are removed.
COMPARISON ALGORITHM AND OPTIMIZATION
Dali uses Branch and Bound Algorithm as a comparison algorithm. It is followed by Monte Carlo
optimization to get the final result of the alignment. The branch-and-bound search (Lathrop and
Smith, 1996) has a systematic pair wise comparison between all the elementary contact patterns
in the two distance matrices. The hexapeptide-hexapeptide contact patterns of protein A ( i A …
i A 5 , j A … j A5 ) are paired with ( i B … i B  5 , j B … j B 5 ) of protein B. Here the hexapeptide i A
… i A 5 is matched with i B … i B  5 and the hexapeptide j A … j A 5 is matched with j B … j B  5 .
Similar contact patterns are stored in a non-exclusive list of pairs (the pair list), which is the raw
material for structural alignment. [2] The branch-and-bound then iteratively decompose the
search space into smaller subsets and an upper bound of the evaluation function is evaluated for
each subset. The subsequent iteration of the algorithm repeats the process on the subset with the
highest upper bound. The branch-and-bound algorithm is guaranteed to obtain an optimal
alignment for the locally similar region but it is slow with an exponential number of steps in
worst case.
In Monte Carlo alignment, the pairs of contact patterns are assembled into a larger set of aligned
pairs. Intervening rows and columns that fall outside the equivalency is removed. This helps to
maximize the similarity score, which is the score for each pair of aligned residues. (Equation 1.)
scorei, j 

distance A i, j  distance B i, j 
 ω distance  i, j … Equation1
= 0.2 

distance i, j




Here i and j represent a pair of aligned residues in protein A and B, distance  i, j represent the
average of distance A i, j and distance B i, j .  is the envelope function and  = e
where d is the arithmetic mean of the distance between residues of proteins A and B of
 d
2
/ r2
,
 
coordinates i, j and r = 20oA. Envelope function is used to give lower weights to those
residues that are far apart and it helps to reduce their relative contribution to the overall score.
This score is summed over all pairs of residues from both the protein structures. So, higher
similarity score corresponds to a good fit or better alignment between the structures.
Using both branch-and-bound and Monte Carlo Algorithm has increased the sensitivity and
specificity of the pair wise protein comparison by Dali.
THE BRANCH AND BOUND ALGORITHM
Only non-gapped segment pairs are used to reduce the complexity of structural alignment
problem. A natural segmentation uses the secondary structures of the query structures. The details
of the algorithm that is used to maximize the alignment between two proteins are illustrated in
figure 5(a).
Figure 5(a). The algorithm is followed from top to bottom and is represented in three different schematics: 3 D chain
at the left and a two-dimensional distance matrix in the middle and sequence alignment in the right.
The structurally equivalent fragments in two proteins - protein 1 and protein 2 - are labeled as
a, b, c and a' , b' , c' respectively. In the two dimensional distance matrices, equivalent fragments
of proteins have boxes of same patterns. The boxes along the diagonal of the matrix represent the
intra-fragment distances and the off-diagonal boxes represent the inter-fragment distances. One
can calculate the similarity score by taking the pair wise distances between all the equivalent
elements in the distance matrices.
First Row: Start the alignment by finding equivalent hexapeptide-hexapeptide fragments a  b
and a 'b ' .
Second Row: Next take only the fragment b and b ' to find additional matching contact patterns
to find additional alignments. This process is called “branching”. The fragment c of protein 1
and c' of protein 2 seems to overlap along with b and b ' .




Third Row: The overlapping pair b  c and b'c' is merged with previously aligned pair
a  b and a'b'

Fourth Row: The two aligned pairs are merged or collapsed into a new alignment a  b  c



and a'b'c' . The collapse is performed after the removal of insertions or deletions followed by
reordering of segments in b ' and c' in the second protein. This process is called “bounding”.
the single solution
Figure 5(b). Another view on branch and bound algorithm. Image source: [3]
Further detailed explanation on branch-and-bound algorithm is explained below by referring to
the figure 5(b) above:
Proteins A and B are represented as a distance matrices(as labeled in figure 5(b)) and each points
on the matrix is the residue-residue distance. The square matrices shown within the matrix is a set
of contacts made by the two segments and the secondary structure segments (i.e., the natural
segmentation, as mentioned above) made in matrix A are  ,  and  . The solution space
consists of all possible placements of residues in protein B in relative to a segment of residues in
A. The idea is to divide the solution space into smaller solution subspace until you get the highest
possible upper bound for the single alignment trace that is left.
To explain the algorithm schematically, consider that our solution space is a circle as shown in
the upper left and right of the figure 5(b). Every point in this solution space of circle is a possible
solution. A line drawn within a circle that divides the solution space into two subspaces (i.e., two
semi-circles). Each of these spaces got an upper bound with a value of 9 for the left semi-circle
and 12 for the right semi-circle. We choose to go to the right since it has the maximum upper
bound value. The right semi-circle is further divided into two subspaces (i.e., quarters) with an
upper bound of 10 to the top quarter and an upper bound of 16 for the bottom quarter. We choose
to go to the bottom quarter and split it again. The two upper bounds for two areas are 14 to the
right and 12 to the left. We choose the right to proceed on with the algorithm. Continue the
process of splitting the subspace until it shrinks to a single alignment trace. The single solution is
indicated with an arrow as shown in figure 5(b). I might be wrong but I believe the final single
solution is supposed to be in the area indicated as a star as shown in figure 5(b).
The upper bound for each segment-segment submatrices of the matrix for protein A is selected by
calculating the similarity scores between the matrix A and for each of the accessible submatrices
in B. In other words, the similarity score is calculated for a predefined set of residues in protein A
(a single highlighted square in matrix A as shown in figure 5(b)) with a set of squares in B (also
highlighted in the same figure above). The predefined set consists of residues in secondary
structure elements (i.e., the natural segmentation). The upper bound for the total similarity score,
which is the sum of all the segment-segment submatrices in A, for one set of solutions is
calculated by summing up all of the separate upper bounds for each of the segment-segment pair
of matrix A [3].
Therefore, branch and bound algorithm recursively splits the search space into smaller subsets
with paired fragments. The maximal similarity score is computed for each of the subsets and they
are summed together to get the highest bound in sum-of-pairs similarity score. This naturally
corresponds to optimal alignment.
MONTE CARLO ALGORITHM
In Dali, Monte Carlo optimization is performed where the user can achieve an iterative
improvement by performing a random walk exploration of the search space with occasional
excursions to the non-optimal section of the structural alignment (Holm and Sander 1993a). At
first, a random basic move is made. The probability of accepting this move is mathematically
expressed as ebeta *(S’ –S), where S’ is the new similarity score, S is the old similarity score and beta
is a parameter. A move always involves an addition or deletion of a residue equivalence
assignment.
Monte Carlo algorithm uses two basic modes of operations: the expansion mode and the trimming
mode. In expansion mode, an alignment is incremented when a matching contact pattern is
obtained from the residue pair and this allow for a possible extensions of the alignment. Since the
alignment is one-to-one mapping between the two protein pairs, say A and B, addition of a new
fragment almost always involves a removal of a previous inconsistent equivalence assignment.
As the result there is a corresponding increment and decrement of total similarity score. In
trimming mode, any fragments that give a negative contribution to the total similarity score are
removed from the alignment. In Monte Carlo, the trimming is done after the first and every fifth
subsequent expansion cycles.
DALI INPUT INTERFACE
The Dali Email Server for Database Searches
In Database searching the user can submit the coordinates of the query protein structure and Dali
uses it to compare against those in the PDB. A multiple alignment of the structural neighbors will
be sent to the user through email. The user can submit the protein structure by email. A *.pdb file
containing the coordinates of the protein structure is uploaded interactively through the webpage
http://www.ebi.ac.uk/dali/Interactive.html. The user need to type in the user’s email address at the
textbox provided in the email box. Use plain text and not encoded messages (MIME or BinHex)
to enter into the textbox. Commercial users need to type in the password. Press the “Submit
Query” button to submit the request. The request is send through the email address: dali@ebi.ac.uk.
The result will be emailed back to the address provided by the user within a few minutes or
within a few days of submission depending upon the protein structure the user has submitted. In
case of longer delays, you can notify via the email address dali-help@ebi.ac.uk. The set is
constructed such that the sequence identity between any two chains of set should be less than
25%. The interface for the database search request is shown in figure 7.
The DaliLite Server for comparing two protein structures
The server for pair wise protein comparison can be accessed at http://www. ebi.ac.uk/DaliLite .
Two inputs are required to submit the request. The user can submit the PDB code of two known
protein structures or the user can upload the structural coordinates in pdb format. If the protein
structure has different domains/chains, they can be entered at the space provided. To submit the
request, click the “Run DaliLite” button.
Dali Database
The Dali database is based on an exhaustive all-against-all 3 dimensional comparison of protein
structures that are currently available in PDB (Protein Data Bank). The input interface for the
Dali Database is at http://www.bioinfo.biocenter.helsinki.fi/dali/start. The classification and
alignments of protein structures are automatically maintained and regularly updated from the
PDB by the Dali search engine. The input interface of the Dali Database is shown in figure 13.
You can enter a PDB ID, or a protein name or a key word at the given textbox. As you scroll
down the page, you will see link for downloading the sequence files, mysql dumpfiles and the
DALI standalone application. The dumpfiles are computer-readable database dumps for largescale studies.
To download the DALI standalone application, read and sign the license agreement, keep a copy
for yourself and return the license agreement by paper mail to Liisa Holm’s address that is
provided. Once that is done you can proceed on to download the program. The download is
available for academic use only and is prohibited for commercial use. Moreover, this application
does not create or update the database. Once you have downloaded the application, unzip and
untar the distribution files. The INSTALL file provides information on how to install and run the
application. For usage instruction, run DaliLite with the option –help.
DALI OUTPUT INTERFACE
The DaliLite Server for comparing two protein structures
The output interface after submitting the PDB ID code of two protein structures is shown in
Figure 8. Please note that the first structure, the query structure, is named as mol1 and the second
structure, the subject structure, is named mol2 by Dali. If chain A for the query structure is
submitted, then Dali names it as mol1A and so on. If no chain is specified for the structures,
DALI will produce the result of comparisons for all the chains between the subject and the query
structure.
Click the button “Submit Another Job” if the user wants to submit a new set of protein structures.
The link under “Structural Alignment” directs the user to a page containing the two-dimensional
sequence alignment between the two protein chains. It also includes the DSSP secondary
structure information (See figure 10). The column “Superimposed C-alpha Traces” give the user a
detail on the matching superimposed C-alpha atoms for both the structures. The column on “PDB
Files” give the links for the original pdb file (i.e., for the protein with PDB ID: 1CDK in figure
9(a)) and the second pdb file contain details for the protein with PDB ID: 1CJA (as given in the
figure 9) after it has been rotated and translated to match the first protein i.e., 1CDK. To view the
full superposition of the original, unchanged protein and the second modified protein, upload both
their pdb files in a structure viewer like Pymol or Rasmol (See figure 9(b), (c) and (d)). The user
can get information on the number of residues that were used for alignment between the two
structures; Z-scores; RMSD (Root Mean Square Deviation); percentage of Sequence Identity. The
details on how to calculate Z-scores and RMSD will be given in the next section.
As you scroll down the same screen as in figure 9, you get details on “Additional data” section
and “Inputs” section (See figure 11(a)). The links provided under the “Additional data” section
provide the following files to be viewed: the rotation-translation matrix used in the rotation and
translating the second protein structure to superpose over the first structure; a list of structurally
equivalent residue ranges (see figure 12) and the log file which gives details on all the steps taken
by the DaliLite application.
The “Inputs” Section gives details on the pdb files for the two protein structures that the user has
used to enter at the DaliLite input interface is displayed (Figure 11(a) and 11(b)). This is for the
user to check that DaliLite server has used the desired pdb files for the two proteins.
Dali Database
Figure 14 show the result for the query “hemoglobin”. Note that the PDB entry for the
hemoglobin with the PDB ID 1a00 has two chains B and D. The representative for these two PDB
entries is given as 1dxtB(3rd column) and the domain fold class is given as 806 (5th column).
Clicking on the fold index (i.e., 806) gives details of all the proteins belonging to same fold class
(See figure 15). The browse link shows details on structural neighbors for each domain. It also
provides the 1D structural alignment between hemoglobin and each of the top 50 neighbors are
shown (Figure 16 and 17). The “interact” link directs you to another page that let you see multiple
structure and sequence alignments in 1D (See Figure 18). Clicking on the “Structure Alignment”
gives details on multiple structure alignment similar to sequence alignment. Click the
“Structure/Sequence Alignment” button to get the details on structural alignment by related
sequences. This is performed by PSI-Blast and stored in the Adda Database (Heger and Holm
2003). This view is especially helpful in tracking down the protein sequence that has been
conserved across the protein families over time. Conserved functional sites are a strong indicator
for common evolutionary origins.
STATISTICAL ANALYSIS OF RESULT FROM DALI
The similarity score for the structural comparison is derived from an all-on-all comparison of
structures in PDB. The DALI score is represented as a standard deviation from the original score
derived from the database background distribution. These scores are also known as Z-scores. Zscores is an important estimate on the quality of the structural alignment. A score greater than 20
implies that the two structures are definitely homologous. A score in the range of 8 to 20 means
that the two structures are probably homologous. A score in the range of 2 to 8 indicates a grey
area and that below 2 means that the two proteins are structurally dissimilar. Dali does not return
any result for the proteins having score below 2. [7]
RMSD (Root Mean Square Deviation) gives the average distance between the backbones of the
superimposed proteins. RMSD is a scoring system used to find an optimal alignment between two
protein structures. The unit of measurement for RMSD is Angstroms Å . Identical structures
have an RMSD value of 0; when a pair of proteins has similar structures, the value falls in the
range 1  3Å . Protein structures with little similarities have an RMSD value greater than 3Å .
The scoring scheme based on RMSD is very good in finding alignments in global similarities.
However, it is not sufficient to find alignments where there are local similarities [8]. The pitfalls
of using RMSD as a scoring criterion are:
 
(1) all atoms are treated equally (for example, residues on the surface of protein structures
such as the hydrophilic molecules have higher degrees of freedom than those in the core
(the hydrophobic molecules).
(2) protein structures with best alignment does not always produce a minimal RMSD value.
(3) the significance of RMSD always depends on the size of the atom.
The formula used for calculating RMSD (Equation (3)) and Z-scores (Equation (2)) is given
below:
Z
x
x

….
Equation2
is the raw score to be standardized;  is the standard deviation and
1 i N 2
RMSD 
 i
N i 1
….
 is the distance between N pairs of equivalent C atoms.

Figure 6. An example on how to calculate RMSD

is the mean.
Equation3
Figure 7. Input interface for the Database search request in DALI server
Figure 8. The input interface for the pair wise comparison of protein structures. 1CPC is the PDB code for Cphycocyanin of cyanobacterium. It has 4 chains – A, B, K and L. 1KTP is the PDB code for C-phycocyanin of
synechococcus vulcanus. It has two chains - A and B. Alternatively one can upload their pdb files.
Figure 9(a). The figure shows the output interface for the pair wise comparison of protein structure. This page gives
details on the protein structure used and the statistical data from the comparison of the two structures. The table gives
value for Z-score, RMSD (Root Mean Square Distance), Sequence Identity expressed as a percentage.
Figure 9(b). The superimposed C-alpha traces of the pdb file (CA_1.pdb – see figure 9(a) above) provided by DaliLite
as viewed in Jmol. The percentage of sequence identity is just 11%.
Figure 9(c) The figure shows original 1CDK:A protein that was used in the request as viewed in Jmol. DaliLite give a
link for the pdb file (“mol1_original.pdb”) for the original molecule. (See figure 9(a)).
Figure 9(d). The figure shows the rotated and translated 1CJK:A protein that was used by DaliLite to align with
1CDK:A (show in fig.9(c)) for structural comparison as viewed in Jmol. The coordinates for this rotated-translated
protein are from the pdb file provided by the DaliLite (“mol2_1.pdb”) (See figure 9(a)).
Figure 10. This figure shows the details on the structural alignment using the amino acid sequence.
Figure 11(a). The additional data section and the Inputs section
Figure 11 (b)
Figure 12. The figure shows the details on residue ranges after clicking on its corresponding link shown in Fig. 11(a).
The numbers below the arrows at the bottom of the figure indicates the columns that the comment area at the top of the
figure refers to.
Figure 13. The Dali Database input interface. You can either enter a PDB identifier, or a protein name or a keyword to
get result. As you scroll down this screen, you have link on downloading sequence files, mysql dumpfiles and the
DALI standalone application.
Figure 14. The output for the query “hemoglobin” in the Dali Database input interface.
Figure 15. Fold Query Result
Figure 16. The figure shows the structural neighbor list for Hemoglobin when the browse link shown in figure 14 is
clicked.
Figure 17. The details on the 1D structural alignment between hemoglobin and deoxyhemoglobin from the Dali
Database. Note that deoxyhemoglobin is the second structural neighbor to hemoglobin as shown in the figure 16 above.
Statistical data on the structural alignment are also provided.
Figure 18. Click on any of the 4 button after selecting the given structures give information of multiple structure and
sequence alignment among the proteins and the pdb code of their superimposed structures.
Figure 19. The given figure illustrates how to calculate a distance matrix for protein A.
2. MULTIPROT
MultiProt method performs simultaneous multiple structural alignments of protein structures.
Although the algorithm is robust and efficient, it treats proteins as rigid structures and performs
rigid structural comparison. The proteins in actuality are flexible molecules [10].
Multiple structural alignment is naturally more powerful that pairwise structural alignment of
protein molecules since multiple structural alignment involves much more information than
pairwise structural alignment. Moreover, there are only a few freely available tools or webservices for three dimensional protein structural analysis. Most of these tools/web services use all
the input molecules of the proteins to do the structural alignment whereas MultiProt does not use
all the input molecules to do the match.
The main concept used in the MultiProt method is to detect a common geometrical core among
the protein structures [9]. The method is able to detect between structural similar and dissimilar
molecules and exclude the dissimilar ones from the alignment. The method can efficiently
perform structural alignment on tens and possibly hundreds of protein structures simultaneously.
Performing the alignment simultaneously eliminate the issue of bias in the superposition and find
the structures with similar fragments of maximal length by disregarding the order of residues in
the chains. Depending on the size of protein structures used, the application can perform the
algorithm for a few seconds to a few minutes.
MultiProt Server can be reached at http://bioinfo3d.cs.tau.ac.il/MultiProt/ . The application is
available for download from the same hyperlink given above but work only on Linux Operating
System.
THE MULTIPROT ALGORITHM
The algorithm for MultiProt structural algorithm performs three different structural alignments in
three stages. The first stage consists of multiple structural alignments of contiguous fragments,
the second stage involves the best multiple alignments based on the global structural similarities
and the third stage consists of bio-core detection, where another scoring scheme is enforced. The
scoring scheme requires that the aligned points are of the same biological type. In essence, a biocore classification is done.
The goal of the method is to detect structurally similar fragments of maximal length.
Stage 1: Multiple structural fragment Alignment:
Consider two protein molecule M p and M q . The fragment of molecule M p starts at i and has a
length of l . Similarly the fragment of molecule M q starts at j and has a length of l . The
fragment M p can be denoted as F
p
i
l  and the fragment of M q
is denoted as F
q
l  . These
j
two fragments are  -congruent since there exists a rigid 3-D transformation, T , that
superimpose both the fragment with an RMSD   .  has a predefined threshold value. The
default value for  in MultiProt method is 3 Å. Here i  t  molecules are of protein M p are
matched against  j  t  molecules in protein M q . The congruent pair of M p and M q is denoted




as RMSD opt Fi p F jq l  and RMSDopt Fi p F jq l    .RMSDopt





i.e., RMSDopt Fi p F jq l   min T RMSD Fi p l , T F jq l 
…. Equation4


where T is a rigid 3D transformation [9, 10]. All the  -congruent fragments Fi p F jq l  can be
obtained in polynomial time. To have an  -congruent multiple alignment, the algorithm requires
a pivot molecule participate in the alignment. All the matched points must be within  distance
from the appropriate pivot molecule. In order for the algorithm not be dependent on the choice of
the pivot molecule, the method iteratively choose every molecule as a pivot molecule. The pseudo
code [9] below illustrates how the rest of the molecules are aligned with respect to the pivot
molecule:
Input : m molecules S  M 1 ...M m 
for i  i to m - 1
M pivot  M i
S'  S \ M pivot
Alignments  MultipleFr agmentAlig nment(M
pivot , S' )
GlobalMult ipleAlignm ent(Alignm ents)
End
Essentially, when performing multiple fragment alignments, a set of multiple transformations
Ti1 ,..., Ti r aligns the molecules M i1 ,..., M ir with M pivot . At this point, the multiple




structural alignment only does the 3D transformations and align fragments as short as 3 aminoacid-long. It cannot detect which points or amino acids are used to match in the 3D space. It is
possible to get more than one alignment/solution when several fragments from the same molecule
align with the pivot molecule because the algorithm tries to get all possible solutions through the
alignment. To par down redundant alignments, the method performs a cut so that only one
fragment for each molecule is selected for multiple alignment. The number of possible alignments
for the given cut grows exponentially with the number of molecules [9,11]. The time complexity
of computing the congruent pair Fi p F jq l  is Ol  and the greedy iterative approach toward


alignment takes O M p  M q [11].
We proceed on to global structural alignment in the second stage. It detects larger structural cores
among the aligned structures.
Stage 2: Global Multiple Alignment:
The goal at this stage is to find the best multiple alignments based on the global structural
similarity, as defined below:
(*) Given m molecules, a parameter  and a threshold value  , for each r such that
2  r  m, find the  largest  –congruent multiple alignment containing exactly r
molecules.[9]
This is a hard problem because MTSA (MulTiple Structural Alignment) problem is NP hard even
with exact congruence with   0 , so a heuristic solution is applied. From every cut at most only
one fragment for each molecule is selected and the resulting multiple structural alignment have
the highest possible score and also preserves the  -congruence. A rigid transformation is applied
after multiple correspondences are applied between the pivot molecule and the other molecules.
This help to minimize the RMSD between the matching points. Generally multiple iterations
(default is 3) are performed to get the best global alignment.
Stage 3: Bio-Core Detection:
In bio-core detection, another scoring scheme that specifies that the matching points must be of
same biological class. The classifications are based on whether the residues are hydrophobic,
charged/polar, aromatic, and Glycine. The highest scoring solution is achieved at this stage and it
is complementary to stage 1 and 2.
Complexity of the Algorithm:
If m is the number of input molecules and n is the size of the longest fragment, the complexity of
Stage 1 is O m  n 2 . The complexity of stage 2 and 3 are O m 2  n 3  n  O m  n 2 .








Theoretically, the overall algorithm time complexity is bounded by O m  n [9].
2
3
INPUT FORMAT OF MULTIPROT
The input format for MultiProt is shown in figure 20. The interface is pretty simple. There is a
textbox to enter the multiple PDB entries, each of them are separated by a space. Alternately you
can upload a zip file containing the pdb structures. The default value set for the RMSD threshold
is 3 Å, which the user can modify to his own value. Once all the required fields are entered, click
the “Submit Query “ to submit the request.
OUTPUT FORMAT OF MULTIPROT
The output format for the alignment request is shown in Figure 21. It gives the summary of each
protein structures submitted, the number of C atoms present in each protein structures that have
been submitted in a tabular form. It also give details on various combinations of structural
alignments, the alignment sizes (i.e., the number of atoms involved in each alignment) and the
RMSD values and a link for the pdb file in each case of alignment. Clicking on the pdb link
allows the user to download the file and later use it to view the superimposed image of the protein
structures involved in structural alignment using web services such as Pymol, Jmol etc. Clicking
on the number link under the Alignment size column gives the user a detailed view of residues
involved in the alignment for each protein structures submitted.
Figure 20. The figure shows the input interface for MultiProt.
Figure 21. The output interface for MultiProt
Figure22. Details on the residues aligned for each protein structure the user has submitted.
Figure 23: The
superimposed image of the
four protein structures that
you see in Figure 21 through
22.
Figure 23 shows the superimposed image of all the four protein structures as shown in the images
21 through 22. All the proteins belongs to the “Superfamily: Trimerix LpxA-like enzymes. Each
of the protein is taken form a different family. All four of the proteins have a helical structure as a
common core. For the multiple structural alignment in our example, the alignment size was 70
with an RMSD of 1.03Å.
MULTIPROT AND DALI
In order to compare the performance between DALI and MultiProt, a pairwise comparison in the
performance is given in the tabular form in Table 1. The original source of “hard to detect”
pairwise alignment was obtained from Shindyalov and Bourne. I performed a comparison of my
data with the data in Table 2 [11]. From the comparison, we can see that DALI was unable to
perform a pair wise comparison between the proteins with PDB ID 2AZA:A and 1PAZ in both
cases and results for rest of the proteins were almost similar to each other. There is a slight
improvement in performance when I compared the results of MultiProt1 and MultiProt2 between
Table 1 and Table 2. The improvement is made in the sense that there is equal or slightly less
number of residues aligned with the same or slightly less RMS value. I believe it has to do with
the improvement in the algorithm used in MultiProt. In short, MultiProt is a better structural
alignment tool when compared to DALI.
Table1. Pair wise Structural Alignment Test between Dali and MultiProt
Molecule 1
Molecule 2
DALI
MultiProt1
MultiProt2
PDB ID
Size
PDB ID
Size
Sal
RMSD
Sal
RMSD
Sal
RMSD
1fxi:A
96
1ubq
76
60
2.6
44
1.67
50
1.81
1ten
89
3hhr:B
195
86
1.9
81
1.35
82
1.35
3hla:B
99
2rhe
114
75
3.0
60
1.8
67
1.9
2aza:A
129
1paz
120
X
X
70
1.70
78
1.75
1cew:I
108
1mol:A
94
81
2.3
73
1.69
72
1.68
1cid
177
2rhe
114
97
3.2
81
1.64
81
1.64
1crl
534
1ede
310
211
3.5
153
1.92
195
2.04
2sim
381
1nsb:A
390
X
X
199
1.92
226
2.11
1bge:B
159
2gmf:A
121
94
3.3
73
1.8
83
2.0
1tie
166
4fgf
124
114
3.1
79
1.75
81
2.01
The table 1. above gives a comparison in pairwise structural alignment using DALI and MultiProt.
MultiProt1 refers to the result obtained when the sequence order is preserved. MultiProt2 refers to
the result obtained when the sequence order is not preserved. Sal refers to the number of residues
that are aligned. Table 2 is from Statsky, Nussinov and Wolfson [11].
Table 2: [11]
CONCLUSIONS
Here we have presented two powerful for comparing the structure of proteins. DALI is limited to
perform pairwise structural comparison of proteins and MultiProt deals with simultaneous
multiple structural alignment of proteins. However both these methods use rigid molecules of
proteins. From the discussions and figures, it is obvious that MultiProt is much powerful and
faster than DALI. Each of the protein comparison method uses different algorithms for their
comparison. When DALI uses Branch-and-Bound and Monte-Carlo Optimization Algorithm,
MultiProt uses its method in three different stages and each of these stages performs three
different alignments. Compared to DALI, comparison method in MultiProt is much more
complex and it is extremely efficient and fast. I conclude this paper with a figure (Figure 24.) of
10 proteins of tansferases family aligned together using the multiple alignment method of
MultiProt:
Figure 24(a). The
proteins of the
transferases family
are aligned
together using the
MultiProt method.
The proteins
involved in the
alignment are:
10gs:A, 1axd:A,
1b48:A, 1c72:A,
1f2e:A, 1gnw:A,
1gwc:B, 1jlv:A,
1ljr:A, and 1pd2: 1.
The number of
residues aligned is
82 and the RMSD
is 1.57. The tabular
data at the lower
left corner of the
figures gives the
data on the number
of C-alpha atoms
present in each
protein molecules.
REFERENCES
[1] Hobohm U, Scharf M, Schneider R, Sander C (1992): Selection of representative protein data
sets. Protein Sci 1:409-17.
[2] Holm L, Sander C (1993a): Protein structure comparison by alignment of distance matrices. J.
Mol Biol 233(1): 123-38.
[3] Holm L, Sander C (1996): Mapping the protein universe. Science 273: 595-603
[4] Holm L, Sander C (1998): Protein folds and families: sequence and structure alignments.
Nucleic Acid Research: 244-47.
[5] Bourne P.E., Weissig H. Structural Bioinformatics. Wiley-Liss, Hoboken, NewJersey.
[6] http://www.wikipedia.org/
[7] Holm L et al, Current Protocols in Bioinformatics. Dali - Structural comparison in Proteins
(** I got this pdf file off the internet; no details on the publication or the date available)
[8] Shindyalov, I. N. and Bourne, P. E., Protein structure alignment by incremental combinatorial
extension of the optimum path. Protein Engineering, 11:739-747, 1998.
[9 ] M. Shatsky, R. Nussinov, H.J. Wolfson (2002): MultiProt - a Multiple Protein Structural
Alignment Algorithm. 2nd Workshop on Algorithms in Bioinformatics (WABI’02 as part of
ALGO’02), Rome, Italy, Sept. 2002, Lecture Notes in Computer Science 2452: 235-250, Springer
Verlag.
[10] M. Shatsky, O. Dror, D. Schneidman-Duhovny, R. Nussinov, H.J. Wolfson (2004):
BioInfo3D: a suite of tools for structural bioinformatics. Nucleic Acids Research,32: W503 –
W507.
[11] M. Shatsky, R. Nussinov, H.J. Wolfson (2004): A Method for Simultaneous Alignment of
Multiple Protein Structures. Protein: Structure, Functions and Bioinformatics, 56: 143 – 156.
Download