Uploaded by mumthasy

The basic idea of sequence comparison

advertisement
Dr Mumthas Yahiya
The concept of sequence similarity &
Idea of Scoring matrixes
Aligning two sequences is the foundation of Bioinformatics. It may be fairly said
that sequence alignment is the operation upon which everything else is built.
➢
Sequence alignments are the starting points for methods predicting de novo
the secondary structure of proteins,
➢
They are a prerequisite for all knowledge-based tertiary structure predictions,
➢
For the estimation of the total number of different types of protein folds,
➢
For interpreting data of genome sequencing projects and
➢
For inferring phylogenetic trees and resolving questions of ancestry between
species.
This power of sequence alignments stems from the observed finding, that if two
biological sequences are sufficiently similar, almost invariably they have
similar biological functions and will be descended from a common ancestor.
The simplest measure of similarity is of course identity:
Identity: Two proteins that have a certain number of amino-acids in common at
aligned positions are said to be identical to that degree. i.e., if they have 43 residues
out of a total of 144 in common, they are 29.9% identical.
Similarity: Often a number of residues will be replaced by ones of similar physicochemical properties. Such mutations may be termed conservative and one may
define various scoring schemes to quantify how similar the two sequences are, taking
into account conservative mutations. Such scores will be measures of similarity.
Homology: If (and only if!) two proteins are evolutionarily related and stem from a
common ancestor, they are called homologous. Similarity and homology are two
complementary terms and must not be confused.
1
Metrics for Similarity scores
A concept of sequence similarity always implies a metric - a statement of
quantitatively how similar we judge two sequences to be.
The distribution of similar or identical residues within two sequences can itself be a
source of valuable information.
If we compile the similarity scores that we give to different amino-acid pairs into a
matrix, this is frequently called a scoring matrix.
The term mutation data matrix is also frequently used: this is a scoring matrix
compiled from the observation of point mutations between aligned sequences.
A metric of similarity between amino acid pairs - eg. quantitatively how similar a
valine residue is to an isoleucine, how related it is to a threonine - can be defined in
a number of ways. It is very important to realize, that all subsequent results depend
critically on just how this is done and what model lies at the basis for the construction
of a specific scoring matrix.
A scoring matrix is a tool to quantify how well a certain model is represented
in the alignment of two sequences, and any result obtained by its application is
meaningful exclusively in the context of that model.
The simplest metric in use is the identity metric. If two amino acids are the same,
they are given one score, if they are not, they are given a different score - regardless,
of what the replacement is.
One may give a score of 1 for matches and 0 for mismatches - this leads to the
frequently used unitary matrix. Or one could assign +6 for a match and -1 for a
mismatch, this would be a matrix useful for local alignment procedures, where a
negative expectation value for randomly aligned sequences is required to ensure that
the score will not grow simply from extending the alignment in a random way.
A very crude model of an evolutionary relationship could be implemented in a
scoring matrix in the following way: since all point-mutations arise from nucleotide
changes, the probability that an observed amino acid pair is related by chance, rather
than inheritance should depend on the number of point mutations necessary to
transform one codon into the other.
2
A metric resulting from this model would define the distance between two amino
acids by the minimal number of nucleotide changes required.
Indeed, this genetic code matrix already improves sensitivity and specificity of
alignments from the identity matrix.
Genetic code matrix works to align related proteins, in the same way that matrices
derived from amino-acid properties work says to minimize the effects of point
mutations.
Other similarity scoring matrices might be constructed from any property of amino
acids that can be quantified (partition coefficients between hydrophobic and
hydrophilic phases, charge, molecular volume, to name only a few).
•
•
•
•
•
•
The Dayhoff Matrix
The Evolutionary Distance Scale
The PET matrices
The GCB Matrix
The BLOSUM Matrices
Dotplots - graphical analysis of similarity
3
Download