Dr Mumthas Yahiya The concept of sequence similarity & Idea of Scoring matrixes Aligning two sequences is the foundation of Bioinformatics. It may be fairly said that sequence alignment is the operation upon which everything else is built. ➢ Sequence alignments are the starting points for methods predicting de novo the secondary structure of proteins, ➢ They are a prerequisite for all knowledge-based tertiary structure predictions, ➢ For the estimation of the total number of different types of protein folds, ➢ For interpreting data of genome sequencing projects and ➢ For inferring phylogenetic trees and resolving questions of ancestry between species. This power of sequence alignments stems from the observed finding, that if two biological sequences are sufficiently similar, almost invariably they have similar biological functions and will be descended from a common ancestor. The simplest measure of similarity is of course identity: Identity: Two proteins that have a certain number of amino-acids in common at aligned positions are said to be identical to that degree. i.e., if they have 43 residues out of a total of 144 in common, they are 29.9% identical. Similarity: Often a number of residues will be replaced by ones of similar physicochemical properties. Such mutations may be termed conservative and one may define various scoring schemes to quantify how similar the two sequences are, taking into account conservative mutations. Such scores will be measures of similarity. Homology: If (and only if!) two proteins are evolutionarily related and stem from a common ancestor, they are called homologous. Similarity and homology are two complementary terms and must not be confused. 1 Metrics for Similarity scores A concept of sequence similarity always implies a metric - a statement of quantitatively how similar we judge two sequences to be. The distribution of similar or identical residues within two sequences can itself be a source of valuable information. If we compile the similarity scores that we give to different amino-acid pairs into a matrix, this is frequently called a scoring matrix. The term mutation data matrix is also frequently used: this is a scoring matrix compiled from the observation of point mutations between aligned sequences. A metric of similarity between amino acid pairs - eg. quantitatively how similar a valine residue is to an isoleucine, how related it is to a threonine - can be defined in a number of ways. It is very important to realize, that all subsequent results depend critically on just how this is done and what model lies at the basis for the construction of a specific scoring matrix. A scoring matrix is a tool to quantify how well a certain model is represented in the alignment of two sequences, and any result obtained by its application is meaningful exclusively in the context of that model. The simplest metric in use is the identity metric. If two amino acids are the same, they are given one score, if they are not, they are given a different score - regardless, of what the replacement is. One may give a score of 1 for matches and 0 for mismatches - this leads to the frequently used unitary matrix. Or one could assign +6 for a match and -1 for a mismatch, this would be a matrix useful for local alignment procedures, where a negative expectation value for randomly aligned sequences is required to ensure that the score will not grow simply from extending the alignment in a random way. A very crude model of an evolutionary relationship could be implemented in a scoring matrix in the following way: since all point-mutations arise from nucleotide changes, the probability that an observed amino acid pair is related by chance, rather than inheritance should depend on the number of point mutations necessary to transform one codon into the other. 2 A metric resulting from this model would define the distance between two amino acids by the minimal number of nucleotide changes required. Indeed, this genetic code matrix already improves sensitivity and specificity of alignments from the identity matrix. Genetic code matrix works to align related proteins, in the same way that matrices derived from amino-acid properties work says to minimize the effects of point mutations. Other similarity scoring matrices might be constructed from any property of amino acids that can be quantified (partition coefficients between hydrophobic and hydrophilic phases, charge, molecular volume, to name only a few). • • • • • • The Dayhoff Matrix The Evolutionary Distance Scale The PET matrices The GCB Matrix The BLOSUM Matrices Dotplots - graphical analysis of similarity 3