Lecture 4 Sequence alignment and searching Sequence alignment What is the purpose of sequence alignment? Identification of homology and homologous sites in related sequences Inference of evolutionary history that lead to the differences in observed sequences Types of comparisons and alignment methods According to sequence According to Coverage: number od sequences: LOCAL GLOBAL TWO SEQUENCES (Pairwise alignment) Database seqrch against query sequences BLAST algorithm Comparoson of two sequences; First step in multiple alignment THREE OR MORE SEQUENCES (Multiple alignment) Defining consensus sequences, protein structural motifs and domains, regulatory elements in DNA etc. Determination of conserved residues and domains; Introductory step in mplecular phylogenetic analysis Introduction to sequence alignment Given two text strings: First string = a b c d e Second string = a c d e f a reasonable alignment would be a b c d e a - c d e f For the sequences gctgaacg and ctataatc: An uninformative alignment: -------gctgaacg ctataatc------- An alignment without gaps gctgaacg ctataatc An alignment with gaps We must choose criteria so that algorithm can choose the best alignment. gctga-a--cg --ct-ataatc And another gctg-aa-cg -ctataatc- Source: Lesk, Introduction to Bioinformatics The dotplot (1) A simple picture that gives an overview of the similarities between two sequences Dotplot showing identities between short name (DOROTHYHODGKIN) and full name (DOROTHYCROWFOOTHODGKIN)of a famous protein crystallographer: Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown in boldface, are the first and last names DOROTHY and HODGKIN. Shorter matching regions, such as the OTH of dorOTHy and crowfoOTHodgkin, or the RO of doROthy and cROwfoot, are noise. Source: Lesk, Introduction to Bioinformatics The dotplot (2) Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal. Source: Lesk, Introduction to Bioinformatics The dotplot (3) Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal. Source: Lesk, Introduction to Bioinformatics The dotplot (4) Source: Lesk, Introduction to Bioinformatics Dotplots and sequence alignments Any path through the dotplot from upper left to lower right passes through a succession of cells, each of which picks out a pair of positions, one from the row and one from the column, that correspond in the alignment; or that indicates a gap in one of the sequences. The path need not pass through filled-in points only. However, the more filled-in points on the diagonal segments of the path, the more matching residues in the alignment. Corrseponding alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Source: Lesk, Introduction to Bioinformatics Measures of sequence similarity Two measures of distance between two character strings: agtc cgta The Hamming distance, defined between two strings of equal length, is the number of positions with mismatching characters. The Levenshtein, or edit distance, between two strings of not necessarily equal length, is the minimal number of 'edit operations' required to change one string into the other, where an edit operation is a deletion, insertion or alteration of a single character in either sequence. A given sequence of edit operations induces a unique alignment, but not vice versa. Hamming distance = 2 ag-tcc cgctca Levenshtein distance = 3 Source: Lesk, Introduction to Bioinformatics Scoring schemes In molecular biology, certain changes are more likely to occur than others Amino acid substitutions tend to be conservative In nucleotide sequences, transitions are more frequent than transversions -> We want to give different weights to different edit operations Hamming and Lenenshtein distance are measues of dissimilarity. In MB it is more common to measure similarity (higher similarity = higher score) Example: a DNA substitution matrix: Source: Lesk, Introduction to Bioinformatics a g c t a 20 10 5 5 g 10 20 5 5 c 5 5 20 10 t 5 5 10 20 Observed number of substitutions Start frequencies for a genetic matrix (upper triangular matrix) Expected number of random mutation events Start frequencies for PAM matrices (lower triangular matrix) Derivation of substitution matrices PAM Matrices First substitution matrices widely used Based on the point-accepted-mutation (PAM) model of evolution (Dayhoff, 1978) PAMs are relative measures of evolutionary distance 1 PAM = 1 accepted mutation per 100 AAs Does not mean that after 100 PAMs every AA will be different? Why or why not? To view and compare different substitution matrices, visit http://eta.embl-heidelberg.de:8000/misc/mat/ PAM Matrices Derived from global alignments of closely related sequences. Matrices for greater evolutionary distances are extrapolated from those for lesser ones. The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances. Does not take into account different evolutionary rates between conserved and non-conserved regions. BLOSUM Matrices Henikoff, S. & Henikoff J.G. (1992) Use blocks of protein sequence fragments from different families (the BLOCKS database) Amino acid pair frequencies calculated by summing over all possible pairs in block Different evolutionary distances are incorporated into this scheme with a clustering procedure (identity over particular threshold = same cluster) Similar idea to PAM matrices Probabilities estimated from blocks of sequence fragments Blocks represent structurally conserved regions BLOSUM Matrices Target frequencies are identified directly instead of extrapolation. Sequences more than x% identitical within the block where substitutions are being counted, are grouped together and treated as a single sequence BLOSUM 50 : >= 50% identity BLOSUM 62 : >= 62 % identity To view and compare different substitution matrices, visit http://eta.embl-heidelberg.de:8000/misc/mat/ BLOSUM Matrices Derived from local, ungapped alignments of distantly related sequences All matrices are directly calculated; no extrapolations are used – no explicit model The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances. The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49). How do you prefer it: done right or done fast? Exact methods - the result is guaranteed to be (mathematically) optimal Pairwise alignment Multiple alignment Needleman-Wunsch (global) Smith-Waterman (local) Rarely used; quickly becomes computationally intractable with increasing number of sequences and/or their length Heuristic methods: make some assumptions that hold most, but not all of the time Pairwise alignment – BLAST Multiple alignment – all widely used multiple alignment programs (ClustalW, TCofee…) Exact pairwise methods: Dynamic programming Equivalence matrix Cumulative matrix (can be computed from Cor N-terminus) An example of a cumulative matrix (computed from Cterminus) with traces of possible alignments) What does it mean? Principle of reduction of number of paths that need to be examined: If a path from X->Z passes through Y, the best path from X->Y is independent of the best path from Y->Z Source: Lesk, Introduction to Bioinformatics BLAST – the workhorse of bioinformatcs http://www.ncbi.nlm.nih.gov/BLAST BLAST = Basic local alignment search tool When you have a nucleotide or protein sequence that you want to search against sequence databases to determine what the sequence is to find related sequences (homologs) Bioinformatics researchers use standalone version of BLAST, run from the command line or other programs Different BLAST programs BLAST algorithm principles (Basic Local Alignment Search Tool) Main idea: 1. 2. query Construct a dictionary of all the words in the query Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman DB BLAST Original Version Dictionary: All words of length k (~11) Alignment initiated between of alignment score T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query words …… scan DB query BLAST Original Version A C G A A G T A A G G T C C A G T k = 4, T=4 The matching word GGTC initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A Example: Gapped BLAST A C G A A G T A A G G T C C A G T Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A Added features: Gapped BLAST A C G A A G T A A G G T C C A G T Pairs of words can initiate alignment Nearby alignments are merged Extensions with gaps until score < T below best score so far Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A Added features: BLAST: Example of sequence search Query sequence or accession number Sequence database to search BLAST: Example of sequence search Please come back later BLAST results: List of hits E–value - number of unrelated databank sequences expected to yield same or higher score by pure chance BLAST results: High scoring pairs (HSPs) Fundamental unit of the BLAST algorithm output HSP (high scoring pair) Aligned fragments of query and detected sequence with similarity score exceeding a set cutoff value E-value (Expectation) Score = 61 (27.8 bits), Expect = 1.8e-65, Sum P(4) = 1.8e-65 Identities = 10/17 (58%), Positives = 16/17 (94%) Query: 81 SGDLSMLVLLPDEVSDL 97 +GD+SM +LLPDE++D+ Sbjct: 259 AGDVSMFLLLPDEIADV 275 HSP BLAST Confidence measures Score and bit-score : depend on scoring method E-value (Expect value) : number of unrelated database sequences expected to yield same or higher score by pure chance P-value (Probability) : probability that database yields by pure chance at least one alignment with same or higer score The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. (E-value approching zero => significant alignment. Less than 0,01 = almost always found to be homologous; 1e-10 for nucleotide searches of 1e-4 for protein searches= frequently related) In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. Be careful when comparing E- or P-values from different searches Comparison is only meaningful for different query sequences searched agains the same database with the same BLAST parameters Variants of BLAST MEGABLAST: Optimized to align very similar sequences PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated WU-BLAST: (Washington University BLAST) Works best when k = 4i 16 Linear gap penalty Optimized, added features BlastZ Combines BLAST/PatternHunter methodology Good for local alignments of non-coding (regulatory) regions IMPORTANT: When NOT to use BLAST A typical situation: you have a piece of human DNA sequence and want to extend it or find what gene it belongs to. In other words, you want an exact or near-exact match to a sequence that is part of an assembled genome. Genome browsers are equipped with very fast algorithms for finding near-exact matches in genomic sequences: BLAT (with UCSC Genome Browser) Highly recommended: the BLAT paper (Kent WJ (2003) Genome Res 12:656-64) - it is famous for its unorthodox writing style SSAHA (with Ensembl) BLAT: BLAST-like alignment tool Differences with respect to BLAST: Dictionary of the DB (BLAST: query) Initiate alignment on any # of matching hits (BLAST: 1 or 2 hits) More aggressive stitching-together of alignments Special code to align mature mRNA to DNA Result: very efficient for: Aligning many sequences to a DB Aligning two genomes