New stuff Dynamic programming We want to align following two sequences: ABCDE PQRST If you already have the optimal solution for: A…D P…R then you know the next pair of characters will be one of these: A…DE P…RS A…DP…RS A…DE P…R- You can extend the match by determining which of these has the highest score. New best alignment = previous best + local best Best previous alignment Sequence A ... ... ... ... Sequence B Dynamic programming (DP) • General class of algorithms typically applied to optimization problems. • Recursive approach. • Original problem is broken into smaller subproblems and then solved. • Pieces of larger problem have a sequential dependency. • 4th piece can be solved using solution of the 3rd piece, the 3rd piece can be solved by using solution of the 2nd piece and so on… DP algorithms • Global alignment - Needlman-Wunsch • Local alignment - Smith-Waterman • Guaranteed to provide the optimal alignment. • Disadvantages: • Slow due to the very large number of computational steps: O(n2). • Computer memory requirements also increase with the square of the sequence lengths. • Therefore, it is difficult to use the method for very long sequences. • Many alignments may give the same optimum score. And none of these correspond to the biologically correct alignment. Homology vs. similarity again • Just a reminder of the important concept in sequence analysis – homology. It is a conclusion about a common ancestral relationship drawn from sequence similarity. • Sequence similarity is a direct result of observation from the sequence alignment. It can be quantified using percentages, but homology can not! • It is important to understand this difference between homology and similarity. • If the similarity is high enough, a common evolutionary relationship can be inferred. Limits of the alignment detection • However, what is enough? How many mutations can occur before the differences make two sequences unrecognizable? • Intuitively, at some point two homologous sequences become so divergent that they do not align well. Twilight zone • The level one can infer homologous relationship depends on type of sequence (proteins, NA) and on the length of the alignment. • Unrelated sequences of DNA have at least 25% chance to be identical. For proteins, it is 5%. If gaps are allowed, this percentage can increase up to 10-20%. • The shorter the sequence, the higher the chance that some alignment can be attributed to random chance. • This suggest that shorter sequences require higher cuttof for inferring homology than longer sequences. 30% Essential bioinformatics, Xiong Determining homology • It must be stressed that the percentage identity values only provide a tentative guidance for homology identification. • This is not a precise rule for determining sequence relationships, especially for sequences in the twilight zone. • A statistically more rigorous approach to determine homologous relationships exist. The statistical significance of the alignment (i.e. its score) can be tested. • However, I will not cover this advanced topic in this lecture. Database similarity searching Sequence database searching query sequence pairwise alignment closely related matches target sequence database Database searching requirements • sensitivity – the ability to find as many correct hits (TP) as possible • selectivity (specificity) – ability to exclude incorrect hits (FP) • speed • ideally: high sensitivity, high specificity, high speed • reality: increase in sensitivity leads to decrease in specificity, improvement in speed often comes at the cost of lowered sensitivity and selectivity Types of algorithms • exhaustive • uses a rigorous algorithm to find the exact solution for a particular problem by examining all mathematical combinations • example: dynamic programming • heuristic • computational strategy to find an empirical or near optimal solution by using rules of thumb Heuristic algorithms • Perform faster searches because they examine only a fraction of the possible alignments examined in regular dynamic programming • currently, there are two major algorithms: • FASTA • BLAST - Basic Local Alignment Search Tool, Google of the sequence world • Not guaranteed to find the optimal alignment or true homologs, but are 50–100 times faster than DP. • The increased computational speed comes at a moderate expense of sensitivity and specificity of the search, which is easily tolerated by working molecular biologists. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mo.l Biol. 1990 Oct 5;215(3):403-10. Two components of BLAST • BLAST consists of two components: • a search algorithm and • the evaluation of the quality of solutions BLAST – ALGORITHM BLAST strategy • Basic Local Alignment Search Tool • Find short stretches (words) of identical or nearly identical letters in two sequences. • The basic assumption is that two related sequences must have at least one word in common. • By first identifying word matches, a longer alignment can be obtained by extending similarity regions from the words. • Once regions of high sequence similarity are found, adjacent high-scoring regions can be joined into a full alignment. How BLAST works – 1st step Divide a query sequence into words of length W (W = 3 for proteins) LGQALWGQIWW LGQ GQA QAL ALW LWG WGQ GQI QIW IWW How BLAST works – 1st step For each of these words, a list of similar words is created using a substitution matrix (implicit: BLOSUM62). LGQALWGQIWW 4 6 11 LWG IWG MWG VWG FWG LYG LFG FWS AWS ... ... ... ... ... ... ... ... ... 21 19 19 18 17 12 11 11 10 threshold T How BLAST works – 2nd step Scan the database sequences for exact matches with the high-scoring words. LWG IWG MWG VWG FWG LYG How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG query sequence database sequence LGQALWGQIWW WTDFGYITALYGRINC How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG query sequence LGQALWGQIWW -1-4-1 4 4 2 6 1 4 -4 -2 database sequence WTDFGYITALYGRINC S = 12 How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG query sequence LGQALWGQIWW -1-4-1 4 4 2 6 1 4 -4 -2 database sequence WTDFGYITALYGRINC S = 17 How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG query sequence LGQALWGQIWW -1-4-1 4 4 2 6 1 4 -4 -2 database sequence WTDFGYITALYGRINC S = 20 How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) LYG query sequence LGQALWGQIWW -1-4-1 4 4 2 6 1 4 -4 -2 database sequence WTDFGYITALYGRINC S = 12 How BLAST works – 3rd step Extend the exact matches to high-scoring segment pair (HSP) Recent improvement (BLAST 2.0) LYG enables the explicit treatment of gaps. query sequence LGQALWGQIWW -1-4-1 4 4 2 6 1 4 -4 -2 database sequence WTDFGYITALYGRINC S = 20 HSP How BLAST works • Under certain conditions, HSPs can be joined to extend the alignment. overlapping HSPs not that distant HSPs 1 query sequence For each word, the list of similar words is created using a substitution matrix 2 database sequences scan match list The query sequence is cut in words of length W the extension of the similarity on both sides of the word extend 3 high scoring pair BLAST parameters W : Word size – find W-mers in target/query 2-3 (3) for proteins, 6-11 (28) for NA T : Neighborhood word score threshold – focus on pairs more than T usually 11-13 X : Drop-off – stop extending when score loss is higher than X S : Score – the final score of a HSP (this is not a parameter, just a result) BLAST variants BLAST parameters • Adjusting T and W controls both speed and sensitivity • • • • • (TP) of BLAST When T is raised, the speed of the search is increased, but fewer hits are registered, and so distantly related database matches may be missed. When T is lowered, the search proceeds more slowly, but many more word hits are evaluated, and thus sensitivity is increased. To speed up BLASTN, increase W (T is not used in BLASTN, words are always identical) To speed up BLASTP, set W=3 and T to a large value. W and T better for controlling speed than X Which sequence to search? • The choice of the type of sequences also influences the sensitivity of the search. • Clear advantage of using protein sequences in detecting homologs • If the input sequence is a protein-encoding DNA sequence, use BLASTX (six open reading frames before sequence comparisons) • If you’re looking for protein homologs encoded in newly sequenced genomes, you may use TBLASTN. This may help to identify protein coding genes that have not yet been annotated. • If a DNA sequence is to be used as the query, a proteinlevel comparison can be done with TBLASTX. • TBLASTN, TBLASTX are very computationally intensive and the search process can be very slow. BLAST – quality assessment E-value • expected value • The E-value estimates the expected number of records in the database that will be returned with a score as good as or better than the score of the record under scrutiny. • An E value of 1 means that in a database of the current size one might expect to see 1 match with a similar score simply by chance. • A value close to zero means that you would practically expect no unrelated sequence to score as high to your query sequence. The interpretation of E-value • The primary use of the E-value is to help to answer the question ‘Is this alignment meaningful?’. Not whether it has biological meaning! • What is the highest E-value that I should consider as significant? • No definite answer, depends on your goals and sequences. • Generally, the lower the better. Commonly used value: 1E-6 • But, in some cases, this may be too restrictive. The interpretation of E-value Bit score • A typical BLAST output reports E values and scores. • There are two kinds of scores: raw and bit scores. • Raw scores S are calculated from the substitution matrix and the gap penalty parameters. • The bit score S’ is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. • Bit scores from different alignments, even those employing different scoring matrices in separate BLAST searches, can be compared. • E-values can not be compared when searching in different databases. The bit scores, however, will remain the same.