#8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Required Reading BCB 444/544 (before lecture) √ Last week: - for Lectures 4-7 Lecture 8 Pairwise Sequence Alignment, Dynamic Programming, Global vs Local Alignment, Scoring Matrices, Statistics • Xiong: Chp 3 • Eddy: What is Dynamic Programming? 2004 Nature Biotechnol 22:909 Finish: Dynamic Programming Global vs Local Alignment http://www.nature.com/nbt/journal/v22/n7/abs/nbt0704-909.html Scoring Matrices & Alignment Statistics √ Wed Sept 5 - for Lecture 7 & Lab 3 BLAST Database Similarity Searching: BLAST (nope, more DP) • Chp 4 - pp 51-62 #8_Sept7 Fri Sept 7 - for Lecture 8 (will finish on Monday) BLAST variations; BLAST vs FASTA • Chp 4 - pp 51-62 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Assignments & Announcements SECTION II Send via email to Pete Zaback petez@iastate.edu ( For now, no late penalty - just send ASAP) Fri Sept 21 - Exam #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment √ Wed Sept 5 - Notes for Lecture 5 posted online - HW#2 posted online & sent via email & handed out in class - HW#2 Due by 5 PM 2 Chp 3- Sequence Alignment √ Tues Sept 4 - Lab #2 Exercise Writeup due by 5 PM Fri Sept 14 9/7/07 • • • • • • 3 Methods √ Evolutionary Basis √ Sequence Homology versus Sequence Similarity √ Sequence Similarity versus Sequence Identity Methods - cont Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 4 Dynamic Programming - 4 Steps: • √ Global and Local Alignment • √ Alignment Algorithms • √ Dot Matrix Method 1. Define score of optimal alignment, using recursion 2. Initialize and fill in a DP matrix for storing optimal scores of subproblems, by solving smallest subproblems first (bottom-up approach) • Dynamic Programming Method - cont • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices 3. Calculate score of optimal alignment(s) 4. Trace back through matrix to recover optimal alignment(s) that generated optimal score • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs 9/7/07 5 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 6 1 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 1- Define Score of Optimal Alignment using Recursion 2- Initialize & Fill in DP Matrix for Storing Optimal Scores of Subproblems • Construct sequence vs sequence matrix • Fill in from [0,0] to [N,M] (row by row), calculating best possible score for each alignment ending at residues at [i,j] x1..i = Prefix of length i of x y1.. j = Prefix of length j of y Define: S(i, j) = Score of optimal alignment of x1..i and y1..j ! Initial conditions: 1 β = Mismatch Penalty γ = Gap penalty S(i,0) = "i # $ S(0, j) = " j # $ ! 0 0 α = Match Reward 1 N S(0,0)=0 Recursive definition: S(i,j) For 1 ≤ i ≤ N, 1 ≤ j ≤ M: ! %S(i "1, j "1) + # (xi , y j ) ' S(i, j) = max&S(i "1, j) "$ 'S(i, j "1) "$ ( σ(xi,yj) = α or β γ BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST = Gap penalty 9/7/07 S(N,M) M 7 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 8 ! How do we calculate S(i,j)? i.e., Score for alignment of x[1..i] to y[1..j]? Case 1: Line up xi with yj 1 of 3 cases ⇒ optimal score for this subproblem: x: C y: C xi aligns to yj xi aligns to a gap yj aligns to a gap x1 x2 . . . xi-1 xi x1 x2 . . . xi-1 xi x1 x2 . . . xi y1 y2 . . . yj-1 yj y1 y2 . . . yj y1 y2 . . . yj-1 yj S(i-1,j-1) + σ(xi,yj) — S(i-1,j) -γ S(i,j-1) — x: C y: C 9 1 λ 1 S(0,0)=0 α = Match Reward β = Mismatch Penalty γ = Gap penalty σ(xi ,yj) = α Initialization S(i,0) = "i # $ S(0, j) = " j # $ or S(i-1,j) S(i,j-1) S(i,j) -γ S(N,M) Recursion %S(i "1, j "1) + # (xi , y j ) ' S(i, j) = max&S(i "1, j) "$ 'S(i, j "1) "$ ( BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST ! 0 C -5 A -10 T -15 β S(i-1,j-1) -γ M λ N + A T T C - G T C C A T T C - G T C C Match Bonus i-1 A j i A - Space Penalty i A - A j -1 j Space Penalty 9/7/07 10 Fill in the DP matrix !! Keep track of dependencies of scores (in a pointer matrix) 0 C - i A A j BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Ready? Fill in DP Matrix 0 T T Case 3: Line up yj with space -γ 9/7/07 A Scoring Consequence? i-1 G C T C j-1 Case 2: Line up xi with space x: C y: C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Note: I changed sequences on this slide (to match the rest of DP example) Specific Example: 9/7/07 T -20 C A -25 -30 C -35 C T C G C A G C -5 -10 -15 -20 -25 -30 -35 -40 10 5 +10 for match, -2 for mismatch, -5 for space 11 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 12 ! BCB 444/544 Fall 07 Dobbs 2 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 4- Trace back through matrix to recover optimal alignment(s) that generated the optimal score 3- Calculate Score S(N,M) of Optimal Alignment - for Global Alignment λ C λ 0 -5 C A -5 10 -1 0 5 T -1 5 0 T -2 0 -5 C A C T C G C A G C -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 5 0 -5 8 3 -2 -7 0 -5 -1 0 15 10 5 0 -5 -2 -7 10 13 8 3 -2 -7 -4 -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 -3 5 -2 0 -5 10 13 28 23 26 33 How? "Repeat" alignment calculations in reverse order, starting at from position with highest score and following path, position by position, back through matrix -1 0 -1 5 -2 0 -2 5 Result? Optimal alignment(s) of sequences +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 13 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 14 Traceback to Recover Alignment Traceback - for Global Alignment Start in lower right corner & trace back to upper left Each arrow introduces one character at end of alignment: • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence λ C T C G C A G C λ 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 C A -5 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 -1 0 5 8 3 -2 -7 0 -5 -1 0 T -1 5 0 15 10 5 0 -5 -2 -7 T C A C -2 0 -5 10 13 8 3 -2 -7 -4 -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 -3 5 -2 0 -5 10 13 28 23 26 33 Can have >1 optimal alignment; this example has 2 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 15 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment λ C A T T C A C C T C G C A G C 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 10 5 0 -5 -1 0 -1 5 -2 0 5 8 3 -2 -7 0 -5 -1 0 -1 5 0 15 10 5 0 -5 -2 -7 -5 10 13 8 3 -2 -7 -4 -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 -3 5 -2 0 -5 10 13 28 23 26 λ C A T T -2 5 -1 0 -2 0 33 BCB 444/544 Fall 07 Dobbs 9/7/07 λ C T C G C A G C 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 -5 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 -1 0 5 8 3 -2 -7 0 -5 -1 0 -1 5 0 15 10 5 0 -5 -2 -7 -2 0 -5 10 13 8 3 -2 -7 -4 C A -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 +10 for match, -2 for mismatch, -5 for space • Where did 33 come from? Match = 10, so 33-10= 23 Must have come from diagonal • Where did 23 come from? (Not a match) Left? 28-5= 23; Diag? 13-2= 11; Top? 8-5= 3 Where did red arrows come from? BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 16 Traceback to Recover Alignment λ -5 9/7/07 17 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 18 3 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Traceback to Recover Alignment λ λ C A Traceback to Recover Alignment C T C G C A G C 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 λ C T C G C A G C -5 10 5 0 -5 -1 0 -1 5 -2 0 -2 5 λ 0 -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 C A T -5 10 C C with 5 0 -5 -1 0 -1 5 -2 0 -2 5 -1 0 5 - with 8 A 3 -2 -7 0 -5 -1 0 -1 5 0 5 T T 1with 10 5 0 -5 -2 -7 -2 0 -5 10 13 C with 3 -2 -7 -4 -2 5 -1 0 5 20 15 13 8 -3 0 -1 5 0 15 18 13 28A A with -3 5 -2 0 -5 10 13 28 23 -1 0 5 8 3 -2 -7 0 -5 -1 0 T T -1 5 0 15 10 5 0 -5 -2 -7 -2 0 -5 10 13 8 3 -2 -7 -4 C A -2 5 -1 0 5 20 15 18 13 8 3 -3 0 -1 5 0 15 18 13 28 23 18 C -3 5 -2 0 -5 10 13 28 23 26 33 T C A C +10 for match, -2 for mismatch, -5 for space Where did 8 come from? Two possibilities: 13-5= 8 or 10-2=8 Then, follow both paths • • 9/7/07 19 λ C T C G C A G C -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 10 C C with 5 0 -5 -1 0 -1 5 -2 0 -2 5 5 - with 8 A 3 -2 -7 0 -5 -1 0 0 5 T T 1with 10 5 0 -5 -2 -7 -2 -7 -4 13 8 3 C A T -5 -1 0 -1 5 T -2 0 -5 10 C A -2 5 -1 0 5 20 G with 15 - -3 0 -1 5 0 15 18 13 28A A with C -3 5 -2 0 -5 10 13 28 23 3 T C1with 8 3 C1with 8 C 3 G2 with 1: 2: 1: 2: C C A T T C T G T G - C C C C A A A A G G - 9/7/07 21 BCB 444/544 Fall 07 Dobbs - T C G C A G C C - T C G C A G C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 22 or, Check Traceback? λ C A T T C A C C C C C 9/7/07 λ C T C G C A G C 0 d -5 -1 0 -1 5 -2 0 -2 5 -3 0 -3 5 -4 0 5 0 -5 -1 0 -1 5 -2 0 -2 5 -2 -7 0 -5 -1 0 10 d 5 0 -5 -2 -7 8 3 -2 -7 -4 13 8 -5 1 0v -1 0 5 -1 5 0 15 -2 0 -5 1 0d 13 -2 5 -1 0 5 20 -3 0 -1 5 0 -3 5 -2 0 -5 d8 h 31 h d d 15 18 15 18 13 28 10 13 28 23 2 h 3 23 d 18 26 33 • h= horizontal move puts a gap in left sequence • v = vertical move puts a gap in top sequence • d = diagonal move uses one character from each sequence Check the scores: +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST C • A horizontal move puts a gap in left sequence • A vertical move puts a gap in top sequence • A diagonal move uses one character from each sequence Top: C T C G C A G C Left: C A T T C A C C - 20 18 What are the 2 Global Alignments with Optimal Score = 33? T T 9/7/07 C3 with 3 C 26 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST A C with 33 C 26 Top: C T C G C A G C Left: C A T T C A C Great - but what are the alignments? #2 C C 18 What are the 2 Global Alignments with Optimal Score = 33? -1 0 -5 3 3 G2 with BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Traceback to Recover Alignment 0 8 C C1with Great - but what are the alignments? #1 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST λ 8 T G with 23 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 24 4 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 Local Alignment: Motivation Local Alignment: Example • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequences is likely to be between two exons G G T C T G A G A A A C G A • To locate protein domains or motifs: Match: +2 • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Best local alignment: Non-coding - "not encoding protein" G G T C T G A G A A A C – G A - Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 25 26 New Slide S(i,0) = 0 S(0, j) = 0 S (0,0) = 0 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores % 'S i "1, j "1 + # x , y ) ( i j) ' ( S (i, j ) = max&S (i "1, j ) " $ ! 'S (i, j "1) " $ ' (0 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) 9/7/07 9/7/07 Local Alignment DP: Initialization & Recursion 1) Initialize top row & leftmost column of matrix with "0" BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Score = 5 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST This slide has been changed! Local Alignment: Algorithm Mismatch or space: -1 27 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 28 ! Filling in DP Matrix for Local Alignment No negative scores - fill in "0" λ C T C G C A G C 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 1 0 0 T 0 0 1 0 0 0 0 1 0 0 T 0 0 1 0 0 0 0 0 0 0 1 1 0 2 4 0 1 0 0 1 2 0 0 C A 0 0 0 0 0 1 0 2 2 0 0 0 1 1 0 1 0 1 0 2 0 1 λ C T C G C A G C λ 0 0 0 0 0 0 0 0 0 C A 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 1 0 2 0 1 0 0 0 0 0 1 0 0 1 0 1 0 2 C A C Traceback - for Local Alignment λ C A C BCB 444/544 Fall 07 Dobbs 9/7/07 1 +1 for match, -1 for mismatch, -5 for space +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 3 29 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 30 5 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 What are the 4 Local Alignments with Optimal Score = 2? What are the 4 Local Alignments with Optimal Score = 2? C C T A C T G T C C A A G C C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: C T C G C A G C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 31 C T G T C C A A G C C 1: C - T - C - G - C C A A G T C T 2: C C T A C T G T C C A A G C C 3: C T T T C C G A C C A G C 4: C T T T C C G A C C A G C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 32 Affine Gap Penalty Functions (for ComS, CprE & Math types) Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] Total Gap Penalty is linear function of gap length: W = where for Biologists: Big O notation • used when analyzing algorithms for efficiency • refers to time or number of steps it takes to solve a problem • expressed as a function of size of the problem γ + δ X (k - 1) γ = gap opening penalty δ = gap extension penalty Can also be solved in O(nm) time using DP k = length of gap Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty 9/7/07 33 Methods • • • • T A Check the scores: +1 for match, -1 for mismatch, -5 for space 9/7/07 Some Results re: Alignment Algorithms BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST C C BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 34 "Scoring" or "Substitution" Matrices √ Global and Local Alignment √ Alignment Algorithms √ Dot Matrix Method √ Dynamic Programming Method - cont 2 Major types for Amino Acids: PAM & BLOSUM PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins • Gap penalities • DP for Global Alignment • DP for Local Alignment BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs 9/7/07 35 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 36 6 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 PAM Matrix BLOSUM Matrix PAM = Point Accepted Mutation BLOSUM = BLOck SUbstitution Matrix relies on "evolutionary model" based on observed differences in closely related proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: average % aa identity in the MSA from which the matrix was generated • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences 37 PAM250 vs BLOSUM 62 9/7/07 38 Which is Better? PAM or BLOSUM See Text Fig 3.5 = PAM250 Fig 3.6= BLOSUM62 • PAM matrices • derived from evolutionary model • often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences Usually only 1/2 of matrix is displayed (it is symmetric) • BLOSUM matrices • based on direct observations • more 'realistic" - and outperform PAM matrices in terms of accuracy in local alignment Here: s(a,b) corresponds to score of aligning character a with character b BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 39 Which Type of Matrix Should You Use? BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 40 Sequence Alignment Statistics Several other types of matrices available: • Distribution of similarity scores in sequence alignment is not a simple "normal" distribution • Gonnet & Jones-Taylor-Thornton: • very robust in tree construction • "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail • "Best" matrix depends on task: • different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs 9/7/07 41 BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 42 7 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 How Assess Statistical Significance of an Alignment? Chp 4- Database Similarity Searching • Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence • If score is in extreme margin, then unlikely due to random chance SECTION II Xiong: Chp 4 Database Similarity Searching • • • • • • • P-value = probability that original alignment is due to random chance (lower P is better) P = 10-5 - 10-50 P > 10-1 sequences have clear homology no better than random Check out: PRSS (Probability of Random Shuffles) http://www.ch.embnet.org/software/PRSS_form.html BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 43 Today's Lab: focus on BLAST Basic Local Alignment Search Exhaustive - tests every possible solution • guaranteed to give best answer • e.g., Dynamic Programming as in Smith-Waterman algorithm Heuristic - does NOT test every possibility • no guarantee that answer is best (but, often can identify optimal solution) • sacrifices accuracy (potentially) for speed • uses "rules of thumb" or "shortcuts" • e.g., BLAST & FASTA 9/7/07 45 Tool BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 46 9/7/07 48 BLAST - a few details Search Tool Developed by Stephen Altschul at NCBI in 1990 BLAST Results? • • Original version of BLAST? List of HSPs = Maximum Scoring Pairs Word length? • • • How? Allows score to drop below threshold, (but only temporarily) 47 3 aa for protein sequence 11 nt for DNA sequence Default is BLOSUM62 Can change under Algorithm Parameters Choose other BLOSUM or PAM matrices Stop-Extension Threshold? • 9/7/07 Typically: Substitution matrix? • • • • More recent, improved version of BLAST? Allows gaps: Gapped Alignment BCB 444/544 Fall 07 Dobbs 44 1. Create list of very possible "word" (e.g., 3-11 letters) from query sequence 2. Search database to identify sequences that contain matching words 3. Score match of word with sequence, using a substitution matrix 4. Extend match (seed) in both directions, while calculating alignment score at each step 5. Continue extension until score drops below a threshold (due to mismatches) High Scoring Segment Pair (HSP) - contiguous aligned segment pair (no gaps) • can be very time/space intensive! BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 STEPS: (identifies optimal solution) Lab3: focus on BLAST Basic Local Alignment Unique Requirements of Database Searching Heuristic Database Searching Basic Local Alignment Search Tool (BLAST) FASTA Comparison of FASTA and BLAST Database Searching with Smith-Waterman Method BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST Exhaustive vs Heuristic Methods BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST SEQUENCE ALIGNMENT Typically: 22 for proteins 20 for DNA BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST 8 #8 Finish DP, Scoring Matrices, Stats & BLAST 9/7/07 BLAST - Statistical Significance? 1. E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random chance, thus higher significance 2. Bit Score: S' normalized score, to account for differences in sequence length & size of database 3. Low Complexity Masking remove repeats that confound scoring BCB 444/544 F07 ISU Dobbs #8 - Finish DP, Scoring Matrices, Stats & BLAST BCB 444/544 Fall 07 Dobbs 9/7/07 49 9