BCB 444/544 Lecture 9 Finish: Scoring Matrices & Alignment Statistics BLAST vs FASTA (not yet!) Smith-Waterman Algorithm #9_Sept10 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 1 Required Reading (before lecture) Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 2 Assignments & Announcements - #1 Revised Grading Policy has been posted online (see Handout) - Please review! Mon Sept 10 - Lab 3 Exercise due 5 PM: to: terrible@iastate.edu Thu Sept 13 - Graded Lab 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 3 Assignments & Announcements - #2 Mon Sept 17 - Answers to HW#2 will be posted on by 5 PM Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • • • • Lectures 2-12 Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 4 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • • • • • • √Evolutionary Basis √Sequence Homology versus Sequence Similarity √Sequence Similarity versus Sequence Identity √Methods - (Dot Plots, DP; Global vs Local Alignment) √Scoring Matrices (PAM vs BLOSUM) Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 5 First, let's re-visit DP for Local Alignment: • Email explaining "confusion" in Lecture 8 on Friday was sent on Sunday (so you wouldn't try to do HW2 without a better explanation!) • Answers to DP Examples given in Lectures are included in Lecture PPTs for Lectures 8 (Friday) & 9 (Today): • Global Alignment • Local Alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 6 What are the 2 Global Alignments with Optimal Score = 33? Top: C T C G C A G C Left: C A T T C A C 1: C C A T T C - G T C C A A G - C C 2: C C A T T C T G - C C A A G - C C Check the scores: +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 7 Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequences is likely to be between two exons • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 8 Local Alignment: Example G G T C T G A G A A A C G A Match: +2 Mismatch or space: -1 Best local alignment: G G T C T G A G A A A C – G A - BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics Score = 5 9/10/07 9 Local Alignment: Algorithm This slide has been changed! 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores 3) Optimal score? in highest scoring cell(s) 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 10 Local Alignment DP: Initialization & Recursion S 0,0 0 New Slide S(i,0) 0 S(0, j) 0 S i 1, j 1 x , y i j S i, j max S i 1, j S i, j 1 0 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 11 Filling in DP Matrix for Local Alignment No negative scores - fill in "0" λ C T C G C A G C 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 0 1 0 2 0 1 0 0 1 0 0 0 0 1 0 2 0 0 0 1 0 1 0 2 0 1 1 λ C A C A C +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 12 Traceback - for Local Alignment λ C T C G C A G C 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 1 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 0 1 0 2 4 0 1 0 0 1 0 0 0 0 1 0 2 2 0 0 0 1 0 1 0 2 0 1 λ C A C A C 3 1 +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 13 What are the 4 Local Alignments with Optimal Score = 2? C C T A C T G T C C A A G C C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: C T C G C A G C BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 14 What are the 4 Local Alignments with Optimal Score = 2? C C T A C T G T C C A A G C C 1: C - T - C - G - C C A A G T C T 2: C C T A C T G T C C A A G C C 3: C T T T C C G A C C A G C 4: C T T T C C G A C C A G C Check the scores: +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 15 Some Results re: Alignment Algorithms (for ComS, CprE & Math types) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] for Biologists: Big O notation • used when analyzing algorithms for efficiency • refers to time or number of steps it takes to solve a problem • expressed as a function of size of the problem BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 16 "Scoring" or "Substitution" Matrices 2 Major types for Amino Acids: PAM & BLOSUM • PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins • BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 17 PAM Matrix: Point Accepted Mutation I added 2 bullets to this slide Relies on "evolutionary model" based on observed differences in closely related proteins [Dayhoff78] • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed • e.g., PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed • PAM1 matrix is used as basis for calculating other matrices: assumes that repeated mutations would follow same pattern as those in PAM1 matrix, and multiple substitutions can occur at the same site • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 18 BLOSUM: BLOck SUbstitution Matrix I added 2 bullets to this slide Based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins (in BLOCKS database) [Henikoff & Henikoff92] • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: avg % aa identity in MSA from which matrix was generated • e.g., BLOSUM62 is derived from sequence alignments of proteins with no more than 62% identity • Blocks database contains ungapped aligned segments corresponding to the most highly conserved regions of proteins • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 19 Scoring Matrices: What are the scores? See Xiong Textbook: Fig 3.5 = PAM250 Fig 3.6 = BLOSUM62 Usually only 1/2 of matrix is displayed (it is symmetric) s(a,b) corresponds to score of aligning character a with character b These are log-odds scores: each entry ~ log (freq(observed)/freq(expected) + more likely than random 0 at random base rate - less likely than random BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 20 Log-odds scoring • What are the odds that this alignment is meaningful? x1x2x3 xN y1y2y3 yN • If sequences are not related: we’re observing a chance event, & the probability is: pX pY i i i i where px is the probability of x, py is probability of y • If sequences are related by evolution: they are derived from a common ancestor, & the probability is: p X Y i i i where pxy is the joint probability that x and y evolved from the same ancestor BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 21 Log-odds scoring matrix • Odds ratio = Relative likelihood of the 2 possibilities: p p p X i Yi i Xi i Yi i pX i Yi pX i pYi i • Alignment score = Log-odds ratio: where S s(x i , y i ) i p xi yi s(x i , y i ) log p p x i y i • Thus, s (xi, yi) gives the substitution matrix score for the pair xi, yi. • Together all the scores s(xi, yi) define the log-odds scoring matrix BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 22 How do we estimate s(x, y)? • The score for matching x and y is: p xy s(x, y) log p p x y • Pxy is probability of substituting x and y • Px is probability of amino acid x (on average ~ 5% with 20 amino acids, similarly for Py) Trusted (manual) alignments of related sequences provide information about biologically permissible mutations Frequency of amino acid substitutions in trusted alignments is used to generate substitution matrices BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 23 A Few Words about Parameter Selection in Sequence Alignment Optimal alignment between a pair of sequences depends critically on the selection of substitution matrix & gap penalty function S i 1, j 1 xi , y j S i, j max S i 1, j S i, j 1 In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) How do we pick parameters that give the most biologically meaningful alignments and alignment scores? BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 24 Which is Better Substitution Matrix? PAM or BLOSUM • PAM matrices • derived from evolutionary model • often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences • BLOSUM matrices • based on direct observations • more "realistic" - and outperform PAM matrices in terms of accuracy in local alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 25 Empirical Tests May be Needed: Several other types of matrices available: • Gonnet & Jones-Taylor-Thornton: • very robust in tree construction • "Best" matrix depends on task: • different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 26 How Should Gaps be Scored? (k) So far, we've used Simple linear gap penalty function: Gap of length k Incurs penalty -kx However, in biological sequences, gaps often occur in clusters: AGKLAVRSTMIESTRVILTWRKW AGKLAVRS------RVILTWRKW w(k) More realistic? "Affine" gap penalty: penalty for one long gap w(k) = + (k – 1) x is smaller than penalty for many smaller gaps gap gap opening extension that add up to same size BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 27 Affine Gap Penalty Functions Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap Total Gap Penalty is function of gap length: W = where + X (k - 1) = gap opening penalty = gap extension penalty k = length of gap Can also be solved in O(nm) time using DP Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 28 Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty • Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap Match score a b c - - d a c c e f d 9 2 7 6 => 24 Values from substitution matrix Gap opening + extension - Alignment (10 + 2) = 12 Score BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 29 Sequence Alignment Statistics • Distribution of similarity scores in sequence alignment is not a simple "normal" distribution • "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 30 How Assess Statistical Significance of an Alignment? • Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence • If score is in extreme margin, then unlikely due to random chance • P-value = probability that original alignment is due to random chance (lower P means alignment more significant) P = 10-5 - 10-50 P > 10-1 sequences have clear homology alignment is no better than random Check out: PRSS (Probability of Random Shuffles) http://www.ch.embnet.org/software/PRSS_form.html BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 31