#9 Scoring Statistics 9/10/07 Required Reading BCB 444/544 (before lecture) Mon Sept 10 - for Lecture 9 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 Lecture 9 Finish: Scoring Matrices & Alignment Statistics Wed Sept 12 - for Lecture 10 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 BLAST vs FASTA (not yet!) Smith-Waterman Algorithm Fri Sept 14 - for Lecture 11 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) #9_Sept10 • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 1 Assignments & Announcements - #1 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 Assignments & Announcements - #2 Mon Sept 17 - Answers to HW#2 will be posted on by 5 PM Revised Grading Policy has been posted online (see Handout) - Please review! Mon Sept 10 - Lab 3 Exercise due 5 PM: to: terrible@iastate.edu Thu Sept 20 - Lab = Optional Review Session for Exam Thu Sept 13 - Graded Lab 3 will be returned at beginning of Lab 4 Fri Sept 21 - Exam 1 - Will cover: • • • • Fri Sept 14 - HW#2 due by 5 PM (106 MBB) Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 3 SEQUENCE ALIGNMENT Pairwise Sequence Alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs 9/10/07 4 • Answers to DP Examples given in Lectures are included in Lecture PPTs for Lectures 8 (Friday) & 9 (Today): • Global Alignment • Local Alignment √ Evolutionary Basis √ Sequence Homology versus Sequence Similarity √ Sequence Similarity versus Sequence Identity √ Methods - (Dot Plots, DP; Global vs Local Alignment) √ Scoring Matrices (PAM vs BLOSUM) Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics • Email explaining "confusion" in Lecture 8 on Friday was sent on Sunday (so you wouldn't try to do HW2 without a better explanation!) Xiong: Chp 3 • • • • • • Lectures 2-12 Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming First, let's re-visit DP for Local Alignment: Chp 3- Sequence Alignment SECTION II 2 9/10/07 5 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 6 1 #9 Scoring Statistics 9/10/07 What are the 2 Global Alignments with Optimal Score = 33? Local Alignment: Motivation • To "ignore" stretches of non-coding DNA: Top: C T C G C A G C Left: C A T T C A C • Non-coding regions (if "non-functional") are more likely to contain mutations than coding regions • Local alignment between two protein-encoding sequences is likely to be between two exons 1: C C A T T C - G T C C A A G - C C 2: C C A T T C T G - C C A A G - C C • To locate protein domains or motifs: • Proteins with similar structures and/or similar functions but from different species (for example), often exhibit local sequence similarities • Local sequence similarities may indicate ”functional modules” Non-coding - "not encoding protein" Exons - "protein-encoding" parts of genes vs Introns = "intervening sequences" - segments of eukaryotic genes that "interrupt" exons Introns are transcribed into RNA, but are later removed by RNA processing & are not translated into protein Check the scores: +10 for match, -2 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 7 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics Local Alignment: Example 1) Initialize top row & leftmost column of matrix with "0" 2) Fill in DP matrix: In local alignment, no negative scores Assign "0" to cells with negative scores Mismatch or space: -1 3) Optimal score? in highest scoring cell(s) Best local alignment: G G T C T G A G A A A C – G A - 4) Optimal alignment(s)? Traceback from each cell containing the optimal score, until a cell with "0" is reached (not just from lower right corner) Score = 5 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics Local Alignment DP: Initialization & Recursion S (0,0) = 0 8 This slide has been changed! Local Alignment: Algorithm G G T C T G A G A A A C G A Match: +2 9/10/07 9/10/07 9 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 10 Filling in DP Matrix for Local Alignment No negative scores - fill in "0" New Slide S(i,0) = 0 S(0, j) = 0 % 'S i "1, j "1 + # x , y ) ( i j) ' ( S (i, j ) = max&S (i "1, j ) " $ ! 'S (i, j "1) " $ ' (0 λ C T C G C A G C λ 0 0 0 0 0 0 0 0 0 C 0 1 0 1 0 1 0 0 1 A 0 0 0 0 0 0 2 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 0 1 0 2 0 1 0 0 1 A 0 0 0 0 1 0 2 0 0 C 0 1 0 1 0 2 0 1 1 C +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 11 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 12 ! BCB 444/544 Fall 07 Dobbs 2 #9 Scoring Statistics 9/10/07 Traceback - for Local Alignment λ C T C G C A G C 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 2 1 0 0 T 0 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 0 0 0 1 0 2 4 0 1 0 0 1 0 0 0 0 1 0 2 2 0 0 0 1 0 1 0 2 0 1 λ C A C A C 3 1 What are the 4 Local Alignments with Optimal Score = 2? +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 13 What are the 4 Local Alignments with Optimal Score = 2? C C T A C T G T C C A A G C C 1: C - T - C - G - C C A A G T C T 2: C C T A C T G T C C A A G C C 3: C T T T C C G A C C A G C 4: C T T T C C G A C C A G C T A C T G T C C A A G C C 1: C T C G C A G C 2: C T C G C A G C 3: C T C G C A G C 4: C T C G C A G C BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 14 Some Results re: Alignment Algorithms (for ComS, CprE & Math types) • Most pairwise sequence alignment problems can be solved in O(mn) time • Space requirement can be reduced to O(m+n), while keeping run-time fixed [Myers88] • Highly similar sequences can be aligned in O (dn) time, where d measures the distance between the sequences [Landau86] for Biologists: 9/10/07 Big O notation • used when analyzing algorithms for efficiency • refers to time or number of steps it takes to solve a problem • expressed as a function of size of the problem Check the scores: +1 for match, -1 for mismatch, -5 for space BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics C C 15 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics PAM Matrix: Point Accepted Mutation "Scoring" or "Substitution" Matrices 9/10/07 16 I added 2 bullets to this slide Relies on "evolutionary model" based on observed differences in closely related proteins [Dayhoff78] 2 Major types for Amino Acids: PAM & BLOSUM • PAM = Point Accepted Mutation relies on "evolutionary model" based on observed differences in alignments of closely related proteins • BLOSUM = BLOck SUbstitution Matrix based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins • Model includes defined rate for each type of sequence change • Suffix number (n) reflects amount of "time" passed: rate of expected mutation if n% of amino acids had changed • e.g., PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed • PAM1 matrix is used as basis for calculating other matrices: assumes that repeated mutations would follow same pattern as those in PAM1 matrix, and multiple substitutions can occur at the same site • PAM1 - for less divergent sequences (shorter time) • PAM250 - for more divergent sequences (longer time) BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs 9/10/07 17 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 18 3 #9 Scoring Statistics 9/10/07 BLOSUM: BLOck SUbstitution Matrix I added 2 bullets to this slide Scoring Matrices: What are the scores? See Xiong Textbook: Fig 3.5 = PAM250 Fig 3.6 = BLOSUM62 Based on % aa substitutions observed in blocks of conserved sequences within evolutionarily divergent proteins (in BLOCKS database) [Henikoff & Henikoff92] Usually only 1/2 of matrix is displayed (it is symmetric) • Doesn't rely on a specific evolutionary model • Suffix number (n) reflects expected similarity: avg % aa identity in MSA from which matrix was generated • e.g., BLOSUM62 is derived from sequence alignments of proteins with no more than 62% identity • Blocks database contains ungapped aligned segments corresponding to the most highly conserved regions of proteins s(a,b) corresponds to score of aligning character a with character b These are log-odds scores: each entry ~ log (freq(observed)/freq(expected) + → more likely than random • BLOSUM45 - for more divergent sequences • BLOSUM62 - for less divergent sequences BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 0 → at random base rate - → less likely than random 9/10/07 19 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics Log-odds scoring • Odds ratio = Relative likelihood of the 2 possibilities: "p "p "p X i Yi i • If sequences are not related: we’re observing a chance event, & the probability is: ! pX ! pY i Xi i Yi ! where where pxy is the joint probability that x and y evolved from the same ancestor 9/10/07 pX i Yi pX i pYi " p % s(x i , y i ) = log$$ x i y i '' p p # xi y i & • Thus, s ( xi, yi) gives the substitution matrix score for the pair xi, yi. ! ! • Together all the scores s(xi , yi) define the log-odds scoring matrix i BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics S = " s(x i , y i ) i • If sequences are related by evolution: they are derived from a common ancestor, & the probability is: " p X Y i i i i i =" • Alignment score = Log-odds ratio: where px is the probability of x, py is probability of y i 20 Log-odds scoring matrix • What are the odds that this alignment is meaningful? x1 x2x3 … xN y1 y2y3 … yN i 9/10/07 21 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 22 ! A Few Words about Parameter Selection in Sequence Alignment How do we estimate s(x, y)? • The score for matching x and y is: Optimal alignment between a pair of sequences depends critically " p % s(x, y) = log$$ xy '' # px py & on the selection of substitution matrix & gap penalty function %S (i "1, j "1) + # ( xi , y j ) ' S (i, j ) = max&S (i "1, j ) " $ 'S i, j "1 " $ ) ( ( • Pxy is probability of substituting x and y • Px is probability of amino acid x (on average ~ 5% with 20 amino acids, similarly for Py) ! Trusted (manual) alignments of related sequences provide information about biologically permissible mutations Frequency of amino acid substitutions in trusted alignments is used to generate substitution matrices BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs 9/10/07 In using BLAST or similar software, it is important to understand and, sometimes, to adjust these parameters (default is NOT always best!) ! 23 How do we pick parameters that give the most biologically meaningful alignments and alignment scores? BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 24 4 #9 Scoring Statistics 9/10/07 Which is Better Substitution Matrix? PAM or BLOSUM Empirical Tests May be Needed: Several other types of matrices available: • PAM matrices • Gonnet & Jones-Taylor-Thornton: • derived from evolutionary model • often used in reconstructing phylogenetic trees - but, not very good for highly divergent sequences • very robust in tree construction • BLOSUM matrices • "Best" matrix depends on task: • based on direct observations • more "realistic" - and outperform PAM matrices in terms of accuracy in local alignment BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 • different matrices for different applications ADVICE: if unsure, try several different matrices & choose the one that gives best alignment result 25 How Should Gaps be Scored? So far, we've used Simple linear gap penalty function: Gap of length k Incurs penalty - k x γ BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics Affine Gap Penalties = Differential Gap Penalties used to reflect cost differences between opening a gap and extending an existing gap γ(k) Total Gap Penalty is function of gap length: W = w(k) More realistic? "Affine" gap penalty: penalty for one long gap w (k) = γ + (k – 1) x δ is smaller than penalty ⇑ ⇑ for many smaller gaps gap gap opening extension that add up to same size where δ γ㻃 γ + δ X (k - 1) γ = gap opening penalty δ = gap extension penalty k = length of gap Can also be solved in O(nm) time using DP Sometimes, a Constant Gap Penalty is used, but it is usually least realistic than the Affine Gap Penalty BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 27 Calculating an Alignment Score using a Substitution Matrix & an Affine Gap Penalty BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 28 Sequence Alignment Statistics • Distribution of similarity scores in sequence alignment is not a simple "normal" distribution • Alignment score is sum of all match/mismatch scores (from substitution matrix) with an affine penalty subtracted for each gap Match a b c - - d score a c c e f d 9 2 7 6 => 24 - 26 Affine Gap Penalty Functions However, in biological sequences, gaps often occur in clusters: AGKLAVRSTMIESTRVILTWRKW AGKLAVRS------RVILTWRKW 9/10/07 • "Gumble extreme value distribution" - a highly skewed normal distribution with a long tail Gap opening + extension (10 + 2) = 12 Values from substitution matrix BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs Alignment Score 9/10/07 29 BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics 9/10/07 30 5 #9 Scoring Statistics 9/10/07 How Assess Statistical Significance of an Alignment? • Compare score of an alignment with distribution of scores of alignments for many 'randomized' (shuffled) versions of the original sequence • If score is in extreme margin, then unlikely due to random chance • P-value = probability that original alignment is due to random chance (lower P means alignment more significant) P = 10-5 - 10-50 P > 10-1 sequences have clear homology alignment is no better than random Check out: PRSS (Probability of Random Shuffles) http://www.ch.embnet.org/software/PRSS_form.html BCB 444/544 F07 ISU Dobbs #9 - Scoring Statistics BCB 444/544 Fall 07 Dobbs 9/10/07 31 6