BCB 444/544 Lecture 13 Star Alignment & Clustal (for MSA) Perhaps: Profiles & Hidden Markov Models (HMMs) #13_Sept19 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 1 Required Reading (before lecture) √Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html Fri Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 2 Assignments & Announcements √Sun Sept 16 - Study Guide for Exam 1 was posted √Mon Sept 17 - Answers to HW#2 were posted Thu Sept 20 - Lab = Optional Review Session for Exam Fri Sept 21 - Exam 1 - Will cover: • • • • Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming? BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 3 Chp 5- Multiple Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • √Scoring Function • √Exhaustive Algorithms • Heuristic Algorithms • Star Alignment • Clustal • √Practical Issues • First, review MSA scoring briefly, then back to Star Alignment & ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 4 Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 In practice, simple scoring functions are used Usually, columns are scored independently: S(m) S mi G Gap penalty i ith column of alignment m A F P G Q I K F F F I D D D F F F I Y Y Y G G Q G Q G K A F P G Q I K F F F I I - F F F I D D D BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs W W W W W W W A F P G Q I K F F I Y Y Y I D D D G G G G G G G 9/19/07 5 Sum of Pairs (SP) Score • SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix • Compute for each column c: F F I - S(mi) = k<l s(mik, mil) mi residue l PAM or BLOSUM score A F P G F F F I F F F I G G Q G A F P G F F I - F F F I BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs W W W W A F P G F F Y D D G G G G 9/19/07 6 Example: Calculating SP Score F Y G D F Y 5 -2 -2 -1 7 G m1 m2 m3 D 1 -5 4 -3 I added more colors to this slide M= F F F Y G G D G G D 5 BLOSUM 60 Gap penalty = -8 s(-,-) = 0 S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 7 Algorithms & Software for MSA? #1 Exhaustive Methods • √ Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 8 Dynamic Programming for MSA • As with pairwise alignments, MSAs can be computed by dynamic programming* *(if you're not in a rush!) F 2D 3D BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 9 Generalized Needleman-Wunsch Algorithm Given 3 sequences x, y, and z: Main iteration loop: S(i,j,k) = max ( S(i-1, j-1, k-1) + (xi, yj, zk), S(i-1, j-1, k ) + (xi, yj, - ), S(i-1, j , k-1) + (xi, -, zk), S(i-1, j , k ) + (xi, -, - ), S(i , j-1, k-1) + ( -, yj, zk), S(i , j-1, k ) + ( -, yj, -), S(i , j , k-1) + ( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 3D 9/19/07 10 What Happens to Computational Complexity? Given k sequences of length n • Space for matrix: O(nk) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22knk) 3D Wow!!! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 11 What's so bad about those exponents? Example: Running Time of DP for MSA • Overall runtime: O(k22knk) # Sequences Running Rime 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences? Globins only »150 aa !! But: There are fast heuristics BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 12 Progressive Alignment Heuristic procedure: 1. Align most similar sequences first 2. Add sequences progressively Multiple Alignment by adding sequences Often: use guide tree to determine order of alignments 1 2 3 4 2 Examples: Star Alignment ClustalW BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 13 Guide Trees Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA ATC ATG ATC -TCG -TCC ATCATGTCG TCC ATG BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs TCG TCC 9/19/07 14 Star Alignment - skipped on Monday: will NOT be covered on Exam 1 Back to 2 Examples of Progressive Alignment Heuristics for MSA: 1. STAR Alignment 2. Clustal BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 15 Star Alignment • • Fast heuristic to compute MSA Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality Algorithm: 1. Compute pairwise similarities 2. Select center sc that maximizes Σic S(sc,si) 3. Add sequences in decreasing order of similarity to center sc 4. Produce a multiple alignment M such that, for every i, the induced pairwise alignment of sc and si is same as the optimal alignment of sc and si BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 16 Step 2 - Select center sc that maximizes Σic S(sc,si) Does that function look familiar? Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF Steiner consensus sequence or string: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) "String" equivalent of arithmetic mean: consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 17 Step 3 - Add sequences in decreasing order of similarity to center sc s1 MPE MSKE | | | || MKE M-KE s1 : s2: s3 : s4 : s3 MPE MKE MSKE SKE s2 MKE || SKE s4 MSKE M-KE M-PE MSKE M-KE S-KE M-PE MSKE M-KE S2+S3 +S1 +S4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 18 Step 4 - Produce a multiple alignment M such that for every i: the induced pairwise alignment of sc and si is same as optimal alignment of sc and si Sc AA--CCTT Sc A-ACC-TT S1 AATGCC-- S2 AGACCGT- S1 A-ATGCC--- Sc A-A--CC-TT S2 AGA--CCGT- BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 19 Complexity of Star Alignment? Given k sequences of length n, and an upper bound l for alignment length We need: • O(k2n2) to compute the alignments • O(k2) to compute the center • O(k2l) to build multiple alignment Overall: O(k2n2) Duh - Is this really much better than O(k22knk)? YES! Remember: k = # of sequences n = length of sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 20 CLUSTAL: Overview Guide Tree 1 2 3 4 5 1 1 2 Distance Matrix 2 3 3 4 5 Progressive Alignment 2 3 4 4 1+2 1+3 1+4 2+3 2+4 3+4 Pairwise Alignments 1 1. Compute pairwise alignments (DP) 2. Convert similarities into distances Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) 3. Build guide tree from distances by Neighbor Joining 4. Align with respect to guide tree BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 21 CLUSTAL: Example 1 2 3 4 5 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 22 One "small" problem? Finding the Guide Tree Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Problem: Such a tree might not exist! Guide Tree 1 2 3 4 5 1 1 2 Distance Matrix 2 3 4 5 3 4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 23 CLUSTAL W Tree Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 24 Algorithms & Software for MSA? #2 √ Exhaustive Methods • Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! Heuristic Methods • √Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 25 Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Heuristic Methods - continued • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Match closely-related sequences first using a guide tree • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) • Block-based Alignment • Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 26 Chp 6 - Profiles & Hidden Markov Models SECTION II SEQUENCE ALIGNMENT Xiong: Chp 6 Profiles & HMMs • √Position Specific Scoring Matrices (PSSMs) • √PSI-BLAST First, review above briefly, then: • Profiles • Markov Models & Hidden Markov Models BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 27 PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) • Position Specific Iterated BLAST • Intuition: substitution matrices should be "sensitive" to protein context • e.g., larger penalty for Ala→Gly substitution if in a helix rather than in a loop • Basic idea: • Use BLAST with high stringency to generate a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Use this matrix (iteratively) to find additional sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 28 PSI-BLAST Pseudocode Position-Specific Scoring Matrix This step requires a user-defined threshold Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 29 I added more text to this slide What is a PSSM? Position-Specific Scoring Matrix A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA Also, sometimes called: Position Weight Matrix (PWM) 8 residue sequence A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 I -3 -3 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 “K” at0position 3 8 -2 -1 -2 gets a-3score of -4 0 -4 2 -3 Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 30 Assigning a "Match" Score with a PSSM PSSM assigns sequence NMFWAFGH a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = 12 A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 31 Creating a PSSM from 1 Sequence R L RNRGQFGH R BLOSUM62 matrix 20 by 20 A -1 -2 -1 0 -1 -2 0 -2 R 5 0 5 -2 1 -3 -2 0 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q 1 0 1 -2 5 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 20 by L BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 32 Creating a PSSM from Multiple Sequences 1. Discard columns that contain gaps in query sequence 2. Compute relative sequence weights 3. Compute PSSM entries, taking into account • Observed residues in column • Sequence weights • Substitution matrix BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 33 1- Discard Columns with Gaps in Query EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 34 2- Compute Sequence Weights EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVDLLVNNA KALGGFNVIVNNA ARFGKIDTLIPNA FEPEGMWGLVNNA AQLKTVDVLINGA 1.2 1.2 0.8 0.8 1.1 0.9 1.1 1.3 Info re: weights was added to this slide • Smaller weights are assigned to redundant sequences • Larger weights are assigned to unique sequences How are weights determined? Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 35 3- Compute PSSM Entries This slide was modified (simplified version) Observed residues E Q R G K A F A / Background frequencies Usually derived from large sequence database A C D E F G H I K L M P Q R S T V W Y 0.085 0.019 0.054 0.065 0.040 0.072 0.023 0.058 0.056 0.096 0.024 0.053 0.042 0.054 0.072 0.063 0.073 0.016 0.034 = PSSM column BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs PSSM 9/19/07 36 This slide was modified PSSM Entries = Log-Odds Scores Observed frequency of residue “A” 1. Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2. Divide by background probability of observing each residue (probability of A given B, where B is background model) 3. Take log so that can add (rather than multiply) scores Foreground model (i.e., the PSSM) Pr A M log 2 Pr A B BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Background model 9/19/07 37 Why (not) PSI-BLAST? • Psi-BLAST weights sequences according to observed diversity specific to family under investigation • Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 38 How to Use PSI-BLAST Effectively • Set initial thresholds high • Inspect each iteration's result for suspicious sequences (When in doubt, leave it out!) • Do several iterations (~5), or until no new sequences are found • Make initial search very broad • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in a more restricted domain, if possible • Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 39 Summary: DP, BLAST & PSI-BLAST • Dynamic programming is O(NM) for pairwise alignment • BLAST is O(M) • BLAST produces an index of words in query sequence that allows fast matching to the database • At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • PSI-BLAST iterates BLAST, adding new homologs at each iteration BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 40 Applications of MSA • Building phylogenetic trees • Finding conserved patterns: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 41 Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent TATA box = transcriptional promoter element Sequence Logo BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 42 Sequence Motifs (Patterns) Other types of representations? • √ Consensus Sequence • √ PSSM - Position-Specific Scoring Matrix • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size information entropy • Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this fun website: Tom Scheider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • Profile • HMM - Hidden Markov Model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 43