#13 - Star Alignment; HMMs 9/19/07 Required Reading BCB 444/544 (before lecture) √ Mon Sept 17 - Lecture 12 Position Specific Scoring Matrices & PSI-BLAST • Chp 6 - pp 75-78 (but not HMMs) Lecture 13 Star Alignment & Clustal (for MSA) Wed Sept 19 - Lecture 13 (not covered on Exam 1) Profiles & Hidden Markov Models • Chp 6 - pp 79-84 • Eddy: What is a hidden Markov Model? Perhaps: Profiles & Hidden Markov Models (HMMs) 2004 Nature Biotechnol 22:1315 http://www.nature.com/nbt/journal/v22/n10/abs/nbt1004-1315.html #13_Sept19 Fri Sept 21 - EXAM 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs √ Sun Sept 16 - Study Guide for Exam 1 was posted SECTION II √ Mon Sept 17 - Answers to HW#2 were posted Xiong: Chp 5 • √ Scoring Function • √ Exhaustive Algorithms • Heuristic Algorithms • Star Alignment • Clustal • √ Practical Issues Fri Sept 21 - Exam 1 - Will cover: Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming? BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 • First, review MSA scoring briefly, then back to Star Alignment & ClustalW 3 Scoring an Alignment - in Lecture 12, so will be covered on Exam 1 F F F I D D D F F F I Y Y Y G G Q G Q G K • Compute for each column c: Gap penalty A F P G Q I K F F F I I - F F F I D D D W W W W W W W A F P G Q I K BCB 444/544 Fall 07 Dobbs 4 F F I Y Y Y F F I - S(mi) = Σk<l s(mik, mil) I D D D mi G G G G G G G 9/19/07 residue l PAM or BLOSUM score A F P G BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 • SP = sum of pairs = sum of scores of all possible pairs of sequences in an MSA, based on a particular scoring matrix i A F P G Q I K BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Sum of Pairs (SP) Score In practice, simple scoring functions are used Usually, columns are scored independently: ith column of alignment m SEQUENCE ALIGNMENT Multiple Sequence Alignment Thu Sept 20 - Lab = Optional Review Session for Exam S(m) = ! S (mi )+ G 2 Chp 5- Multiple Sequence Alignment Assignments & Announcements • • • • 9/19/07 5 F F F I F F F I G G Q G A F P G F F I - F F F I BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs W W W W A F P G F F Y D D G G G G 9/19/07 6 1 #13 - Star Alignment; HMMs 9/19/07 Example: Calculating SP Score F Y G D Algorithms & Software for MSA? #1 m1 m2 m3 F Y 5 -2 -2 -1 7 G I added more colors to this slide D 1 -5 4 -3 M= F F F Y G G D Exhaustive Methods • √ Multidimensional dynamic programming (DP) • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook G G D Prohibitive in both time & space requirements for more than 10 sequences!! • Full DP Optimal Global Alignment? 5 Gap penalty = -8 s(-,-) = 0 BLOSUM 60 Heuristic Methods • Progressive (Star Alignment, Clustal) • Iterative • Block-based S(m) = S(m1) + S(m2) + S(m3) = 3s(F,F) + 2s(-,Y) + s(-,-) + s(G,G) + 2s(G,D) = 15 -16 + 0 + 4 -6 = -3 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 7 Dynamic Programming for MSA BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 8 Generalized Needleman-Wunsch Algorithm • As with pairwise alignments, MSAs can be computed by dynamic programming* Given 3 sequences x, y, and z: *(if you're not in a rush!) Main iteration loop: S(i,j,k) = F 2D 3D BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 9 What Happens to Computational Complexity? max ( S(i-1, j-1, k-1) + σ(xi, yj, zk), S(i-1, j-1, k ) + σ(xi, y j, - ), S(i-1, j , k-1) + σ(xi, -, zk ), S(i-1, j , k ) + σ(xi, -, - ), S(i , j-1, k-1) + σ( -, yj, zk), S(i , j-1, k ) + σ( -, y j, -), S(i , j , k-1) + σ( -, -, zk) ) BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 3D 9/19/07 10 What's so bad about those exponents? Example: Running Time of DP for MSA • Overall runtime: O(k22kn k) Given k sequences of length n • Space for matrix: O(nk ) • Neighbors/cell: 2k-1 • Time to compute SP score: O(k2) • Overall runtime: O(k22kn k) # Sequences 3D Wow!!! Running Rime 2 1 second 3 2 minutes 4 5 hours 5 3 weeks 6 9 years Sequences? Globins only »150 aa !! But: There are fast heuristics BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 Fall 07 Dobbs 9/19/07 11 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 12 2 #13 - Star Alignment; HMMs 9/19/07 Progressive Alignment Heuristic procedure: 1. Align most similar sequences first 2. Add sequences progressively Guide Trees Binary tree • Leaves correspond to sequences • Internal nodes represent alignments • Root corresponds to final MSA Multiple Alignment by adding sequences Often: use guide tree to determine order of alignments 1 2 -TCG -TCC ATCATG- 3 4 2 Examples: Star Alignment ClustalW ATC ATG ATC BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 13 • • Fast heuristic to compute MSA Good approximation of optimal MSA, if scoring scheme satisfies triangle inequality 1. Compute pairwise similarities 2. Select center sc that maximizes Σi≠c S(sc,si) 3. Add sequences in decreasing order of similarity to center sc 4. Produce a multiple alignment M such that, for every i , the induced pairwise alignment of sc and s i is same as the optimal alignment of sc and s i 2. Clustal 9/19/07 15 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 16 of similarity to center sc Does that function look familiar? FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF s1 MPE MSKE | | | || MKE M-KE MKE s1 : s2 : s3 : s4 : s3 s2 MPE MKE MSKE SKE || SKE Steiner consensus sequence or string: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si) "String" equivalent of arithmetic mean: consensus sequence is string that minimizes sum of edit distances to members of a family of strings (thus, maximizing similarity score…) 9/19/07 9/19/07 Step 3 - Add sequences in decreasing order Step 2 - Select center sc that maximizes Σi≠c S(sc,si) BCB 444/544 Fall 07 Dobbs 14 Algorithm: 1. STAR Alignment BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 Star Alignment Back to 2 Examples of Progressive Alignment Heuristics for MSA: Recall: Consensus sequence = single sequence (more accurately; "model") that represents most common residue of each column in MSA TCG TCC ATG BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Star Alignment - skipped on Monday: will NOT be covered on Exam 1 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs TCG TCC s4 MSKE M-KE S2+S3 17 M-PE MSKE M-KE S-KE M-PE MSKE M-KE +S1 +S4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 18 3 #13 - Star Alignment; HMMs 9/19/07 Step 4 - Produce a multiple alignment M Complexity of Star Alignment? such that for every i: the induced pairwise alignment of sc and si is same as optimal alignment of sc and si Sc AA--CCTT Sc A-ACC-TT S1 AATGCC-- S2 AGACCGT- S1 A-ATGCC--- Sc A-A--CC-TT S2 AGA--CCGT- Given k sequences of length n, and an upper bound l for alignment length We need: • O(k2n2) to compute the alignments • O(k2) to compute the center • O(k2l) to build multiple alignment Overall: O(k2n2) Duh - Is this really much better than O(k22 knk)? YES! BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 19 CLUSTAL: Overview 2 3 4 Progressive Alignment 5 1 Distance Matrix 2 3 5 9/19/07 20 1 2 3 4 9/19/07 22 5 4 3 4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 2 3 1 2 k = # of sequences n = length of sequences CLUSTAL: Example Guide Tree 1 Remember: 1 4 1. Compute pairwise alignments (DP) 2. Convert similarities into distances 1+2 1+3 1+4 Distance between a pair = # of mismatched positions in alignment (divided by total # of matches) 2+3 2+4 3+4 3. Build guide tree from distances by Neighbor Joining 4. Align with respect to guide tree Pairwise Alignments BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 21 One "small" problem? Finding the Guide Tree BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs CLUSTAL W Tree Goal: Given k sequences and their pairwise distances, find a tree, such that all distances correspond to path lengths between leaves Problem: Such a tree might not exist! Guide Tree 1 2 3 4 5 1 1 2 Distance Matrix 3 4 5 2 3 4 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 Fall 07 Dobbs Tree calculated from an alignment of >1100 ring finger domains, using ClustalW 1.83 9/19/07 23 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 24 4 #13 - Star Alignment; HMMs 9/19/07 Algorithms & Software for MSA? #3 will NOT be covered on Exam1 Algorithms & Software for MSA? #2 Heuristic Methods - continued √ Exhaustive Methods • Progressive alignments (Star Alignment, Clustal) • Others: T-Coffee, DbClustal -see text: can be better than Clustal • Multidimensional dynamic programming (DP) • Match closely-related sequences first using a guide tree • Divide-and-Conquer Alignment (DCA) - "semi-exhaustive" web-based version available - see textbook • Partial order alignments (POA) • Doesn't rely on guide tree; adds sequences in order given • PRALINE • Full DP Optimal Global Alignment? Prohibitive in both time & space requirements for more than 10 sequences!! • Preprocesses input sequences by building profiles for each • Iterative methods • Idea: optimal solution can be found by repeatedly modifying existing suboptimal solutions (eg: PRRN) Heuristic Methods • √ Progressive (Star Alignment, Clustal) • Iterative • Block-based BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs • Block-based Alignment • Multiple re-building attempts to find best alignment (eg: DIALIGN2 & Match-Box) • Local alignments • Profiles, Blocks, Patterns - more on these soon! 9/19/07 25 • e.g., larger penalty for Ala→Gly substitution if in a helix rather than in a loop Profiles & HMMs • Basic idea: • √ Position Specific Scoring Matrices (PSSMs) • √ PSI-BLAST • Use BLAST with high stringency to generate a set of closely related sequences • Align those sequences to create a new substitution matrix for each position • Use this matrix (iteratively) to find additional sequences First, review above briefly, then: • Profiles • Markov Models & Hidden Markov Models 9/19/07 27 Position-Specific Scoring Matrix A PSSM is: • a representation of a motif • an n by m matrix, where n is size of alphabet & m is length of sequence • a matrix of scores in which entry at (i, j) is score assigned by PSSM to letter i at the jth position Convert query to PSSM (or a Profile) do { BLAST database with PSSM Stop if no new homologs are found Add new homologs to PSSM } Print current set of homologs Xiong: PSSM = table that contains probability information re: residues at each position of an ungapped MSA Note: Xiong textbook distinguishes between PSSMs (which have no gaps) & Profiles (can include gaps). Thus, based on these definitions, PSI-BLAST uses a Profile to iteratively add new homologs - other authors refer to pattern used by PSI-BLAST as a PSSM. 9/19/07 9/19/07 What is a PSSM? Position-Specific Scoring Matrix PSI-BLAST Pseudocode BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Also, sometimes called: Position Weight Matrix (PWM) 29 28 I added more text to this slide 8 residue sequence 20 letter alphabet BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 Fall 07 Dobbs 26 • Position Specific Iterated BLAST • Intuition: substitution matrices should be "sensitive" to protein context SEQUENCE ALIGNMENT Xiong: Chp 6 This step requires a user-defined threshold 9/19/07 PSI-BLAST (Covered in Lecture 12, so will be covered on Exam1) Chp 6 - Profiles & Hidden Markov Models SECTION II BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs A -1 -2 -1 0 -1 -2 0 R 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 Q -2 0 1 0 1 -2 5 -3 -2 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 0 H 0 1 0 I -3 -3 -3 L -2 -3 -2 -4 -2 0 -4 -3 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 “K” -2 at0 position -1 -2 3 8 -4 0 -4 2 -3 gets a-3score of Note: Assumes positions are independent BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 30 5 #13 - Star Alignment; HMMs 9/19/07 Assigning a "Match" Score with a PSSM Creating a PSSM from 1 Sequence R L PSSM assigns sequence NMFWAFGH -1 -2 -1 0 -1 -2 0 -2 A -1 -2 -1 0 -1 -2 0 R 5 0 5 -2 1 -3 -2 0 R 5 0 5 -2 1 -3 -2 N 0 6 0 0 0 -3 0 1 N 0 6 0 0 0 -3 0 1 D -2 1 -2 -1 0 -3 -1 -1 D -2 1 -2 -1 0 -3 -1 -1 C -3 -3 -3 -3 -3 -2 -3 -3 C -3 -3 -3 -3 -3 -2 -3 -3 Q a score of: 0 + -2 + -3 + -2 + -1 + 6 + 6 + 8 = A 12 RNRGQFGH -2 0 1 0 1 -2 5 -3 -2 0 Q 1 0 1 -2 5 -3 -2 E 0 0 0 -2 2 -3 -2 0 E 0 0 0 -2 2 -3 -2 0 G -2 0 -2 6 -2 -3 6 -2 G -2 0 -2 6 -2 -3 6 -2 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 H 0 1 0 -2 0 -1 -2 8 I -3 -3 -3 -4 -3 0 -4 -3 L -2 -3 -2 -4 -2 0 -4 -3 R 0 K 2 0 2 -2 1 -3 -2 -1 K 2 0 2 -2 1 -3 -2 -1 M -1 -2 -1 -3 0 0 -3 -2 M -1 -2 -1 -3 0 0 -3 -2 F -3 -3 -3 -3 -3 6 -3 -1 F -3 -3 -3 -3 -3 6 -3 -1 P -2 -2 -2 -2 -1 -4 -2 -2 P -2 -2 -2 -2 -1 -4 -2 -2 S -1 1 -1 0 0 -2 0 -1 S -1 1 -1 0 0 -2 0 -1 T -1 0 -1 -2 -1 -2 -2 -2 T -1 0 -1 -2 -1 -2 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 W -3 -4 -3 -2 -2 1 -2 -2 Y -2 -2 -2 -3 -1 3 -3 2 Y -2 -2 -2 -3 -1 3 -3 2 V -3 -3 -3 -3 -2 -1 -3 -3 V -3 -3 -3 -3 -2 -1 -3 -3 BLOSUM62 matrix 20 by 20 20 by L BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 31 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 1. Discard columns that contain gaps in query sequence 2. Compute relative sequence weights 3. Compute PSSM entries, taking into account • Observed residues in column • Sequence weights • Substitution matrix 2- Compute Sequence Weights EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVDLLVNNA KALGGFNVIVNNA ARFGKIDTLIPNA FEPEGMWGLVNNA AQLKTVDVLINGA 1.2 1.2 0.8 0.8 1.1 0.9 1.1 1.3 9/19/07 33 3- Compute PSSM Entries How are weights determined? Based on branch lengths in guide tree: value for each sequence is then used to multiply raw alignment scores Goal of weighting? to decrease matching scores of frequent characters in MSA & increase scores of infrequent characters E Q R G K A F A / Background frequencies Usually derived from large sequence database 35 9/19/07 34 This slide was modified (simplified version) Observed residues • Larger weights are assigned to unique sequences BCB 444/544 Fall 07 Dobbs BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Info re: weights was added to this slide 9/19/07 EEFGSVDGLVNNA QKYGRLDVMINNA RRLGTLNVLVNNA GGIGPVD-LVNNA KALGGFNVIVNNA ARFGKID-LIPNA FEPEGMWGLVNNA AQLKTVDVLINGA EEFG----SVDGLVNNA QKYG----RLDVMINNA RRLG----TLNVLVNNA GGIG----PVD-LVNNA KALG----GFNVIVNNA ARFG----KID-LIPNA FEPEGPEKGMWGLVNNA AQLK----TVDVLINGA • Smaller weights are assigned to redundant sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 32 1- Discard Columns with Gaps in Query Creating a PSSM from Multiple Sequences BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 A C D E F G H I K L M P Q R S T V W Y 0.085 0.019 0.054 0.065 0.040 0.072 0.023 0.058 0.056 0.096 0.024 0.053 0.042 0.054 0.072 0.063 0.073 0.016 0.034 = PSSM column BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs PSSM 9/19/07 36 6 #13 - Star Alignment; HMMs 9/19/07 This slide was modified PSSM Entries = Log-Odds Scores Observed frequency of residue “A” 1. Estimate probability of observing each residue (probability of A given M, where M is PSSM model) 2. Divide by background probability of observing each residue (probability of A given B, where B is background model) 3. Take log so that can add (rather than multiply) scores • Psi-BLAST weights sequences according to observed diversity specific to family under investigation Foreground model (i.e., the PSSM) • Advantage: If sequences used to construct PSSMs are all homologous, sensitivity for a given level of specificity improves significantly & Pr (A M )# ! log 2 $$ ! % Pr (A B ) " • Disadvantage: However, if any non-homologous sequences are included in PSSMs, they become “corrupted” and "pull in" additional non-homologous sequences, resulting in false positive hits Background model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 37 How to Use PSI-BLAST Effectively BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 38 Summary: DP, BLAST & PSI-BLAST • Dynamic programming is O(NM) for pairwise alignment • Set initial thresholds high • Inspect each iteration's result for suspicious sequences (When in doubt, leave it out!) • Do several iterations (~5), or until no new sequences are found • Make initial search very broad • BLAST is O(M) • BLAST produces an index of words in query sequence that allows fast matching to the database • At NCBI, target databases are also pre-indexed to indicate positions in all database sequences that match each possible search word above some score threshold • First, use NR (large, inclusive database) with up to 5 iterations to set PSSM • Then use that PSSM to search in a more restricted domain, if possible • Be particularly cautious about matches to sequences with highly biased amino acid content BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs Why (not) PSI-BLAST? 9/19/07 • PSI-BLAST iterates BLAST, adding new homologs at each iteration 39 Applications of MSA BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 40 Application: Discover Conserved Patterns Is there a conserved cis-acting regulatory sequence? • Building phylogenetic trees • Finding conserved patterns: Rationale: if sequences are homologous (derived from a common ancestor), they may be structurally/functionally equivalent • Regulatory motifs (TF binding sites) • Splice sites • Protein domains TATA box = transcriptional promoter element • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) • DNA fragment assembly (in genomic sequencing) Sequence Logo BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 Fall 07 Dobbs 9/19/07 41 BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs 9/19/07 42 7 #13 - Star Alignment; HMMs 9/19/07 Sequence Motifs (Patterns) Other types of representations? • √ Consensus Sequence • √ PSSM - Position-Specific Scoring Matrix • √ Sequence Logo - "enhanced"consensus sequence, in which symbol size ∝ information entropy • Information entropy??? In information theory, the Shannon entropy or information entropy is a measure of the [decrease in] uncertainty associated with a random variable. Entropy quantifies information in a piece of data. - Wikipedia • Check out this fun website: Tom Scheider, NCIF • http://www.ccrnp.ncifcrf.gov/~toms/glossary.html#sequence_logo • Profile • HMM - Hidden Markov Model BCB 444/544 F07 ISU Dobbs #13- Star Alignment; HMMs BCB 444/544 Fall 07 Dobbs 9/19/07 43 8