#11 - Multiple Sequence Alignment 9/14/07 Required Reading BCB 444/544 (before lecture) √Mon Sept 10 - for Lecture 9/10 BLAST variations; BLAST vs FASTA, SW • Chp 4 - pp 51-62 Lecture 11 First BLAST vs FASTA √Wed Sept 12 - for Lecture 11 & Lab 4 Multiple Sequence Alignment (MSA) • Chp 5 - pp 63-74 Plus some Gene Jargon Multiple Sequence Alignment (MSA) Fri Sept 14 - for Lecture 12 Position Specific Scoring Matrices & Profiles • Chp 6 - pp 75-78 (but not HMMs) #11_Sept14 • Good Additional Resource re: Sequence Alignment? • Wikipedia: http://en.wikipedia.org/wiki/Sequence_alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 1 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment Review: Gene Jargon #1 Assignments & Announcements - #1 9/14/07 2 (for HW2, 1c) Exons = "protein-encoding" (or "kept" parts) of eukaryotic genes Revised Grading Policy has been sent via email Please review! vs Introns = "intervening sequences" = segments of eukaryotic genes that "interrupt" exons √Mon Sept 10 - Lab 3 Exercise due 5 PM: to: terrible@iastate.edu ?Thu Sept 13 - Graded Labs 2 & 3 will be returned at beginning of Lab 4 Fri Sept 14 - HW#2 due by 5 PM (106 MBB) • Introns are transcribed into pre-RNA • but are later removed by RNA processing • & do not appear in mature mRNA • so are not translated into protein Study Guide for Exam 1 will be posted by 5 PM BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 3 Assignments & Announcements - #2 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 4 Chp 4- Database Similarity Searching Mon Sept 17 - Answers to HW#2 will be posted by 5 PM SECTION II SEQUENCE ALIGNMENT Xiong: Chp 4 Database Similarity Searching Thu Sept 20 - Lab = Optional Review Session for Exam • √Unique Requirements of Database Searching • √Heuristic Database Searching Fri Sept 21 - Exam 1 - Will cover: • • • • Lectures 2-12 (thru Mon Sept 17) Labs 1-4 HW2 All assigned reading: Chps 2-6 (but not HMMs) Eddy: What is Dynamic Programming BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 • √Basic Local Alignment Search Tool (BLAST) • FASTA • Comparison of FASTA and BLAST • Database Searching with Smith-Waterman Method 5 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 6 1 #11 - Multiple Sequence Alignment 9/14/07 Why search a database? FASTA and BLAST • Both FASTA, BLAST are based on heuristics • Given a newly discovered gene, • Tradeoff: • Does it occur in other species? • Is its function known in another species? • user defines value for k = word length • Identification of potential genes • Identification of other functional parts of chromosomes • Slower, but more sensitive than BLAST at lower values of k, (preferred for searches involving a very short query sequence) • BLAST family • Family of different algorithms optimized for particular types of queries, such as searching for distantly related sequence matches • Find members of a multigene family BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 • BLAST was developed to provide a faster alternative to FASTA withoutBCB sacrificing much accuracy 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 7 9/14/07 9 • BLASTX - 6-frame translated DNA seq query against protein DB • TBLASTN - protein query against 6-frame DNA translation • TBLASTX - 6-frame DNA query to 6-frame DNA translation • PSI-BLAST - protein "profile" query against protein DB • PHI-BLAST - protein pattern against protein DB BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 1: Filter low-complexity regions (LCRs) 2. Make a list (dictionary): all words of length 3aa or 11 nt 3. Augment list to include similar words 4. Store list in a search tree (data structure) • Low complexity sequences can yield false positives. 5. Scan database for occurrences of words in search tree • Screen them out of your query sequences! When appropriate! e.g., for GGGG: 6. Connect nearby occurrences 7. Extend matches (words) in both directions L! = 4!=4x3x2x1= 24 nG =4 nT =nA =nC =0 P ni ! = 4!x0!x0!x0! = 24 K=1/4 log4 (24/24) = 0 8. Prune list of matches using a score threshold 9. Evaluate significance of each remaining match 9/14/07 9/14/07 10 This slide has been changed! K = computational complexity; • Low complexity regions, varies from 0 (very low complexity) transmembrane regions and to 1 (high complexity) coiled-coil regions often display Alphabet size (4 or 20) significant similarity without Window homology. length (usually Remove low-complexity regions (LCRs) BCB 444/544 Fall 07 Dobbs BLASTP - protein sequence query against protein DB BLASTN - DNA/RNA seq query against DNA DB (GenBank) http://www.ncbi.nlm.nih.gov/blast/producttable.shtml Detailed Steps in BLAST algorithm 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 10. PerformBCB Smith-Waterman to get alignment • • • Which Newest: toolMEGA-BLAST should you use?- optimized for highly similar sequences Local alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 8 BLAST - a Family of Programs: Different BLAST "flavors" BLAST algorithms can generate both "global" and "local" alignments 1. Speed • FASTA • Given a newly sequenced genome, which regions align with genomes of other organisms? Global alignment Sensitivity vs • DP is slower, but more sensitive 11 For CGTA: K=1/4 log4 (24/1) = 0.57 12) K= & # $ L! ! 1 log N $ ! L $ ' ni ! ! % i " Frequency of ith letter in the window BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 12 2 #11 - Multiple Sequence Alignment 9/14/07 2: List all words in query 3: Augment word list YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM FMT MTS TSE SEK … BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM AAA AAB FMT AAC MTS 203 = 8000 … TSE possible matches SEK YYY … 9/14/07 13 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 3: Augment word list BLOSUM62 scores Non-match G G G G 6 + 6 + Match 9/14/07 16 YGGFMTSEKSQTPLVTLFKNAIIKNAHKKGQ YGG GGF GFM GGI GGL FMT GGM MTS GGF GGW TSE GGY SEK … … A user-specified threshold, T, determines which 3-letter words are considered matches and non-matches BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 15 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 3: Augment word list Example Find all words that match EAM with a score greater than or equal to 11 Observation: A R N D C Q E G H I L K M F P S T W Y V Selecting only words with score > T greatly reduces number of possible matches otherwise, 203 for 3-letter words from amino acid sequences! BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 14 3: Augment word list G G F A A A 0 + 0 + -2 = -2 F Y 3 = 15 9/14/07 9/14/07 17 A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 EAM DAM QAM ESM EAL BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 5 2 2 5 5 + + + + + 4 4 4 1 4 + + + + + 5 5 5 5 2 = = = = = 9/14/07 14 11 11 11 11 18 3 #11 - Multiple Sequence Alignment 9/14/07 4: Store words in search tree Search tree Augmented list of query words “Does this query contain GGF?” G GGF GGL GGM GGW GGY Search tree G F L M W Y “Yes, at position 2.” BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 19 Example BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 20 9/14/07 22 5: Scan the database sequences Put this word list into a search tree Database sequence D A M A A Q C M M I E G M • K S M T M V M Query sequence DAM QAM EAM KAM ECM EGM ESM ETM EVM EAI EAL EAV A M • • • V L • • • • M BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 21 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 6: Connect nearby occurences (diagonal matches in Gapped BLAST) Example Scan this "database" for occurrences of your words Database sequence E A M P Q L S V D A M Query sequence MKFLILLFNILCLDAMLAADNHGVGPQGASGVDPITFDINSNQTGPAFLTAVEAIGVKYLQVQHGSNVNIHRLVEGNVKAMENA • BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 23 Two dots are connected IFF if they are less than A letters apart & are on diagonal • • • • • • • • BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 24 4 #11 - Multiple Sequence Alignment 9/14/07 7: Extend matches, calculating score at each step 7: Extend matches in both directions L P M P Scan DB P Q G L L P E G L L <word> 7 2 6 <-----> 2 7 7 2 6 4 4 Query sequence Database sequence BLOSUM62 scores word score = 15 HSP SCORE = 32 (High Scoring Pair) • Each match is extended to left & right until a negative BLOSUM62 score is encountered • Extension step typically accounts for > 90% of execution time BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 25 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 8: Prune matches 9: Evaluate significance 9/14/07 26 This slide has been changed! • BLAST uses an analytical statistical significance calculation • Discard all matches that score below defined threshold RECALL: 1. E-value: E = m x n x P m = total number of residues in database n = number of residues in query sequence P = probability that an HSP is result of random chance lower E-value, less likely to result from random chance, thus higher significance 2. Bit Score: S' = normalized score, to account for differences in size of database ( m) & sequence length(n) ; Note (below) that bit score is linearly related to raw alignment S'=score, (λ X so: S - lnhigher K)/ln2 S'where: λ = Gumble distribution constant means alignment has higher significance S = raw alignment score K = constant associated with scoring matrix For more details - see text & BLAST tutorial BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 27 10: Use Smith-Waterman algorithm (DP) to generate alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 28 BLAST: What is a "Hit"? • A hit is a w-length word in database that aligns with a word from query sequence with score > T • ONLY significant matches are re-analyzed using Smith-Waterman DP algorithm. • BLAST looks for hits instead of exact matches • Allows word size to be kept larger for speed, without sacrificing sensitivity • Alignments reported by BLAST are produced by dynamic programming • Typically, w = 3-5 for amino acids, w = 11-12 for DNA • T is the most critical parameter: • ↑T ⇒ ↓ “background” hits (faster) • ↓T ⇒ ↑ ability to detect more distant relationships (at cost of increased noise) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 29 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 30 5 #11 - Multiple Sequence Alignment 9/14/07 Tips for BLAST Similarity Searches Practical Issues Searching on DNA or protein level? • If you don’t know, use default parameters first In general, • Try several programs & several parameter settings protein-encoding DNA should be translated! • If possible, search on protein sequence level • DNA yields more random matches: • Scoring matrices: PAM1 / BLOSUM80: • 25% for DNA vs. 5% for proteins if expect/want less divergent proteins PAM120 / BLOSUM62: "average" proteins • DNA databases are larger and grow faster PAM250 / BLOSUM45: if need to find more divergent proteins • Selection (generally) acts on protein level • Synonymous mutations are usually neutral • Proteins: • DNA sequence similarity decays faster >25-30% identity (and >100aa) -> likely related 15-25% identity -> twilight zone BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment <15% identity -> likely unrelated 9/14/07 31 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BLAST vs FASTA 9/14/07 32 BLAST & FASTA References • Seeding: • FASTA - • BLAST integrates scoring matrix into first phase developed first • Pearson & Lipman (1988) Improved Tools for Biological Sequence Comparison. PNAS 85:2444- 2448 • FASTA requires exact matches (uses hashing) • BLAST increases search speed by finding fewer, but better, words during initial screening phase • FASTA uses shorter word sizes - so can be more sensitive • BLAST • Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) • Altschul, Madden, Schaffer, Zhang, Zhang, Miller, Lipman (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-402 • Results: • BLAST can return multiple best scoring alignments • FASTA returns only one final alignment BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 33 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment NCBI - BLAST Programs Glossary & Tutorials BLAST Notes - & DP Alternatives 9/14/07 34 BLAST • BLAST uses heuristics: it may miss some good matches • But, it’s fast: 50 - 100X faster than Smith-Waterman (SW) DP • Large impact: • NCBI’s BLAST server handles more than 100,000 queries/day • Most used bioinformatics program in the world! But - Xiong says: "It has been estimated that for some families of protein sequences BLAST can miss 30% of truly significant matches." • Increased availability of parallel processing has made DP-based approaches feasible: • http://www.ncbi.nlm.nih.gov/BLAST/ • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/glossary2.html • http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html • 2 DP-based web servers: both more sensitive than BLAST • Scan Protein Sequence: http://www.ebi.ac.uk/scanps/index.html Implements modified SW optimized for parallel processing • ParAlign www.paralign.org - parallel SW or heuristics BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 35 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 36 6 #11 - Multiple Sequence Alignment 9/14/07 Chp 5- Multiple Sequence Alignment SECTION II Multiple Sequence Alignments SEQUENCE ALIGNMENT Xiong: Chp 5 Multiple Sequence Alignment • Scoring Function • Exhaustive Algorithms • Heuristic Algorithms • Practical Issues BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 37 Overview Credits for slides: Caragea & Brown, 2007; Fernandez-Baca, Heber &HunterBCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 38 Multiple Sequence Alignment • Generalize pairwise alignment of sequences to include > 2 homologous sequences 1. What is a multiple sequence alignment (MSA)? 2. Where/why do we need MSA? • Analyzing more than 2 sequences gives us much more information: 3. What is a good MSA? • Which amino acids are required? Correlated? 4. Algorithms to compute a MSA • Evolutionary/phylogenetic relationships • Similar to PSI-BLAST idea (not yet covered in lecture): use a set of homologous sequences to provide more "sensitivity" BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 39 Not a MSA 9/14/07 40 Definition: MSA What is a MSA? ATT-GC ATTTGC ATTTG BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment Given a set of sequences, a multiple sequence alignment is an assignment of gap characters, such that AT-TGC ATTTGC ATTTG- AT-T-GC ATTT-GC ATTT-G- MSA Not a MSA • resulting sequences have same length • no column contains only gaps Why? BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 41 ATT-GC ATTTGC ATTTG AT-TGC ATTTGC ATTTG- AT-T-GC ATTT-GC ATTT-G- NO YES NO BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 42 7 #11 - Multiple Sequence Alignment 9/14/07 Displaying MSAs: using CLUSTAL W What is a Consensus Sequence? A single sequence that represents most common residue of each column in a MSA Example: RED: AVFPMILW (small) BLUE: DE FGGHL-GF F-GHLPGF FGGHP-FG FGGHL-GF (acidic, negative chg) MAGENTA: RHK (basic, positive chg) GREEN: STYHCNGQ (hydroxyl + amine + basic) * entirely conserved column : . all residues have ~ same size all residues have ~ same size AND OR Steiner consensus seqence: Given sequences s1,…, sk, find a sequence s* that maximizes Σi S(s*,si ) hydropathy hydropathy BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 43 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 44 Application: Recover Phylogenetic Tree Applications of MSA What was series of events that led to current species? • Building phylogenetic trees • Finding conserved patterns, e.g.: • Regulatory motifs (TF binding sites) • Splice sites • Protein domains • Identifying and characterizing protein families • Find out which protein domains have same function • Finding SNPs (single nucleotide polymorphisms) & mRNA isoforms (alternatively spliced forms) NYLS • DNA fragment assembly (in genomic sequencing) BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 NFLS NYLS 45 Application: Discover Conserved Patterns BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 46 Goal: Characterize Protein Families Which parts of globin sequences are most highly conserved? Is there a conserved cis-acting regulatory sequence? Rationale: if they are homologous (derived from a common ancestor), they may be structurally equivalent TATA box = transcriptional promoter element BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment BCB 444/544 Fall 07 Dobbs 9/14/07 47 BCB 444/544 F07 ISU Dobbs #11 - Multiple Sequence Alignment 9/14/07 48 8