#4 - Sequence Alignment 8/27/07 Required Reading BCB 444/544 Finish: (before lecture) Mon Aug 27 - for Lecture #4 Lecture 2- Biological Databases Pairwise Sequence Alignment • Chp 3 - pp 31-41 Lecture 4 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Sequence Alignment Thurs Aug 30 - Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment #4_Aug27 Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 1 BCB 444/544 F07 ISU HW#2: Back to: Dobbs #4 - Sequence Alignment 8/27/07 2 Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • • • • • BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 3 What is a Database? What is a Database? Types of Databases Biological Databases Pitfalls of Biological Databases Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 4 Types of Databases 3 Major types of electronic databases: Duh!! 1. Flat files - simple text files OK: skip we'll skip that! • no organization to facilitate retrieval 2. Relational - data organized as tables ("relations") • shared features among tables allows rapid search 3. Object-oriented - data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 5 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 6 1 #4 - Sequence Alignment 8/27/07 Biological Databases Types of Biological Databases Currently - all 3 types, but MANY flat files 1- Primary • "simple" archives of sequences, structures, images, etc. What are goals of biological databases? • raw data, minimal annotations, not always well curated! 2- Secondary 1. Information retrieval • enhanced with more complete annotation of sequences, 2.Knowledge discovery structures, images, etc. • usually curated! Important issue: 3- Specialized Interconnectivity • focused on a particular research interest or organism • usually - not always - highly curated BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 7 BCB 444/544 F07 ISU Examples of Biological Databases 8/27/07 8 8/27/07 10 8/27/07 12 Examples of Biological Databases 1- Primary 2- Secondary • DNA sequences • Protein sequences • GenBank - US • Swiss-Prot, TreEMBL, PIR • European Molecular Biology Lab - EMBL • these recently combined into UniProt • DNA Data Bank of Japan - DDBJ 3- Specialized • Structures (Protein, DNA, RNA) • Species-specific (or "taxonomic" specific) • PDB - Protein Data Bank • Dobbs #4 - Sequence Alignment • Flybase, WormBase, AceDB, PlantDB NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment • Molecule-specific,disease-specific 8/27/07 9 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment Information Retrieval from Biological Databases Pitfalls of Biological Databases • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 11 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 2 #4 - Sequence Alignment 8/27/07 Web Resources: Bioinformatics & Computational Biology ISU Resources & Experts ISU Research Centers & Graduate Training Programs: • NCBI - National Center for Biotechnology Information • • • • • • • • • • • ISCB - International Society for Computational Biology JCB - Jena Center for Bioinformatics Pitt - OBRC Online Bioinformatics Resources Collection UBC - Bioinformatics Links Directory UWash - BioMolecules LH Baker Center - Bioinformatics & Biological Statistics BCB - Bioinformatics & Computational Biology BCB Lab - (Student-Led Consulting & Resources) CIAG - Center for Integrated Animal Genomics CCILD - Computational Intelligence, Learning & Discovery IGERT Training Grant - Computational Molecular Biology ISU Facilities: • ISU - Bioinformatics Resources - Andrea Dinkelman • ISU - YABI = "Yet Another Bioinformatics Index" (from BCB Lab at ISU) • Wikipedia: • Biotechnology - Instrumentation Facilities • PSI - Plant Sciences Institute • PSI Centers Bioinformatics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 13 BCB 444/544 F07 ISU SUMMARY: #2- Biological Databases 8/27/07 14 Chp 3- Sequence Alignment SECTION II BEWARE! SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • • • • • • BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 15 Motivation for Sequence Alignment Evolutionary Basis Sequence Homology versus Sequence Similarity Sequence Similarity versus Sequence Identity Methods Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 16 Why Align Sequences? "Sequence comparison lies at the heart of bioinformatics analysis." Dobbs #4 - Sequence Alignment Databases contain many sequences with known functions & many sequences with unknown functions. Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Genes (or proteins) with similar sequences may have similar structures and/or functions. Pairwise sequence alignment is fundamental; it used to: Sequence alignment can provide important clues to the function of a novel gene or protein • Search for common patterns of characters • Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: • Database searching (e.g., BLAST) • Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 17 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 18 3 #4 - Sequence Alignment 8/27/07 Examples of Bioinformatics Tasks that Rely on Sequence Alignment Evolutionary Basis • DNA, RNA and proteins are "molecular fossils" • Genomic sequencing (> 500 complete genomes sequenced!) • they encode the history of millions of years of evolution • During evolution, molecular sequences accumulate random changes (mutations/variants) • Assembling multiple sequence reads into contigs, scaffolds • Aligning sequences with chromosomes • Finding genes and regulatory regions • Identifying gene products • Identifying function of gene products • Studying the structural organization of genomes • Comparative genomics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment • some of which provide a selective advantage or disadvantage, and some of which are neutral • Sequences that are structurally and/or functionally important tend to be conserved • (e.g., chromosomal telomeric sequences; enzyme active sites) • Significant sequence conservation allows inference of evolutionary relatedness 8/27/07 19 BCB 444/544 F07 ISU Homology (e.g., human α-globin & mouse α-globin) Homology = similarity due to descent from a common evolutionary ancestor • Paralogs - "similar genes" within a species; result of gene duplication events; corresponding proteins may (or may not) have similar functions HOMOLOGY ≠ SIMILARITY (e.g., human α-globin & human β-globin) When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous A We can infer homology from similarity (can't prove it!) A is the parent gene Speciation leads to B & C Duplication leads to C’ Speciation Duplication B Dobbs #4 - Sequence Alignment 8/27/07 21 Sequence Homology vs Similarity • Similar sequences - sequences that have a high percentage of BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 22 • Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing • Drena's opinion: Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: For protein sequences, sequence similarity and identity have different meanings: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Identity = % of exact matches between two aligned sequences • Similarity = % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, • Sequence similarity: structural propsensities, evolutionary profiles) • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 Fall 07 Dobbs B and C are Orthologous C and C’ are Paralogous C' For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: evolutionary ancestry Dobbs #4 - Sequence Alignment C Sequence Similarity vs Identity • Homologous sequences - sequences that share a common BCB 444/544 F07 ISU 20 2 types of homologous sequences: • Orthologs - "same genes" in different species; result of common ancestry; corresponding proteins have "same" functions For us: BCB 444/544 F07 ISU 8/27/07 Orthologs vs Paralogs Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly But, Dobbs #4 - Sequence Alignment 8/27/07 • Drena's opinion: Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) 23 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 24 4 #4 - Sequence Alignment 8/27/07 Goal of Sequence Alignment What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Find the best pairing of 2 sequences, such that there is maximum correspondence between residues Align: • DNA 4 letter alphabet (+ gap) 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. TTGACAC TTTACAC 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############. • Proteins OR 20 letter alphabet (+ gap) 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. RKVA-GMA RKIAVAMA Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 25 Statement of Problem 8/27/07 26 • Sequences can diverge from a common ancestor through various types of mutations: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences • Substitutions • Insertions • Deletions Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score Dobbs #4 - Sequence Alignment Dobbs #4 - Sequence Alignment Types of Sequence Variation Given: BCB 444/544 F07 ISU BCB 444/544 F07 ISU ACGA → AGGA ACGA → ACCGA ACGA → AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match 8/27/07 27 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 28 Avoiding Random Alignments with a Scoring Function Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): Match: Mismatch: Gap: + m - s - d e.g. +1 -1 -2 F = m(#matches) + s(#mismatches) + d(#gaps) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 29 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 30 5 #4 - Sequence Alignment 8/27/07 Not All Mismatches are the Same Substitution Matrix • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala s(a,b) corresponds to score of aligning character a with character b • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 31 Methods • • • • BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 32 Global vs Local Alignment Global and Local Alignment Alignment Algorithms Dot Matrix Method Dynamic Programming Method Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length • Gap penalities • DP for Global Alignment • DP for Local Alignment Local alignment • Scoring Matrices • Finds local regions with highest similarity between 2 sequences • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 33 S = CTGTCGCTGCACG T = TGCCGTG CTGTCG-CTGCACG -TGC-CG-TG---- 8/27/07 34 Both are important but it is critical to use right method for a given task! Global alignment: Local alignment • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths CTGTCGCTGCACG--------TGC-CGTG Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences CTGTCG-CTGCACG -TGCCG--TG---- Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment Global vs Local Alignment When use which? Global vs Local Alignment - example Global alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 35 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 36 6 #4 - Sequence Alignment 8/27/07 Alignment Algorithms Dot Matrix Method (Dot Plots) 3 major methods for alignment: • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence 1. Dot matrix analysis 2. Dynamic Programming 3. Word or k-tuple methods (later, in Chp 4) • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to diagonal) indicate inversions A C G C G A C A C G Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment BCB 444/544 Fall 07 Dobbs 8/27/07 37 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 38 7