BCB 444/544 Finish: Lecture 2- Biological Databases Lecture 4 Sequence Alignment #4_Aug27 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 1 Required Reading (before lecture) Mon Aug 27 - for Lecture #4 Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook Wed Aug 29 - for Lecture #5 Dynamic Programming • Eddy: What is Dynamic Programming? Thurs Aug 30 - Lab #2: Databases, ISU Resources,& Pairwise Sequence Alignment Fri Aug 31 - for Lecture #6 Scoring Matrices and Alignment Statistics • Chp 3 - pp 41-49 BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 2 HW#2: BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 3 Back to: Chp 2- Biological Databases • Xiong: Chp 2 Introduction to Biological Databases • • • • • What is a Database? Types of Databases Biological Databases Pitfalls of Biological Databases Information Retrieval from Biological Databases BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 4 What is a Database? Duh!! OK: skip we'll skip that! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 5 Types of Databases 3 Major types of electronic databases: 1. Flat files - simple text files • no organization to facilitate retrieval 2. Relational - data organized as tables ("relations") • shared features among tables allows rapid search 3. Object-oriented - data organized as "objects" • objects associated hierarchically BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 6 Biological Databases Currently - all 3 types, but MANY flat files What are goals of biological databases? 1. Information retrieval 2. Knowledge discovery Important issue: Interconnectivity BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 7 Types of Biological Databases 1- Primary • "simple" archives of sequences, structures, images, etc. • raw data, minimal annotations, not always well curated! 2- Secondary • enhanced with more complete annotation of sequences, structures, images, etc. • usually curated! 3- Specialized • focused on a particular research interest or organism • usually - not always - highly curated BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 8 Examples of Biological Databases 1- Primary • DNA sequences • GenBank - US • European Molecular Biology Lab - EMBL • DNA Data Bank of Japan - DDBJ • Structures (Protein, DNA, RNA) • PDB - Protein Data Bank • NDB - Nucleic Acid Data Bank BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 9 Examples of Biological Databases 2- Secondary • Protein sequences • Swiss-Prot, TreEMBL, PIR • these recently combined into UniProt 3- Specialized • Species-specific (or "taxonomic" specific) • Flybase, WormBase, AceDB, PlantDB • Molecule-specific,disease-specific BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 10 Pitfalls of Biological Databases • Errors! & • Lack of documentation re: quality or reliability of data • Limited mechanisms for "data checking" or preventing propagation of errors (esp. annotation errors!!) • Redundancy • Inconsistency • Incompatibility (format, terminology, data types, etc.) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 11 Information Retrieval from Biological Databases 2 most popular retrieval systems: • ENTREZ - NCBI • will use a LOT - was introduced in Lab 1 • SRS - Sequence Retrieval Systems - EBI • will use less, similar to ENTREZ Both: • Provide access to multiple databases • Allow complex queries BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 12 Web Resources: Bioinformatics & Computational Biology • NCBI - National Center for Biotechnology Information • • • • • ISCB - International Society for Computational Biology JCB - Jena Center for Bioinformatics Pitt - OBRC Online Bioinformatics Resources Collection UBC - Bioinformatics Links Directory UWash - BioMolecules • ISU - Bioinformatics Resources - Andrea Dinkelman • ISU - YABI = "Yet Another Bioinformatics Index" (from BCB Lab at ISU) • Wikipedia: Bioinformatics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 13 ISU Resources & Experts ISU Research Centers & Graduate Training Programs: • • • • • • LH Baker Center - Bioinformatics & Biological Statistics BCB - Bioinformatics & Computational Biology BCB Lab - (Student-Led Consulting & Resources) CIAG - Center for Integrated Animal Genomics CCILD - Computational Intelligence, Learning & Discovery IGERT Training Grant - Computational Molecular Biology ISU Facilities: • Biotechnology - Instrumentation Facilities • PSI - Plant Sciences Institute • PSI Centers BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 14 SUMMARY: #2- Biological Databases BEWARE! BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 15 Chp 3- Sequence Alignment SECTION II SEQUENCE ALIGNMENT Xiong: Chp 3 Pairwise Sequence Alignment • • • • • • Evolutionary Basis Sequence Homology versus Sequence Similarity Sequence Similarity versus Sequence Identity Methods Scoring Matrices Statistical Significance of Sequence Alignment Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page. BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 16 Motivation for Sequence Alignment "Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins Pairwise sequence alignment is fundamental; it used to: • Search for common patterns of characters • Establish pair-wise correspondence between related sequences Pairwise sequence alignment is basis for: • Database searching (e.g., BLAST) • Multiple sequence alignment (MSA) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 17 Why Align Sequences? Databases contain many sequences with known functions & many sequences with unknown functions. Genes (or proteins) with similar sequences may have similar structures and/or functions. Sequence alignment can provide important clues to the function of a novel gene or protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 18 Examples of Bioinformatics Tasks that Rely on Sequence Alignment • Genomic sequencing (> 500 complete genomes sequenced!) • Assembling multiple sequence reads into contigs, scaffolds • Aligning sequences with chromosomes • Finding genes and regulatory regions • Identifying gene products • Identifying function of gene products • Studying the structural organization of genomes • Comparative genomics BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 19 Evolutionary Basis • DNA, RNA and proteins are "molecular fossils" • they encode the history of millions of years of evolution • During evolution, molecular sequences accumulate random changes (mutations/variants) • some of which provide a selective advantage or disadvantage, and some of which are neutral • Sequences that are structurally and/or functionally important tend to be conserved • (e.g., chromosomal telomeric sequences; enzyme active sites) • Significant sequence conservation allows inference of evolutionary relatedness BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 20 Homology Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly For us: Homology = similarity due to descent from a common evolutionary ancestor But, HOMOLOGY ≠ SIMILARITY When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous We can infer homology from similarity (can't prove it!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 21 Orthologs vs Paralogs 2 types of homologous sequences: • Orthologs - "same genes" in different species; result of common ancestry; corresponding proteins have "same" functions (e.g., human -globin & mouse -globin) • Paralogs - "similar genes" within a species; result of gene duplication events; corresponding proteins may (or may not) have similar functions (e.g., human -globin & human -globin) A A is the parent gene Speciation leads to B & C Duplication leads to C’ Speciation Duplication B C B and C are Orthologous C and C’ are Paralogous C' BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 22 Sequence Homology vs Similarity • Homologous sequences - sequences that share a common evolutionary ancestry • Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties (e.g., size, hydrophobicity, charge) IMPORTANT: • Sequence homology: • An inference about a common ancestral relationship, drawn when two sequences share a high enough degree of sequence similarity • Homology is qualitative • Sequence similarity: • The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 23 Sequence Similarity vs Identity For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: • Two DNA sequences can share a high degree of sequence identity (or similarity) -- means the same thing • Drena's opinion: Always use "identity" when making quantitative comparisons re: DNA or RNA sequences (to avoid confusion!) For protein sequences, sequence similarity and identity have different meanings: • Identity = % of exact matches between two aligned sequences • Similarity = % of aligned residues that share similar characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles) • Drena's opinion: Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 24 What is Sequence Alignment? Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence. Align: 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE. 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############. OR 1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############. Is one of these alignments "optimal"? Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 25 Goal of Sequence Alignment Find the best pairing of 2 sequences, such that there is maximum correspondence between residues • DNA 4 letter alphabet (+ gap) TTGACAC TTTACAC • Proteins 20 letter alphabet (+ gap) RKVA-GMA RKIAVAMA BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 26 Statement of Problem Given: • 2 sequences • Scoring system for evaluating match (or mismatch) of two characters • Penalty function for gaps in sequences Find: Optimal pairing of sequences that • Retains the order of characters • Introduces gaps where needed • Maximizes total score BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 27 Types of Sequence Variation • Sequences can diverge from a common ancestor through various types of mutations: • Substitutions • Insertions • Deletions ACGA AGGA ACGA ACCGA ACGA AGA • Insertions or deletions ("indels") result in gaps in alignments • Substitotions result in mismatches • No change? match BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 28 Gaps Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 29 Avoiding Random Alignments with a Scoring Function • Introducing too many gaps generates nonsense alignments: s--e-----qu---en--ce sometimesquipsentice • Need to distinguish between alignments that occur due to homology and those that occur by chance • Define a scoring function that accounts for mismatches and gaps Scoring Function (F): Match: Mismatch: Gap: + m - s - d e.g. +1 -1 -2 F = m(#matches) + s(#mismatches) + d(#gaps) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 30 Not All Mismatches are the Same • Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala • A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions • Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 31 Substitution Matrix s(a,b) corresponds to score of aligning character a with character b Match scores are often calculated based on frequency of mutations in very similar sequences (more details later) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 32 Methods • • • • Global and Local Alignment Alignment Algorithms Dot Matrix Method Dynamic Programming Method • Gap penalities • DP for Global Alignment • DP for Local Alignment • Scoring Matrices • Amino acid scoring matrices • PAM • BLOSUM • Comparisons between PAM & BLOSUM • Statistical Significance of Sequence Alignment BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 33 Global vs Local Alignment Global alignment • Finds best possible alignment across entire length of 2 sequences • Aligned sequences assumed to be generally similar over entire length Local alignment • Finds local regions with highest similarity between 2 sequences • Aligns these without regard for rest of sequence • Sequences are not assumed to be similar over entire length BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 34 Global vs Local Alignment - example S = CTGTCGCTGCACG T = TGCCGTG Global alignment CTGTCG-CTGCACG -TGC-CG-TG---- Local alignment CTGTCGCTGCACG--------TGC-CGTG CTGTCG-CTGCACG -TGCCG--TG---Which is better? BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 35 Global vs Local Alignment When use which? Both are important but it is critical to use right method for a given task! Global alignment: • Good for: aligning closely related sequences of approx. same length • Not good for: divergent sequences or sequences with different lengths Local Alignment: • Good for: searching for conserved patterns (domains or motifs) in DNA or protein sequences • Not good for: generating alignment of closely related sequences Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 36 Alignment Algorithms 3 major methods for alignment: 1. Dot matrix analysis 2. Dynamic Programming 3. Word or k-tuple methods (later, in Chp 4) BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 37 Dot Matrix Method (Dot Plots) • Place 1 sequence along top row of matrix • Place 2nd sequence along left column of matrix • Plot a dot each time there is a match between an element of row sequence and an element of column sequence • For proteins, usually use more sophisticated scoring schemes than "identical match" • Diagonal lines indicate areas of match A C G C G A C A C G • Reverse diagonals (perpendicular to diagonal) indicate inversions Exploring Dot Plots BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8/27/07 38