What are Math and Computer Science doing in Biology? Dan Gusfield UC Davis March 29, 2012 Denison University One limited perspectiv e Short Answer: • • • • • Bioinformatics Computational Biology Statistical Biology Mathematical Biology ….. Short Answer: • • • • • Bioinformatics Computational Biology Statistical Biology Mathematical Biology ….. My focus computational biology –“An interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia) Biology Computer Science Math & Statistics 6 UC Davis Computational biology, Bioinformatics How can non-biologists, non-chemists understand or contribute to biology? Where does our license come from? My Fear 30 years ago was that I would first need to master material like: Citric Acid Cycle Amylase + starch substrate Bond representation of triplex DNA. This view is down the long axis. The “third” strand is colored. MYOGLOBIN - An oxygen carrier in muscle Here is another way of visualising tertiary the structure Tertiary Stucture Spot the Tertiary folding. Quaternary Structure Spot the Haem group LYSOZYME Including the Side chains. Can you see any active site now? It looked very daunting! But, By some wonderful fact or fluke of nature, a huge simplification is possible and very productive. Molecular information is (partially) Digital. And, nature takes notes (leaves historical footnotes). PRIMARY STRUCTURE Primary structure is described by the sequence of Amino Acids in the chain This diagram shows the primary structure of PIG INSULIN, a protein hormone as discovered by Frederick Sanger. He was given a Nobel prize in 1958. Hemoglobin – Primary Structure NH2-Val-His-Leu-Thr-Pro-Glu-Glu- Lys-Ser-Ala-Val-Thr-Ala-Leu-TrpGly-Lys-Val-Asn-Val-Asp-Glu-ValGly-Gly-Glu-….. beta subunit amino acid sequence It has been amazingly productive to treat protein and DNA molecules just as text: collecting, comparing, creating molecular sequences. No hard-core chemistry or biology - just text comparison and analysis. Fluke of nature? An imposition of the human mind? Lucky break for us? The first major success story: Simian Sarcoma Virus onc Gene, v-sis is derived from the Gene (or Genes) of a Platelet-Derived Growth Factor. R.F. Doolittle et al, Science 1983 “The transforming protein of a primate sarcoma virus and a platelet-derived growth factor are derived from the same or closely related cellular genes. This conclusion is based on the demonstration of extensive sequence similarity.” From the abstract Sequence similarity suggested that genes involved in cancer were functionally related to genes involved in blood platelet growth, two biological phenomena that had previously seemed unrelated. This was a very surprising result, and a novel kind of reasoning. But, Biology via Sequence Analysis is now completely accepted, main-stream. Some biologists have even replaced their wet-labs with computer labs, doing biology only by sequence analysis. “The ultimate rational behind all purposeful structures and behavior of living things is embodied in the sequence of residues of nascent polypeptide chains …” J. Monod “The rosetta stone of modern biology appears to be sequence comparitive analysis.” T. Smith Success stories from sequence analysis are now routine. Why? Mostly shared history and duplication with modification, but also shared physical, chemical constraints. “We didn't know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans.” Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine Take-home message High sequence similarity implies significant functional and/or structural similarity Ancestor paralogs 29 UC Davis Species A Species B orthologs 3/11/2016 Can we reverse the statement? Two sequences with high functional similarity should have similar sequences. 31 UC Davis 3/11/2016 The success of sequence comparison and analysis, and the development of efficient DNA sequencing, has lead to huge projects to capture, accumulate, store, curate, and annotate bio-molecular sequences. Genbank, Blast, Human Genome Project, specialized databases. Today it has around 300 trillion bases! Examples of large-scale sequencing projects 1,000 Genomes Project. http://www.1000genomes.org/. BGI, 10,000 whole human genomes. BGI, 1,000 individuals with IQ>145 versus 1,000 random individuals. BGI, Autism Genetic Resource Exchange, 10,000 individuals. BGI, CHOP, many childhood diseases. Genome Institute, Washington U. St. Louis, 600 childhood cancer patients; $65 million over three years. 150 tumor & normal cancer genome pairs. Epitwin: TwinsUK & BGI $30 million for epigenetic differences in 5,000 twins. Netherlands Genome Project: BGI 750 genomes (250 trios) in Dutch biobanks. Epi4K: Duke et al. $25M to sequence 4,000 genomes for epilepsy research. U. Michigan Cancer Center: Clinical next-gen sequencing of cancer patients. R. Michelmore In near future: DNA sequence = an inexpensive commodity generated on a variety of platforms $1,000 ($100?) human genome coming => $1,000 genome for many animals and plants $100 genome for fungi $10 genome for bacteria en masse Metagenomics: sequencing of communities biomes (humans = 100x more bacteria) novel & unculturable organisms characterization of diversity & unique genes Not just genomic DNA sequence: DNA modifications epigenomics & copy number variation (CNV) expression analysis (RNAseq not arrays) Enormous amounts of sequence data Need for major data handling capabilities Vital role for bioinformatics just to manage the data R. Michelmore More recently: Metagenomics, metabolomics, proteomics, microbiomics, epigenomics, transcriptomics, methylomics…. High-throughput biology generating massive amounts of data; sometimes too large even to store. NYT November 30, 2011: “The Bejing Genome Center has enough sequencing capacity to sequence 2,000 human genomes per day.” “World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high.” OK, so sequences and sequence analysis are important, but where’s the promised computer science and math? Simple sequence comparison, comparing new sequences against sequences in databases, has been extremely productive. But how do we extract the most biological value from sequences? The Larger Challenge and Opportunity: How to utilize the deluge of sequence data? What significant patterns do you see in: Making sense of the code 43 UC Davis Regulatory Motifs Genes Noncoding regions 3/11/2016 are located in a “sea” of Damien Peltier How do we analyze so much data? How do we know that patterns we see are meaningful? How do we know that similarities we see are based in biology and not just random happenstance? Humans are good at seeing patterns, even in random events and data. From Mars From the bible code What we need: • Clear, biologically meaningful definitions of similarity, patterns. Biological models of mutation and evolution - how sequences evolve. • Metrics - how similar, how good the fit. • Efficient methods to compute similarities, and find patterns, and compute the metrics. • Efficient methods to assess the “significance” of the finds. For those tasks, we need • Biology - to define and model meaningful types of similarities and patterns to look for. • Mathematics - to propose and understand the models and metrics. • Computer Science - for efficient sequence analysis and search algorithms. • Statistics - to measure the ``significance” (deviation from random happenstance) of the finds. computational biology –“An interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia) Biology Computer Science Math & Statistics 51 UC Davis Computational biology, Bioinformatics “It costs more to analyze a genome than to sequence a genome.” D. Haussler A small part of the story in greater detail Basic problem: define and compute the similarity of two sequences • Biological-Mathematical model: Two sequences are similar when… • Algorithmic problem: How do you compute the sequence similarity of two sequences S1 and S2. 54 UC Davis 3/11/2016 “All models are wrong, but some are useful.” George Box Modeling sequence evolution S1: AATCCAGTTTTACAGATCCTC length m=21 S2: AATAGTTTTACAGACTCAT length n=19 Alignment: Insert spaces into, or before or after the two sequences to make them the same length. S1: - AATCCAGTTTTATAGA-TCCTC length m=23 S2: AATA—GTTTTACAGACTCAT-- length n=23 Match, Mismatch, Space, Gap One measure of the goodness of the alignment is the (# of matches) -- (# of mismatches) --(# of spaces) Given a metric to measure the goodness of any specific alignment, we define the Similarity of two sequences S1 and S2 as: The Maximum (# matches) -- (# mis) -- (# spaces) over all possible alignments of S1 and S2. But how do we compute similarity? Mathematics finds a formula: So there are a huge number of alignments. Mathematics counts the number of alignments Length of the Number of sequences alignments 59 10 184,756 20 ~1.4e11 100 ~9.0e58 UC Davis 3/11/2016 There are too many alignments to try each one out, but clever, efficient algorithms, using the technique of Dynamic Programming, allow the efficient computation of similarity. (Computer Science contribution). For any length n, the number of operations needed to compute the similarity of two n-length sequences, via Dynamic Programming, is proportional to n squared (i.e, n^2). Number of operations needed to compute Similarity Length of the Number of sequences operations using explicit enumeration Number of operations using Dynamic Programming 10 184,756 100 20 ~1.4e11 400 100 ~9.0e58 10e4 So similarity can be found quickly, but Is the similarity significant? Elegant statistical methods can be used to determine the probability that two random sequences would have that level of similarity or more. We don’t reject the possibility that two sequences are similar due only to chance, unless the computed probability is very low. Extensions: Finding patterns in multiple sequences ACTAACCGGGAGATTTCAGA AAGTTCCGGGAGATTTCCA TAGTTATCCGGGAGATTAGA AAAACCGGTAGATTTCAGG human chimp mouse rat Multiple Sequence Alignment AC--TAACCGGGAGATTTCAGA AAGTT--CCGGGAGATTTCC-A TAGTTATCCGGGAGATT--AGA AA---AACCGGTAGATTTCAGG human chimp mouse rat CLUSTALW multiple sequence alignment (rbcS gene) Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton Pea Tobacco Ice-plant Larch Turnip Wheat Duckweed T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG Again we need a model of what multiple sequence alignments are biologically meaningful; a metric to score the goodness of a multiple alignment; an algorithm to compute multiple alignments, based on the metric; and statistical methods to evaluate the signifinance of an alignment. Summarizing • Biology by sequence analysis opens the door widely to non-biologists. • Models of sequence evolution and metrics used in sequence analysis are articulated by biology and Mathematics. • Computer Science contributes efficient algorithms to do the analysis and compute the metrics. • Statistics is needed to evaluate the significance of the computed results. • Sequence analysis is just one of many ways that computer science and mathematics have entered biology. In general: The computationalbiology work flow Biological Knowledge Biological model Eg. DNA replication infidelity model, mutagens, radiation models etc Mathematical model and assumptions E.g. assumption about mutation distribution or preferential attachment Mathematical problem Algorithmic problem Programming problem 69 UC Davis E.g. DNA mutates E.g. Given the mathematical model, find spots where mutation rates are high or low in a statistically significant way E.g. What algorithm should I develop to efficiently find hotspots E.g. Data storage, Memory, OOP and languages, optimizations, GUI 3/11/2016 Another illustrastion, involving phylogenetic trees rather than sequences. Comparing Trees: Tanglegrams • A Tanglegram is a pair of phylogenetic trees drawn in the plane with no crossing edges, with the same labeled leaf set. The leaves of one tree are displayed on a line, and the leaves of the other tree are displayed on a parallel line. • One tree represents the evolution of a set of species, and the other tree represents the evolution of a set of parasites that inhabit the species. • A straight line connect each leaf in one tree to the leaf with the same label in the other tree. • The number of crossing lines is a measure of the similarity of the trees. • A small measure suggests that the species and parasites coevolved. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Images courtesy of NTBG QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Images courtesy of NTBG But the trees can be redrawn to reduce the number of crossings. So we have the algorithmic problem of finding planar layouts of the two trees, to minimize the number of crossings of the lines between the leaves. That minimum number is the metric of similarity. How do we compute it, and how can we evaluate significance? Thank you