Challenges for computer science as a part of Systems Biology Benno Schwikowski Institute for Systems Biology Seattle, WA Towards integrative models DNA mRNA - Sequence - Abundance - Genomic locus - Regulatory - Domain content information - Intron/exon structure - initiation/ - Regulatory motifs termination - Chemical modifications signals - SNPs Species - Splice variants - Accessibility - Variation Protein interaction - Interaction partner - Direct/indirect - Affinity - Effect Protein - Abundance - State - Localization - 3D structure - Functional characterization - Half-life - Active sites - Biochemical function - Cellular role Conditions/time Genes Math and Computer Science Challenges Benno Schwikowski Challenge: Integrative models …Across genes and proteins: Many genes involved (e.g., multifactorial diseases) • …Across model systems: Lack of experimental platforms in target system • …Across levels of biological organization (e.g. gene regulatory processes involving phosphorylation) • …Across experiments: Robustness against errors in mass spectrometry, mRNA measurements • …Across timescales Math and Computer Science Challenges Benno Schwikowski Challenge: Capturing evolutionary constraints DNA RNA Proteins Modules Organelles Cells Organs Individuals Populations Ecologies "Nothing in biology makes sense except in the light of evolution.“ Theodosius Dobzhansky Math and Computer Science Challenges Benno Schwikowski Challenge: Which tools and experiments to use Challenge: Choosing experiments • Machine Learning Determine most likely classification/parameterization on the basis of a randomly sampled dataset • Active Learning Allow an algorithm to query selected data points, using the result of previous queries. Math and Computer Science Challenges Benno Schwikowski Challenge: Relations between system variables can be quite complex Yuh, Bolouri, Davidson, Science, 1998 Math and Computer Science Challenges Benno Schwikowski Challenge: Relations between system variables can be quite complex Yuh, Bolouri, Davidson, Science, 1998 Math and Computer Science Challenges Benno Schwikowski Challenge: Develop models that allow extremely efficient algorithms AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... Math and Computer Science Challenges Benno Schwikowski CLUSTALW(1.74) multiple sequence alignment Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton Pea Tobacco Ice-plant Larch Turnip Wheat Duckweed T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG Math and Computer Science Challenges Benno Schwikowski Challenge: Developing models that allow extremely efficient algorithms ACGT AGTCGTACGTGAC... AGTAGACGTGCCG... ACGT ACGTGAGATACGT... ACGT ACGG GAACGGAGTACGT... TCGTGACGGTGAT... Parsimony score: 1 J. Comp Biol. 2002 Math and Computer Science Challenges Benno Schwikowski An Exact Algorithm (generalizing Sankoff and Rousseau 1975) Wu [s] = best parsimony min ( Wv [t] score + d(s, for t) )subtree rooted at node u, u is labeled with string s. vif: child t of u … … 4k entries ACGG: 1 ACGT: 0 ACGG: + ACGT: 0 ... AGTCGTACGTG ... … ACGG: ACGT :0 ... … … ACGG: 2 ACGT: 1 ACGG: ACGT :0 … ... ACGG: ACGT :0 ... ... … ACGG: 1 ACGT: 1 ... … ACGG: 0 ACGT: 2 … ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG ACGG: 0 ACGT: + ... ... Math and Computer Science Challenges J. Comp Biol. 2002 Benno Schwikowski What are good challenges to tackle? • Biological/medical questions asked • Experimental technologies to acquire a lot of relevant data • Available datasets with a formalized notion of “data quality” Math and Computer Science Challenges Benno Schwikowski Memory complexity: O(k 42k ) per node Number of species Average sequence length Time complexity: Total time O(n k (42k + l )) Motif length J. Comp Biol. 2002 Math and Computer Science Challenges Benno Schwikowski Technology-based challenges: Universal DNA Tag Systems Existing applications in high-throughput technologies • Universal DNA arrays • Padlock probes • LYNX mRNA technology Formalization Define: weight(A/T)=1, weight(C/G)=2 weight(AACTTG) = 1+1+2+1+1+2 = 8 melting temperature (AACTTG) = 2·weight l-u code problem Given two integers, l < u, find the largest set of tags such that Each tag has weight u Each string of weight l occurs at most once J. Comp Biol. 2000 & 2003 Challenge:Visualization Andrea Weston et al. @ ISB & Cytoscape Math and Computer Science Challenges Benno Schwikowski Challenge:Visualization Cytoscape, pre-release 2.0 Math and Computer Science Challenges Benno Schwikowski A computer scientist’s perspective Donald Knuth “Biology is so digital, and incredibly complicated […] I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it's at that level.” Donald Knuth, 7 Dec 1993 Math and Computer Science Challenges Benno Schwikowski