Challenges for computer science and math as a part of Systems Biology

advertisement
Challenges for computer science
as a part of Systems Biology
Benno Schwikowski
Institute for Systems Biology
Seattle, WA
Towards integrative models
DNA
mRNA
- Sequence
- Abundance
- Genomic locus
- Regulatory
- Domain content
information
- Intron/exon structure - initiation/
- Regulatory motifs
termination
- Chemical modifications signals
- SNPs
Species
- Splice variants
- Accessibility
- Variation
Protein
interaction
- Interaction
partner
- Direct/indirect
- Affinity
- Effect
Protein
- Abundance
- State
- Localization
- 3D structure
- Functional
characterization
- Half-life
- Active sites
- Biochemical function
- Cellular role
Conditions/time
Genes
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Integrative models
…Across genes and proteins: Many genes involved
(e.g., multifactorial diseases)
• …Across model systems: Lack of experimental
platforms in target system
• …Across levels of biological organization
(e.g. gene regulatory processes involving
phosphorylation)
• …Across experiments: Robustness against errors
in mass spectrometry, mRNA measurements
• …Across timescales
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Capturing evolutionary constraints
DNA
RNA
Proteins
Modules
Organelles
Cells
Organs
Individuals
Populations
Ecologies
"Nothing in biology makes sense except in the light of evolution.“
Theodosius Dobzhansky
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Which tools and experiments to use
Challenge: Choosing experiments
• Machine Learning
Determine most likely
classification/parameterization on the basis of a
randomly sampled dataset
• Active Learning
Allow an algorithm to query selected data points,
using the result of previous queries.
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Relations between system
variables can be quite complex
Yuh, Bolouri, Davidson, Science, 1998
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Relations between system
variables can be quite complex
Yuh, Bolouri,
Davidson,
Science,
1998
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Develop models that allow
extremely efficient algorithms
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGTGAGATACGT...
GAACGGAGTACGT...
TCGTGACGGTGAT...
Math and Computer Science Challenges
Benno Schwikowski
CLUSTALW(1.74) multiple sequence alignment
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT
GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA
TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC
TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC
ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC
TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA
TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA
TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A
C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A
AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA
ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA
CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A
GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT
TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA
GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA
GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG
GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG
CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA
CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG
TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC
CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton
Pea
Tobacco
Ice-plant
Larch
Turnip
Wheat
Duckweed
T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC
TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC
CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA
TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC
TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA
TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG
GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC
CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
Math and Computer Science Challenges
Benno Schwikowski
Challenge: Developing models that allow
extremely efficient algorithms
ACGT
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGT
ACGTGAGATACGT...
ACGT
ACGG
GAACGGAGTACGT...
TCGTGACGGTGAT...
Parsimony score: 1
J. Comp Biol. 2002
Math and Computer Science Challenges
Benno Schwikowski
An Exact Algorithm
(generalizing Sankoff and Rousseau 1975)
Wu [s] = best
parsimony
 min
( Wv [t] score
+ d(s, for
t) )subtree rooted at node u,
u is labeled
with string s.
vif: child
t
of u
…
…
4k entries
ACGG: 1
ACGT: 0
ACGG: +
ACGT: 0
...
AGTCGTACGTG
...
…
ACGG:
ACGT :0
...
…
…
ACGG: 2
ACGT: 1
ACGG:
ACGT :0
…
...
ACGG:
ACGT :0 ...
...
…
ACGG: 1
ACGT: 1
...
…
ACGG: 0
ACGT: 2
…
ACGGGACGTGC
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
ACGG: 0
ACGT: +
...
...
Math and Computer Science Challenges
J. Comp Biol. 2002
Benno Schwikowski
What are good challenges to tackle?
• Biological/medical questions asked
• Experimental technologies to acquire a lot of
relevant data
• Available datasets with a formalized notion of “data
quality”
Math and Computer Science Challenges
Benno Schwikowski
Memory complexity: O(k  42k ) per node
Number of
species
Average
sequence
length
Time complexity: Total time O(n k (42k + l ))
Motif length
J. Comp Biol. 2002
Math and Computer Science Challenges
Benno Schwikowski
Technology-based challenges:
Universal DNA Tag Systems
Existing applications in high-throughput technologies
• Universal DNA arrays
• Padlock probes
• LYNX mRNA technology
Formalization
Define: weight(A/T)=1, weight(C/G)=2
weight(AACTTG) = 1+1+2+1+1+2 = 8
 melting temperature (AACTTG) = 2·weight
l-u code problem
Given two integers, l < u, find the largest
set of tags such that
Each tag has weight  u
Each string of weight  l occurs at most once
J. Comp Biol. 2000 & 2003
Challenge:Visualization
Andrea Weston et al.
@ ISB & Cytoscape
Math and Computer Science Challenges
Benno Schwikowski
Challenge:Visualization
Cytoscape, pre-release 2.0
Math and Computer Science Challenges
Benno Schwikowski
A computer scientist’s perspective
Donald Knuth
“Biology is so digital, and incredibly
complicated […] I can't be as
confident about computer
science as I can about biology.
Biology easily has 500 years of
exciting problems to work on, it's
at that level.”
Donald Knuth, 7 Dec 1993
Math and Computer Science Challenges
Benno Schwikowski
Download