What are Math and Computer Science doing in Biology?

advertisement
What are Math and Computer
Science doing in Biology?
Dan Gusfield
UC Davis
March 29, 2012
Denison University
One limited
perspectiv
e
Short Answer:
•
•
•
•
•
Bioinformatics
Computational Biology
Statistical Biology
Mathematical Biology
…..
Short Answer:
•
•
•
•
•
Bioinformatics
Computational Biology
Statistical Biology
Mathematical Biology
…..
My focus
computational biology
–“An interdisciplinary field that applies the
techniques of computer science, applied
mathematics and statistics to address
biological problems” (Wikipedia)
Biology
Computer
Science
Math &
Statistics
6
UC Davis
Computational
biology, Bioinformatics
How can non-biologists,
non-chemists understand
or contribute to biology?
Where does our license
come from?
My Fear 30 years ago was
that I would first need to
master material like:
Citric Acid Cycle
Amylase + starch substrate
Bond representation of triplex DNA. This view is down the long axis. The “third” strand is colored.
MYOGLOBIN - An oxygen carrier in muscle
Here is another way of visualising tertiary the structure
Tertiary
Stucture
Spot the
Tertiary
folding.
Quaternary
Structure
Spot the
Haem group
LYSOZYME
Including the Side
chains.
Can you see any
active site now?
It looked very daunting!
But,
By some wonderful
fact or fluke of nature,
a huge simplification is
possible and very
productive.
Molecular information is (partially)
Digital.
And, nature takes notes
(leaves historical footnotes).
PRIMARY STRUCTURE
Primary structure is described by the sequence of Amino Acids in the chain
This diagram shows the primary structure of
PIG INSULIN, a protein hormone as
discovered by Frederick Sanger.
He was given a Nobel prize in 1958.
Hemoglobin – Primary
Structure
NH2-Val-His-Leu-Thr-Pro-Glu-Glu-
Lys-Ser-Ala-Val-Thr-Ala-Leu-TrpGly-Lys-Val-Asn-Val-Asp-Glu-ValGly-Gly-Glu-…..
beta subunit amino acid sequence
It has been amazingly
productive to treat
protein and DNA
molecules just as text:
collecting, comparing,
creating molecular
sequences.
No hard-core chemistry or
biology - just text comparison
and analysis.
Fluke of nature?
An imposition of the human
mind?
Lucky break for us?
The first major success
story:
Simian Sarcoma Virus onc Gene, v-sis is
derived from the Gene (or Genes) of a
Platelet-Derived Growth Factor.
R.F. Doolittle et al, Science 1983
“The transforming protein of a
primate sarcoma virus and a
platelet-derived growth factor are
derived from the same or closely
related cellular genes. This
conclusion is based on the
demonstration of extensive
sequence similarity.”
From the abstract
Sequence similarity suggested
that genes involved in cancer
were functionally related to genes
involved in blood platelet growth,
two biological phenomena that
had previously seemed unrelated.
This was a very surprising result,
and a novel kind of reasoning.
But,
Biology via Sequence Analysis
is now completely accepted,
main-stream.
Some biologists have even
replaced their wet-labs with
computer labs, doing biology
only by sequence analysis.
“The ultimate rational behind all
purposeful structures and behavior
of living things is embodied in the
sequence of residues of nascent
polypeptide chains …” J. Monod
“The rosetta stone of modern biology
appears to be sequence comparitive
analysis.” T. Smith
Success stories from sequence
analysis are now routine. Why?
Mostly shared history and duplication
with modification, but also
shared physical, chemical
constraints.
“We didn't know it at the time,
but we found out everything
in life is so similar, that the same
genes that work in flies are the
ones that work in humans.”
Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine
Take-home message
High sequence similarity implies significant
functional and/or structural similarity
Ancestor
paralogs
29
UC Davis
Species A
Species B
orthologs
3/11/2016
Can we reverse the
statement?
Two sequences with high
functional similarity should
have similar sequences.
31
UC Davis
3/11/2016
The success of sequence
comparison and analysis, and the
development of efficient DNA
sequencing, has lead
to huge projects to capture,
accumulate, store, curate, and
annotate bio-molecular sequences.
Genbank, Blast, Human Genome
Project, specialized databases.
Today it has around 300 trillion bases!
Examples of large-scale sequencing projects
1,000 Genomes Project. http://www.1000genomes.org/.
BGI, 10,000 whole human genomes.
BGI, 1,000 individuals with IQ>145 versus 1,000 random individuals.
BGI, Autism Genetic Resource Exchange, 10,000 individuals.
BGI, CHOP, many childhood diseases.
Genome Institute, Washington U. St. Louis, 600 childhood cancer patients;
$65 million over three years. 150 tumor & normal cancer genome pairs.
Epitwin: TwinsUK & BGI $30 million for epigenetic differences in 5,000 twins.
Netherlands Genome Project: BGI 750 genomes (250 trios) in Dutch biobanks.
Epi4K: Duke et al. $25M to sequence 4,000 genomes for epilepsy research.
U. Michigan Cancer Center: Clinical next-gen sequencing of cancer patients.
R. Michelmore
In near future: DNA sequence = an inexpensive commodity
generated on a variety of platforms
$1,000 ($100?) human genome coming =>
$1,000 genome for many animals and plants
$100 genome for fungi
$10 genome for bacteria en masse
Metagenomics: sequencing of communities
biomes (humans = 100x more bacteria)
novel & unculturable organisms
characterization of diversity & unique genes
Not just genomic DNA sequence:
DNA modifications
epigenomics & copy number variation (CNV)
expression analysis (RNAseq not arrays)
Enormous amounts of sequence data
Need for major data handling capabilities
Vital role for bioinformatics just to manage the data
R. Michelmore
More recently: Metagenomics,
metabolomics, proteomics,
microbiomics, epigenomics,
transcriptomics, methylomics….
High-throughput biology generating
massive amounts of data;
sometimes too large even to store.
NYT November 30, 2011:
“The Bejing Genome Center has
enough sequencing capacity to
sequence 2,000 human genomes
per day.”
“World capacity is now 13 quadrillion
DNA bases a year, an amount that
would fill a stack of DVDs two miles
high.”
OK, so sequences and
sequence analysis are
important, but where’s the
promised computer science
and math?
Simple sequence comparison,
comparing new sequences against
sequences in databases, has been
extremely productive.
But how do we extract the most
biological value from sequences?
The Larger Challenge and Opportunity:
How to utilize the deluge of sequence data?
What significant patterns do
you see in:
Making sense of the code
43
UC Davis
Regulatory
Motifs
Genes
Noncoding
regions
3/11/2016
are located in a “sea” of
Damien Peltier
How do we analyze so much data?
How do we know that patterns we
see are meaningful? How do we
know that similarities we see are
based in biology and not just
random happenstance?
Humans are good at seeing
patterns, even in random events
and data.
From
Mars
From the bible code
What we need:
• Clear, biologically meaningful definitions of
similarity, patterns. Biological models of
mutation and evolution - how sequences
evolve.
• Metrics - how similar, how good the fit.
• Efficient methods to compute similarities, and
find patterns, and compute the metrics.
• Efficient methods to assess the “significance”
of the finds.
For those tasks, we need
• Biology - to define and model meaningful
types of similarities and patterns to look for.
• Mathematics - to propose and understand the
models and metrics.
• Computer Science - for efficient sequence
analysis and search algorithms.
• Statistics - to measure the ``significance”
(deviation from random happenstance) of the
finds.
computational biology
–“An interdisciplinary field that applies the
techniques of computer science, applied
mathematics and statistics to address
biological problems” (Wikipedia)
Biology
Computer
Science
Math &
Statistics
51
UC Davis
Computational
biology, Bioinformatics
“It costs more to analyze a genome
than to sequence a genome.”
D. Haussler
A small part of the story in
greater detail
Basic problem: define and
compute the similarity of two
sequences
• Biological-Mathematical model: Two
sequences are similar when…
• Algorithmic problem: How do you compute
the sequence similarity of two sequences
S1 and S2.
54
UC Davis
3/11/2016
“All models are
wrong, but some
are useful.”
George Box
Modeling sequence evolution
S1:
AATCCAGTTTTACAGATCCTC
length m=21
S2:
AATAGTTTTACAGACTCAT
length n=19
Alignment: Insert spaces into, or before or after
the two sequences to make them the same length.
S1:
- AATCCAGTTTTATAGA-TCCTC
length m=23
S2:
AATA—GTTTTACAGACTCAT--
length n=23
Match, Mismatch, Space, Gap
One measure of the goodness of the alignment is the
(# of matches) -- (# of mismatches) --(# of spaces)
Given a metric to measure the
goodness of any specific alignment,
we define the Similarity of two
sequences S1 and S2 as:
The Maximum
(# matches) -- (# mis) -- (# spaces)
over all possible alignments of
S1 and S2.
But how do we compute similarity?
Mathematics finds a formula:
So there are a huge number of
alignments.
Mathematics counts the
number of alignments
Length of the Number of
sequences
alignments
59
10
184,756
20
~1.4e11
100
~9.0e58
UC Davis
3/11/2016
There are too many alignments
to try each one out, but clever,
efficient algorithms, using the
technique of Dynamic Programming,
allow the efficient computation of
similarity. (Computer Science
contribution).
For any length n, the number of
operations needed to compute
the similarity of two n-length
sequences, via Dynamic
Programming, is proportional to
n squared (i.e, n^2).
Number of operations needed
to compute Similarity
Length of the Number of
sequences
operations
using explicit
enumeration
Number of
operations
using
Dynamic
Programming
10
184,756
100
20
~1.4e11
400
100
~9.0e58
10e4
So similarity can be found quickly, but
Is the similarity significant?
Elegant statistical methods can
be used to determine the probability
that two random sequences would
have that level of similarity or more.
We don’t reject the possibility that
two sequences are similar due only
to chance, unless the computed
probability is very low.
Extensions: Finding patterns
in multiple sequences
ACTAACCGGGAGATTTCAGA
AAGTTCCGGGAGATTTCCA
TAGTTATCCGGGAGATTAGA
AAAACCGGTAGATTTCAGG
human
chimp
mouse
rat
Multiple Sequence Alignment
AC--TAACCGGGAGATTTCAGA
AAGTT--CCGGGAGATTTCC-A
TAGTTATCCGGGAGATT--AGA
AA---AACCGGTAGATTTCAGG
human
chimp
mouse
rat
CLUSTALW multiple sequence alignment (rbcS gene)
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT
GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA
TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC
TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC
ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC
TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA
TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA
TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A
C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A
AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA
ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA
CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A
GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT
TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA
GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA
GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG
GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG
CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA
CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG
TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC
CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton
Pea
Tobacco
Ice-plant
Larch
Turnip
Wheat
Duckweed
T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC
TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC
CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA
TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC
TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA
TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG
GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC
CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
Again we need a model of what
multiple sequence alignments
are biologically meaningful;
a metric to score the
goodness of a multiple alignment;
an algorithm to compute multiple
alignments, based on the metric;
and statistical methods to evaluate
the signifinance of an alignment.
Summarizing
• Biology by sequence analysis opens the door
widely to non-biologists.
• Models of sequence evolution and metrics
used in sequence analysis are articulated by
biology and Mathematics.
• Computer Science contributes efficient
algorithms to do the analysis and compute
the metrics.
• Statistics is needed to evaluate the
significance of the computed results.
• Sequence analysis is just one of many ways
that computer science and mathematics have
entered biology.
In general: The computationalbiology work flow
Biological Knowledge
Biological model
Eg. DNA replication infidelity model, mutagens,
radiation models etc
Mathematical model and
assumptions
E.g. assumption about mutation distribution or
preferential attachment
Mathematical problem
Algorithmic problem
Programming problem
69
UC Davis
E.g. DNA mutates
E.g. Given the mathematical model, find spots
where mutation rates are high or low in a
statistically significant way
E.g. What algorithm should I develop to
efficiently find hotspots
E.g. Data storage, Memory, OOP and languages,
optimizations, GUI
3/11/2016
Another illustrastion, involving
phylogenetic trees rather than
sequences.
Comparing Trees:
Tanglegrams
• A Tanglegram is a pair of phylogenetic trees drawn in the plane
with no crossing edges, with the same labeled leaf set. The
leaves of one tree are displayed on a line, and the leaves of the
other tree are displayed on a parallel line.
• One tree represents the evolution of a set of species, and the
other tree represents the evolution of a set of parasites that
inhabit the species.
• A straight line connect each leaf in one tree to the leaf with the
same label in the other tree.
• The number of crossing lines is a measure of the similarity of
the trees.
• A small measure suggests that the species and parasites coevolved.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Images courtesy of NTBG
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Images courtesy of NTBG
But the trees can be redrawn to
reduce the number of crossings.
So we have the algorithmic
problem of finding planar layouts
of the two trees, to minimize the
number of crossings of the lines
between the leaves. That minimum
number is the metric of similarity.
How do we compute it, and how can
we evaluate significance?
Thank you
Download