Computational Analysis of DNA Sequences Peter Krusche Warwick Systems Biology Centre

advertisement
Computational Analysis of
DNA Sequences
Peter Krusche
Warwick Systems Biology Centre
Thursday, 8 December 2011
Some Biology
[ cells, clocks, cancer ]
Sequences and Motifs
[ computational sequence analysis ]
Computation
[ an example and some challenges ]
Thursday, 8 December 2011
Some Biology
Thursday, 8 December 2011
It’s real science!
The goal: understand and explain the mechanisms that
underlie life.
Another way to look at it: “Reverse-engineer” nature.
Thursday, 8 December 2011
Main Systems Biology Activities
1) Analysis of experimental data
[ statistics, sequence analysis, image analysis, ... ]
2) Create models e.g. of complex protein
interactions and biochemical reactions
[ kinetics, network models, model parameter estimation, ...]
3) Suggest new experiments to provide evidence for
models
[ communicate with biologists ]
Thursday, 8 December 2011
Cells and the Cell Cycle
Living organisms consist
of cells.
Dividing cells go
through different phases
which make up the
Cell Cycle.
Thursday, 8 December 2011
So, how do cells work?
DNA material describes how to
make proteins.
In living cells, the DNA is normally
stored in the nucleus.
DNA can be transcribed into
mRNA, which can be translated into
proteins.
Proteins participate in most
processes in cells.
Source: http://www.genome.gov/Glossary/
Thursday, 8 December 2011
Proteins
Complex macromolecules, which
can have specific functions, such as:
Catalyse biochemical reactions
(Enzymes)
Structural or mechanical functions
(e.g. maintaining cell shape)
Cell signalling, immune responses,
cell adhesion, and the cell cycle.
Transcription factors start up the
transcription of genes.
Source: http://www.genome.gov/Glossary/
Thursday, 8 December 2011
Protein Sequences
Transcription and translation represented by character
sequences:
DNA
mRNA
Source: http://en.wikipedia.org/wiki/Protein
Thursday, 8 December 2011
Protein
sequence
Promoter Sequences
Promoter sequences are the DNA sequences that
surround “coding areas” that will be transcribed+translated
into protein.
Source: http://www.genome.gov/Glossary/
Thursday, 8 December 2011
Promoter Sequences
Promoter sequences are the DNA sequences that
surround “coding areas” that will be transcribed+translated
into protein.
The promoter sequence normally controls when a gene
gets expressed as a protein.
Source: http://www.genome.gov/Glossary/
Thursday, 8 December 2011
How do promoters “control” expression?
Promoter sequences contain Motifs.
Motifs are short sequence fragments, which will attract
binding of transcription factor proteins
(which then causes transcription and production of a protein)
Example:
E-box:
Thursday, 8 December 2011
CACGTG
normally present in promoters of genes
which are expressed rhythmically.
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
Proteins in Action
The Zebrafish Clock
Image Source: Vatine et al. 2011
Thursday, 8 December 2011
A Simplified Model
(-.%
()*%+%&*,%
(-.%
&.420%
&*,20%
!"#$% &$'%
!/012%+%&13%
(Model by Alex Esparza)
Thursday, 8 December 2011
Model Output
!"#"$%&'$(%)%&*+,$&
!"!#$%
!"#"$%&
!&'(%
,-%.(/%
!)(*+#%
*+,$&
Cell clock produces a light-entrainable oscillation
Thursday, 8 December 2011
Experimental Validation
(#)*$'
!"#$#%&'
+*,'
Thursday, 8 December 2011
Experimental Validation
(#)*$'
!"#$#%&'
!+,$-'
Thursday, 8 December 2011
The C5Sys Project
Circadian and cell cycle clock systems in cancer
We believe Clock and Cell cycle regulation are
connected.
What is the connection to cancer?
Some drugs work better/worse depending on the
time of day they are given.
Shift-workers can have higher cancer risk.
http://www.erasysbio.net/index.php?index=272
Thursday, 8 December 2011
Sequences and Motifs
Thursday, 8 December 2011
What do we want to do?
We would like to learn how the clock and
cell cycle are connected.
(and compare to experimental data)
Test if certain types of Motifs
(clock/cell cycle) are present in
‣
‣
Clock genes
Cell cycle genes
... and compare to random sets of genes.
Thursday, 8 December 2011
Promoter Analysis
Where are the regulatory sequences of a gene?
Source: http://www.ncbi.nlm.nih.gov
Thursday, 8 December 2011
Motif Search
Fix p-value threshold, search for motifs in all
regulatory sequences for a genome
‣
Many ways exist to evaluate motif
presence
(we use position weight matrices)
‣
Thursday, 8 December 2011
Single-Species analysis can be very noisy.
Example (Zebrafish p21)
We want to find the two known regulatory CACGTG’s in the middle.
We do find lots of other stuff.
Thursday, 8 December 2011
Example (Zebrafish p21)
We want to find the two known regulatory CACGTG’s in the middle.
We do find lots of other stuff.
Thursday, 8 December 2011
Example (Zebrafish p21)
We want to find the two known regulatory CACGTG’s in the middle.
We do find lots of other stuff.
Thursday, 8 December 2011
Sequence Conservation
Sequences that are very similar across
multiple species may be evolutionarily conserved.
Thursday, 8 December 2011
Sequence Conservation
Sequences that are very similar across
multiple species may be evolutionarily conserved.
Conservation indicates similarity in function
for genetic sequence regions.
Thursday, 8 December 2011
Sequence Conservation
Sequences that are very similar across
multiple species may be evolutionarily conserved.
Conservation indicates similarity in function
for genetic sequence regions.
Conserved regions in promoters are likely to
contain areas that are relevant to expression
regulation.
Thursday, 8 December 2011
Sequence Conservation
Sequence Alignment can be used to find
conserved blocks.
Thursday, 8 December 2011
Once we know where to look...
E-Boxes identified in
promoter fragment
Thursday, 8 December 2011
Some Computations
Thursday, 8 December 2011
String Comparison
1) Hamming Distance: count mismatches.
dist( AACACCTACG, AAGACCAACT ) = 3
2) String alignment: align maximum number of letters,
preserving order.
AACT ACCCTAA C G
||
||| || |
GAA GACC AAGCT
The aligned characters form the longest common
subsequence (LCS).
LCS distance:
Thursday, 8 December 2011
dist(x,y) = |x| + |y| - 2 * |LCS(x, y)|
Alignment Plots
Naive approach: O(|x| |y| w2)
Ott et al. ’08: Heuristic
improvements, x25 speedup, same
worst case.
Rasmussen et al. ’05: Very fast
algorithm for computing the scores
of window pairs with >90%
similarity.
Thursday, 8 December 2011
Using Alignment Plots
Local self-similarity:
find repetitive/low complexity regions
Thursday, 8 December 2011
Using Alignment Plots
We compute pairwise alignment plots of the promoter
areas of the same gene in multiple species.
The maximum window score over each column of windows
in the plot can be plotted as a profile.
Thursday, 8 December 2011
Using Alignment Plots
We compute pairwise alignment plots of the promoter
areas of the same gene in multiple species.
The maximum window score over each column of windows
in the plot can be plotted as a profile.
Thursday, 8 December 2011
Using Alignment Plots
We compute pairwise alignment plots of the promoter
areas of the same gene in multiple species.
The maximum window score over each column of windows
in the plot can be plotted as a profile.
Peak shows possible location
of regulatory motifs
Thursday, 8 December 2011
From single-Gene to Genome-wide
For individual genes, all types of analysis we showed
are pretty quick:
•
Searching for motifs in 5000bp of sequence:
a few seconds
•
Computing pairwise local alignments for multiple
5000bp sequences: 1-2 minutes (*)
•
Finding related genes (Orthology) : ~5min
We want to do this for an entire genome
(10000’s of genes!)
(*) This increases quadratically with sequence size!
80kb sequences => 2h!
Thursday, 8 December 2011
From single-Gene to Genome-wide
Applying our analyses on a genome-wide
scale has some difficulties:
•
New sequence database versions => new
sequences
•
Many incremental results, we need to be
able to
‣
‣
Thursday, 8 December 2011
change parameters,
repeat partial computations.
Distributed Computation
Data Storage:
A MySQL database running on the WSBC
cluster.
Computation:
APPLES Perl framework handles data
collection + DB access.
Computationally expensive tasks are
implemented in C/C++/Assembler.
Thursday, 8 December 2011
Distributed Computation
All computation objects and results are
versioned and serializable:
We can convert them between Perl objects and a
(canonical) JSON format.
We can uniquely identify equal sets of parameters.
We know for each result how it was obtained.
Including source code and input database versions.
Thursday, 8 December 2011
Distributed Computation
Fault tolerance
We serialize all our computations to the DB.
Runner instances can (re-)run any computation.
Automatic task parallelism
Many runners can be active at the same time.
Locking mechanism prevents running jobs twice.
(work-stealing type scheduling)
Thursday, 8 December 2011
Buy faster CPUs.
Use Parallel Computing.
Use GPU’s.
Implement stuff faster and better.
Design better algorithms.
Outlook
Thursday, 8 December 2011
Faster Computation
Improving the sensitivity of our local alignment
algorithms consumes more time.
Better statistics e.g. for motif scoring can reduce
noise, takes more time.
➡ Design better algorithms.
➡ Use GPU and parallel computation.
Thursday, 8 December 2011
Distributed Data Storage
Single-node MySQL is not the ideal tool of
choice for our purposes.
Once necessary, we could use a distributed
database for result caching and merging
(something like RIAK/...).
We will use a graph database (Neo4j) for
storing network-type results.
Thursday, 8 December 2011
Visualisation
Show computational results in a way useful
to biologists.
Some interesting projects:
Protovis, Cytoscape, Arbor.js, ...
(graphs and charts on the web)
Cube, Chronoscope, ...
(show and keep timeseries)
Many more...
Thursday, 8 December 2011
Thanks! Questions?
Thursday, 8 December 2011
Download