Computational Analysis of DNA Sequences Peter Krusche Warwick Systems Biology Centre Thursday, 8 December 2011 Some Biology [ cells, clocks, cancer ] Sequences and Motifs [ computational sequence analysis ] Computation [ an example and some challenges ] Thursday, 8 December 2011 Some Biology Thursday, 8 December 2011 It’s real science! The goal: understand and explain the mechanisms that underlie life. Another way to look at it: “Reverse-engineer” nature. Thursday, 8 December 2011 Main Systems Biology Activities 1) Analysis of experimental data [ statistics, sequence analysis, image analysis, ... ] 2) Create models e.g. of complex protein interactions and biochemical reactions [ kinetics, network models, model parameter estimation, ...] 3) Suggest new experiments to provide evidence for models [ communicate with biologists ] Thursday, 8 December 2011 Cells and the Cell Cycle Living organisms consist of cells. Dividing cells go through different phases which make up the Cell Cycle. Thursday, 8 December 2011 So, how do cells work? DNA material describes how to make proteins. In living cells, the DNA is normally stored in the nucleus. DNA can be transcribed into mRNA, which can be translated into proteins. Proteins participate in most processes in cells. Source: http://www.genome.gov/Glossary/ Thursday, 8 December 2011 Proteins Complex macromolecules, which can have specific functions, such as: Catalyse biochemical reactions (Enzymes) Structural or mechanical functions (e.g. maintaining cell shape) Cell signalling, immune responses, cell adhesion, and the cell cycle. Transcription factors start up the transcription of genes. Source: http://www.genome.gov/Glossary/ Thursday, 8 December 2011 Protein Sequences Transcription and translation represented by character sequences: DNA mRNA Source: http://en.wikipedia.org/wiki/Protein Thursday, 8 December 2011 Protein sequence Promoter Sequences Promoter sequences are the DNA sequences that surround “coding areas” that will be transcribed+translated into protein. Source: http://www.genome.gov/Glossary/ Thursday, 8 December 2011 Promoter Sequences Promoter sequences are the DNA sequences that surround “coding areas” that will be transcribed+translated into protein. The promoter sequence normally controls when a gene gets expressed as a protein. Source: http://www.genome.gov/Glossary/ Thursday, 8 December 2011 How do promoters “control” expression? Promoter sequences contain Motifs. Motifs are short sequence fragments, which will attract binding of transcription factor proteins (which then causes transcription and production of a protein) Example: E-box: Thursday, 8 December 2011 CACGTG normally present in promoters of genes which are expressed rhythmically. Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 Proteins in Action The Zebrafish Clock Image Source: Vatine et al. 2011 Thursday, 8 December 2011 A Simplified Model (-.% ()*%+%&*,% (-.% &.420% &*,20% !"#$% &$'% !/012%+%&13% (Model by Alex Esparza) Thursday, 8 December 2011 Model Output !"#"$%&'$(%)%&*+,$& !"!#$% !"#"$%& !&'(% ,-%.(/% !)(*+#% *+,$& Cell clock produces a light-entrainable oscillation Thursday, 8 December 2011 Experimental Validation (#)*$' !"#$#%&' +*,' Thursday, 8 December 2011 Experimental Validation (#)*$' !"#$#%&' !+,$-' Thursday, 8 December 2011 The C5Sys Project Circadian and cell cycle clock systems in cancer We believe Clock and Cell cycle regulation are connected. What is the connection to cancer? Some drugs work better/worse depending on the time of day they are given. Shift-workers can have higher cancer risk. http://www.erasysbio.net/index.php?index=272 Thursday, 8 December 2011 Sequences and Motifs Thursday, 8 December 2011 What do we want to do? We would like to learn how the clock and cell cycle are connected. (and compare to experimental data) Test if certain types of Motifs (clock/cell cycle) are present in ‣ ‣ Clock genes Cell cycle genes ... and compare to random sets of genes. Thursday, 8 December 2011 Promoter Analysis Where are the regulatory sequences of a gene? Source: http://www.ncbi.nlm.nih.gov Thursday, 8 December 2011 Motif Search Fix p-value threshold, search for motifs in all regulatory sequences for a genome ‣ Many ways exist to evaluate motif presence (we use position weight matrices) ‣ Thursday, 8 December 2011 Single-Species analysis can be very noisy. Example (Zebrafish p21) We want to find the two known regulatory CACGTG’s in the middle. We do find lots of other stuff. Thursday, 8 December 2011 Example (Zebrafish p21) We want to find the two known regulatory CACGTG’s in the middle. We do find lots of other stuff. Thursday, 8 December 2011 Example (Zebrafish p21) We want to find the two known regulatory CACGTG’s in the middle. We do find lots of other stuff. Thursday, 8 December 2011 Sequence Conservation Sequences that are very similar across multiple species may be evolutionarily conserved. Thursday, 8 December 2011 Sequence Conservation Sequences that are very similar across multiple species may be evolutionarily conserved. Conservation indicates similarity in function for genetic sequence regions. Thursday, 8 December 2011 Sequence Conservation Sequences that are very similar across multiple species may be evolutionarily conserved. Conservation indicates similarity in function for genetic sequence regions. Conserved regions in promoters are likely to contain areas that are relevant to expression regulation. Thursday, 8 December 2011 Sequence Conservation Sequence Alignment can be used to find conserved blocks. Thursday, 8 December 2011 Once we know where to look... E-Boxes identified in promoter fragment Thursday, 8 December 2011 Some Computations Thursday, 8 December 2011 String Comparison 1) Hamming Distance: count mismatches. dist( AACACCTACG, AAGACCAACT ) = 3 2) String alignment: align maximum number of letters, preserving order. AACT ACCCTAA C G || ||| || | GAA GACC AAGCT The aligned characters form the longest common subsequence (LCS). LCS distance: Thursday, 8 December 2011 dist(x,y) = |x| + |y| - 2 * |LCS(x, y)| Alignment Plots Naive approach: O(|x| |y| w2) Ott et al. ’08: Heuristic improvements, x25 speedup, same worst case. Rasmussen et al. ’05: Very fast algorithm for computing the scores of window pairs with >90% similarity. Thursday, 8 December 2011 Using Alignment Plots Local self-similarity: find repetitive/low complexity regions Thursday, 8 December 2011 Using Alignment Plots We compute pairwise alignment plots of the promoter areas of the same gene in multiple species. The maximum window score over each column of windows in the plot can be plotted as a profile. Thursday, 8 December 2011 Using Alignment Plots We compute pairwise alignment plots of the promoter areas of the same gene in multiple species. The maximum window score over each column of windows in the plot can be plotted as a profile. Thursday, 8 December 2011 Using Alignment Plots We compute pairwise alignment plots of the promoter areas of the same gene in multiple species. The maximum window score over each column of windows in the plot can be plotted as a profile. Peak shows possible location of regulatory motifs Thursday, 8 December 2011 From single-Gene to Genome-wide For individual genes, all types of analysis we showed are pretty quick: • Searching for motifs in 5000bp of sequence: a few seconds • Computing pairwise local alignments for multiple 5000bp sequences: 1-2 minutes (*) • Finding related genes (Orthology) : ~5min We want to do this for an entire genome (10000’s of genes!) (*) This increases quadratically with sequence size! 80kb sequences => 2h! Thursday, 8 December 2011 From single-Gene to Genome-wide Applying our analyses on a genome-wide scale has some difficulties: • New sequence database versions => new sequences • Many incremental results, we need to be able to ‣ ‣ Thursday, 8 December 2011 change parameters, repeat partial computations. Distributed Computation Data Storage: A MySQL database running on the WSBC cluster. Computation: APPLES Perl framework handles data collection + DB access. Computationally expensive tasks are implemented in C/C++/Assembler. Thursday, 8 December 2011 Distributed Computation All computation objects and results are versioned and serializable: We can convert them between Perl objects and a (canonical) JSON format. We can uniquely identify equal sets of parameters. We know for each result how it was obtained. Including source code and input database versions. Thursday, 8 December 2011 Distributed Computation Fault tolerance We serialize all our computations to the DB. Runner instances can (re-)run any computation. Automatic task parallelism Many runners can be active at the same time. Locking mechanism prevents running jobs twice. (work-stealing type scheduling) Thursday, 8 December 2011 Buy faster CPUs. Use Parallel Computing. Use GPU’s. Implement stuff faster and better. Design better algorithms. Outlook Thursday, 8 December 2011 Faster Computation Improving the sensitivity of our local alignment algorithms consumes more time. Better statistics e.g. for motif scoring can reduce noise, takes more time. ➡ Design better algorithms. ➡ Use GPU and parallel computation. Thursday, 8 December 2011 Distributed Data Storage Single-node MySQL is not the ideal tool of choice for our purposes. Once necessary, we could use a distributed database for result caching and merging (something like RIAK/...). We will use a graph database (Neo4j) for storing network-type results. Thursday, 8 December 2011 Visualisation Show computational results in a way useful to biologists. Some interesting projects: Protovis, Cytoscape, Arbor.js, ... (graphs and charts on the web) Cube, Chronoscope, ... (show and keep timeseries) Many more... Thursday, 8 December 2011 Thanks! Questions? Thursday, 8 December 2011