Algorithmic Analysis of Human DNA Replication Timing from Discrete Microarray Data Christopher Taylor

advertisement
Algorithmic Analysis of Human
DNA Replication Timing from
Discrete Microarray Data
Christopher Taylor
Gabriel Robins & Anindya Dutta
Thesis Statement
The DNA replication timing profile can be
reconstructed efficiently and accurately
from discrete time points.
(Glossary)
2
Presentation Outline
• Biology background
• Microarray technology
• Experimental data
– Challenges
• Algorithms
• Research Plans
– Replication timing
– Origins
– Scale up
3
Why Study DNA Replication?
• Natural Science
– DNA is the blueprint for organisms
• It must be passed on (organism, cell)
• Engineering
– Gene therapy
• Insertion, deletion, modification
– Cancer is unchecked replication
4
•
•
•
•
... A
G
G
T
C
G
A
C
A
C ...
... T
C
C
A
G
C
T
G
T
G ...
Human genome > 3 billion bp
Replication rate ~ 1000 bp/min
Serial replication  5.7 years
6 to 10 hours (speedup > 5000)
5
Background
• Prokaryotes
– E. Coli
• DnaA binds to oriC
• Eukaryotes – ORC
– S. Cerevisiae (yeast)
• ARS 11 bp consensus
– Mapping of origins
– Human
• No known consensus
• Few origins characterized
6
Genome Tiling Microarrays
• Interrogation at genomic scale
– Large increase in data
• Microarray data analysis
• Array of probes tiles genome
ATGGACTACGGATCAGTAAATCGATTAGGCACCAGATCAAGTACGATCCAGAGTACATAGCATACCATGACTAGA
GAGTACATAGCATACCATGACTAGA
TACCTGATGCCTAGTCATTTAGCTAATCCGTGGTCTAGTTCATGCTAGGTCTCATGTATCGTATGGTACTGATCT
CTCATGTATCGTATGGTACTGATCT
• Cross-hybridization
– Repeats not tiled
• Gaps in genome
PM probe
MM probe
GAGTACATAGCATACCATGACTAGA
A
7
Image analysis computes intensity of each array probe
8
The Cell Cycle
S-Phase
Start of S-phase
(0 hour)
9
Profiling DNA Replication Timing
• Ideal: f(chr, bp) = rtime
• Isolate DNA replicated in
discrete parts of S-phase
– One cell is not enough
– Synchronize S-phase entry
• Apply drugs
• Release together
– Synchronization error
– Label in two hour intervals
• Allelic Variation
– mf(chr, bp) = {rtime1, rtime2, …}
10
0hr
0hr
Allelic Variation
2hr
2hr
• Fluorescent in-situ Hybridization
(FISH)
4hr
– Replication timing at a given site
4hr
Temporally non-specific replication (TNS)
6hr
Temporally specific replication (TS)
6hr
8hr
8hr
10hr
10hr
11
11
What is the Problem?
Reconstruct a continuous replication profile
– Temporally (time points)
– Spatially (probes)
from noisy data
– Biological experiments
– Synchronization error
– Microarray artifacts
efficiently
– Genomic data (> 3 billion bp)
12
Initial Analysis
• Tiling Analysis Software (TAS)
– Wilcoxon Rank Sum test in sliding window
• Assess enrichment of treatment over control
– Window slides to get p-value for each probe
• O(kn) time complexity
– n = # probes on array
– k = # probes in a window
» k scales linearly with window size
13
New Analysis
• Thesis Statement (revisited):
The DNA replication timing profile can be
reconstructed efficiently and accurately from
discrete time points.
• Incorporate information from all time points
– Continuous view of replication timing (TR50)
• Address temporally non-specific replication
• Scale up to the whole genome efficiently
14
Allelic Variation Examples
Temporally specific replication
0
0
0
2
1/1
4
5
TR50
0
6
0
8
10
Temporally non-specific replication
1/6
0
1/6
2
1/3
4
5
TR50
0
6
1/3
8
10
Challenge: From distribution of array signal, determine replication category.
15
Temporal Specificity Algorithm
// Is there evidence that all alleles are replicating together?
If (max sum of two adjacent time points ≥ 5/6 * total sum)
then {probe is temporally specific}
// Is at least one allele replicating apart from the majority?
Else If (max sum of two adjacent time points not including
the maximum time point ≥ 1/3 * total sum)
then {probe is temporally non-specific}
// Isolated signal is not strong enough to be an allele.
Else
{probe is temporally specific}
16
Plotting TR50
8
6
4 2
TR50 (hours)
33
33.5
34
Chromosomal Position (in millions of bp)
• Smoothed TR50 curve recovers replication pattern
• Local minima  Possible locations of replication origin
17
Segregation Algorithm
Ratio ≥ 2-to-1 &
3.4 ≤ Avg ≤ 3.9
Avg < 3.4
TNS
Ratio < 2-to-1
Early
Avg > 3.9
Avg ≥ 3.4
Avg < 3.4
Avg > 3.9
Mid
Avg ≤ 3.9
Late
Ratio < 2-to-1
Ratio < 2-to-1
• Sliding window passes over probes to generate intervals
– Ratio of TSP to TNSP determines temporal specificity
– Average TR50 determines timing category
18
Research Plan: Profile Generation
0-2hr
2-4hr
4-6hr
6-8hr
8-10hr
Probe Classification
(Temporal Specificity
Algorithm) &
TR50 Calculation
No Signal Probes
TNS Probes
TS Probes & TR50
Low Probe Density
Segregation
Algorithm
(Sliding Window)
TNS Regions
TS Regions
Join Intervals
TS Probes that fall into JTS Regions
Mask TS probes
with JTS Regions
Joined TNS Regions
Joined TS Regions
TR50 Smoothing
Early
Smoothed TR50
Segregate JTS Regions into
1/3’s based on STR50
Mid
Late
Joined Early
Join
Intervals
Joined Mid
Joined Late
• Parameters to evaluate:
– Segregation Algorithm: sliding window size, minimum probe density
– Join Intervals: minimum interval size
19
Evaluation
• Concordance of biological phenomena
– Segregation intervals ↔ FISH
– STR50 local minima ↔ Other origin methods
– Correlation with other biological data
•
•
•
•
Gene density ↔ Early replication
AT content ↔ Late replication
Gene expression ↔ Early replication
Activating acetylation/methylation ↔ Early replication
• Performance on random data
– Large quantity of TNS replication
20
Research Plan: Replication Origins
• Drive DNA replication pattern
• Smoothed TR50 local minima
– Cleaned up with new profiles
• Other biological assays
–
–
–
–
Early labeling fragments
Nascent strands
Bubble trapping
ORC binding
21
Approach and Evaluation
• Correlation between methods
– Consensus sets
• Motif analysis
– Positional attributes
• Replication timing
• Proximity to genes
• Evaluation is difficult (few validated origins)
– Agreement between methods
– Testing proposed correlations
– Paper in preparation
22
Scaling Up to Whole Genome
• Pilot 1%
100% of human genome
– Algorithms developed with scalability in mind
• Incremental update sliding windows  Linear time
• Performance based evaluation
– If 100% data available
• Profile multiple runs
– Else
• Profile many 1% runs
23
Implementation Details
• Java
– Class representation of proprietary microarray files
– Algorithms to process raw microarray data
– Diagnostic tools
• Perl
– Scripts to process intermediate and final data
– Correlations, data transformation, quality assurance
• R statistical language
– Smoothing, statistical plots, correlation studies
• Shell scripts
– Automated processing of microarray sets
24
Current/Expected Contributions
• Algorithms, Software Infrastructure, Analysis
• Probe-by-probe TR50 analysis
– Temporal Specificity Algorithm
• Combinatorial analysis of allele locations
• Segregation Algorithm
– TNS, Early, Mid, Late replicating areas
• Used to design validation experiments
• Smoothed TR50 profile
– Local minima provide candidate origin set
• Linear algorithms enable scale up
• Randomness testing
25
Publications
Completed:
• ENCODE Project Consortium. The ENCODE
(ENCyclopedia Of DNA Elements) Project. Science.
2004 Oct 22; 306(5696):636-40.
• ENCODE Project Consortium. Identification and
analysis of functional elements in 1% of the human
genome by the ENCODE pilot project. Nature.
{In Press, to appear in June 14, 2007 issue}
• Karnani N., Taylor C., Malhotra A., Dutta A. Pan-S
replication patterns and chromosomal domains
defined by genome tiling arrays of encode genomic
areas. Genome Research.
{In Press, to appear in June 2007 issue}
• UCSC Browser Tracks:
TR50,
Smoothed TR50, Local Minima, Segregation
In Progress:
• Multi-million dollar NIH grant for scale up to full
human genome
• Paper detailing origin methods, correlations, etc.
26
Timeline
Spring 2007 (present to June 20):
Implement proposed replication profile generation algorithms
– Generate new profiles for existing data and evaluate against FISH
– Collect new origin sets and continue analysis for paper completion
Summer 2007 (June 21 to September 21):
Explore correlations of new profiles with other data sets
Submit paper to PSB 2008 based on new method and results
Develop random data sets to test profile generation algorithms
Fall 2007 (September 22 to December 21):
Evaluate performance for scale up to whole genome
Tie up loose ends and begin writing the dissertation
Winter 2007-2008 (December 22 to March 19):
Finish dissertation and schedule defense before May 2008
27
Acknowledgements
• Advising:
– Anindya Dutta, Gabriel Robins
• Biological Experiments:
– Neerja Karnani, Patrick Boyle, Larry Mesner,
Jamie Teer, Hakkyun Kim
• Collaborative Analysis:
– Ankit Malhotra
• Discussions of Analysis:
– Stefan Bekiranov
28
THE END
29
Why is this work computer science?
• Fred Brooks: The Computer Scientist as Toolsmith II
– “Hitching our research to someone else’s driving problems, and
solving those problems on the owners’ terms, leads us to richer
computer science research.”
• Not an incremental improvement
– Algorithmic techniques and analysis used to solve a problem
previously addressed inadequately with a statistical approach that
performed poorly
• Collaboration outside of engineering disciplines enhances
visibility, funding opportunities, and demand for CS work
• Developed algorithms, time complexity analysis,
combinatorial analysis, feedback to experimental design
30
Will this work lead to any CS
publications?
• The Nature article focused on analysis of the
biological data and includes descriptions of
some of my algorithms
• The Genome Research paper and origins paper
will also contain writeups of my algorithms and
analysis techniques
• The Pacific Symposium on Biocomputing
focuses on algorithms and computational
techniques
31
Isn't your approach too simple?
• The approach isn’t simple:
–
–
–
–
Combinatorial analysis
Temporal specificity algorithm (many iterations)
Probewise computation to deal with binding affinity
Incremental updating sliding windows
• Cross-hybridiztion
• Synchronization error
– Smoothing
• Parameterization
– Linear algorithms for scale up
32
Can't your algorithm be replaced by
a well-known statistical method?
• HMM’s were used for segregation of intervals
– Performed poorly in comparison to my algorithm
• Less accurate categorization of replication intervals
• Prone to rapid oscillation, producing tiny intervals
• Parameterization was difficult
• Lowess smoothing is a statistical method
– Parameterization was not easy
33
What are the biggest challenges in
this work?
• Noise!
– The data to analyze comes from biological experiments with
several sources of noise that compound upon one another
• Biology
– I haven’t had a course in biology since 10th grade
• Microarrays
– New, evolving technology we’re still learning to deal with
• Data size
– Hundreds of GB of data to process
– Replicates, failed experiments
– Algorithms must be efficient
34
What kind of career are you aiming
for after graduation, and why?
• Teaching Computer Science (Small College)
– I enjoyed learning in my undergraduate curriculum with
meaningful interactions with professors
– I taught Discrete Math at UVa in Fall ’02 and Spring ’03
• Enjoyable, but 60-70 students too large
• Post-doctoral (Biological Computing)
– Many opportunities around the world
– Further exploration of the field
35
How will you know when your
work/thesis is done?
• Research is never really done, but you have to
declare victory at some point
• The replication profiling algorithms I’ve developed
already perform quite well
– I have concrete plans to improve and finalize them
36
Download