RECOMBINOMICS: Myth or Reality?

advertisement
RECOMBINOMICS: Myth or Reality?
Laxmi Parida
IBM Watson Research
New York, USA
IBM Computational Biology Center
RoadMap
1. Motivation
2. Reconstructability
(Random Graphs Framework)
3. Reconstruction Algorithm
(DSR Algorithm)
4. Conclusion
2
IBM Computational Biology Center
3
IBM Computational Biology Center
www.nationalgeographic.com/genographic
4
IBM Computational Biology Center
www.ibm.com/genographic
5
IBM Computational Biology Center
 Five year study, launched in April 2005 to address
anthropological questions on a global scale
using genetics as a tool
 Although fossil records fix human origins in Africa, little is known
about the great journey that took Homo sapiens to the far
reaches of the earth.
How did we, each of us, end up where we are?
phylogeographic question
 Samples all around the world are being collected and the mtDNA
and Y-chromosome are being sequenced and analyzed
6
IBM Computational Biology Center
DNA material in use
under unilinear transmission
16000 bp
58 mill bp
0.38%
7
IBM Computational Biology Center
Missing information in
unilinear transmissions
past
present
8
IBM Computational Biology Center
Paradigm Shift in Locus & Analysis
Using recombining DNA sequences
 Why?


Nonrecombining gives a partial story
1. represents only a small part of the genome
2. behaves as a single locus
3. unilinear (exclusively male of female) transmission
Recombining towards more complete information
 Challenges


Computationally very complex
How to comprehend complex reticulations?
9
IBM Computational Biology Center
RoadMap
1. Motivation
2. Reconstructability
(Random Graphs Framework)
3. Reconstruction Algorithm
(DSR Algorithm)
4. Conclusion
L Parida,
Pedigree History: A Reconstructability Perspective using Random-Graphs Framework,
Under preparation.
10
IBM Computational Biology Center
RoadMap
1. Motivation
2. Reconstructability
(Random Graph Framework)
3. Reconstruction Algorithm
(DSR Algorithm)
4. Conclusion
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium
Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium,
Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009
11
IBM Computational Biology Center
INPUT: Chromosomes (haplotypes)
OUTPUT: Recombinational Landscape (Recotypes)
12
IBM Computational Biology Center
Our Approach
Granularity g
statistical
NO
Acceptable p-value?
YES
IRiS
combinatorial
statistical
Analyze Results
M Mele, A Javed, F Calafell, L Parida, J Bertranpetit and Genographic Consortium
Recombination-based genomics: a genetic variation analysis in human populations,
under submission.
13
IBM Computational Biology Center
Preprocess:
Dimension reduction via Clustering
11
12
13
14
15
16
0
17
1
18
4
19
6
5
20
8
21
9
10
7
22
23
3
2
24
14
IBM Computational Biology Center
Analysis Flow
Granularity g
NO
statistical
Acceptable p-value?
YES
IRiS
combinatorial
Analyze Results
statistical
15
IBM Computational Biology Center
p-value Estimation
16
IBM Computational Biology Center
Comparison of the Randomization Schemes
17
IBM Computational Biology Center
SNP Blocks (granularity g=3)
18
IBM Computational Biology Center
Analysis Flow
Granularity g
NO
statistical
Acceptable p-value?
YES
IRiS
combinatorial
Analyze Results
statistical
19
IBM Computational Biology Center
IRiS
(Identifying Recombinations in Sequences)
Stage Haplotypes: use SNP block patterns
biological insights
Segment along the length: infer trees
computational insights
Infer network (ARG)
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium
Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
20
IBM Computational Biology Center
Segmentation
12345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345
11111111111111111111111111111111111111112222222222222222222222222222222222233333333344444444455555555555555----
21
IBM Computational Biology Center
Segmentation
22
IBM Computational Biology Center
Consensus of Trees
23
IBM Computational Biology Center
Algorithm Design
1. Ensure compatibility of component trees
2. Parsimony model:
minimize the no. of recombinations
24
IBM Computational Biology Center
Algorithm Design
1. Ensure compatibility of component trees
2. Parsimony model:
minimize the no. of recombinations
Theorem:
The problem is NP-Hard.
“It is impossible to design an algorithm that guarantees optimality.”
25
IBM Computational Biology Center
DSR Scheme
(Dominant—Subdominant---Recombinant)
26
IBM Computational Biology Center
DSR Scheme: Level 1
27
IBM Computational Biology Center
DSR Assignment Rules
1. At most one D per row and column;
if no D, at most one S per row and column
2. At most one non-R in the row and column,
but not both
28
IBM Computational Biology Center
DSR Assignment Rules
1. Each row and each column
has at most one D
ELSE has at most one S
2. A non-R can have other non-Rs either in its
row or its column but NOT both
29
IBM Computational Biology Center
DSR Scheme: Level 1
30
IBM Computational Biology Center
DSR Scheme: Level 2
31
IBM Computational Biology Center
DSR Scheme: Level 2
32
IBM Computational Biology Center
DSR Scheme: Level 3
33
IBM Computational Biology Center
DSR Scheme: Level 3
34
IBM Computational Biology Center
DSR Scheme: Level 4
35
IBM Computational Biology Center
DSR Scheme: Level 5
36
IBM Computational Biology Center
Mathematical Analysis:
Approximation Factor
 Greedy DSR Scheme
 Z and Y are computable functions of the input
L Parida, A Javed, M Mele, F Calafell, J Bertranpetit and Genographic Consortium,
Minimizing Recombinations in Consensus Networks for Phylogeographic Studies, BMC Bioinformatics 2009
37
IBM Computational Biology Center
Analysis Flow
Granularity g
NO
statistical
Acceptable p-value?
YES
IRiS
combinatorial
Analyze Results
statistical
38
IBM Computational Biology Center
IRiS Output: RECOTYPE
Recombination vectors
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 R14
s1
1 0 0 0 1 1 1 1 0
0
0
0
1
0
s2
0 1 0 1 1 1 0 1 0
0
1
0
0
0
.
.
.
.
……….
……….
……….
39
IBM Computational Biology Center
Quick Sanity Check:
Ultrametric Network on RECOTYPES
40
IBM Computational Biology Center
IRiS
(Identifying Recombinations in Sequences)
Stage Haplotypes: use SNP block patterns
IRiS software will be released
by the end
summer
Segmentof
along
the length: infer’09
trees
biological insights
computational insights
Infer network (ARG)
Asif Javed
L Parida, M Mele, F Calafell, J Bertranpetit and Genographic Consortium
Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns
Journal of Computational Biology, vol 15(9), pp 1—22, 2008
41
IBM Computational Biology Center
What’s in a name?
1. Allele-frequency
variations between populations is also reflected
RECOMBIN-OMICS
in the purely recombination-based variations
Jaume Bertranpetit
2. Detects subcontinental divide from short segments

based on populations level analysis
RECOMBIN-OMETRICS
3. Detects populations from short segments

based on recombination events analysis
Robert Elston
42
IBM Computational Biology Center
wepopulations
ready for
the
1. Allele-frequency variationsAre
between
is also
reflected
in the purely recombination-based variations
OMICS / OMETRICS?
2. Detects subcontinental divide from short segments

based on populations level analysis
population-specific signals ?
3. Detects populations from shorto segments
other critical signals ?
o

based on recombination events analysis
o
anything we didn’t already know?
43
IBM Computational Biology Center
Thank you!!
44
IBM Computational Biology Center
45
Download