Presentation - Duke ECE

advertisement
A New Nonparametric Bayesian
Model for Genetic Recombination in
Open Ancestral Space
Paper by E. P. Xing and K-A. Sohn
Presented by Chunping Wang
Machine Learning Group, Duke University
February 26, 2007
Outline
• Terminology and Introduction
• DP Mixtures for Non-recombination Inheritance
• HMDP for Recombination
• Results
• Conclusions
Terminology and Introduction (1)
• Allele: a viable DNA coding on a chromosome –
observation
• Locus : the location of an allele – index of an observation
• Haplotype: a sequence of alleles – data sequence
• Recombination: exchange pieces of paired chromosome
– state-transition
• Mutation: any change to a haplotype during inheritance –
emission
Terminology and Introduction (2)
Ancestors
Descendants
Terminology and Introduction (3)
Problems:
1. Ancestral inference: recovering ancestral haplotypes;
2. Recombination analysis: inferring the recombination
hotspots;
3. Ancestral mapping: inferring the ancestral origin of
each allele in each modern haplotype.
DP Mixtures for Non-recombination
Inheritance (1)
Non-recombination:
• Only mutation may occur during inheritance;
• Each modern haplotype is originated from a single
ancestor.
Only true for haplotypes spanning a short region in a
chromosome.
DP Mixtures for Non-recombination
Inheritance (2)
Q |  , Q0 ~ DP ( , Q0 )
i | Q ~ Q
hi | i ~ Ph (i )
where  k  (ak ,  k ), k  1, , K , the
n
distinct values of {i }i 1 , denote the
joint of the kth ancestor and the
mutation parameter corresponding to
the kth ancestor.
*
Q0

Q
i
hi
n
DP Mixtures for Non-recombination
Inheritance (3)
HMDP for Recombination (1)
For long haplotypes possibly bearing multiple ancestors,
we consider recombinations (state-transitions across
discrete space-interval).
F



Q1
Q0
i
i
hi2
hi2
1
m1

Q2
2

m2
Qj
i
j
hi j

mj
HMDP for Recombination (2)
 Each row of the transition matrix in HMM is a DP.
Also these DPs are linked by the top level master DP,
and have the same set of target states.
 The mixing proportions for each lower level DP are
denoted as  j  [ j ,1 ,  j , 2 ,] , then the jth row of the
transition matrix is  j.
HMDP for Recombination (3)
Modern haplotype
Ancestor haplotype
The indicators of ith modern haplotype for all the loci,
which specify the corresponding ancestral haplotype
• when no recombination takes place during the inheritance
process producing haplotype Hi, Ci ,t  k , t
• when a recombination occurs between loci t and t+1,
Ci ,t  Ci ,t 1
HMDP for Recombination (4)
Introduce a Poisson point process to control the duration
of non-recombinant inheritance (space-inhomogeneous)
1 x 
p( x |  )   e
x!
x-the number of recombinations
Denote
d: the physical distance between loci t and t+1 ;
r: recombination rate per unit distance.
Then
p( x  0 | dr )  e
 dr
p( x  0 | dr )  1  e

1  
 dr


HMDP for Recombination (5)
Combine with the standard stationary HMDP, the
non-stationary state transition probability:
p(Ci ,t 1  k ' | Ci ,t  k )   k ,k '  (1   ) (k , k ' )
While d or r goes to infinity, e dr  0,   1, the
inhomogeneous HMDP model goes back to a standard
HMDP.
HMDP for Recombination (6)
Inference:
The prior base: F ( A, )  p( A) p( )
p( A) uniform
 ~ Beta ( h ,  h )
The emission function:
Integrate over
p ( h | c, a )
where
p ( )
, the marginal likelihood:
HMDP for Recombination (7)
Inference:
Combine the HDP prior and the marginal likelihood,
we can infer the posterior for {Ci ,t } and { Ak ,t }, which
are the variables of interest.
Two sampling stages:
1. Sample {Ci ,t } given all haplotypes h and the
most recently sampled ancestor pool a;
2. Sample every ancestor Ak given all haplotypes
h and the current {Ci ,t }
Results (1)
Simulated data:
30 populations, each includes 200 haplotypes from K=5
ancestral haplotypes. T=100
Compare: HMDP, HMMs with K=3,5 and 10
The average ancestor
reconstruction errors
for the five ancestors
Even the HMM with K=5 cannot beat the HMDP
Results (2)
The vertical gray lines - the pre-specified
recombination hotspots
Threshold 2
Threshold 1
Box plot of the empirical recombination rates
Results (3)
Population maps: 1. true map; 2. HMDP;
3-5. HMMs with K=3,5,10
Each vertical thin line – one modern haplotype;
Each color – one ancestral haplotype.
Measure for accuracy: the mean squared distance to the true map
Results (4)
Real haplotype data sets 1: Daly data – single population
512 haplotypes. T=103
Bottom: empirical recombination rates
Upper vertical lines: recombination hotspots.
Red dotted lines: HMM; blue dashed lines: MDL; black solid
lines: HMDP
Results (5)
Choose the threshold
A Gaussian mixture fitting of empirical recombination rates
Results (6)
Estimated population map
Each vertical thin line – one modern haplotype;
Each color – one ancestral haplotype.
Conclusions
• This HMDP model is an application and extension
of the HDP into the population genetics field;
• The HDP allows the space of states in HMM to be
infinite so that it is suitable for inferring unknown
number of ancestral haplotypes;
• The HMDP model also allows the recombination
rates to be non-stationary;
• The HMDP model can jointly infer a number of
important genetic variables.
Download