A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Paper by E. P. Xing and K-A. Sohn Presented by Chunping Wang Machine Learning Group, Duke University February 26, 2007 Outline • Terminology and Introduction • DP Mixtures for Non-recombination Inheritance • HMDP for Recombination • Results • Conclusions Terminology and Introduction (1) • Allele: a viable DNA coding on a chromosome – observation • Locus : the location of an allele – index of an observation • Haplotype: a sequence of alleles – data sequence • Recombination: exchange pieces of paired chromosome – state-transition • Mutation: any change to a haplotype during inheritance – emission Terminology and Introduction (2) Ancestors Descendants Terminology and Introduction (3) Problems: 1. Ancestral inference: recovering ancestral haplotypes; 2. Recombination analysis: inferring the recombination hotspots; 3. Ancestral mapping: inferring the ancestral origin of each allele in each modern haplotype. DP Mixtures for Non-recombination Inheritance (1) Non-recombination: • Only mutation may occur during inheritance; • Each modern haplotype is originated from a single ancestor. Only true for haplotypes spanning a short region in a chromosome. DP Mixtures for Non-recombination Inheritance (2) Q | , Q0 ~ DP ( , Q0 ) i | Q ~ Q hi | i ~ Ph (i ) where k (ak , k ), k 1, , K , the n distinct values of {i }i 1 , denote the joint of the kth ancestor and the mutation parameter corresponding to the kth ancestor. * Q0 Q i hi n DP Mixtures for Non-recombination Inheritance (3) HMDP for Recombination (1) For long haplotypes possibly bearing multiple ancestors, we consider recombinations (state-transitions across discrete space-interval). F Q1 Q0 i i hi2 hi2 1 m1 Q2 2 m2 Qj i j hi j mj HMDP for Recombination (2) Each row of the transition matrix in HMM is a DP. Also these DPs are linked by the top level master DP, and have the same set of target states. The mixing proportions for each lower level DP are denoted as j [ j ,1 , j , 2 ,] , then the jth row of the transition matrix is j. HMDP for Recombination (3) Modern haplotype Ancestor haplotype The indicators of ith modern haplotype for all the loci, which specify the corresponding ancestral haplotype • when no recombination takes place during the inheritance process producing haplotype Hi, Ci ,t k , t • when a recombination occurs between loci t and t+1, Ci ,t Ci ,t 1 HMDP for Recombination (4) Introduce a Poisson point process to control the duration of non-recombinant inheritance (space-inhomogeneous) 1 x p( x | ) e x! x-the number of recombinations Denote d: the physical distance between loci t and t+1 ; r: recombination rate per unit distance. Then p( x 0 | dr ) e dr p( x 0 | dr ) 1 e 1 dr HMDP for Recombination (5) Combine with the standard stationary HMDP, the non-stationary state transition probability: p(Ci ,t 1 k ' | Ci ,t k ) k ,k ' (1 ) (k , k ' ) While d or r goes to infinity, e dr 0, 1, the inhomogeneous HMDP model goes back to a standard HMDP. HMDP for Recombination (6) Inference: The prior base: F ( A, ) p( A) p( ) p( A) uniform ~ Beta ( h , h ) The emission function: Integrate over p ( h | c, a ) where p ( ) , the marginal likelihood: HMDP for Recombination (7) Inference: Combine the HDP prior and the marginal likelihood, we can infer the posterior for {Ci ,t } and { Ak ,t }, which are the variables of interest. Two sampling stages: 1. Sample {Ci ,t } given all haplotypes h and the most recently sampled ancestor pool a; 2. Sample every ancestor Ak given all haplotypes h and the current {Ci ,t } Results (1) Simulated data: 30 populations, each includes 200 haplotypes from K=5 ancestral haplotypes. T=100 Compare: HMDP, HMMs with K=3,5 and 10 The average ancestor reconstruction errors for the five ancestors Even the HMM with K=5 cannot beat the HMDP Results (2) The vertical gray lines - the pre-specified recombination hotspots Threshold 2 Threshold 1 Box plot of the empirical recombination rates Results (3) Population maps: 1. true map; 2. HMDP; 3-5. HMMs with K=3,5,10 Each vertical thin line – one modern haplotype; Each color – one ancestral haplotype. Measure for accuracy: the mean squared distance to the true map Results (4) Real haplotype data sets 1: Daly data – single population 512 haplotypes. T=103 Bottom: empirical recombination rates Upper vertical lines: recombination hotspots. Red dotted lines: HMM; blue dashed lines: MDL; black solid lines: HMDP Results (5) Choose the threshold A Gaussian mixture fitting of empirical recombination rates Results (6) Estimated population map Each vertical thin line – one modern haplotype; Each color – one ancestral haplotype. Conclusions • This HMDP model is an application and extension of the HDP into the population genetics field; • The HDP allows the space of states in HMM to be infinite so that it is suitable for inferring unknown number of ancestral haplotypes; • The HMDP model also allows the recombination rates to be non-stationary; • The HMDP model can jointly infer a number of important genetic variables.