Lecture 9: Linkage Analysis II Date: 9/24/02 Unknown linkage phase Mixture of linkage phase Mixture of self and random mating Unknown Linkage Phase for Backcross coupling A B a b x A B a b repulsion A b A B A b A B a B a B x A B a b no information A B A b A b a b A b a b x a B a b A B a B ? a b a B ? Unknown Linkage Phase F2 coupling-coupling A B a b A B x A B a b a B a b A b repulsion-repulsion A b a x A B b A B a b a B coupling-repulsion a B A b A B dealt with later A B a b x A b a b a B A b A B a b a B a B A b Determining Linkage Phase: F2-CD Goal: Calculate likelihood for F2 with one codominant and one dominant locus. Show that the coupling and repulsion likelihoods are symmetric about 0.5. 1. Determine the possible gametes and their probabilities. Assume coupling of A and B in both parents. AB Ab aB ab (1-q)/2 q/2 q/2 (1-q)/2 2. Determine the observable genotypes and their probabilities. AAB- Aabb AaB- Aabb aaB- aabb (1-q2)/4 q 2/4 (1-q +q 2)/2 q(1-q)/2 q(2-q)/4 (1-q)2/4 Determining Linkage Phase: F2-CD 3. Write an expression for the likelihood, then log likelihood. LC q 1 q 2 / 4 q 2 / 4 f1 1 q q / 2 q 1 q / 2 q 2 q / 4 1 q / 4 f2 2 f3 f4 f5 f6 2 lC q f1 log 1 q 2 2 f 2 log q f 3 log 1 q q 2 f 4 log q 1 q f 5 log q 2 q 2 f 6 log 1 q 4. Repeat the whole process now assuming repulsion phase and obtain expression for lR(q). 5. Confirm lC(q)=lR(1-q). Symmetry Around 0.5 0.01 0.11 0.21 0.31 0.41 0.51 0.61 0.71 0.81 0.91 0 Log Likelihood -1000 -2000 -3000 Coupling Phase Repulsion Phase -4000 -5000 -6000 -7000 -8000 Recombinant Fraction An Ad Hoc Linkage Phase Determination Method I When the likelihood surface for the coupling and repulsion phase is symmetric about 0.5 (backcross and F2 with 1 codominant marker, then a single test is sufficient. Calculate the G statistic under the coupling assumption (use lC(q)). If it is significant and q<0.5, then the linkage is coupling If it is significant and q>0.5, then the linkage is repulsion. If it is not significant, no determination can be made. An Ad Hoc Linkage Phase Determination Method II When the likelihood surface is not symmetric (e.g. F2 with dominant markers). Calculate GC under coupling and GR under repulsion model. If either is significant and GC > GR, then linkage is coupling. GR > GC, then linkage is repulsion. Otherwise, no determination can be made. Statistical Phase Determination: Error There is a high chance of making an error when linkage is loose. When q<0.3, then the chance of error is small except for F2-DD, even with sample sizes of ~20. For F2-DD cross need sample size >100 to keep error down. Sample size needed decreases as linkage becomes tighter. Once Linkage Phase Determined Once linkage phase has been determined, the analysis continues as before. Assume linkage phase is now known and do a phase-known analysis. Phase-Unknown Gametes gametes produced by father AaBb AaBb aabb AB ab Ab aB aabb Aabb aaBb • There are multiple reasons why you may not know phase. • One reason is that grandparents are unavailable. Likelihood for Phase-Unknown Gametes Let X be the count of AB and ab gametes. Let Y be the count of Ab and aB gametes. Lq Pdata q Pdata, coupled q Pdata, repulsion q Pdata coupled, q Pcoupled Pdata repulsion, q Prepulsion q 1 q X Y 1 X 1 Y q 1 q 2 2 Distribution of the Log Likelihood Ratio Test Statistic Unfortunately, the test statistic G=2(lnL1 – lnL2) does not have a regular distribution under the null of no linkage. Numerical approximation of the distribution is required. On the other hand, there is usually insufficient data in one family to get a significant test statistic. Distribution When There Are Multiple Families 1 1 Y X Lq ln q X 1 q q Y 1 q 2 2 The distribution of G approaches a 50:50 mixture of a probability mass at 0 and a chi-squared distribution with one degree of freedom. In other words, we can simply perform a one-tailed chisquare test to test linkage when large numbers of families are included in the study. General Analysis with Missing Information: Step 1 AaBb aabb Aabb aaBb 1. Identify all possible mating types that could produce these offspring and their expected frequency. (Retain phase information). All Possible Mating Types Mating Type Expected Frequency AB/ab x AB/ab (2p1p2q1q2)2 AB/ab x Ab/aB 2(2p1p2q1q2)2 Ab/aB x Ab/aB (2p1p2q1q2)2 AB/ab x Ab/ab 2(2p1p2q1q2)(2p1p2q2q2) Ab/aB x Ab/ab 2(2p1p2q1q2)(2p1p2q2q2) AB/ab x aB/ab 2(2p1p2q1q2)(2p2p2q1q2) Ab/aB x aB/ab 2(2p1p2q1q2)(2p2p2q1q2) AB/ab x ab/ab 2(2p1p2q1q2)(p2p2q2q2) Ab/aB x ab/ab 2(2p1p2q1q2)(p2p2q2q2) Ab/ab x aB/ab 2(2p1p2q2q2)(2p2p2q1q2) General Analysis with Missing Information : Step 2 2. Conditional on parental mating type, calculate the probability of each offspring genotype. Probability of Offspring Conditional on Mating Type e.g. AB/ab x Ab/aB AB (1-q)/2 Ab aB q/2 q/2 ab (1-q/2 AB 0.25q1q 0.25q2 0.25q2 0.25q1q q/2 Ab (1-q/2 aB (1-q/2 ab q/2 0.251q2 0.25q1q 0.25q1q 0.251q2 0.251q2 0.25q1q 0.25q1q 0.251q2 0.25q1q 0.25q2 0.25q2 0.25q1q General Analysis with Missing Information : Step 3 PAaBb AB/ab x Ab/aB 4 0.25q 1 q q 1 q 3. Calculate the unconditional probability of each offspring genotype. PAaBb PAaBb mating mating types type Pmating type General Analysis with Missing Information : Step 4 4. Sum the log-likelihood contributions over all possible offspring genotypes. l q f logP j j j offspring genotype General Analysis with Missing Information : Step 5 5. The log-likelihood ratio statistic is asymptotically a 50:50 mixture of 0 point and mass and chi-squared with one degree of freedom. G 2ln L1 ln L0 Mixture of Linkage Phase A mixture of linkage phase results when the two parents have difference phase. Consider the F2 with coupling-repulsion parents. AB/ab x Ab/aB Mixture of Linkage Phase: Expected Genotype Frequency Genotype Count Expected Frequency Pi(R|G) AABB f1 0.25q(1-q) 0.5 AABb f2 0.25(1-2q +q 2) q 2/[(1-q)2+q 2] Aabb f3 0.25q Mixture of Linkage Phase: Expected Genotype Frequency Genotype Count Expected Frequency Pi(R|G) AABB f1 0.25q(1-q) 0.5 AABb f2 0.25(1-q )2 q 2/[(1-q)2+q 2] Aabb f3 0.25q(1-q) 0.5 AaBB f4 0.25(1-q )2 q 2/[(1-q)2+q 2] AaBb f5 q(1-q) 0.5 Aabb f6 0.25(1-q )2 q 2/[(1-q)2+q 2] aaBB f7 0.25q(1-q) 0.5 aaBb f8 0.25(1-q )2 q 2/[(1-q)2+q 2] aabb f9 0.25q(1-q) 0.5 Mixture of Linkage Phase: Log Likelihood Lq f1 f 3 f 5 f 7 f 9 log q N f 3 f 4 f 6 f 8 log 1 q Analytic MLE available: f1 f 3 f 5 f 7 f 9 ˆ q 2N Mixture of Self and Random Mating (MSR) Controlled crosses not always available. Frequently, crosses resulting from open-pollinated populations are. These lead to MSR. Assume loci A and B are linked in coupling phase with recombination fraction q. Assume alleles a and A at A and b and B at B. Assume u and v are the frequencies of A and B in the pollen pool. (e.g. frequency of a is 1-u) Assume linkage equilibrium in the pollen. MSR - Expected Frequencies for Codominant Alleles Genotype Count Expected Frequencies Outcross Self AABB f1 0.5uv(1-q) 0.25(1-q)2 AABb f2 0.5u[(1-v)(1-q)+vq] 0.5q(1-q) Aabb f3 0.5u(1-v)q 0.25q 2 AaBB f4 0.5v[(1-u)(1-q )+uq] 0.5q (1-q) AaBb f5 0.51q 12q)(u+v-2uv)] 0.5(1-q)2 Aabb f6 0.5(1-v)(u-2uq +q ) 0.5q (1-q) aaBB f7 0.5(1-u)vq 0.25q 2 aaBb f8 0.5(1-u)(v-2vqq ) 0.5q (1-q) aabb f9 0.5(1-u)(1-v)(1-q) 0.25(1-q)2 MSR – Log Likelihood Function 9 Lq f i log tpoi 1 t psi i 1 • t is the probability of outcrossing (vs. selfing) • poi is the expected frequency of type i progeny from outcross. • psi is the expected frequency of type i progeny from self. • q enters through the above expected frequencies as provided in previous table. Estimating Allelic Frequencies in Pollen Pool (u and v) Use a single locus, say A. Consider heterozygous maternal plants (Aa). Write an expression for the log-likelihood in MSR population. Condition on the outcrossing rate t. Solve analytically for umle. Estimating the Outcrossing Rate t The prior analysis conditioned on the outcrossing rate t. Unfortunately Aa heterozygous mother is necessary to determine linkage but is least informative for t. MSR - Estimating Recombination Fraction q I EM: Calculate the conditional probabilities of recombination given the genotype. 1 9 q n 1 f i poiAb poiaB t psiAb psiaB 1 t N i 1 NR: Calculate the score and information. 9 S q i 1 d log 1 t psi tpoi fi dq d 2 log 1 t psi tpoi I q E f i dq 2 i 1 9 MSR - Estimating q, u, and v EM Pick initial estimates (u0, v0, q0). Calculate expected gametic frequencies in selfed and outcrossed populations conditional on current estimates and observed genotype frequencies. tfi poig Calculate the mle for (u1, v1, q1). Iterate. MSR - Estimating q, u, and v (NR) L u L S q , u , v v L q 2L 2 u2 L I q , u , v uv 2 L uq un 1 un 1 1 vn1 vn I S q q N n1 n 2L uv 2L v 2 2L qv 2L uq 2L qv 2L q 2 MSR – Linkage Information Linkage information content is sensitive to allele frequencies when outcrossing is high. Linkage information content decreases rapidly as the allelic frequencies approach 0.5. When linkage is tight MSR provides less information relative to F2 than when linkage is tight, but high linkage is always more informative than low linkage. MSR - Bias and Variance Bias and mean square error is higher for dominant markers than codominant. Bias and mean square errors are acceptable for q<0.2 only when dominant allele frequency is less than true q. When dominant allele frequency is > 0.5, high negative bias on q. Allele frequency cannot be accurately estimated when true frequency is <0.1 or >0.5 and outcrossing is low. Summary Unknown linkage phase Reducing the problem to a phase-known problem Likelihood when phase unknown Likelihood for general pedigree with missing information. Likelihood for mixture of linkage phase Mixture of Self and Random mating (MSR)