Short Communication A Note on the Optimal Measure of Allelic Association S. Shete∗ Department of Epidemiology, The University of Texas M. D. Anderson Cancer Center, Houston, Texas, 77030, USA Summary Association of pairs of loci due to departure of gamete frequencies from expected frequencies is called allelic association. This measure is useful for fine-scale mapping. There are several measures of allelic association in the literature. With the availability of a SNP map, the use of linkage disequilibrium to map genes is one of the central issues in the human genome project. The size of linkage disequilibrium blocks is of considerable interest. In this note, we investigate the statistical properties of a measure which is known to have the strongest theoretical population basis and be least sensitive to the marker allele frequencies. In particular, we have shown that this measure has intuitive appeal and the estimator for this measure has several statistically optimal properties such as consistency, asymptotic unbiasedness and asymptotic efficiency. Introduction Association of pairs of loci due to departure of gamete frequencies from the expected frequencies is called allelic association. This is also known as linkage disequilibrium or gametic phase disequilibrium. Allelic association is an important measure for localising disease genes. There are several ways of estimating this quantity. With the availability of a SNP map, the strength of linkage disequilibrium between loci across the genome is of considerable interest (Reich et al. 2001; Gabriel et al. 2002). It has been suggested that knowledge of such linkage disequilibrium blocks in the human genome will be useful in association studies for common genetic variations. Devlin & Risch (1995) compared several measures of linkage disequilibrium for fine-scale mapping. In this communication, we comment on statistical efficiency of the estimator of ρ given by Collins & Morton (1998). The optimal properties of this metric of allelic association were discussed in Morton et al. (2001). Morton et al. (2001) showed that the measure ρ has the strongest population-theory basis, and is least sensitive to marker allele frequencies when compared with other measures Methods and Discussion Consider a disease locus with two alleles, D and d, and a marker locus with two alleles, M and m. Let Q be the frequency of disease allele D and R be the frequency of marker allele M, which is associated with D. The frequencies of four possible gametes can be written as shown in Table 1 of Morton et al. (2001). These frequencies are: Frequency of MD haplotype = π11 = Qρ + QR(1 − ρ) Frequency of mD haplotype = π12 = (1 − ρ)Q(1 − R) Frequency of Md haplotype = π21 = (R − Q)ρ + R(1 − Q)(1 − ρ) Frequency of md haplotype = π22 ∗ Correspondence: Department of Epidemiology, Box 189, The University of Texas M. D. Anderson Cancer Center, 1515 Holcombe Blvd, Houston, TX, 77030, U.S.A. Telephone: (713) 745-2483. Fax: (713) 792-8261. E-mail: sshete@mdanderson.org C of association, and hence should be used for localisation of disease loci. In view of its usefulness, it is important to study the statistical properties of the estimator of ρ. University College London 2003 = (1 − R)ρ + (1 − Q) × (1 − R)(1 − ρ) Annals of Human Genetics (2003) 67,189–191 189 S. Shete These frequencies can be arranged in a two-by-two matrix. It is assumed that π 11 π 22 ≥ π 12 π 21 , which can always be satisfied by interchanging columns of the twoby-two matrix. It is also assumed that Q ≤ R, 1 − Q, which can also be satisfied by interchanging rows of the two-by-two matrix. Collins & Morton (1998) defined the allelic association metric ρ as ρ= π11 π22 − π12 π21 . Q(1 − R) (1) If we let the random variable X equal one when allele D is present at the disease locus, and zero when allele d is present, and if the random variable Y equals one when allele M is present at the marker locus and zero when allele m is present, then the numerator of (1) is the same as the covariance between X and Y . Hill (1974) gave the sampling variance of the numerator in (1). This metric has intuitive appeal for measuring allelic association. One can measure the dependence between alleles M and D based on P (M | D), the conditional probability of obtaining allele M given that allele D is observed at the disease locus. If the two alleles are not associated, the conditional probability should be the same as the marginal probability, P(M), and hence any measure of association should be zero. Thus, one can define the measure of association as γ = P (M | D) − P (M). Annals of Human Genetics (2003) 67,189–191 L(N1 , N2 , N3 ) = exp 3 ηi (θ)Ti (x) − B(θ ) h(x) i =1 where, η1 (θ) = ln( 1 − π11 −π11π12 − π21 ), η2 (θ ) = ln( 1 − π11 −π12π12 − π21 ), η3 (θ) = ln( 1 − π11 −π21π12 − π21 ), Ti (x) = Ni , B(θ ) = − Nln(1 − π11 − π12 − π21 ), and h(x) = N1 !N2N! . !N3 !N4 ! The maximum likelihood of estimators of ρ, Q, and R can be obtained and are given below: ρ̂ = (N1 N4 − N2 N3 ) , (N1 + N2 )(N2 + N4 ) Q̂ = (N1 + N2 ) , N R̂ = (N1 + N3 ) . N (2) But this definition of association is not consistent with respect to haplotype frequencies. For example, if P (MD) = 0.4, P (M) = 0.7, and P (D) = 0.5, then γ = 0.1. If P (MD) = 0.3, P (M) = 0.5, and P (D) = 0.5, then also γ = 0.1. But in the first case there are 1.3 times more MD haplotypes than in the second case. One rational way to avoid this inconsistency is by dividing γ given in (2) by P(m), the frequency of allele m. When we do this, in the first case the measure of association would be 0.1/0.3 = 0.333, and in the second case it would be 0.1/0.5 = 0.2. But γ /P (m ) is the same as ρ defined in (1). This discussion shows the intuitive appeal of using the association metric ρ. Also note that (1) is equal to the absolute value of allelic measure D defined by Lewontin (1964). Metric D is obtained by divid190 ing the covariance defined above by either minimum of (P(d)P(M), P(D)P(m)) or maximum of (− P(D)P(M), − P(d)P(m)), depending upon whether the covariance is positive or negative, respectively. This ‘Minimum’ or ‘Maximum’ function in the denominator of D makes it difficult for investigating the statistical distribution of the metric D . Next we consider the statistical properties of the estimator of ρ. Let N1 , N2 , and N3 be the frequencies of haplotypes MD, mD, and Md, respectively, and N be the total number of haplotypes. Then, we can model (N1 , N2 , N3 ) as a multinomial distribution. Let N4 = N − N1 − N2 − N3 . The likelihood function belongs to the class of exponential families (see 1.4.1 of Lehmann, 1983) and can be written as (3) Because the multinomial distribution is an exponential family of distributions, as shown by example 5.3 and theorem 4.1 of Lehmann (1983), it follows that the maximum likelihood estimator of ρ is consistent for estimating ρ and is asymptotically efficient, that is, it has the smallest possible asymptotic variance among all consistent estimators and is also asymptotically unbiased (see Theorem 4.1 of Lehmann, 1983). These statistically optimal properties are also true for Q̂ and R̂. √ N(ρ̂ − ρ, Q̂ − Q, R̂ − R) is also asymptotically a trivariate normal with a mean vector of zero and the C University College London 2003 Allelic Association variance covariance matrix given below [R(1 − Q)(1 − ρ) − 2Q(1 − R)ρ(1 − ρ) + ρ](1 − ρ) − ρ(1 − ρ)Q ρ(1 − ρ)(1 − R) NQ(1 − R) N N Q(1 − Q) Q(1 − R)ρ − ρ(1 − ρ)Q −1 = I (θ ) = N N N Q(1 − R)ρ R(1 − R) ρ(1 − ρ)(1 − R) N N N (4) where I (θ ) is Fisher’s information matrix. Of particu√ lar interest is that N(ρˆ − ρ) is asymptotically normal with a mean of zero and variance equal to [R (1 − Q) (1 − ρ) − 2Q(1 − R) ρ(1 − ρ) + ρ](1 − ρ)/ [NQ(1 − R)]. Under the null hypothesis that ρ = 0, R(1 − Q) ˆ = NQ(1 we can see that Va r (ρ) , as shown in Collins − R) & Morton (1998). The asymptotic distribution theory presented here is useful for testing allelic association by using the chi-squared test statistic. Recently, there has been interest in understanding the structure of haplotypes in the human genome by studying the linkage disequilibrium pattern (LD blocks) between loci (Reich et al. 2001; Gabriel et al. 2002). In view of the optimal statistical properties and intuitive features of the association metric ρ discussed in this note, and its population genetics theory basis discussed by Morton et al. (2001), we recommend that this metric be used to measure the linkage disequilibrium strength between loci across the genome. Acknowledgments I am grateful to Prof. Newton E. Morton for helpful suggestions and comments on an earlier version of the paper. I thank the two anonymous reviewers for their helpful constructive comments. I also thank Dr. Maureen Goode for comments which led to a better presentation of the material in this paper. C University College London 2003 References Collins, A. & Morton, N. E. (1998) Mapping a disease locus by allelic association. Proc Natl Acad Sci USA 95, 1741– 1745. Devlin, B. & Risch, N. (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322. Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., Liu-Cordero, S. N., Rotimi, C., Adeyemo, A., Cooper, R., Ward, R., Lander, E. S., Daly, M. J. & Altshuler, D. (2002) The structure of haplotype blocks in the human genome. Science 296, 2225–2229. Hill, W. G. (1974) Estimation of linkage disequilibrium in randomly mating populations. Heredity 33, 229–239. Lehmann, E. L. (1983) Theory of point estimation. New York: John Wiley and Sons, Inc. Lewontin, R. C. (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49, 49–67. Morton, N. E., Zhang, W., Taillon-Miller, P., Kwok, P-Y. & Collins, A. (2001) The optimal measure of allelic association. Proc Natl Acad Sci USA 98, 5217–5221. Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R. & Lander, E. S. (2001) Linkage disequilibrium in the human genome. Nature 411, 199–204. Received: 4 April 2002 Accepted: 30 July 2002 Annals of Human Genetics (2003) 67,189–191 191