- Wiley Online Library

advertisement
Short Communication
A Note on the Optimal Measure of Allelic Association
S. Shete∗
Department of Epidemiology, The University of Texas M. D. Anderson Cancer Center, Houston, Texas, 77030, USA
Summary
Association of pairs of loci due to departure of gamete frequencies from expected frequencies is called allelic
association. This measure is useful for fine-scale mapping. There are several measures of allelic association in the
literature. With the availability of a SNP map, the use of linkage disequilibrium to map genes is one of the central
issues in the human genome project. The size of linkage disequilibrium blocks is of considerable interest. In this note,
we investigate the statistical properties of a measure which is known to have the strongest theoretical population basis
and be least sensitive to the marker allele frequencies. In particular, we have shown that this measure has intuitive
appeal and the estimator for this measure has several statistically optimal properties such as consistency, asymptotic
unbiasedness and asymptotic efficiency.
Introduction
Association of pairs of loci due to departure of gamete
frequencies from the expected frequencies is called allelic association. This is also known as linkage disequilibrium or gametic phase disequilibrium. Allelic association is an important measure for localising disease genes.
There are several ways of estimating this quantity. With
the availability of a SNP map, the strength of linkage disequilibrium between loci across the genome is of considerable interest (Reich et al. 2001; Gabriel et al. 2002).
It has been suggested that knowledge of such linkage disequilibrium blocks in the human genome will be useful
in association studies for common genetic variations.
Devlin & Risch (1995) compared several measures of
linkage disequilibrium for fine-scale mapping. In this
communication, we comment on statistical efficiency of
the estimator of ρ given by Collins & Morton (1998).
The optimal properties of this metric of allelic association were discussed in Morton et al. (2001). Morton
et al. (2001) showed that the measure ρ has the strongest
population-theory basis, and is least sensitive to marker
allele frequencies when compared with other measures
Methods and Discussion
Consider a disease locus with two alleles, D and d, and
a marker locus with two alleles, M and m. Let Q be the
frequency of disease allele D and R be the frequency
of marker allele M, which is associated with D. The
frequencies of four possible gametes can be written as
shown in Table 1 of Morton et al. (2001). These frequencies are:
Frequency of MD haplotype = π11
= Qρ + QR(1 − ρ)
Frequency of mD haplotype = π12
= (1 − ρ)Q(1 − R)
Frequency of Md haplotype = π21 = (R − Q)ρ
+ R(1 − Q)(1 − ρ)
Frequency of md haplotype = π22
∗
Correspondence: Department of Epidemiology, Box 189, The
University of Texas M. D. Anderson Cancer Center, 1515
Holcombe Blvd, Houston, TX, 77030, U.S.A. Telephone: (713)
745-2483. Fax: (713) 792-8261. E-mail: sshete@mdanderson.org
C
of association, and hence should be used for localisation
of disease loci. In view of its usefulness, it is important
to study the statistical properties of the estimator of ρ.
University College London 2003
= (1 − R)ρ + (1 − Q)
× (1 − R)(1 − ρ)
Annals of Human Genetics (2003) 67,189–191
189
S. Shete
These frequencies can be arranged in a two-by-two matrix. It is assumed that π 11 π 22 ≥ π 12 π 21 , which can always be satisfied by interchanging columns of the twoby-two matrix. It is also assumed that Q ≤ R, 1 − Q,
which can also be satisfied by interchanging rows of the
two-by-two matrix. Collins & Morton (1998) defined
the allelic association metric ρ as
ρ=
π11 π22 − π12 π21
.
Q(1 − R)
(1)
If we let the random variable X equal one when allele D is present at the disease locus, and zero when
allele d is present, and if the random variable Y equals
one when allele M is present at the marker locus and
zero when allele m is present, then the numerator of
(1) is the same as the covariance between X and Y .
Hill (1974) gave the sampling variance of the numerator
in (1).
This metric has intuitive appeal for measuring allelic
association. One can measure the dependence between
alleles M and D based on P (M | D), the conditional
probability of obtaining allele M given that allele D is
observed at the disease locus. If the two alleles are not associated, the conditional probability should be the same
as the marginal probability, P(M), and hence any measure of association should be zero. Thus, one can define
the measure of association as
γ = P (M | D) − P (M).
Annals of Human Genetics (2003) 67,189–191
L(N1 , N2 , N3 ) = exp
3
ηi (θ)Ti (x) − B(θ ) h(x)
i =1
where, η1 (θ) = ln( 1 − π11 −π11π12 − π21 ), η2 (θ ) =
ln( 1 − π11 −π12π12 − π21 ), η3 (θ)
= ln( 1 − π11 −π21π12 − π21 ),
Ti (x) = Ni , B(θ ) = − Nln(1 − π11 − π12 − π21 ), and
h(x) = N1 !N2N!
.
!N3 !N4 !
The maximum likelihood of estimators of ρ, Q, and
R can be obtained and are given below:
ρ̂ =
(N1 N4 − N2 N3 )
,
(N1 + N2 )(N2 + N4 )
Q̂ =
(N1 + N2 )
,
N
R̂ =
(N1 + N3 )
.
N
(2)
But this definition of association is not consistent
with respect to haplotype frequencies. For example,
if P (MD) = 0.4, P (M) = 0.7, and P (D) = 0.5, then
γ = 0.1. If P (MD) = 0.3, P (M) = 0.5, and P (D) =
0.5, then also γ = 0.1. But in the first case there are 1.3
times more MD haplotypes than in the second case. One
rational way to avoid this inconsistency is by dividing γ
given in (2) by P(m), the frequency of allele m. When
we do this, in the first case the measure of association
would be 0.1/0.3 = 0.333, and in the second case it
would be 0.1/0.5 = 0.2. But γ /P (m ) is the same as ρ
defined in (1). This discussion shows the intuitive appeal
of using the association metric ρ. Also note that (1) is
equal to the absolute value of allelic measure D defined
by Lewontin (1964). Metric D is obtained by divid190
ing the covariance defined above by either minimum of
(P(d)P(M), P(D)P(m)) or maximum of (− P(D)P(M),
− P(d)P(m)), depending upon whether the covariance
is positive or negative, respectively. This ‘Minimum’ or
‘Maximum’ function in the denominator of D makes
it difficult for investigating the statistical distribution of
the metric D .
Next we consider the statistical properties of the estimator of ρ. Let N1 , N2 , and N3 be the frequencies of
haplotypes MD, mD, and Md, respectively, and N be the
total number of haplotypes. Then, we can model (N1 ,
N2 , N3 ) as a multinomial distribution. Let N4 = N −
N1 − N2 − N3 . The likelihood function belongs to
the class of exponential families (see 1.4.1 of Lehmann,
1983) and can be written as
(3)
Because the multinomial distribution is an exponential family of distributions, as shown by example 5.3
and theorem 4.1 of Lehmann (1983), it follows that
the maximum likelihood estimator of ρ is consistent for
estimating ρ and is asymptotically efficient, that is, it
has the smallest possible asymptotic variance among all
consistent estimators and is also asymptotically unbiased
(see Theorem 4.1 of Lehmann, 1983). These statistically
optimal properties are also true for Q̂ and R̂.
√
N(ρ̂ − ρ, Q̂ − Q, R̂ − R) is also asymptotically
a trivariate normal with a mean vector of zero and the
C
University College London 2003
Allelic Association
variance covariance matrix given below
 [R(1 − Q)(1 − ρ) − 2Q(1 − R)ρ(1 − ρ) + ρ](1 − ρ) − ρ(1 − ρ)Q ρ(1 − ρ)(1 − R) 


NQ(1 − R)
N
N



Q(1 − Q)
Q(1 − R)ρ 
− ρ(1 − ρ)Q
−1


= I (θ ) = 

N
N
N




Q(1 − R)ρ
R(1 − R)
ρ(1 − ρ)(1 − R)
N
N
N
(4)
where I (θ ) is Fisher’s information matrix. Of particu√
lar interest is that N(ρˆ − ρ) is asymptotically normal
with a mean of zero and variance equal to [R (1 −
Q) (1 − ρ) − 2Q(1 − R) ρ(1 − ρ) + ρ](1 − ρ)/
[NQ(1 − R)]. Under the null hypothesis that ρ = 0,
R(1 − Q)
ˆ = NQ(1
we can see that Va r (ρ)
, as shown in Collins
− R)
& Morton (1998). The asymptotic distribution theory
presented here is useful for testing allelic association by
using the chi-squared test statistic.
Recently, there has been interest in understanding
the structure of haplotypes in the human genome by
studying the linkage disequilibrium pattern (LD blocks)
between loci (Reich et al. 2001; Gabriel et al. 2002).
In view of the optimal statistical properties and intuitive features of the association metric ρ discussed in
this note, and its population genetics theory basis discussed by Morton et al. (2001), we recommend that this
metric be used to measure the linkage disequilibrium
strength between loci across the genome.
Acknowledgments
I am grateful to Prof. Newton E. Morton for helpful suggestions and comments on an earlier version of the paper. I thank
the two anonymous reviewers for their helpful constructive
comments. I also thank Dr. Maureen Goode for comments
which led to a better presentation of the material in this
paper.
C
University College London 2003
References
Collins, A. & Morton, N. E. (1998) Mapping a disease locus
by allelic association. Proc Natl Acad Sci USA 95, 1741–
1745.
Devlin, B. & Risch, N. (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29,
311–322.
Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M.,
Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M.,
Lochner, A., Faggart, M., Liu-Cordero, S. N., Rotimi, C.,
Adeyemo, A., Cooper, R., Ward, R., Lander, E. S., Daly,
M. J. & Altshuler, D. (2002) The structure of haplotype
blocks in the human genome. Science 296, 2225–2229.
Hill, W. G. (1974) Estimation of linkage disequilibrium in
randomly mating populations. Heredity 33, 229–239.
Lehmann, E. L. (1983) Theory of point estimation. New York:
John Wiley and Sons, Inc.
Lewontin, R. C. (1964) The interaction of selection and linkage. I. General considerations; heterotic models. Genetics
49, 49–67.
Morton, N. E., Zhang, W., Taillon-Miller, P., Kwok, P-Y. &
Collins, A. (2001) The optimal measure of allelic association. Proc Natl Acad Sci USA 98, 5217–5221.
Reich, D. E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P. C., Richter, D. J., Lavery, T., Kouyoumjian, R., Farhadian, S. F., Ward, R. & Lander,
E. S. (2001) Linkage disequilibrium in the human genome.
Nature 411, 199–204.
Received: 4 April 2002
Accepted: 30 July 2002
Annals of Human Genetics (2003) 67,189–191
191
Download