apoesmall

advertisement
Further investigation of linkage disequilibrium
between SNPs and their ability to identify associated
susceptibility loci
Introduction
The ability to fine-map disease susceptibility genes using closely spaced markers
depends on the linkage disequilibrium (LD) relationships between the polymorphisms
involved and on the nature of the effect of the disease polymorphism on susceptibility.
The process is complicated by the fact that the relationship between physical distance
and LD is not monotonic but is affected by historical events such as genetic drift,
admixture, multiple mutations and natural selection (Nordborg and Tavare, 2002) and
by variations in recombination rates between regions leading to phenomena such as
the observation of haplotype blocks (Cardon and Abecasis 2003). The ability to detect
association of a marker with disease will also depend on the marker’s allele
frequencies. We have shown (Sham et al. 2000) that for relatively rare mutations the
ability to detect association with disease increases as the number of alleles of a marker
rises to around 5 or 10, and that this effect is especially strong when more than one
pathogenic mutation event has occured in the gene during evolution. Finally, for a
given pattern of linkage disequilibrium it will be easier to detect a susceptibility
polymorphism which has a major rather than a minor effect on risk of affection.
To explore these issues, a previous study investigated 60 SNPs in the region
surrounding the APOE locus in a sample of cases suffering from Alzheimer’s disease
and controls (Martin et al, 2000). The APOE locus provides a good example of a
polymorphism affecting susceptibility to a disease with a complex mode of
inheritance. Studying the surrounding SNPs allowed insights to be gained regarding
the ability of to fine-map similar loci with similar effects, assuming that similar
patterns of LD might be present. SNPs were analysed in a region ~1.5 cM distal of
APOE and ~190 kb proximal. 16 of the 60 SNPs showed evidence for association
with disease at p<0.05 using either allelic or genotypic tests for association, and of
these 7 lay within 40 kb of APOE. Of these 7, 4 yielded a significance at p<0.05 when
corrected for the number of markers analysed. The APOE-4 polymorphism itself
produced highly significant evidence for association, as did 2 other SNPs nearby. The
evidence for association with APOE-4 was nevertheless appreciably stronger than for
these other two loci. The authors concluded that high density of SNPs would be
necessary to in order to have a good chance of including SNPs with detectable levels
of allelic association with a disease mutation.
A recent study of markers around the DBH locus has produced results which are in
some ways similar (Zabetian et al 2003). This found that only markers within a 10 kb
haplotype block were strongly associated with the phenotype influenced by a DBH
polymorphism, and the authors concluded that if by chance these markers had not
been included in an association study then the role of the DBH locus could have been
overlooked.
In the current study we sought to gain further information regarding the likely
usefulness of SNPs in case-control studies by using the APOE dataset but treating
each SNP in turn as if it were a disease susceptibility locus. While the previous study
included 59 marker SNPs along with the susceptibility polymorphism itself, this
would allow us to investigate thousands of pairs of marker-disease loci (the pairs
admittedly being non-independent). To do this, we used the observed patterns of LD
in the dataset and calculated the allele frequencies we would expect to observe at each
SNP given a specified disease model. We sought in particular to address three issues
regarding the power of association studies, namely: the expected observed association
with the susceptibility polymorphism itself; the ability of nearby markers to detect
association; the ability to distinguish between the susceptibility polymorphism and
marker loci also showing association.
Method
As previously described (Martin et al, 2000) the data consisted of consisted of 220
unrelated cases of late onset Alzheimer’s disease and 220 controls collected by the
Bryan Alzheimer’s Disease Research Center at Duke’s University Medical Centre, all
subjects being white. 60 SNPs in the region surrounding APOE, including APOE-4
itself, were genotyped, new SNPs being generated by screening seven controls using
YAC truncation and random sequencing (Lai et al, 1998). For the current study one of
the SNPs, SNP474, which lies a long way distant from the others was discarded. For
the remaining 59 SNPs, the allele frequencies from the combined dataset of cases and
controls were calculated and two-locus haplotype frequencies were estimated using
the EH+ program (Zhao et al, 2000). These frequency estimates were then used as if
they represented population allele and haplotype frequencies. It should be noted that
because the case and control samples were of equal size this would lead to an
overestimate of the frequency of the APOE-4 allele and other alleles in LD with it
which are also associated with Alzheimer’s disease. Likewise, the magnitude of LD
occurring with associated markers might be somewhat different from that which
would be observed in an unselected sample. We do not believe that this poses a
problem for the current study, since we wished merely to obtain some representative
estimates for allele frequencies of SNPs generally and for typical patterns of LD
between pairs of SNPs near to each other. The observed allele frequencies were used
to calculated the D’ measure of LD (Lewontin 1964) and this was plotted against the
distance between loci in each pair.
We used the observed allele frequencies and estimated haplotype frequencies to
calculate the allele frequencies we would expect to observe for all SNPs if we
obtained a sample of cases and controls, contingent on each SNP in turn being a
susceptibility polymorphism for a disease. Thus, for each SNP we calculated the
expected numbers of cases and controls having each genotype if that SNP itself
influenced susceptibility to disease. Next, using the estimated haplotype frequencies
we calculated the expected genotypes of every other SNP in cases and controls given
the expected genotypes of the disease SNP. We could then calculate what evidence
we would obtain for association if the disease SNP itself or any of the 58 other SNPs
were analysed. Repeating this process for each SNP would yield 59 studies of
association with a disease locus and 3,422 studies of a marker lying close to a disease
locus.
To model the effects of the susceptibility locus, we assigned penetrance values
previously estimated in a Finnish population to be the probabilities of affection with
Alzheimer’s disease conditional on possessing 0, 1 or 2 copies of the APOE-4 allele.
These values consisted of 0.029, 0.076 and 0.214. They were used for every SNP,
regardless of its allele frequency. What this means is that we were assuming that the
polymorphism would have a constant biological effect in terms of its effect on
absolute risk of the disease. However using various allele frequencies would mean
that the diseases being modelled had different prevalences in the population and also
that the risk fraction attributable to the locus under consideration would vary. We took
these penetrance values as an example of a susceptibility locus having a known effect
size. Of course, the power studies we performed would be critically dependent on the
disease model and our results can thus only be taken to provide some illustration of
the kinds of effects which might be seen in practice.
Taking each SNP as the disease locus and having observed allele frequencies p1 and
p2, we calculated the expected frequency for each genotype would as follows.
For affected cases: mA11=f11p12/K, mA12=2f12p1p2/K and mA22=f22p22/K
For unaffected controls: mU11=(1-f11)p12/(1-K), mU12=2(1-f12)p1p2/(1-K) and mU22=(1f22)p22/K
Where the population prevalence of the disease is K= f11p12+2f12p1p2+f22p22 and the
penetrances have the previously stated values f11=0.029, f12=0.076 and f22=0.214.
To calculate the frequency of marker genotypes for the other SNPs we use the
estimated haplotype frequencies, using hjk to denote the frequency of the haplotype
having allele j at the disease locus and allele k at the marker locus. Expected
frequencies for homozygotes and heterozygotes were estimated according to the
formulae given in Sham (1997).
In cases:
mAjj =( f11 h1j 2 + 2 f12 h1j h2j + f22h2j2)/K
and:
mAjk =( f11 h1j h1j + 2 f12 (h1j h1k + h1j h2k )+ f22 h2j h2k)/K
In controls:
mUjj =( (1-f11) h1j 2 + 2 (1-f12 )h1j h2j + (1-f22)h2j2)/(1-K)
and:
mUjk =((1-f11) h1j h1j + 2 (1-f12)(h1j h1k + h1j h2k )+ (1-f22)h2j h2k)/(1-K)
We used these genotype frequencies to calculate expected allele counts in samples of
N0 controls and N1 cases, so that for each allele n0j=(2mUjj+mUij) and n1j=(2mAjj+mAij).
In order to determine the strength of evidence for association that might be observed
in a case-control study we then calculated the value for a likelihood ratio test statistic
based on these expected counts:
G2 = 2 ∑ nij ln (nij/νij)
where the expected counts assuming no association are given by
νij=Ni(n0j+n1j)/(N0+N1). This test statistic is asymptotically distributed as a chi-squared
with one degree of freedom. It has the advantage over the commonly used Pearson’s
chi-squared test that it scales linearly with sample size so that once one knows the
value of the statistic one can expect from a given sample one can readily calculate the
sample size which would be expected to produce any given value of the test statistic.
We denote the statistic obtained for the disease polymorphism itself as G2D and for
each marker SNP as G2M.
For every pair of SNPs we calculated and plotted the LD measure D’ against distance
between them. We also calculated the chi-squared statistic indicating the statistical
significance of deviance from the haplotype frequencies expected assuming no LD,
using the full sample of 440 case and control subjects. We then assumed the same
sample size of 440 to calculate the statistics we would expect to obtain using the
procedures above, taking each SNP in turn as if it affected disease susceptibility. For
each SNP we calculated and plotted the value of G2D, indicating the evidence for
association with the disease locus itself, and then it was paired with each other SNP in
turn with the other SNP being treated as a marker and we calculated the value for
G2M, indicating the evidence for association of the marker SNP. For each of these
pairs of SNPs we then plotted the value of G2M against the distance between the
marker and the disease locus.
When carrying out a case-control association study, association may be observed
botht with the disease polymorphism itself and with markers in linkage disequilibrium
with it. If the observed association is much stronger with the disease polymorphism
then one may use this observation to distinguish which of the associated
polymorphisms directly affects susceptibility and which ones are markers showing
association due to LD. However if the LD is very strong then a marker may be as
strongly associated as the disease polymorphism and it may be difficult to distinguish
which is which on genetic grounds. If all the SNPs in a particular region were typed
then a number might show association with disease status. However provided only
one actually influenced risk directly then it would be expected to show stronger
association than the others, which would just be acting as markers. In order to
investigate the power of a case-control sample to distinguish disease from marker
polymorphisms, for each SNP treated as a disease polymorphism we compared the
value of G2D against the maximum value for G2M obtained for any of the other SNPs.
The difference between these values, G2D-max(G2M), provides a measure of the
strength of evidence for the disease polymorphism rather than the most strongly
associated marker polymorphism to be the one which directly influences susceptibility
to disease. For example, since the statistics are based on 2ln(L), if the value of G2Dmax(G2M) exceeds 2ln(100)=9.2 then the likelihood ratio in favour of the disease
polymorphism rather than the marker polymorphism affecting susceptibility would be
greater than 100. We therefore measured G2D-max(G2M) for each SNP and plotted it
against G2D.
Results
Figure 1 shows the value of D’ plotted against inter-SNP distance for every pair of
SNPs. It can be seen that high values of D’ can be found even between SNPs which
are separated by distances of over 1,000 kb. It does not appear that D’ declines
particularly rapidly with distance and within 1000 kb there is only a fairly weak
negative correlation of r=−0.20 between D’ and distance. However Figure 2 throws a
somewhat different light on this finding. This shows that the statistical significance of
the LD which we might expect to observe with a sample size of 440, as measured by a
chi-squared statistic having one degree of freedom, in fact falls off very rapidly with
distance. Several extremely high chi-squared statistics are obtained between SNPs
within 150 kb of each other and most of the statistically significant statistics occur
within this range, although a few highly significant statistics occur at distances up to
300kb. The explanation for the discrepancy between the distances over which high
values of D’ are found and over which these values are statistically significant is that
the high D’ values at large distances tend to be produced by SNPs having low allele
frequencies. These can lead to the absence or near-absence of one haplotype,
producing a high D’ but one which is not statistically significant (Zapata et al 2001).
This is an important point and shows that relying on D’ as a measure of LD may not
tell us much about what we might call useful LD, that is the kind of LD which can
detect statistically significant association between loci.
Figure 3 shows the values for G2D, the statistical evidence produced by the
susceptibility polymorphism itself showing association with disease status, which we
might expect to observe with a sample size of 440. These values are entirely
dependent on the allele frequencies of the SNP in question, and show that for this
sample size many susceptibility loci having the kind of effect modelled can be
expected to produce statistically significant evidence for association with disease,
which will be strongest when the alleles have similar frequencies. This is the case
when the rarer allele still has a frequency of more than 0.15-0.2. However as this
frequency becomes lower then the expected value for G2D falls rapidly and it can seen
that several SNPs would not produce values of which would be regarded as only of
borderline significance or not significant at all. It should be remembered that the
expected value of G2D is directly proportional to the sample size so used, so another
way of putting this observation would be to say that larger sample sizes would be
required to detect the effect of those SNPs in which one allele was relatively rare.
Figure 4 shows the values for G2M which we might expect to obtain when we study a
marker SNP rather than the susceptibility polymorphism itself. Again, what we see is
that association with disease which is statistically significant at p<0.05 is only
observed at distances less than 300 kb and that only markers less than 100 kb from the
disease locus produce strong evidence for association. Within 100 kb, the correlation
between G2M and distance is low, with r=−0.16. Even within this distance, many
disease-marker pairs fail to show any detectable evidence for association and out of
530 pairs which are not more than 100 kb apart only 77 produce a result significant at
p<0.05 and only 52 are significant at p<0.01. If attention is restricted to diseasemarker pairs within 10 kb of each other then the proportion of significant results
increases. However out of 88 such pairs there still only 33 which were significant at
p<0.05 and 25 which were significant at p<0.01.
Figure 5 shows the value for max(G2M) plotted against G2D. This demonstrates the
ability to detect association using a marker SNP rather than the disease polymorphism
itself, and also enables one to assess the extent to which the effect of a disease
polymorphism could be distinguished from a marker polymorphism in LD with it.
One thing which is apparent is that for a number of disease polymorphisms there is no
other SNP which gives notable evidence for association with the disease phenotype.
Of the 59 SNPs treated as a disease polymorphism, 51 are expected to produce a value
of G2D significant at p<0.01. However only 28 have max(G2M) significant at p<0.01
and only 33 have max(G2M) significant at p<0.05. Thus for about a third of disease
polymorphisms which themselves produce evidence for association one would fail to
detect any association unless this polymorphism itself were studied. Even if all the
surrounding SNPs were genotyped one still would not find any evidence implicating
the region. The figure shows that this can happen even with polymorphisms which
themselves produce very strong evidence for association with values of G2D
exceeding 25.
For the majority of polymorphisms max(G2M) is statistically significant but is
considerably lower than G2D. This means that one could detect association through
genotyping marker SNPs but when the disease polymorphism was genotyped it would
be distinguished by being more strongly associated with the disease phenotype and so
there would be good genetic evidence to favour it over other polymorphisms.
However there are a few disease loci for which G2D is very high, ranging from 20 to
30, but which have a value for max(G2M) which is not much lower. For these cases, it
would be hard to distinguish between the effects of the disease and marker
polymorphisms. This arises because of very strong LD between some pairs of SNPs.
On detailed examination of the data it could be seen that SNPs SNP06, SNP07,
SNP09 and SNP10, which are physically very close to each other, were in strong LD,
making their individual effects difficult to distinguish.
Discussion
Although our results pertain only to one particular model for the effect of a
polymorphism on risk of affection, they do provide some insights into the kinds of
effects we may expect to see when carrying out case-control studies of complex
diseases. We should begin by emphasising that we have considered only a very simple
model, with there being only one polymorphism influencing susceptibility. In a real
situation, it would be quite possible that different polymorphisms could occur within
the same gene, each having a different effect on susceptibility. Discriminating the
effects of two or more such polymorphisms would represent a far more complex task
than the simple detection of association. Also, we have ignored haplotype
relationships between markers (rather than between marker and disease loci). Making
use of marker haplotypes would introduce many additional complexities. It is possible
that making use of haplotypes could at least in some circumstances enhance the
ability to detect association, as has previously been demonstrated when studying
markers around APOE in a family sample (Martin et al, 2000). Treating two or
markers as providing haplotypes can allow the creation of multiallelic multilocus
genotypes which can be more informative than biallelic markers (Sham et al 2000). If
haplotype blocks can be reliably and replicably identified then they may also be
treated as multallelic markers and redundancies in genotyping may be avoided (Daly
et al, 2001). However it is by no means straightforward to derive haplotypes and the
applicability of such approaches remains to be fully explored. Another issue to be
borne in mind is that our results are based entirely on expected genotype and
haplotype frequencies. In real samples chance factors can result in marked deviations
from the expected frequencies of observed genotypes. Such random effects might act
to strengthen or weaken the overall evidence for association or even, for example, to
produce datasets in which a marker showed stronger association with the affection
phenotype than did the susceptibility polymorphism itself. Finally, we would add that
the results we have obtained relate only to the genetic region and sample subjects
which we have studied. The study of markers around DBH seemed to show LD
extending over shorter distances than we detected around APOE (Zabetian et al
2003).
Even from the simple studies which we have done we feel it is possible to derive a
few useful conclusions. Firstly, it does seem that at least in some circumstances casecontrol studies should have reasonable power to detect association using realistic
sample sizes, provided that the disease polymorphism itself is genotyped. However
the power to detect association is sensitive to the allele frequencies. Where one allele
is rare the polymorphism will make only a small contribution to the overall risk of
affection in the population, even when the absolute effect on risk remains constant. In
such a case far larger samples would be required to detect association. In the current
study only a few SNPs had an allele so rare as to make the detection of association
problematic, although of course to some extent the method used for detecting SNPs
will have meant that there are rarer ones present in the region which have not been
discovered. It is very simple to calculate the expected power to detect association
given any particular model incorporating effect of genotype on risk and allele
frequencies. However for non-Mendelian diseases these parameters are not known in
advance. Even if one can use linkage information to estimate the overall effect on risk
due to a particular region, one has no idea how many different susceptibility
polymorphisms may be present in that region. Hence one may be able to put an upper
limit on the power of a study but the actual power will be much lower than this if the
effect is due to the action of several rare polymorphisms.
The second main finding is that, broadly speaking, detecting association using
markers in the region of a disease polymorphism is far more problematic than using
the disease polymorphism itself. Association is unlikely to be detected unless the
marker is very close to the disease polymorphism. This finding holds in spite of the
fact that D’ can be close to 1 even between pairs of SNPs at distances of 1,000 kb or
more from each other, and in fact it could be argued that D’ is a somewhat misleading
measure of LD in this situation. Even within 300 kb many markers will show no
appreciable evidence for association with disease and for the vast majority the
evidence will certainly be much weaker than for the disease polymorphism itself. For
about half of the SNPs we studied, there was no other marker which would have
produced notable evidence for association if that SNP had been a disease
polymorphism. It seems difficult to avoid the conclusion that when searching for a
disease polymorphism one should make efforts to identify and study every SNP
within a region. If one fails to do this but instead only studies a sub-sample one runs a
substantial risk of failing to detect association and falsely concluding that the region
does not harbour a susceptibility polymorphism. It could be argued that using
haplotype information would mitigate against this problem and we have not been able
to assess the extent to which this would be the case in this dataset. However it seems
doubtful that haplotypes could be guaranteed to provide strong evidence for
association when no marker individually was even weakly positive. This may be
especially the case for common polymorphisms rather than rare disease mutations.
One might hope that a disease mutation would occur on the background of only one
or two haplotypes, but a common polymorphism having a more modest effect on
susceptibility might well be shared between several haplotypes, reducing the gain in
information which examining haplotypes might otherwise provide.
By contrast, there were a small number of SNPs which were in very strong LD with
each other, to the extent that it would have been problematic to determine which one
directly influenced susceptibility and which ones were acting as markers. The LD was
so strong that very large samples would be required to provide statistical evidence
based on differences in strength of association. If confronted with this situation one
might need either to obtain samples from other populations in which LD between the
pairs of SNPs happened to be lower or else to carry out functional studies in order to
decide which SNP was more like to be having a direct effect on susceptibility.
Our results are in line with previous studies which indicate that detecting
polymorphisms influencing complex disease is at best a risky affair. Even if a
susceptibility polymorphism is itself genotyped the statistical signficance of the
difference in allele frequencies between cases and controls is critically dependent not
only on the biological effect of the polymorphism but also on the frequencies of its
alleles. If nearby marker SNPs are genotyped then they may provide evidence for
association, but equally there is a strong possibility that none will do so, even if they
are very close to the susceptibility polymorphism. This finding would lead to the
conclusion that extremely dense SNP maps would be needed to avoid missing
important loci. The extent to which the use of haplotype blocks will mitigate this
problem remains to be seen – certainly we seem to see many SNPs which do not
appear to be in LD with any of those nearby. Further investigations into the LD
relationships between different polymorphisms in different parts of the genome and
the extent to which the use of haplotypes can enhance the ability to detect association
with a disease phenotype should be carried out to elucidate these issues further.
References
Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES (2001) High-resolution
haplotype structure in the human genome. Nat Genet 29 229–232.
Kuusisto J, Koivisto K, Kervinen K, Mykkanen L, Helkala EL, Vanhanen M,
Hanninen T, Pyorala K, Kesaniemi YA, Riekkinen P, et al. (1994). Association of
apolipoprotein E phenotypes with late onset Alzheimer's disease: population based
study. British Medical Journal 309 636-8
Martin ER, Lai EH, Gilbert JR, Rogala AR, Afshari AJ, Riley J, Finch KL, Stevens
JF, Livak KJ, Slotterbeck BD, Slifer SH, Warren LL, Conneally PM, Schmechel DE,
Purvis I, Pericak-Vance MA, Roses AD, Vance JM (2000). SNPing away at complex
diseases: analysis of single-nucleotide polymorphisms around APOE in Alzheimer
disease. Am J Hum Genet. 67: 383-94.
Nordborg M and Tavare, S (2002). Linkage disequilibrium: what history has to tell us.
Trends Genetics 18: 83-90.
Sham , P.C. (1998) Statistics in Human Genetics. Arnold
Sham PC, Zhao JH, Curtis D (2000) The effect of marker characteristics on the power
to detect linkage disequilibrium due to single or multiple ancestral mutations. Ann
Hum Genet 64 161-169.
Zabetian CP, Buxbaum SG, Elston RC, Kohnke MD, Anderson GM, Joel Gelernter J,
Cubells JF (2003) The structure of linkage disequilibrium at the DBH locus strongly
influences the magnitude of association between diallelic markers and plasma
dopamine beta-hydroxylase activity. Am J Hum Genet 72 1389–1400.
Zapata C, Carollo C, Rodriguez, C (2001) Sampling variance and distribution of the
D' measure of overall gametic disequilibrium between multiallelic loci.
Ann Hum Genet; 65 395-406.
Zhao JH, Curtis D, Sham PC (2000) Model-free and permutation analysis for allelic
associations. Hum Hered 50 133-139.
Download