Figure 1. Tagging effect decreases localization success rates with or

advertisement
Why I chose:
First reading results seemed counterintuitive
Introduction full of references I didn’t know
Useful? Or Gee Whizz so what?...Needed to read in detail
Seemed relevant to our MND study GWAS + imputation + sequencing
Nicely laid our for journal club presentation
Localisation success rate = probability that the causal SNP is top ranked within
an associated region
depends on joint effects of selection based on p-value, tagging and genotyping accuracy
Consider 2 SNPs
One causal from sequencing or imputation – imperfect genotyping accuracy
One tag from GWAS perfect genotyping accuracy
MAF both SNPs = 0.12
Causal SNP OR =1.25
Selection at tag SNP based on p-value < 0.05 in 1000 cases & 1000 controls
Association test statistic at causal or
genotyped SNP
Call rate at
causal SNP
Correlation between actual and estimated
genotype at the causal SNP
Generates Fig 1-3
Correlation between actual
genotype at causal and
genotyped SNPs
Figure 1. Tagging effect decreases localization success rates with or without the selection effect.
A& B Tight linkage disequilibrium between SNPs can obscure the causal SNP
C&D Selection at the tag SNP inflates the association evidence at the tag, increasing the probability that it outranks
the causal SNP
Localisation success rate = probability that the causal SNP is top ranked within an associated region
Causal MAF 0.12
Correlation causal
& non-causal seq
SNP 0.9
OR=1.25
Perfect genotyping
accuracy
Tag MAF 0.12
Fig S8: Tagging effect decreases
localization success rates with or
without the selection effect, 3
SNPs:1 tag, 1 causal, 1 noncausal
sequencing SNP.
Causal MAF 0.12
Correlation causal
& non-causal seq
SNP 0.9
OR=1.25
Perfect genotyping
accuracy
Tag MAF 0.12
Causal MAF 0.02
Correlation causal
& non-causal seq
SNP 0.9
OR=1.5
Perfect genotyping
accuracy
Tag MAF 0.02
Fig S9: Tagging effect
decreases localization
success rates with
or without the selection
effect 5 SNPs: 1 tag, 1
causal, 3 noncausalsequencing SNPs.
Causal MAF 0.02
Correlation causal
& non-causal seq
SNP 0.9
OR=1.5
Perfect genotyping
accuracy
Tag MAF 0.02
Figure 2. Low genotyping accuracy at causal SNP further reduces localization success rates with or without the
selection effect.
Sequencing or imputation error decreases the localization success rate, with or without tag selection
Causal MAF 0.12
OR=1.25
Tag MAF 0.12
Perfect genotyping
accuracy for tag
SNP
S4. Low genotyping accuracy at causal SNP further reduces localization success
rates with or without the selection effect RARE causal SNP
Causal MAF 0.02
OR=1.5
Tag MAF 0.02
Perfect genotyping
accuracy for tag
SNP
S5. Low genotyping accuracy at causal SNP further reduces localization success
rates with or without the selection effect common causal SNP
Causal MAF 0.25
OR=1.25
Tag MAF 0.25
Perfect genotyping
accuracy for tag
SNP
Figure 3. Counter-intuitively, sample size can reduce localization success rate
Well-tagged causal SNPs sequenced with low accuracy are unlikely to be correctly identified even as sample size
increases.
Causal MAF 0.12
Correlation causal
& non-causal seq
SNP 0.9
OR=1.25
Perfect genotyping
accuracy
Tag MAF 0.12
When the causal SNP is less accurately genotyped than one of its highly correlated
proxies (i.e. rC < rG and rCG is large), the proxy SNP may capture the association better
than the causal SNP. As a result, this proxy SNP will out-rank the causal SNP more than
50% of the time.
MAF = 0.02
MAF = 0.12
MAF=0.25
Results so far demonstrate the need to correct for the joint effects of selection,
tagging and genotyping accuracy on the localization success rate.
How to correct?
Test statistic at
sequenced SNP
Call rates
i.e missingness
Joint vs individual
G=tag S=seq
Correlation between
genotyped and
sequenced in sample
when no errors
Estimate of selection bias
of genetic effect at tag SNP
– form of winner’s curse
Revised test statistic
at sequenced SNP
Missingness
rate
Correlation between true
genotype and sequenced
genotype in the sample
When low
difference between
test statistic and
revised test
statistic increases
Is zero if independent
samples are used for
sequencing and
identification of tag SNP
G= genotyped
C=causal
rCG = correlation between genotyped and causal SNPs
Selection effect most pronounced when low power at the tag SNP
Unconditional expected association at
the sequenced SNP
Distortion due to the tag SNP selection
propogated through correlation
The higher the correlation between the genotyped and sequenced
SNP, the higher the test statistic at the sequenced SNP and the lower
its variance
SNPs in high LD with the tag are more likely to be top-ranked = “tagging effect”
Counts of
missingness
Estimate
from
sample
Boot strap resampling at
the genome-wide level
Incorporates information
across the whole
genome to account for
effects of LD and rank on
bias
Mean posterior genotype eg MACH ratio of variance estimate or full genotype posterior
probabilities eg BEAGLE r2
Scenario 1: GWAS used for discovery, and sequencing/ imputation used for finemapping around GWAS ‘‘hits’’ using the same GWAS sample.
GWAS-focused design based on the WTCCC Type 1 Diabetes
A significant region is identified by a significant GWAS tag SNP (p < 5x10-7) and followed
by fine-mapping with post-GWAS data (sequenced or imputed SNPs) in the region
surrounding the tag SNP. The SNP with the largest test statistic in the region is selected
as the best candidate causal SNP.
Scenario 2: All GWAS and sequenced/imputed SNPs used for discovery and fine-mapping
in the same dataset.
Scenario 3: Discovery and fine-mapping using different datasets.
Scenario 4: Discovery and fine-mapping using different datasets + Multiple causal SNPs.
Scenario 5 Discovery and fine-mapping using different datasets + missing data (imperfect call
rate)
Table 2. Parameters and parameter values of the main simulation studies.
Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4.
No good if tag is
causal
After re-rankig
localisation success
rate “similar” to
when tag is not
causal. “Minor
tradeoff” as GWAS
SNP unlikely to be
causal
Scenario 1: GWAS used for discovery, and sequencing/ imputation used for finemapping around GWAS ‘‘hits’’ using the same GWAS sample.
Adverse effect of tagging (down table) and genotyping accuracy (across table) are
highest when causal SNP is well tagged (larger r) and less accurately sequenced (low
rho) e.g. high density GWAS followed by low density sequencing
Well-tagged causal SNPs suffer lower localisation success rates because perfectly
genotyped tag captures the association better than the imperfectly
Down table
sequenced/imputed causal SNP
Across table
Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4.
Scenario 2: All GWAS and sequenced/imputed SNPs used for discovery and fine-mapping
in the same dataset ie significance is not required at the GWAS SNP.
Impact of2:sample
size,
correlation
between tagSNPs
and used
causalforSNP
fixed and
Scenario
All GWAS
and
sequenced/imputed
discovery
Genotyping accuracy
alone
impacts
fine-mapping
in the same
dataset.
Big impact of re-ranking when low seq cover and large sample size
Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4.
Scenario 3: Discovery and fine-mapping using different datasets.
Very simialar rates to scenario 2
Table 3. Localization success rates for simulation Scenarios 1, 2, 3, 4.
Improves reranking for
both causal
SNPs
Scenario 4: Discovery and fine-mapping using different datasets (as
3)+ Multiple causal SNPs
Table 4. Localization success rates for simulation Scenarios 5a.
Scenario 5 Discovery and fine-mapping using different datasets + missing data (imperfect call
rate) (across table changed)
Missing data affect localisation success rates in a similar manner to imperfect genotyping
accuracy
Summary from simulation
• GWAS-based region selection or moderate genotype error substantially reduces the
probability of correctly identifying the causal SNP
• Proposed re-ranking can recover lost power increasing localisation success rates by 1.5
to 3 times
• When genotypig accuracy is high power lost due to tagging is small so re-ranking has no
effect
Figure 4. Naïve test statistics and re-ranking statistics for regions surrounding rs78246868 in the 8q24.21 region
for association with prostate cancer risk.
Michaela et al Prostate cancer
Consortium different genotyping platforms
Imputed to 1000 Genomes
Fixed-effect meta-analysis
Cohorts excluded from assocation analysis if
imputation r2 < 0.8
Report 5 statistically independent regions
within 8q24.21 locus plus 11q13.3 and
17q24.3
Selected all SNPs in LD r2 > 0.2 with index SNP
Didn’t exclude studies based on imputation r2
Only correct for imputation accuracy ie deltaG =0
New top SNPs for 8q24.21 and 17q24.3
8q24.21: 2 SNPs move from lower ransks to top 10%
Figure 5. Naïve test statistics and re-ranking statistics for regions surrounding rs8071558 in the 17q24.3 region
for association with prostate cancer risk.
8 SNPs move from lower ranks
to top 10%
SNPs naively ranked in top 10%
stay highly ranked
When most SNPs are well
genotyped re-ranking only
makes subtle changes
One poorly imputed SNP
(yellow) moves form rank 245
to 16.
Association driven by one
study (rank 10) , when
removed SNP rank is 306
changing to 106
DISCUSSION
• Tagging and genotyping accuracy are non-trivial sources of bias that could obscure
association evidence at the causal SNP
• Proposed re-ranking is simple to implement and can substantially increase the
probability of identifying the causal SNP
• For low coverage sequencing we recommend the re-ranking method
• For imputation and high coverage sequencing we recommend that unfiltered SNPs
in associated regions be used with the re-ranking method
• Large changes in rank should be carefully examined for heterogeneity between
studies
• Re-ranking is most beneficial when genotyping accuracy is low
• High density genotyping followed by low density sequencing can generate
misleading results- Don’t do it
• Imputation and sequencing software output accurate estimates of rho needed for
the re-ranking
DISCUSSION
Re-ranking important when study specific factors exacerbate GWAS-based selection
and genotyping error
• High genetic diversity so sequence read are difficult to align
• Low LD among SNPs or lack of population-specific reference panel so poor
imputation
• Low MAF SNPs tend to suffer from both low power and high genotyping error
When genotyping accuracy is very poor, re-ranking may not be able to generate
useful results- first consider accuracy thresholds recommended by genotype calling
or imputation algorithm
Re-ranking only improves localization success when applied to SNPs under the
alternative, ie SNPs that re themselves causal or in LD with a causal SNP
Existing methods that incorporate genotyping uncertainty into tests for
association do not completely recover lost power
This paper considered frequentist and Bayesian methods of incorporating
uncertainty
We anticipate that re-ranking to correct for the adverse effects of selection, tagging and
differential genotyping accuracy rates will continue to be important because costeffective designs are for low-coverage large sample sizes
Download