ppt

Association Tests for Rare Variants Using Sequence Data Guimin Gao, Wenan Chen, & Xi Gao Department of Biostatistics, VCU Introduction to Association tests: two hypotheses  Common variant-common disease  Common variant: Minor allele frequencies (MAF) >= 5%  Using linkage disequilibrium(LD)  Rare variant-common disease  Rare variant: MAF < 1% (or 5%)  High allelic heterogeneity: collectively by multiple rare variants with moderate to high penetrances  Associations through LD would not be suitable Association tests for Common variants  Test a single marker each time  Cochran-Armitage’s trend test (CATT) (assuming additive (ADD))  Power: High for additive (ADD) or Multiplicative (MUL); low recessive (REC) or Dominant (DOM)  Genotype association test (GAT) using chi-square statistic  Power: a little lower for ADD, higher for REC  MAX3 = maximum of three trend test statistics across the REC, ADD, and DOM models (Freidlin et al. 2002 Hum Hered.)  Power: lower than CATT under ADD  higher than CATT & CAT under REC Association tests for Common variants  Test for single marker (CATT, GAT, & MAX3)  Low power when MAF <10%  No power for rare variants with MAF<1%  Multivariate test  Considering a group of variants (ex. SNPs in a gene) each time  Multiple logistic regression (or Hotelling test, Fisher’s product) Logit  0  k  x ij  j j 1  Xij = 0, 1, 2, the count. of the minor alleles of indiv i at locus j  Power: higher than single-marker test; still very low due to large d.f = No. of SNPS = k  Need new methods for rare variants  Collapsing SNPs into a single marker to reduce d.f. Outline  Introduction to association tests  Three well-known collapsing methods for rare variants: CAST, CMC, & Weighted Sum methods  An evaluation using GAW 17 data  Extension to the three collapsing methods  Future research Three association tests for Rare variants  Collapsing a set of rare variants (into a single marker)  A cohort allelic sums test (CAST) (Morgenthaler & Thilly 2007, Mutat. Res.)  Combined Multivariate and Collapsing (CMC) (Li & Leal, 2007, AJHG)  Division into subgroups, collapsing in each subgroup  Weighted Sum statistic (Madsen & Browning, 2009; PloS Genet. Price et al. 2010, AJHG) A cohort allelic sums test (CAST)  A group of n variants (SNPs) in a unit (ex. one gene, LD block)  Collapsing the genotypes across the variants  Indicator coding for individual j  xj = 1, if rare alleles present at any of the n variants;  xj = 0, otherwise  Testing if the proportions of individuals with rare variants (xj = 1) in cases and controls differ  Higher power than method testing single variant each time  Only for rare variants Combined Multivariate and Collapsing (CMC) Method (Li & Leal 08)  Consider SNPs in a unit with MAF< a threshold (0.01 or 0.05)  Division and Collapsing  Divided into several sub-groups based on the MAF  Ex. Subgroups : (0, 0.001], [0.001, 0.005), [0.005, 0.01)  SNPs are collapsed in each sub-group  xij = 1, if indiv j has rare alleles present in the i-th subgroup;  xij = 0, otherwise Combined Multivariate and Collapsing (CMC) Method (Li & Leal 08)  Multivariate test of collapsed sub-groups  Hotelling T2 test, MANOVA, Fisher’s product method  Power: often higher than CAST  Different threshold may have different power Weighted Sum Method (Madsen & Browning 09)  A group of variants (SNPs) in a unit  A weight for SNP i by the S.t.d of No. of minor alleles in the sample wi  n i  q i (1  q i )  qi is the minor allele freq in controls  Calculate a weighted genetic score for indiv j L vj   I ij / w i i 1  Iij = 0, 1, 2, the count of the minor allele of indiv i at locus j  Obtain the Rank (Vj); Sum of the ranks of affected indivs x  rank ( v j A j ) Permutation for p-value estimation  From observed data: x  rank ( v j ) j A  Permutation to estimate p-value:  Phenotype labels are permuted 1000 times, x1, …x1000  Calculate the mean (μ) and standard deviation (σ) of 1000 xs z  x  u   Assume z ~ N(0, 1) under null hypothesis  Obtain the p-value from N(0, 1)  Fast, p-value ~U[0,1] Weighted Sum Method (Madsen & Browning 09)  Power comparison:  Simulations assuming genotypic relative risk is proportion to MAF at disease loci (Madsen & Browning 09)  Weighted Sum Method (WSM) > CMC > CAST  (WSM) > CMC may not be true in other situations  Can be applied to rare variants & common variants  Disadvantage:  Give very high weights to very rare alleles (singleton), very low weights to common variants. An evaluation of the CMC method and Weighted sum method by using GAW 17 data  Both methods are powerful (based on the authors’ simulation)  Our evaluation based on simulated datasets from GAW 17  GAW 17 data:  a subset of genes with real sequence data available in the 1000 genome project  Simulated phenotypes  Unrelated individuals, families  Dataset of 697 unrelated individuals  24487 SNPs in 3205 genes from 22 autosomal chromosomes  Only test for the 2196 genes with non-synonymous SNPs GAW 17 dataset of unrelated individuals  Four phenotypes: Q1, Q2, Q4 and disease status.  Q1, Q2, and Q4 are quantitative traits  Q1 associated with 39 SNP in 9 genes,  Q2 associated with 72 SNPs in 13 genes  Q4: not related to any genes  Disease status is a binary trait: affected or unaffected, associated with 37 genes  200 simulated phenotype replicates  Only one replicate of genotype data (original data) Transforming Phenotypes  Methods: case-control design  Transform Q1, Q2, Q4 into binary traits  Splitting at the top 30% percentile of the distributions Criteria for evaluation of Tests  Familywise error rate (FWER)  2196 genes with non-synonymous SNPs, 2196 tests  2196 null hypotheses Hj0: gene not associated with the trait  Q1 associated in 9 genes, 9 null hypotheses are not true.  (2196-9) null hypotheses are true  FWER = Pr(reject at least one true null hypothesis) = Nf/200  Nf : No. of replicates, at least one true hypothesis are rejected  Average Power  Mean of power for all the 9 genes that affect the phenotypes  Evaluating power: Q1, Q2, Disease  Evaluate FWER: Q4 Distribution of MAF in the GAW 17 dataset Figure 1. Distribution of MAF of 24487 SNPs in GAW 17 Figure 1. Group SNPs based on MAFs for CMC 0 - 0.01 0.01 - 0.1 Similar to Madsen & Browning (2009) >=0.1 Table 1: Average power Traits CMC method Weighted sum method Q1 0.144 0.112 Q2 0.00615 0.00308 Disease 0.00444 0.00500 Table 2: FWER (nominal α = 0.05) Trait CMC method Weighted sum method Q4 0.115 0.0100 • CMC has FWER inflation • Population stratification or admixture, Samples from Asian, Europe,… • Relatedness among samples • Similar results in Power and FWER were reported at GAW 17 Variable-Threshold Approach (Price et al 2010)  Given a threshold T, calculate a score for indiv j L V j (T )   I ij i 1  Iij = 0, 1, 2, the count of the minor allele of indiv i at locus j  Calculate the sum of score for cases: N V (T )  D V j (T ) j 1  Calculate Z(T) = V(T)/Var(V(T))  Find T to maximize Z(T), Zmax = max (Z(T))  Permutation to estimate p-value for Zmax  Power: >CMC; Extended to quantitative traits A weighted approach (Price et al 2010)  Calculate a weighted score for indiv j L V j   w i I ij i 1  Iij = 0, 1, 2  Calculate the sum of score for cases N V  D V j j 1  Possible weight w i  1 / q i (1  q i )  Power: similar to the weighted sum method (Madsen & Browning 09) A weighted approach (Price et al 2010)  Calculate the sum of score for cases N V  L D V j 1 j V j   w i I ij i 1  Iij = 0, 1, 2  Calculate weight by the prediction of functional effects  PolyPhen-2 is used to predict damaging effects of missense mutations with probabilistic scores.  Probabilistic scores as weights may reduce the noise of nonfunctional variants.  Higher Power than other methods A data-adaptive sum test (Han & Pan 2010, Hum Hered)  Logistic model Logit  0  c k  x ij j 1  xij = 0, 1, 2, the count of the minor allele of indiv i at locus j  Effect on opposite directions Logit  0   j x ij  If j <0, with p-value < threshold (0.1), change xij into 2- xij  Permutation to estimate p-value Conclusion  Collapsing methods have higher power than single-marker test  For genome-wide data analysis, collapsing methods don’t have much power after multiple testing adjusting  Weighted sum methods are promising, need prior information from biological data Future research  Modifying the weighted sum method (in progress)  Very high weights to very rare variants  Smoothing weights w’ = 0.5w +0.5 (average of all w) Thank you

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib