Supplementary Methods (doc 236K)

advertisement
Confidential
.
1
SUPPLEMENTARY METHODS
2
AMBROSIA: The pseudocode for the AMBROSIA algorithm is shown in Supplementary
3
Figure 1.
4
Case Study 1, Interactions in Data with Simulated LD Patterns: The gene-gene
5
interaction model used for this case study is shown in Supplementary Figure 2A.
6
We simulated 60 SNPs in four groups, G1 through G4, with linkage disequilibrium within
7
each group (Supplementary Figure 2B). The disease-causing SNPs corresponding to
8
SNPs i and j in Supplementary Figure 2A were SNP 7 (red in G1) and 22 (red in G2),
9
respectively. The minor allele frequencies (MAF) of the central SNP in each group were
10
0.5; the MAF of the flanking SNPs and LD pattern were the same within each group
11
and are shown in Supplementary Figure 2B. The penetrance matrix corresponding to
12
the disease associated gene-gene interaction between SNP 7 and SNP 22 was:
13
BB
Bb
bb
AA
p
p 1 p
Aa
p
p 1 p
aa 1 p 1 p 1 p
14
The major and minor alleles for SNP 7 are A and a, respectively, whereas the major
15
and minor alleles for SNP 22 are B and b, respectively. The probability of disease in
16
individuals with the risk-associated genotype combinations, {AA, BB}, {AA, Bb}, {Aa,
17
BB}, {Aa, Bb}, is p. The probability of disease in individuals without the risk-associated
18
genotypes was set to (1 – p). In this context, the relative risk was the ratio of p to (1 –
19
p). The case-control status phenotype variable was denoted by Y.
20
The relative risk values used in simulation were 1.2, 1.5, 1.8, 2.0 and 2.5. The case-
21
control design with 2500 cases and 2500 controls was used.

Page 1
Confidential
.
1
Power was assessed from 100 independent repetitions of the simulation procedure.
2
Power was defined as the proportion of models that contained the interacting
3
combinations. The average number of false combinations per model (FCM) was
4
defined as the number of false combinations in each model averaged over the 100
5
independent repetitions. For Case Study 1, the AMBIENCE input parameter values
6
were  = 10 and = 2. All power calculations in AMBROSIA were done with  = 0.001.
7
Case Study 2, Gene-Gene Interactions with Genetic Heterogeneity: To challenge
8
the capabilities of AMBROSIA, we created a simulation with two pairs of gene-gene
9
interactions.
10
The model for Case Study 2, summarized in Supplementary Figure 2C, contained 120
11
SNPs in eight groups. The allele frequencies of the central SNPs were all 0.5; the MAF
12
of the flanking SNPs and the LD patterns within each group were similar to Case Study
13
1.
14
The model contained genetic heterogeneity (GH) with two pairs of interacting loci, SNP
15
7 with SNP 22 and SNP 67 with SNP 82 that each increased risk in half of the cases.
16
Both pairs of interactions followed the penetrance matrix of Case Study 1. The relative
17
values used in simulation were 1.2, 1.5, 1.8, 2.0 and 2.5 and a case-control design with
18
2500 cases and 2500 controls was used. For Case Study 2, the AMBIENCE input
19
parameter values were  = 10 and = 2.
20
Comparisons with Multi-factor Dimensionality Reduction (MDR)
21
The AMBROSIA method was compared head to head with MDR
22
implementation was downloaded from http://sourceforge.net/projects/mdr/.
23
Comparisons to MDR using Case Study 2. The MDR method was first run to conduct
24
an exhaustive search of all possible 1 to 4-locus interactions with each data repetition
25
of Case Study 2. However, each of these MDR runs failed to complete successfully
Page 2
2.
The MDR
Confidential
.
1
after 48 hours of run time. To enable comparison of AMBROSIA to MDR, we created a
2
smaller data set by simplifying Case Study 2 to contain only 24 SNPs. The central SNP
3
and one SNP with R2 = 0.9 and one SNP with R2 = 0.8 were selected from each of the
4
eight Groups. The LD patterns of this reduced data set are summarized in
5
Supplementary Figure 2D.
6
The model contained genetic heterogeneity (GH) with two pairs of interacting loci:
7
Interaction 1 consisted of SNP 1 and SNP 4 and Interaction 2 consisted SNP 13 and
8
SNP 16 that each increased risk in half of the cases. Both pairs of interactions followed
9
the penetrance matrix of Case Study 1. The simulation strategy was similar to Case
10
Study 1.
11
The relative risk value of 2.0 was studied and a case-control design with 2500 cases
12
and 2500 controls was used.
13
The MDR method was first run to conduct an exhaustive search of all possible 1 to 4-
14
locus interactions with each data repetition. Statistical significance in MDR was
15
obtained by comparing the observed prediction error for each MDR model to the null
16
distribution obtained from 10,000 permutations.
17
The AMBIENCE input parameter values were  = 50 and = 2. The power and FCM of
18
MDR and AMBROSIA were computed from 100 independent simulations.
19
Comparisons with MDR using Models Based on the MDR Power Paper. Four two-
20
locus interaction models employed in the MDR power evaluation paper by Ritchie et al.
21
1
22
models used is shown in Table 1 of the main paper.
23
A case-control study design with 2500 cases and 2500 controls was assumed. 24
24
diallelic SNPs were simulated in eight groups (Supplementary Figure 1D). For each
were used for comparison against AMBROSIA. The penetrance matrices for the
Page 3
Confidential
.
1
group with 3 SNPs, the central SNP is in LD with R2 = 0.9 and R2 = 0.8 with the other
2
two SNPs of that group respectively.
3
Four interaction models were simulated with genetic heterogeneity. Models 1-GH, 2-
4
GH, 3-GH and 4-GH contained genetic heterogeneity (GH) with two pairs of interacting
5
loci, SNP(1) with SNP(4), defined as Interaction 1 and SNP(13) with SNP(16), defined
6
as Interaction 2. The corresponding penetrance matrices in Table 1 were used for
7
simulations for both pairs of interacting loci for each model such that each Interaction
8
increased disease risk in half of the cases. The remaining SNPs were not associated
9
with the phenotype. For models 1-GH and 2-GH, within each 3-SNP group, the allele
10
frequency for the central SNP was 0.5; the minor allele frequencies (MAF) for the other
11
two SNPs were 0.445 (R2 of 0.9 with the central SNP) and 0.474 (R2 of 0.8 with the
12
central SNP) respectively. For model 3-GH, within each 3-SNP group, the MAF for the
13
central SNP was 0.25; MAF for the other two SNPs were 0.232 (R2 of 0.9 with the
14
central SNP) and 0.211 (R2 of 0.8 with the central SNP) respectively. For model 4-GH,
15
within each 3-SNP group, the MAF for the central SNP was 0.1; MAF for the other two
16
SNPs were 0.091 (R2 of 0.9 with the central SNP) and 0.082 (R2 of 0.8 with the central
17
SNP) respectively. For each model, we simulated 100 data sets. Genotypes were
18
assumed to be in Hardy-Weinberg equilibrium proportions.
19
20
SUPPLEMENTARY RESULTS
21
Demonstration Run with Case Study 1, Interactions in Data with Simulated LD
22
Patterns: We used the top 10 one-SNP and two-SNP combinations with the highest
23
KWII values identified by AMBIENCE to build the most parsimonious model using
24
AMBROSIA.
Page 4
Confidential
.
1
In the first step of AMBROSIA, we confirmed that all the combinations identified by
2
AMBIENCE were significant using permutation based approaches with  = 0.001. The
3
KWII values for the combinations are summarized in Supplementary Table 1. The value
4
of  was set to 0.001.
5
AMBROSIA starts with the combination {22, Y}, which has the highest KWII, as the first
6
member combination in the model. In the next step, it assesses whether {7, Y} can be
7
added to the model. It evaluates KWII(7, 22, Y) = 0.012 > 0 and because MICC = 254,
8
which makes e–MICC < . As a result {7, Y} gets added to the model, so that M = {{22, Y},
9
{7, Y}}. In the same step, the significant combination {7, 22, Y} is also added to the
10
model because it has positive KWII and since its sub-combinations are present in the
11
model, the overall model complexity is not increased. Next the combination {23, Y} is
12
evaluated; however, KWII(22, 23, Y) = -0.036 < 0, so {23, Y} is rejected. Repeating
13
these steps for the 1-way and 2-way combinations listed in Table 2 does not add any
14
more combinations to M. Finally M = {{22, Y}, {7, Y}, {7,22, Y}} emerged as the most
15
informative and parsimonious model explaining the disease phenotype. The
16
significance of the overall model was p < 0.0001. These results from AMBROSIA are
17
consistent with model used for simulation.
18
These promising results provide proof of concept that AMBROSIA can be used for
19
model synthesis in the presence of the confounding effects of LD.
20
Page 5
Confidential
.
1
2
TABLES
Supplementary Table 1. Top 10 one and two-way combinations for Case Study 1.
1-way Combinations
KWII
2-way Combinations
KWII
{22, Y}
0.026
{7, 22, Y}
0.450
{7, Y}
0.026
{7, 21, Y}
0.380
{6, Y}
0.023
{8, 21, Y}
0.379
{8, Y}
0.023
{7, 23, Y}
0.379
{21, Y}
0.023
{6, 22, Y}
0.378
{23, Y}
0.023
{8, 22, Y}
0.323
{5, Y}
0.020
{8, 23, Y}
0.322
{24, Y}
0.020
{6, 21, Y}
0.322
{20, Y}
0.020
{6, 23, Y}
0.321
{9, Y}
0.020
{7, 24, Y}
0.319
3
4
Page 6
Confidential
1
.
Supplementary Table 2. Interactions in the GAW15 data set.
Locus
Chr
SNP #
Phenotype
Effects
DR
6
152-155
RA
Affects risk of RA
A
16
30-31
RA
Controls effect of DR on RA risk
B
8
442
RA
Controls effect of smoking on RA risk
C
6
152-155
RA
Increases RA risk only in women
D
6
161-162
RA
Rare allele increases RA risk 5-fold
E
18
268-269
RA, Anti-CCP
F
11
387-389
IgM
G
9
185-186
Severity
25% QTL for severity
H
9
192-193
Severity
25% QTL for severity
Age
-
RA
Affects RA risk through smoking and sex ratio
Sex
-
RA
Affects RA risk with Locus C
Smoking
-
RA, IgM
Affects of DR on anti-CCP and increases RA risk
QTL for IgM
Affects RA risk with Locus B and through IgM
2
3
Page 7
Confidential
1
.
FIGURE LEGENDS
2
Supplementary Figure 1. Pseudocode for the AMBROSIA algorithm.
3
Supplementary Figure 2. Supplementary Figure 1 shows the gene-gene interaction
4
model used to generate the data for Case Study 1. The red bars correspond to the
5
disease-associated interacting central SNPs whereas the blue bars represent the non-
6
interacting central SNPs in each group. The height of the bars corresponds to the
7
extent of linkage disequilibrium from the central SNP as measured by R2. The R2 values
8
are indicated and range from 0.2-0.9 in intervals of 0.1 for Figure 1B and 1C and are
9
0.9 and 0.8.
10
Page 8
Confidential
1
.
SUPPLEMENTARY FIGURE 1
Algorithm : Ambrosia
Input :
 : Set of combinations sorted by KWII.
 : Threshold for complexity filter.
Output :
 : Model
1. Compute combination p - values with permutations.
2.   0; # MIC; N  no. of samples in the data;   1;
3. 1  2N    2   ;
4.   ;
5. while  is not empty do
2
6.
C  Combination with sub-combinations already in M or the one with highest KWII from
7.
 temp    {C};
8.
  Degrees of freedom of M
temp
;
9.  2  2N  (  KWII(C))  2   ;
10.    2  1; # MICC
11. if exp(- )   # Complexity Filter#
12.
13.
14.
15.
if redundancy(M, C) is false # Redundancy Filter#
M  M temp ;
    KWII(C);
endif
16. endif
17.    \ {C};
18. endwhile
19. return M;

3
4
5
Page 9
;
Confidential
1
.
SUPPLEMENTARY FIGURE 2
2
3
4
5
Page 10
Confidential
.
1
REFERENCES
2
Ritchie MD, Hahn LW, Moore JH (2003). Power of multifactor dimensionality reduction
3
for detecting gene-gene interactions in the presence of genotyping error, missing data,
4
phenocopy, and genetic heterogeneity. Genet Epidemiol 24(2): 150-157.
5
6
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF et al (2001).
7
Multifactor-dimensionality reduction reveals high-order interactions among estrogen-
8
metabolism genes in sporadic breast cancer. Am J Hum Genet 69(1): 138-147.
9
10
11
Page 11
Download