Please, refer to the references in the main manuscript.
How to compute p-values in the restandardization GSA (pooled version)
N is the number of gene sets considered. P is the number of permutations taken. T j
, j = 1,
2,…,
N are the gene set scores. Let T
1
j
S
.
Step 1 . Permute the sample labels and compute each gene set score to generate the matrix of randomized summary set statistic T perm
T j k
, 1
j
N , 1
k
N
P
P
, where T k
is the score of the j th gene set for the k th permutation, and P is the j number of permutations executed.
Step 2 . (Pooled and randomized gene set scores)
Compute g ( T perm
)
g ( T j k
)
*
*
, 1
j
N , 1
k
P , where g ( T j k
) represents a generalized gene set statistic and
*,
υ
*
are their mean and standard deviation of randomly drawn gene sets over all the permutations performed
Step 3 . For each gene set, compute the generalized statistic g ( T j
) as well as its mean and variance ,
υ
, then compute the ratio of scores in g ( T j
)
g ( T i k
)
*
*
. g ( T perm
) that satisfy
If g is a linear transform of individual scores such as mean of absolute values, p th moment, or maxmean statistic,
*,
υ
*
and
can simply be replaced by those mean and variance of the individual gene scores without randomly drawing gene sets.
Comparison of GWAS on height between European and Korean population
Weedon et al. (25) and Gudbjartsson et al. (26) recently reported 20 (P
GWA
< 5X10 -7 ) and 27
(P
GWA
< 1.6X10
-7 ) regions, respectively that were highly associated with adult height in multiple cohorts of primarily European ancestry. On the other hand, the GWA analysis of
Korean samples identified only eight regions significantly associated with height (P
GWA
<
4X10
-6
) and five out of them largely overlapped the results of the preceding European studies
(25,26). The European studies also analyzed the gene ontology (GO) terms or pathways of the genes neighboring the loci associated with adult height and suggested several biological processes that were likely implicated in the height variation. On the other hand, only seven genes were identified within 200-kb window of variants significantly associated in Korean population; five of them overlapped with the European results. Although the small number of implicated genes makes their GO analysis meaningless, most of them are also found in the gene list of the European studies. This implies that some of the strongest signals in the
European studies are well captured in Korean samples as well, while other signals may be weaker in Korean samples and are lost below the threshold.
Imputation of KARE genotype data
The KARE genotype data were supplemented by imputing SNP genotypes based on the genotypes of the JPT+CHB panel of the International HapMap Phase II (The International
HapMap Consortium 2005). Details of SNP imputation and filtering have been published elsewhere (28). Briefly, the genotypes of a total of 2,168,896 SNPs were imputed using
PLINK and 799,492 of them passing P
HWE
> 10 -6 , IMPUTED > 0.9, and INFO > 0.9
(concordance > 99%) were kept for subsequent analyses. GWA scan for height was carried out by the trend test after adjustment of age and sex using PLINK. The GWA p-values for all the filtered SNPs were gathered and used as an input to GSA-SNP.
Guidance on how to choose a GSA method in GSA-SNP
By and large, there are three kinds of GSA methods: Gene- or sample- randomizing methods and their hybridization. Briefly speaking, gene-randomization methods (Z-statistic method of
GSA-SNP, GeSBAP) assess the enrichment of the association signal in a gene set compared to the background genes, and sample randomization methods (SRT) assess the existence of the signal in a gene set. In other words, the foci are a little different between the two approaches.
In the practical perspective, the gene randomization methods have an advantage. It is applicable to a small number of data samples, while the sample-randomization methods require more data samples. In the statistical perspective, the sample-randomizing methods have an advantage. Gene-randomization methods assume that each gene set is a collection of independent samples (genes), which is not valid because most gene sets share a common biological function and may have more or less correlation structures in their gene expression.
On the other hand, the sample-randomization method is free from this assumption. However, the problem of independent sampling may be ameliorated in the context of GWAS, because the correlation of association depends on the haplotype structures of the genome rather than the biological functions shared by gene sets.
In the context of GWAS, the sample randomizing methods require high computational costs in both memory and time. For a p-value approach, for example, we should generate a thousand simulated p-values to obtain a reasonable level of significance, which is very timeconsuming. For this reason, we devised two methods that hybridize gene- and samplerandomizing methods (Restandardization and GSEA) in our software. By pooling the gene set scores of different gene sets, we can obtain significant scores with only a hundred simulated samples. Therefore, the user with a preference to sample randomizing methods may choose the Restandardization or GSEA methods. Comparing the two hybrid methods, restandardization method yields many significant results, while GSEA provides the most conservative results. See Nam and Kim (2008) for a detailed comparison of GSA methods.
Comparison of best, 2 nd best and 3 rd best p-values for height and triglyceride traits
The Supplementary Figure 1 & 2 demonstrate the distributions of p-values between the k th vs.
( k +1)th best p-values, k =1,2,3, and 4 for the two traits: height and triglyceride. In both cases, the correlations between p-values increased as k increased. In particular, the difference between the best and the second best p-values for the triglyceride example was pronounced where the use of the second best p-values removed many doubtful best p-values.
Comparison of corrected p-values
We also observed that by using the second best p-values, the corrected p-values (Z-statistic) or FDR values (GSEA-Maxmean) were overall improved by using the second best p’s, while the predicted gene sets themselves were similar between the best and the second best p results.
But, when we used the third or higher orders, the FDRs in GSEA method became worse than those for the best p option.
Supplementary Figure 1 . (Height example) The Scatter plots between the k th vs. ( k +1)th best p-values, k =1,2,3, and 4 for the height data. Blue areas represent densely distributed data points.
Supplementary Figure 2.
(Triglyceride example) The Scatter plots between the k th vs. ( k +1)th best p-values, k =1,2,3, and 4 for the triglyceride data. Blue areas represent densely distributed data points.