Supplementary Information (doc 1094K)

advertisement
Supplementary material for “Random Forest Fishing: A Novel Approach to
Identifying Organic Group of Risk Factors in Genome wide Association
Studies” by Yang and Gu.
Table of Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
Brief review of Random Forest (RF) ................................................................................... 2
Cooling scheme used to control bait set sizes .................................................................. 3
SNP coverage......................................................................................................................... 4
Tuning parameters in RF and RFF ...................................................................................... 6
Computational time of RFF tests ......................................................................................... 8
Power for detecting multiple risk SNPs using 500 cases and 500 controls .................. 9
Functional annotation of genes assigned to the RFF-found SNPs .............................. 10
Overlap of RFF results with other tests in real GWAS data analyses .......................... 12
References ............................................................................................................................ 12
1
1. Brief review of Random Forest (RF)
RF is an ensemble method that combines the result of many classification and
regression trees (CART) to make a prediction. The trees were built after introducing
two levels of randomization1. Specifically, each tree in the forest is grown as follows:
1. Randomly sampling subjects from the data to grow each tree. The same
number of subjects are randomly sampled with replacement from the original
data, and used as the training dataset to grow the trees. The sampling leaves
out about one third of the subjects from the original. They are called out of bag
(OOB) samples, and are used as the test (validation) dataset to get unbiased
estimation of the prediction rate and variable importance, so RF requires no
external testing samples.
2. Randomly selecting candidate variables to determine splitting criteria at each
node of the tree. If there are totally M variables in the dataset, a number m
much less than M is specified such that at each node, only m variables are
selected at random for evaluation and the one that most differentiate the
predicting trait is chosen to split the node. The splitting procedure is repeated
to get all nodes of the tree.
3. The above steps are repeated to grow a pre-determined number of trees to
form a “random forest”. The prediction results of all the trees are pooled to
“vote” for the overall prediction by the random forest.
RF provides excellent prediction accuracy. The importance of each variable for
predicting the trait could be measured using the difference in prediction accuracy
before and after randomly permuting the values of that variable. This measure entails
not only marginal effects of the variable, but also interaction effects with other factors.
Using this feature precludes the need to explicitly model each possible interaction,
desirable when analyzing large-scale GWAS datasets.
Direct application of RF to GWAS analysis poses a real challenge. First, to roughly
cover all SNPs and their interactions, a large number of trees have to be grown. To
give a simple example, if we grow a random forest in a dataset with 500,000 SNPs,
and assume it involves ~700 SNPs per tree. The probability that two specific SNPs
occur in the same tree is about 2×10-6. This means that at least 500,000 trees have to
be grown in the forest to ensure that this combination is expected to be represented at
least once. The requirement for a huge number of trees, plus the complexity to
explore node-splitting variables when growing each tree, make it extremely
computation intensive if not entirely impossible. Second, even if we managed to grow
enough trees for a GWAS dataset, the variable importance measure based on these
trees would not be reliable. Since the vast majority of the chromosomal SNPs are
noise (irrelevant to the disease of interest), at each node of a tree, the handful of
candidate splitting variables tend to be all noise. That means most splits within a tree
are using noise variables, and estimations such as prediction rates and variable
importance based on these trees are “fitting-to-noise” and unreliable.
To overcome these challenges, the new RFF method uses the idea of “fishing” to
2
improve the power of RF to detect interacting SNPs, and autonomously determine the
size of a globally important group of relevant SNPs. To avoid “fitting to noises”, RFF
handles the dimensionality problem by using an iterative process to traverse the
space of all variables and limiting the number of variables in each iteration of RF
analysis.
2. Cooling scheme used to control bait set sizes
In all experiments reported herein, the rate of decreasing baits sizes is chosen so
that over any interval of a given length, on average, RFF always evaluates about the
same number of variables. Let the distribution of pool sizes (the same as bait set
sizes) follows a density function
a given length
. For any interval covering from
to
of
the average number of variables to be sampled by all pools over the
interval equals
only dependent on
, where
is an arbitrary bait size. To make the integral
(constant for any given value of
), we used
, where
is a normalization constant. This cooling scheme ensures that more iterations are
used for smaller bait set sizes. Figure S1 shows the distribution used by this cooling
scheme in real data analysis that decreases bait set sizes from 2500 to 50.
Figure S1. Distribution of bait set sizes used in RFF analysis of real GWAS data of
3
the HHD study. Bait sets decreases from 2500 to 50. The distribution is approximated
by
.
An added benefit of using decreasing bait set sizes is that it also gives the number
of variables to output after fishing. In a typical RFF test, the prediction accuracy of bait
sets tends to increase at first as more and more important factors are caught. After the
set gets saturated with important factors, the important factors are beginning to be
dropped off as bait set size decreases, which makes the prediction accuracy stop
increasing and start dropping down. We select the bait set at the peak of the
prediction rate curve as the output important variables from RFF, i.e., the smallest set
with best prediction accuracy.
3. SNP coverage
For the iterative process employed by RFF, we set a target SNP coverage to
ensure that, on average, all GWAS SNPs are evaluated for the expected number of
times. The covered times in real data analysis were summarized and shown in Table
S1. When the targeted coverage is 10, only 2 SNPs were not tested, and more than
of the GWAS SNPs were tested at least 5 times. The actual average coverage
in analysis is 12.1 times. The histograms of actual times a SNP was sampled
(covered) were shown in Figure S2.
Table S1. SNP coverage in real analysis for target coverage of 10.
Targeted coverage 10
Covered times
SNPs
%
4
2
0%
264
0.07%
2970
0.76%
17137
4.40%
58932
15.14%
132459
34.02%
Figure S2. Histogram of actual times a SNP was
sampled in the real data analysis of the HHD study.
The target coverage was 10.
5
4. Tuning parameters in RF and RFF
Table S2. Tunable randomForest2 /Random Jungle3 parameters and their default values used by RFF in evaluation tests
Parameter
Description
Value used in the reported RFF tests
mtry
Number of variables to be randomly
selected and analyzed to choose a
splitting variable at each node of a tree
Default value (square root of number of input variables) is used in
RFF
ntree
Total number of trees to construct in RF
Default value (500 trees) is used in evaluating RFF.
weight
Weights for each subject in the dataset; to
be tuned for unbalanced case/control
No applicable because we used a balanced case-control design.
datasets
Tree type
Decision tree or regression tree
depending on nature of the phenotype
Decision trees are used for binary traits.
Importance measure
Measure of variable importance, mean
decrease in prediction accuracy, or node
impurity (Gini index)
Default choice by the RF engine was used: in the simulation tests,
mean decrease in prediction accuracy was used by randomForest;
in the real data test, decrease in node Gini index was used by
Random Jungle.
6
Table S3. Overview of RFF tuning parameters
Parameter
Description
Number of
Number of populations(parallel bait sets)
populations
in genetic algorithm
Value used in the reported RFF tests
10
Initial bait set size
The bait sets size at the beginning
100 in simulation; 2500 in real GWAS analysis
Last bait set size
The last bait set size
The targeted average times that each
SNPs will be sampled for evaluation in
fishing
3 in simulation; 50 in real GWAS analysis
Targeted SNP
coverage
4-24 in simulation analysis; 10 in real GWAS analysis
Number of
generations
Total number of generations or iterations
The values is automatically set after generating vector bait set
sizes given the initial bait set size, the last bait set size, and the
target SNP coverage times. The total number of generations is
603-3614 in simulations, and 624 in real GWAS analysis.
Pairwise interaction
guidance
(Optional) if pairwise interaction guidance
is to be used, the pairwise interaction test
result should be given.
For both simulation and real data analysis, guidance is given by
using PLINK fast-epistasis test result.
Initial bait sets
(Optional) the bait sets before RFF starts
could be specified. It uses previous
analysis results as starting point for
fishing. Otherwise, if it is not specified,
RFF starts by randomly sample GWAS
SNPs to start.
Both simulated and real data analysis reported in the manuscript
use random initial bait sets. We also tried using initial sets from top
single SNP test results and compared to the random sets. Using
top SNPs at start gives much better prediction rate at the
beginning, but in the end, it has close best prediction and similar
best bait set during the whole fishing process.
7
Tunable parameters used by RF(RJ) outside of RFF
The parameters in Table S2 and Table S3 were used in RFF when calling RF on
subsets of variables. RFF was also compared with direct RF applications using RJ. In
direct RJ analysis of simulated datasets, SNP importance is estimated based on
1,000 trees. To check if the number of trees is adequate, we tested a few randomly
selected datasets with up to 10,000 trees and found no considerable change in either
importance values or variable ranks. Other parameters in RJ tests were set using
default values as in Table S2.
RJ application using tuned parameters as by Goldstein el. al.4 was performed
with mtry set to 1/10 of total GWAS SNPs, ntree set to 8000, and SNPs in high LD
pruned out (SNP
<0.9).
5. Computational time of RFF tests
For the simulation tests, RFF using the R package and randomForest2 as RF
engine took ~10 hours on a single CPU (Quad-Core AMD Opteron(tm) Processor
2354) to analyze the 50K dataset in 1000 subjects at coverage of 4x, and ~65 hours at
24x. For the real data tests, RFF using the C++ package and Random Jungle3 as
RF engine took ~3 hours on a single CPU (Intel(R) Xeon(R) X5660 2.80GHz) to
analyze the QC’ed 500K dataset in 140 individuals.
8
6. Power for detecting multiple risk SNPs using 500 cases and 500 controls
Figure S3. Power of detecting risk SNPs using 500 cases and 500 controls
The power is shown in all scenarios to detect any of the risk SNPs (“>=1 SNP”), any of
the weak risk SNPs (“>= 1 weak SNPs”), and more weak SNPs. A weak risk SNP is
one that has no marginal effect at all, and contributes to disease through interactions.
The power to detect the 6 risk SNPs are shown for the 5 scenarios. “Chi2” represent
χ2 tests; “RJ” for Random Jungle test; “Pairwise” for pairwise interaction test using
PLINK fast epitasis. In these methods, we declare the top 31 SNPs as “detected” to
measure the power. “RFF.nointx” and “RFF.intx” are RFF tests without and with
interaction. “RFF.intx1” and “RFF.intx2” used empirical guidance based on the
pairwise interaction tests, with SNP coverage of 4 and 24, respectively; and
“RFF.intx3” used theoretical guidance based on synthetic interactions clustered over
the 6 risk SNPs.
9
7. Functional annotation of genes assigned to the RFF-found SNPs
Table S4. Functional enrichment detected by DAVID5 as over representation of genes involved in known biological pathways or with known
disease associations. Terms with p value
Category
OMIM_DISEASE
KEGG_PATHWAY
PANTHER_PATHWAY
KEGG_PATHWAY
GENETIC_ASSOCIATION_DB_DISEASE
KEGG_PATHWAY
REACTOME_PATHWAY
GENETIC_ASSOCIATION_DB_DISEASE
PANTHER_PATHWAY
GENETIC_ASSOCIATION_DB_DISEASE
are shown with their P values.
Term
A Genome-Wide Association Study Identifies Protein
Quantitative Trait Loci (pQTLs)
Arrhythmogenic right ventricular cardiomyopathy (ARVC)
P00053:T cell activation
Androgen and estrogen metabolism
atherosclerosis, coronary, lipoprotein
Purine metabolism
REACT_11061:Signalling by NGF
Post-transplantation diabetes mellitus (PTDM)
P00033:Insulin/IGF pathway-protein kinase B signaling
cascade
long QT syndrome
10
P-Value
0.0028
0.044
0.052
0.056
0.071
0.074
0.081
0.084
0.085
0.097
11
8. Overlap of RFF results with other tests in real GWAS data analyses
Table S5. Overlap of top 213 SNP from different tests
RFF
Chi2
epi
rj
rj.tuned
RFF*
213
92
1
95
105
Chi2
92
213
0
69
85
epi
1
0
213
0
0
*RFF: random forest fishing; Chi2: single SNP
rj
95
69
0
213
84
rj.tuned
105
85
0
84
213
test; epi: pairwise interaction test using
PLINK fast-epistasis; rj: direct application of Random Jungle using default parameters; rj.tuned:
Random Jungle test by tuning parameters as in [Goldstein el. al.], mtry is set to
GWAS SNPs, and ntree is set to 8000; SNPs in high LD are pruned out (
of total
). These
parameters lead to the best results in Goldstein el. al.4
Table S6. Overlap genes assigned to the top 213 SNP from different tests as in Table
S4.
RFF
Chi2
epi
rj
rj.tuned
RFF
196
95
9
98
121
Chi2
95
159
5
73
98
epi
9
5
138
8
9
rj
98
73
8
185
103
rj.tuned
121
98
9
103
204
9. References
1.
Breiman L: Random Forest. Machine Learning 2001; 45: 5-32.
2.
Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002;
2: 18-22.
3.
Schwarz DF, Konig IR, Ziegler A: On safari to Random Jungle: a fast implementation
of Random Forests for high-dimensional data. Bioinformatics 2010; 26:
1752-1758.
4.
Goldstein BA, Hubbard AE, Cutler A, Barcellos LF: An application of Random
Forests to a genome-wide association dataset: Methodological considerations &
new findings. BMC Genetics 2010; 11: 49.
12
13
Download