Supplementary material for “Random Forest Fishing: A Novel Approach to Identifying Organic Group of Risk Factors in Genome wide Association Studies” by Yang and Gu. Table of Contents 1. 2. 3. 4. 5. 6. 7. 8. 9. Brief review of Random Forest (RF) ................................................................................... 2 Cooling scheme used to control bait set sizes .................................................................. 3 SNP coverage......................................................................................................................... 4 Tuning parameters in RF and RFF ...................................................................................... 6 Computational time of RFF tests ......................................................................................... 8 Power for detecting multiple risk SNPs using 500 cases and 500 controls .................. 9 Functional annotation of genes assigned to the RFF-found SNPs .............................. 10 Overlap of RFF results with other tests in real GWAS data analyses .......................... 12 References ............................................................................................................................ 12 1 1. Brief review of Random Forest (RF) RF is an ensemble method that combines the result of many classification and regression trees (CART) to make a prediction. The trees were built after introducing two levels of randomization1. Specifically, each tree in the forest is grown as follows: 1. Randomly sampling subjects from the data to grow each tree. The same number of subjects are randomly sampled with replacement from the original data, and used as the training dataset to grow the trees. The sampling leaves out about one third of the subjects from the original. They are called out of bag (OOB) samples, and are used as the test (validation) dataset to get unbiased estimation of the prediction rate and variable importance, so RF requires no external testing samples. 2. Randomly selecting candidate variables to determine splitting criteria at each node of the tree. If there are totally M variables in the dataset, a number m much less than M is specified such that at each node, only m variables are selected at random for evaluation and the one that most differentiate the predicting trait is chosen to split the node. The splitting procedure is repeated to get all nodes of the tree. 3. The above steps are repeated to grow a pre-determined number of trees to form a “random forest”. The prediction results of all the trees are pooled to “vote” for the overall prediction by the random forest. RF provides excellent prediction accuracy. The importance of each variable for predicting the trait could be measured using the difference in prediction accuracy before and after randomly permuting the values of that variable. This measure entails not only marginal effects of the variable, but also interaction effects with other factors. Using this feature precludes the need to explicitly model each possible interaction, desirable when analyzing large-scale GWAS datasets. Direct application of RF to GWAS analysis poses a real challenge. First, to roughly cover all SNPs and their interactions, a large number of trees have to be grown. To give a simple example, if we grow a random forest in a dataset with 500,000 SNPs, and assume it involves ~700 SNPs per tree. The probability that two specific SNPs occur in the same tree is about 2×10-6. This means that at least 500,000 trees have to be grown in the forest to ensure that this combination is expected to be represented at least once. The requirement for a huge number of trees, plus the complexity to explore node-splitting variables when growing each tree, make it extremely computation intensive if not entirely impossible. Second, even if we managed to grow enough trees for a GWAS dataset, the variable importance measure based on these trees would not be reliable. Since the vast majority of the chromosomal SNPs are noise (irrelevant to the disease of interest), at each node of a tree, the handful of candidate splitting variables tend to be all noise. That means most splits within a tree are using noise variables, and estimations such as prediction rates and variable importance based on these trees are “fitting-to-noise” and unreliable. To overcome these challenges, the new RFF method uses the idea of “fishing” to 2 improve the power of RF to detect interacting SNPs, and autonomously determine the size of a globally important group of relevant SNPs. To avoid “fitting to noises”, RFF handles the dimensionality problem by using an iterative process to traverse the space of all variables and limiting the number of variables in each iteration of RF analysis. 2. Cooling scheme used to control bait set sizes In all experiments reported herein, the rate of decreasing baits sizes is chosen so that over any interval of a given length, on average, RFF always evaluates about the same number of variables. Let the distribution of pool sizes (the same as bait set sizes) follows a density function a given length . For any interval covering from to of the average number of variables to be sampled by all pools over the interval equals only dependent on , where is an arbitrary bait size. To make the integral (constant for any given value of ), we used , where is a normalization constant. This cooling scheme ensures that more iterations are used for smaller bait set sizes. Figure S1 shows the distribution used by this cooling scheme in real data analysis that decreases bait set sizes from 2500 to 50. Figure S1. Distribution of bait set sizes used in RFF analysis of real GWAS data of 3 the HHD study. Bait sets decreases from 2500 to 50. The distribution is approximated by . An added benefit of using decreasing bait set sizes is that it also gives the number of variables to output after fishing. In a typical RFF test, the prediction accuracy of bait sets tends to increase at first as more and more important factors are caught. After the set gets saturated with important factors, the important factors are beginning to be dropped off as bait set size decreases, which makes the prediction accuracy stop increasing and start dropping down. We select the bait set at the peak of the prediction rate curve as the output important variables from RFF, i.e., the smallest set with best prediction accuracy. 3. SNP coverage For the iterative process employed by RFF, we set a target SNP coverage to ensure that, on average, all GWAS SNPs are evaluated for the expected number of times. The covered times in real data analysis were summarized and shown in Table S1. When the targeted coverage is 10, only 2 SNPs were not tested, and more than of the GWAS SNPs were tested at least 5 times. The actual average coverage in analysis is 12.1 times. The histograms of actual times a SNP was sampled (covered) were shown in Figure S2. Table S1. SNP coverage in real analysis for target coverage of 10. Targeted coverage 10 Covered times SNPs % 4 2 0% 264 0.07% 2970 0.76% 17137 4.40% 58932 15.14% 132459 34.02% Figure S2. Histogram of actual times a SNP was sampled in the real data analysis of the HHD study. The target coverage was 10. 5 4. Tuning parameters in RF and RFF Table S2. Tunable randomForest2 /Random Jungle3 parameters and their default values used by RFF in evaluation tests Parameter Description Value used in the reported RFF tests mtry Number of variables to be randomly selected and analyzed to choose a splitting variable at each node of a tree Default value (square root of number of input variables) is used in RFF ntree Total number of trees to construct in RF Default value (500 trees) is used in evaluating RFF. weight Weights for each subject in the dataset; to be tuned for unbalanced case/control No applicable because we used a balanced case-control design. datasets Tree type Decision tree or regression tree depending on nature of the phenotype Decision trees are used for binary traits. Importance measure Measure of variable importance, mean decrease in prediction accuracy, or node impurity (Gini index) Default choice by the RF engine was used: in the simulation tests, mean decrease in prediction accuracy was used by randomForest; in the real data test, decrease in node Gini index was used by Random Jungle. 6 Table S3. Overview of RFF tuning parameters Parameter Description Number of Number of populations(parallel bait sets) populations in genetic algorithm Value used in the reported RFF tests 10 Initial bait set size The bait sets size at the beginning 100 in simulation; 2500 in real GWAS analysis Last bait set size The last bait set size The targeted average times that each SNPs will be sampled for evaluation in fishing 3 in simulation; 50 in real GWAS analysis Targeted SNP coverage 4-24 in simulation analysis; 10 in real GWAS analysis Number of generations Total number of generations or iterations The values is automatically set after generating vector bait set sizes given the initial bait set size, the last bait set size, and the target SNP coverage times. The total number of generations is 603-3614 in simulations, and 624 in real GWAS analysis. Pairwise interaction guidance (Optional) if pairwise interaction guidance is to be used, the pairwise interaction test result should be given. For both simulation and real data analysis, guidance is given by using PLINK fast-epistasis test result. Initial bait sets (Optional) the bait sets before RFF starts could be specified. It uses previous analysis results as starting point for fishing. Otherwise, if it is not specified, RFF starts by randomly sample GWAS SNPs to start. Both simulated and real data analysis reported in the manuscript use random initial bait sets. We also tried using initial sets from top single SNP test results and compared to the random sets. Using top SNPs at start gives much better prediction rate at the beginning, but in the end, it has close best prediction and similar best bait set during the whole fishing process. 7 Tunable parameters used by RF(RJ) outside of RFF The parameters in Table S2 and Table S3 were used in RFF when calling RF on subsets of variables. RFF was also compared with direct RF applications using RJ. In direct RJ analysis of simulated datasets, SNP importance is estimated based on 1,000 trees. To check if the number of trees is adequate, we tested a few randomly selected datasets with up to 10,000 trees and found no considerable change in either importance values or variable ranks. Other parameters in RJ tests were set using default values as in Table S2. RJ application using tuned parameters as by Goldstein el. al.4 was performed with mtry set to 1/10 of total GWAS SNPs, ntree set to 8000, and SNPs in high LD pruned out (SNP <0.9). 5. Computational time of RFF tests For the simulation tests, RFF using the R package and randomForest2 as RF engine took ~10 hours on a single CPU (Quad-Core AMD Opteron(tm) Processor 2354) to analyze the 50K dataset in 1000 subjects at coverage of 4x, and ~65 hours at 24x. For the real data tests, RFF using the C++ package and Random Jungle3 as RF engine took ~3 hours on a single CPU (Intel(R) Xeon(R) X5660 2.80GHz) to analyze the QC’ed 500K dataset in 140 individuals. 8 6. Power for detecting multiple risk SNPs using 500 cases and 500 controls Figure S3. Power of detecting risk SNPs using 500 cases and 500 controls The power is shown in all scenarios to detect any of the risk SNPs (“>=1 SNP”), any of the weak risk SNPs (“>= 1 weak SNPs”), and more weak SNPs. A weak risk SNP is one that has no marginal effect at all, and contributes to disease through interactions. The power to detect the 6 risk SNPs are shown for the 5 scenarios. “Chi2” represent χ2 tests; “RJ” for Random Jungle test; “Pairwise” for pairwise interaction test using PLINK fast epitasis. In these methods, we declare the top 31 SNPs as “detected” to measure the power. “RFF.nointx” and “RFF.intx” are RFF tests without and with interaction. “RFF.intx1” and “RFF.intx2” used empirical guidance based on the pairwise interaction tests, with SNP coverage of 4 and 24, respectively; and “RFF.intx3” used theoretical guidance based on synthetic interactions clustered over the 6 risk SNPs. 9 7. Functional annotation of genes assigned to the RFF-found SNPs Table S4. Functional enrichment detected by DAVID5 as over representation of genes involved in known biological pathways or with known disease associations. Terms with p value Category OMIM_DISEASE KEGG_PATHWAY PANTHER_PATHWAY KEGG_PATHWAY GENETIC_ASSOCIATION_DB_DISEASE KEGG_PATHWAY REACTOME_PATHWAY GENETIC_ASSOCIATION_DB_DISEASE PANTHER_PATHWAY GENETIC_ASSOCIATION_DB_DISEASE are shown with their P values. Term A Genome-Wide Association Study Identifies Protein Quantitative Trait Loci (pQTLs) Arrhythmogenic right ventricular cardiomyopathy (ARVC) P00053:T cell activation Androgen and estrogen metabolism atherosclerosis, coronary, lipoprotein Purine metabolism REACT_11061:Signalling by NGF Post-transplantation diabetes mellitus (PTDM) P00033:Insulin/IGF pathway-protein kinase B signaling cascade long QT syndrome 10 P-Value 0.0028 0.044 0.052 0.056 0.071 0.074 0.081 0.084 0.085 0.097 11 8. Overlap of RFF results with other tests in real GWAS data analyses Table S5. Overlap of top 213 SNP from different tests RFF Chi2 epi rj rj.tuned RFF* 213 92 1 95 105 Chi2 92 213 0 69 85 epi 1 0 213 0 0 *RFF: random forest fishing; Chi2: single SNP rj 95 69 0 213 84 rj.tuned 105 85 0 84 213 test; epi: pairwise interaction test using PLINK fast-epistasis; rj: direct application of Random Jungle using default parameters; rj.tuned: Random Jungle test by tuning parameters as in [Goldstein el. al.], mtry is set to GWAS SNPs, and ntree is set to 8000; SNPs in high LD are pruned out ( of total ). These parameters lead to the best results in Goldstein el. al.4 Table S6. Overlap genes assigned to the top 213 SNP from different tests as in Table S4. RFF Chi2 epi rj rj.tuned RFF 196 95 9 98 121 Chi2 95 159 5 73 98 epi 9 5 138 8 9 rj 98 73 8 185 103 rj.tuned 121 98 9 103 204 9. References 1. Breiman L: Random Forest. Machine Learning 2001; 45: 5-32. 2. Liaw A, Wiener M: Classification and Regression by randomForest. R News 2002; 2: 18-22. 3. Schwarz DF, Konig IR, Ziegler A: On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics 2010; 26: 1752-1758. 4. Goldstein BA, Hubbard AE, Cutler A, Barcellos LF: An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings. BMC Genetics 2010; 11: 49. 12 13