Machine Learning Approaches to Detect Epistasis and Application to the Neuregulin 1 Pathway in Schizophrenia Kristin K. Nicodemus NIHR BRC-MH, IoP, KCL Why Machine Learning? Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why Machine Learning? Size of datasets increasing exponentially Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why Machine Learning? Size of datasets increasing exponentially In genomewide association studies, assume underlying biologic model = network or pathway Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why Machine Learning? Size of datasets increasing exponentially In genomewide association studies, assume underlying biologic model = network or pathway Current approaches consider single markers or simple combinations Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why Machine Learning? Size of datasets increasing exponentially In genomewide association studies, assume underlying biologic model = network Current approaches consider single markers or simple combinations Brute-force all-possible interaction model search not computationally tractable Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why Machine Learning? Size of datasets increasing exponentially In genomewide association studies, assume underlying biologic model = network Current approaches consider single markers or simple combinations Brute-force all-possible interaction model search not computationally tractable Use machine learning to detect pathways? Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Overview Why machine learning? Random Forest: Step-by-step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • Random Forest: Selection of Parameters Random Forest: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues MLAs: Application to NRG1 pathway in schizophrenia Why machine learning? • Random Forest: Stepby-step • RF: Step-by-Step Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Step-by-Step Why machine learning? • Random Forest: Stepby-step • Two Types of Variable Importance Measures Random Forest: Selection of Parameters • Gini index Permutation-based Average across all trees in the forest of decrease in node impurity for each predictor Unscaled: Average across all trees in the forest of the difference between the prediction accuracy (e.g., case or control) using the observed data and the same data with that predictor randomly permuted Note: based on in-bag samples Scaled = Unscaled/SE RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why machine learning? • Random Forest: Stepby-step • RF: Selection of Parameters Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Selection of Parameters • How to select the best parameters for RF? • Using R package randomForest: Why machine learning? Random Forest: Stepby-step replace = subsampling or bootstrap sampling Random Forest: Selection of Parameters mtry = the number of predictors to sample at each split from total set of predictors • nodesize = min(terminal node size) • • ntree = number of trees in the forest Flavour of variable importance measure RF: Methodology Issues MLAs: Application to NRG1 pathway in schizophrenia RF: replace = T or F? Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • When the scale of measurement varies across predictors, the use of bootstrapping (or resampling with replacement) may lead to biased VIMs due to inflation of VIMs for predictors with more categories (Strobl et al., 2007) RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Sampling without replacement (subsampling) is computationally faster Generally use 63.2% subsampling RF: Setting mtry Original paper suggests mtry should be “tuned” for each dataset, with default value of (N predictors)1/2 Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • Tuning = start at default, try larger and smaller numbers to find the value with the lowest out-ofbag error rate RF: Methodology Issues • Recent work suggests setting mtry to a much larger number when many “noise” variables reduces the error rate improvement in signal detection (Schwarz, Nicodemus, Koenig, Ziegler, in preparation) MLAs: Application to NRG1 pathway in schizophrenia • RF: Setting nodesize Original paper (and default) nodesize = 1 for classification and 5 for regression, claimed no overfitting Stability of VIMs and ability to detect weak signals are increased when nodesize is set larger than default (Nicodemus, submitted) Gini index VIM showed strong evidence for overfitting when nodesize set to default and predictors were correlated (Nicodemus & Malley, 2009) Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Setting ntree ntree is the number of trees to grow in the forest ntree should be set to a large size (where “large” is dependent on the number of predictors) because: Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • It allows all predictors sufficient chance to be selected to be in the mtry subset and thus used in building trees RF: Methodology Issues • It creates more accurate permutation-based VIMs It helps increase stability of VIM rankings Is the most computationallyexpensive parameter MLAs: Application to NRG1 pathway in schizophrenia • RF: Which VIM? Gini index VIM Why machine learning? • Random Forest: Stepby-step • Sensitive to predictors with differing numbers of categories (Strobl et al., 2007) – prefers those with greater cutpoints Sensitive to predictor correlation (Nicodemus & Malley, 2009, Nicodemus et al., 2010) – prefers uncorrelated predictors Sensitive to predictor category frequencies – prefers predictors with large category frequencies (Nicodemus, submitted) Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Which VIM? Scaled Permutation VIM Why machine learning? • Random Forest: Stepby-step • Is dependent on the size of the forest (DíazUriarte & Alvarez de Andrés, 2006) – the larger the forest, the larger the number of times a predictor may be used and this reduces the standard error and boosts VIMs for frequentlyused predictors, which is a problem because split criterion is based on Gini index Sensitive to predictor correlation (Nicodemus & Malley, 2009, Nicodemus et al., 2010) because this influences selection frequencies Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Which VIM? Unscaled Permutation VIM Why machine learning? • Random Forest: Stepby-step • Not dependent on the size of the forest Not sensitive to predictor correlation Not biased under H0 Random Forest: Selection of Parameters • RF: Methodology Issues Not sensitive to predictors with different numbers of categories • Not sensitive to predictor category frequency • Best all-round! MLAs: Application to NRG1 pathway in schizophrenia Why machine learning? • Random Forest: Stepby-step • RF: Methodology Issues Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Why machine learning? • Random Forest: Stepby-step • Simulation study assessed impact of correlation on VIMs under H0 and HA Genetic: 5 Genes, 199 SNPs, 500 replicates/condition, retained LD structure Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Why machine learning? • Random Forest: Stepby-step • H0 : Gini VIM Random Forest: Selection of Parameters • H0 : Permutation VIM 0 0 RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Why machine learning? • Random Forest: Stepby-step • HA : Gini VIM, OR 1.75 Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Current work: assessing stability/ranking using RF VIMs Stability key feature in interpretation of ranked lists of predictors Previous work suggested using the Gini VIM as it was more stable Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • “Real” SNP data (723 SNPs) x-axis: rank in original dataset, y-axis: rank in 100 90% jackknife samples RF: Methodology Issues Gini VIM biased when predictors vary in the number of categories (Strobl et al, 2007) How about when categories are equal but category frequencies are not? Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Under H0: Gini VIM SNP median rank = 1 (range 1-2) using on 100 90% jackknife samples: stable! Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • Unscaled permutation VIM SNP median rank = 198 (range 105-199) Χ2 p-value: 0.98 MAF: 0.41, Largest r2 = 0.21 Genotype frequencies: Cases (0.341, 0.5, 0.159) Controls (0.341, 0.503, 0.156) RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • RF: Methodology Issues Under HA, OR = 2.0: Gini VIM SNP median rank = 101.5 (range 60-138) using on 100 90% jackknife samples: unstable! Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • Unscaled permutation VIM SNP median rank = 19 (range 6-35) Χ2 p-value: 1.3E-16 RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Why machine learning? • Random Forest: Stepby-step • MLAs: Application Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • MLAs: Application Application to NRG1 pathway in schizophrenia: NRG1, ERBB4, AKT1, NOS1AP, NOS1 and DLG4 (121 SNPs) Schizophrenia case (N = 296) – control (N = 365) sample Follow-up: validation in independent sample of healthy controls (N = 172) using fMRI Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • Nicodemus KK et al., NRG1, ERBB4 and AKT1 epistasis increases schizophrenia risk and is biologically validated via functional neuroimaging in healthy controls. Arch Gen Psychiatry (in press) MLAs: Study Design Consensus approach: 3 different MLAs (random forest, conditional inference forest, Monte Carlo logic regression) SNPs selected for follow-up had empirical p-values < 0.05 for VIMs using at least 2 of 3 algorithms Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • 7 SNPs then subjected to formal tests of interaction (LRTs from logistic regression) Significant interactions tested in fMRI sample MLAs: Application to NRG1 pathway in schizophrenia • MLAs: Results 3 significant case-control interactions, all validated via neuroimaging, including one 3way interaction between NRG1-ERBB4AKT1 Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • Using fMRI in healthy controls, schizophrenia-risk associated interacting alleles showed inefficient processing during the N-back task, reminiscent of altered cognitive processing in schizophrenia Nicodemus KK et al., Evidence of statistical epistasis between DISC1, CIT and NDEL1 impacting risk for schizophrenia: biological validation with functional neuroimaging. Hum Genet, Epub Jan 19, 2010 RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia • MLAs: Results Why machine learning? • 2 3 4 Random Forest: Stepby-step • R- Akt1 and ErbB4 R- Akt1 and NRG1 1 R- only Akt1 R all vs. NR 5 Random Forest: Selection of Parameters • 7 8 9 6 RF: Methodology Issues • R- Akt1 and ErbB4 R- Akt1 and NRG1 R- only Akt1 R all vs. NR OR for 3 risk genotypes = 27.13 (3.30, Case-control: 223.03, p=0.0020, N cases=17, N controls=5); LRT p=0.042 , 3 other genotype combinations showed OR p<0.05 fMRI p-value for interaction=0.0050 risk vs. no risk, FWE 8 9 corrected. Controls 7carrying 3 schizophrenia risk 1 2 3 showed 4 5 6 ~5x less efficient processing in PFC genotypes during N-back task MLAs: Application to NRG1 pathway in schizophrenia • MLAs: Future Directions Andreas Ziegler, Inke Koenig, Daniel Schwartz (U Luebeck): backward elimination for feature selection using random jungle (www.randomjungle.org) Yan Meng (Broad): meta-machine learning, mixture-of-experts approaches for integration of different data types Carolin Strobl (U Munich) and Yan Meng: assessment/development of new VIMs for correlated predictors kristin.nicodemus@kcl.ac.uk Why machine learning? • Random Forest: Stepby-step • Random Forest: Selection of Parameters • RF: Methodology Issues • MLAs: Application to NRG1 pathway in schizophrenia •