Machine Learning Approaches to Detect Epistasis and Application to the Neuregulin 1

advertisement
Machine Learning Approaches to
Detect Epistasis
and
Application to the Neuregulin 1
Pathway in Schizophrenia
Kristin K. Nicodemus
NIHR BRC-MH, IoP, KCL
Why Machine Learning?
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why Machine Learning?
Size of datasets increasing exponentially
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why Machine Learning?
Size of datasets increasing exponentially
In genomewide association studies,
assume underlying biologic model =
network or pathway
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why Machine Learning?
Size of datasets increasing exponentially
In genomewide association studies,
assume underlying biologic model =
network or pathway
Current approaches consider single
markers or simple combinations
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why Machine Learning?
Size of datasets increasing exponentially
In genomewide association studies,
assume underlying biologic model =
network
Current approaches consider single
markers or simple combinations
Brute-force all-possible interaction model
search not computationally tractable
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why Machine Learning?
Size of datasets increasing exponentially
In genomewide association studies,
assume underlying biologic model =
network
Current approaches consider single
markers or simple combinations
Brute-force all-possible interaction model
search not computationally tractable
Use machine learning to detect pathways?
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Overview
Why machine learning?
Random Forest: Step-by-step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
Random Forest: Selection of Parameters
Random Forest: Methodology Issues
•
MLAs: Application to NRG1 pathway in
schizophrenia
•
RF:
Methodology
Issues
MLAs:
Application to
NRG1
pathway in
schizophrenia
Why machine
learning?
•
Random
Forest: Stepby-step
•
RF: Step-by-Step
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Step-by-Step
Why machine
learning?
•
Random
Forest: Stepby-step
•
Two Types of Variable Importance Measures
Random
Forest:
Selection of
Parameters
•
Gini index
Permutation-based
Average across all trees in
the forest of decrease in
node impurity for each
predictor
Unscaled: Average across
all trees in the forest of the
difference between the
prediction accuracy (e.g.,
case or control) using the
observed data and the same
data with that predictor
randomly permuted
Note: based on in-bag
samples
Scaled = Unscaled/SE
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why machine
learning?
•
Random
Forest: Stepby-step
•
RF: Selection of Parameters
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Selection of Parameters
•
How to select the best parameters for RF?
•
Using R package randomForest:
Why machine
learning?
Random
Forest: Stepby-step
replace = subsampling or bootstrap sampling
Random
Forest:
Selection of
Parameters
mtry = the number of predictors to sample at
each split from total set of predictors
•
nodesize = min(terminal node size)
•
•
ntree = number of trees in the forest
Flavour of variable importance measure
RF:
Methodology
Issues
MLAs:
Application to
NRG1
pathway in
schizophrenia
RF: replace = T or F?
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
When the scale of measurement varies across
predictors, the use of bootstrapping (or resampling
with replacement) may lead to biased VIMs due to
inflation of VIMs for predictors with more
categories (Strobl et al., 2007)
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Sampling without replacement (subsampling) is
computationally faster
Generally use 63.2% subsampling
RF: Setting mtry
Original paper suggests mtry
should be “tuned” for each
dataset, with default value
of (N predictors)1/2
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
Tuning = start at default, try larger and smaller
numbers to find the value with the lowest out-ofbag error rate
RF:
Methodology
Issues
•
Recent work suggests setting mtry to a much
larger number when many “noise” variables
reduces the error rate
improvement in signal detection
(Schwarz, Nicodemus, Koenig, Ziegler, in
preparation)
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Setting nodesize
Original paper (and default) nodesize = 1 for
classification and 5 for regression, claimed no
overfitting
Stability of VIMs and ability to detect weak signals
are increased when
nodesize is set larger
than default (Nicodemus,
submitted)
Gini index VIM showed
strong evidence for overfitting when nodesize set to default and predictors
were correlated (Nicodemus & Malley, 2009)
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Setting ntree
ntree is the number of trees to grow in the forest
ntree should be set to a large size (where “large” is
dependent on the number of predictors) because:
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
It allows all predictors sufficient chance to be
selected to be in the mtry subset and thus used in
building trees
RF:
Methodology
Issues
•
It creates more accurate permutation-based VIMs
It helps increase stability of
VIM rankings
Is the most computationallyexpensive parameter
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Which VIM?
Gini index VIM
Why machine
learning?
•
Random
Forest: Stepby-step
•
Sensitive to predictors with differing numbers of
categories (Strobl et al., 2007) – prefers those
with greater cutpoints
Sensitive to predictor correlation (Nicodemus &
Malley, 2009, Nicodemus et al., 2010) – prefers
uncorrelated predictors
Sensitive to predictor category frequencies –
prefers predictors with large
category frequencies
(Nicodemus, submitted)
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Which VIM?
Scaled Permutation VIM
Why machine
learning?
•
Random
Forest: Stepby-step
•
Is dependent on the size of the forest (DíazUriarte & Alvarez de Andrés, 2006) – the larger
the forest, the larger the number of times a
predictor may be used and this reduces the
standard error and boosts VIMs for frequentlyused predictors, which is a problem because split
criterion is based on Gini index
Sensitive to predictor correlation (Nicodemus &
Malley, 2009, Nicodemus et
al., 2010) because this
influences selection
frequencies
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Which VIM?
Unscaled Permutation VIM
Why machine
learning?
•
Random
Forest: Stepby-step
•
Not dependent on the size of the forest
Not sensitive to predictor correlation
Not biased under H0
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
Not sensitive to predictors with different numbers
of categories
•
Not sensitive to predictor
category frequency
•
Best all-round!
MLAs:
Application to
NRG1
pathway in
schizophrenia
Why machine
learning?
•
Random
Forest: Stepby-step
•
RF: Methodology Issues
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Why machine
learning?
•
Random
Forest: Stepby-step
•
Simulation study assessed impact of
correlation on VIMs under H0 and HA
Genetic: 5 Genes, 199 SNPs, 500
replicates/condition, retained LD structure
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Why machine
learning?
•
Random
Forest: Stepby-step
•
H0 : Gini VIM
Random
Forest:
Selection of
Parameters
•
H0 : Permutation VIM
0
0
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Why machine
learning?
•
Random
Forest: Stepby-step
•
HA : Gini VIM, OR 1.75
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Current work: assessing stability/ranking
using RF VIMs
Stability key feature in interpretation of
ranked lists of predictors
Previous work suggested using the Gini
VIM as it was more stable
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
“Real” SNP data (723 SNPs)
x-axis: rank in original dataset, y-axis: rank in
100 90% jackknife samples
RF: Methodology Issues
Gini VIM biased when predictors vary in the number
of categories (Strobl et al, 2007)
How about when categories are equal but category
frequencies are not?
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Under H0: Gini VIM SNP median rank = 1
(range 1-2) using on 100 90% jackknife
samples: stable!
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
Unscaled permutation VIM SNP median rank =
198 (range 105-199)
Χ2 p-value: 0.98
MAF: 0.41, Largest r2 = 0.21
Genotype frequencies:
Cases (0.341, 0.5, 0.159)
Controls (0.341, 0.503, 0.156)
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
RF: Methodology Issues
Under HA, OR = 2.0: Gini VIM SNP median rank
= 101.5 (range 60-138) using on 100 90%
jackknife samples: unstable!
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
Unscaled permutation VIM SNP median rank =
19 (range 6-35)
Χ2 p-value: 1.3E-16
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Why machine
learning?
•
Random
Forest: Stepby-step
•
MLAs: Application
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
MLAs: Application
Application to NRG1 pathway in
schizophrenia: NRG1, ERBB4, AKT1,
NOS1AP, NOS1 and DLG4 (121 SNPs)
Schizophrenia case (N = 296) – control (N =
365) sample
Follow-up: validation in independent sample
of healthy controls (N = 172) using fMRI
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Nicodemus KK et al., NRG1, ERBB4 and AKT1
epistasis increases schizophrenia risk and is
biologically validated via functional neuroimaging in
healthy controls. Arch Gen Psychiatry (in press)
MLAs: Study Design
Consensus approach: 3 different MLAs
(random forest, conditional inference forest,
Monte Carlo logic regression)
SNPs selected for follow-up had empirical
p-values < 0.05 for VIMs using at least 2 of
3 algorithms
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
7 SNPs then subjected to formal tests of
interaction (LRTs from logistic regression)
Significant interactions tested in fMRI
sample
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
MLAs: Results
3 significant case-control interactions, all
validated via neuroimaging, including one 3way interaction between NRG1-ERBB4AKT1
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
Using fMRI in healthy controls,
schizophrenia-risk associated interacting
alleles showed inefficient processing during
the N-back task, reminiscent of altered
cognitive processing in schizophrenia
Nicodemus KK et al., Evidence of statistical epistasis
between DISC1, CIT and NDEL1 impacting risk for
schizophrenia: biological validation with functional
neuroimaging. Hum Genet, Epub Jan 19, 2010
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
MLAs: Results
Why machine
learning?
•
2
3
4
Random
Forest: Stepby-step
•
R- Akt1 and ErbB4
R- Akt1 and NRG1
1
R- only Akt1
R all vs. NR
5
Random
Forest:
Selection of
Parameters
•
7
8
9
6
RF:
Methodology
Issues
•
R- Akt1 and ErbB4
R- Akt1 and NRG1
R- only Akt1
R all vs. NR OR for 3 risk genotypes = 27.13 (3.30,
Case-control:
223.03, p=0.0020, N cases=17, N controls=5); LRT p=0.042
, 3 other genotype combinations showed OR p<0.05
fMRI p-value for interaction=0.0050 risk vs. no risk, FWE
8 9
corrected. Controls 7carrying
3 schizophrenia risk
1
2
3 showed
4 5
6 ~5x less efficient processing in PFC
genotypes
during N-back task
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
MLAs: Future Directions
Andreas Ziegler, Inke Koenig, Daniel
Schwartz (U Luebeck): backward
elimination for feature selection using
random jungle (www.randomjungle.org)
Yan Meng (Broad): meta-machine learning,
mixture-of-experts approaches for
integration of different data types
Carolin Strobl (U Munich) and Yan Meng:
assessment/development of new VIMs for
correlated predictors
kristin.nicodemus@kcl.ac.uk
Why machine
learning?
•
Random
Forest: Stepby-step
•
Random
Forest:
Selection of
Parameters
•
RF:
Methodology
Issues
•
MLAs:
Application to
NRG1
pathway in
schizophrenia
•
Download