class notes

advertisement
Bioinformatics
Multifactor Dimensionality Reduction
Kristel Van Steen, PhD, ScD
(kristel.vansteen@ulg.ac.be)
Université de Liege - Institut Montefiore
2008-2009
Outline


Setting the scene
Analyses methods for gene-gene
interactions



Traditional vs non-Traditional
MDR, MB-MDR, FAM-MDR
The future: work in progress
Setting the scene
Genetic Architecture of Disease



The number of genes that impact disease
susceptibility
The distribution of alleles and genotypes at
those genes
The manner in which the alleles and genotypes
impact disease susceptibility
(Weiss 1993)
Complications in disentangling?
There are likely to be many susceptibility genes
each with combinations of rare and common
alleles and genotypes that impact disease
susceptibility primarily through non-linear
interactions with genetic and environmental
factors
Terminology: Epistasis
Does evidence of statistical epistasis necessarily imply
genetical or biological epistasis?
(Moore 2004)
Analysis Methods
Traditional vs Non-Traditional
Traditional methods involving single markers
have limited use and more advanced and
efficient methods are needed to identify
gene interactions and epistatic patterns of
susceptibility
Alternative Methods

Tree-based methods:



Pattern recognition methods:





Recursive Partitioning (Helix Tree)
Random Forests (R, CART)
Symbolic Discriminant Analysis (SDA)
Mining association rules
Neural networks (NN)
Support vector machines (SVM)
Data reduction methods:



DICE (Detection of Informative Combined Effects)
MDR (Multifactor Dimensionality Reduction)
Logic regression …
(e.g., Onkamo and Toivonen 2006)
Gene Interaction Models

Non-parametric:


Appealing because no distributional
assumptions on genotype-phenotype effect
Parametric:


Appealing because easy adjustment for
confounding variables and main effects
Severe limitations in presence of too many
independent variables in relation to number of
observed outcome events
Out-of-control curse?
~500,000 SNPs span 80% of common variation in genome (HapMap)
2 x 1026
3 x 1021
5 x 105
1
1 x 1011
2
2 x 1016
3
4
5
Curse of Dimensionality

Bellman R (1961) Adaptive control processes: A
guided tour. Princeton University Press:
“... Multidimensional variational problems cannot be
solved routinely ... . This does not mean that we cannot
attack them. It merely means that we must employ
some more sophisticated techniques.”
Limitation of Regression


Having too many independent variables in relation
to the number of observed outcome events
Assuming 10 bi-allelic loci:
# of Parameters =
Main
effect
# of
Parameters
20
 n
  *2 k
 k
2-locus
3-locus
4-locus
interaction interaction interaction
180
960
3360
Limitation of Regression

Fewer than 10 outcome events per independent variable
can lead to biased estimates of the regression
coefficients and to an increase in Type 1 and Type 2
errors.
# of parameters P  min(ncase , ncontrol)/10 - 1

For 200 cases and 200 controls, this formula suggests
that no more than 19 (= 200/10 – 1) parameters should
be estimated in logistic regression model.
Multifactor Dimensionality
Reduction
(MDR)
to tackle the dimensionality problem of
interaction detection
Publications since Conception
MDR for Interaction Detection

MDR creates a one-dimensional multi-locus
genotype variable (high and low risk), which is
evaluated for its ability to classify and predict
disease status through cross-validation and
permutation testing.
(Ritchie et al 2001; Hahn et al 2003)
10 cross-validation  10 best models.
The model with minimum PE is the
best n-locus model.
MDR Steps
9/10 training data
1/10 test data
(Ritchie et al 2003)
Two Measures for Selection of
Best n-locus model

Misclassification error:
The proportion of incorrect classification in the training set.

Prediction error (PE):
The proportion of incorrect prediction in the test set.
Best Multi-factor Models
Best 2-factor model
Best 3-factor model
Best 4-factor model
Best 5-factor model
Best 6-factor model
.
.
Best n-factor model
Model Selection and Evaluation

Among the best n-factor models, the best model
is:



The model with the minimum average PE.
The model with the maximum average CVC.
Rule of parsimony: If there is a tie, select the smaller
model.
Significance of the Final Model
Via permutation tests:




Randomize the the case and control labels in the
original dataset multiple times to create a set of
permuted datasets.
Run MDR on each permuted dataset.
Maximum CVC and minimum PE identified for each
dataset saved and used to create an empirical
distribution for estimation of a P-value.
Example: through simulation
200 cases and 200 controls;
10 SNPs: 1, 2, 3 , …, 10.
Disease etiology due to interaction
between SNP 1 and SNP 6.
Over 10 CVs and
10 runs
Advantages of MDR





Simultaneous detection of multiple genetic loci
associated with a discrete clinical endpoint in
absence of main effect.
Non-parametric: Overcomes “curse of
dimensionality” by logistic regression model.
Three genotype groups are considered
separately
Non-linear interactions between multiple
polymorphisms in the absence of independent
effects
Low false positive rates
Disadvantages of MDR

Need to introduce parametrics?

MDR in its initial layout cannot deal with main
effects / confounding factors / nondichotomous outcomes:


GMDR / OR-MDR
Low power in the presence of genetic
heterogeneity
Power Simulation Set-Up
no noise
5% genotyping error -- GE
5% missing data -- MS
50% phenocopy -- PC
50% genetic heterogeneity – GH
GE + MS
…
…
GE+MS+PC
…
…
6 models
4 models
GE+MS+PC+GH
Total
16 models
Disadvantages of MDR

Noteworthy:



Model selection on the basis of prediction
accuracy
One single higher-order interaction model is
proposed
Some important interactions could be missed
due to pooling too many cells together
(Lou et al 2008)
Model Based Multifactor
Dimensionality Reduction
(MB-MDR)
Model Based MDR (MB-MDR)
MDR:
X={H,L}
MDR-MB:
X={H,L,O}
Marginal Effects
MB-MDR in its simplest form

Step 1: New risk cell
identification via
association test on
each genotype cell cj



Parametric or nonparametric test of
association
ORj
Step 2: Test X on Y


Parametric or nonparametric
ORH, ORL
MB-MDR in its simplest form

Step 3: assess significance


W = [b/se(b)]2, b=ln(OR)
Adjust for number of combined cells in high and low
risk category
Adjusted p-values: Permutation null
distributions
Improve power in the presence of
heterogeneity
Power of MDR compared to MB-MDR under aforementioned scenarios
(Calle, Urrea, Malats, Van Steen 2008- submitted)
MB-MDR in its simplest form

Step 4:
Adjusted p-values need to be corrected for
multiple testing
From MB-MDR to FAM-MDR

Extension to families




Perform polygenic analysis using the complete
pedigree structure but ignore marker data.
Derive residuals from this model (gives rise to
independent quantitative “new” traits)
Submit to MB-MDR
Effected sizes can be derived using measured (multilocus) genotype models on the selected combinations
of markers.
Motivation for FAM-MDR
Motivation for FAM-MDR

The idea of removing “family trend due to
genetic inheritance” was also adopted in the
GRAMMAR approach of Aulchenko and
colleagues.
FBAT?
“For each particular method there are situations
for which it is particularly well suited, and others
where it performs badly compared to the best
that can be done with that data…
However, it is seldom known in advance which
procedure will perform best or even well for any
given problem.”
(Hastie et al 2001)
Acknowledgements
Helpful discussions:
Marylyn Ritchie and co-workers (USA), Malu Calle and Victor Urrea
(Spain)
Phd students on the project:
Jestinah Mahachie (e.g., MDR and longitudinal measurements),
Vaness De Wit (e.g., MDR and multi-allelic markers; sparse cell
management), Lizzy De Lobel (e.g., pre-screening algorithms)
Post-doc on the project:
Tom Cattaert (e.g., FAM-MDR simulations)
Download