Bioinformatics Multifactor Dimensionality Reduction Kristel Van Steen, PhD, ScD (kristel.vansteen@ulg.ac.be) Université de Liege - Institut Montefiore 2008-2009 Outline Setting the scene Analyses methods for gene-gene interactions Traditional vs non-Traditional MDR, MB-MDR, FAM-MDR The future: work in progress Setting the scene Genetic Architecture of Disease The number of genes that impact disease susceptibility The distribution of alleles and genotypes at those genes The manner in which the alleles and genotypes impact disease susceptibility (Weiss 1993) Complications in disentangling? There are likely to be many susceptibility genes each with combinations of rare and common alleles and genotypes that impact disease susceptibility primarily through non-linear interactions with genetic and environmental factors Terminology: Epistasis Does evidence of statistical epistasis necessarily imply genetical or biological epistasis? (Moore 2004) Analysis Methods Traditional vs Non-Traditional Traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions and epistatic patterns of susceptibility Alternative Methods Tree-based methods: Pattern recognition methods: Recursive Partitioning (Helix Tree) Random Forests (R, CART) Symbolic Discriminant Analysis (SDA) Mining association rules Neural networks (NN) Support vector machines (SVM) Data reduction methods: DICE (Detection of Informative Combined Effects) MDR (Multifactor Dimensionality Reduction) Logic regression … (e.g., Onkamo and Toivonen 2006) Gene Interaction Models Non-parametric: Appealing because no distributional assumptions on genotype-phenotype effect Parametric: Appealing because easy adjustment for confounding variables and main effects Severe limitations in presence of too many independent variables in relation to number of observed outcome events Out-of-control curse? ~500,000 SNPs span 80% of common variation in genome (HapMap) 2 x 1026 3 x 1021 5 x 105 1 1 x 1011 2 2 x 1016 3 4 5 Curse of Dimensionality Bellman R (1961) Adaptive control processes: A guided tour. Princeton University Press: “... Multidimensional variational problems cannot be solved routinely ... . This does not mean that we cannot attack them. It merely means that we must employ some more sophisticated techniques.” Limitation of Regression Having too many independent variables in relation to the number of observed outcome events Assuming 10 bi-allelic loci: # of Parameters = Main effect # of Parameters 20 n *2 k k 2-locus 3-locus 4-locus interaction interaction interaction 180 960 3360 Limitation of Regression Fewer than 10 outcome events per independent variable can lead to biased estimates of the regression coefficients and to an increase in Type 1 and Type 2 errors. # of parameters P min(ncase , ncontrol)/10 - 1 For 200 cases and 200 controls, this formula suggests that no more than 19 (= 200/10 – 1) parameters should be estimated in logistic regression model. Multifactor Dimensionality Reduction (MDR) to tackle the dimensionality problem of interaction detection Publications since Conception MDR for Interaction Detection MDR creates a one-dimensional multi-locus genotype variable (high and low risk), which is evaluated for its ability to classify and predict disease status through cross-validation and permutation testing. (Ritchie et al 2001; Hahn et al 2003) 10 cross-validation 10 best models. The model with minimum PE is the best n-locus model. MDR Steps 9/10 training data 1/10 test data (Ritchie et al 2003) Two Measures for Selection of Best n-locus model Misclassification error: The proportion of incorrect classification in the training set. Prediction error (PE): The proportion of incorrect prediction in the test set. Best Multi-factor Models Best 2-factor model Best 3-factor model Best 4-factor model Best 5-factor model Best 6-factor model . . Best n-factor model Model Selection and Evaluation Among the best n-factor models, the best model is: The model with the minimum average PE. The model with the maximum average CVC. Rule of parsimony: If there is a tie, select the smaller model. Significance of the Final Model Via permutation tests: Randomize the the case and control labels in the original dataset multiple times to create a set of permuted datasets. Run MDR on each permuted dataset. Maximum CVC and minimum PE identified for each dataset saved and used to create an empirical distribution for estimation of a P-value. Example: through simulation 200 cases and 200 controls; 10 SNPs: 1, 2, 3 , …, 10. Disease etiology due to interaction between SNP 1 and SNP 6. Over 10 CVs and 10 runs Advantages of MDR Simultaneous detection of multiple genetic loci associated with a discrete clinical endpoint in absence of main effect. Non-parametric: Overcomes “curse of dimensionality” by logistic regression model. Three genotype groups are considered separately Non-linear interactions between multiple polymorphisms in the absence of independent effects Low false positive rates Disadvantages of MDR Need to introduce parametrics? MDR in its initial layout cannot deal with main effects / confounding factors / nondichotomous outcomes: GMDR / OR-MDR Low power in the presence of genetic heterogeneity Power Simulation Set-Up no noise 5% genotyping error -- GE 5% missing data -- MS 50% phenocopy -- PC 50% genetic heterogeneity – GH GE + MS … … GE+MS+PC … … 6 models 4 models GE+MS+PC+GH Total 16 models Disadvantages of MDR Noteworthy: Model selection on the basis of prediction accuracy One single higher-order interaction model is proposed Some important interactions could be missed due to pooling too many cells together (Lou et al 2008) Model Based Multifactor Dimensionality Reduction (MB-MDR) Model Based MDR (MB-MDR) MDR: X={H,L} MDR-MB: X={H,L,O} Marginal Effects MB-MDR in its simplest form Step 1: New risk cell identification via association test on each genotype cell cj Parametric or nonparametric test of association ORj Step 2: Test X on Y Parametric or nonparametric ORH, ORL MB-MDR in its simplest form Step 3: assess significance W = [b/se(b)]2, b=ln(OR) Adjust for number of combined cells in high and low risk category Adjusted p-values: Permutation null distributions Improve power in the presence of heterogeneity Power of MDR compared to MB-MDR under aforementioned scenarios (Calle, Urrea, Malats, Van Steen 2008- submitted) MB-MDR in its simplest form Step 4: Adjusted p-values need to be corrected for multiple testing From MB-MDR to FAM-MDR Extension to families Perform polygenic analysis using the complete pedigree structure but ignore marker data. Derive residuals from this model (gives rise to independent quantitative “new” traits) Submit to MB-MDR Effected sizes can be derived using measured (multilocus) genotype models on the selected combinations of markers. Motivation for FAM-MDR Motivation for FAM-MDR The idea of removing “family trend due to genetic inheritance” was also adopted in the GRAMMAR approach of Aulchenko and colleagues. FBAT? “For each particular method there are situations for which it is particularly well suited, and others where it performs badly compared to the best that can be done with that data… However, it is seldom known in advance which procedure will perform best or even well for any given problem.” (Hastie et al 2001) Acknowledgements Helpful discussions: Marylyn Ritchie and co-workers (USA), Malu Calle and Victor Urrea (Spain) Phd students on the project: Jestinah Mahachie (e.g., MDR and longitudinal measurements), Vaness De Wit (e.g., MDR and multi-allelic markers; sparse cell management), Lizzy De Lobel (e.g., pre-screening algorithms) Post-doc on the project: Tom Cattaert (e.g., FAM-MDR simulations)