Lisa M. Lix, PhD P. Stat. School of Public Health Joint Seminar: Statistics and Collaborative Graduate Program in Biostatistics January 5, 2012 Co-Authors: Tolu Sajobi, Bola Dansu Funding: ◦ Canadian Institutes of Health Research ◦ Centennial Chair Program, University of Saskatchewan Background Description of Relative Importance Measures Numeric Example Monte Carlo Study: Design and Results Discussion and Conclusions m ≥ 2 correlated variables for N study participants with n1 participants in group 1 and n2 participants in group 2 (n1 + n2 = N) In many studies, the variables are assumed to follow a normal distribution, N(μjk, σjk2), for k = 1 ,…, m and j = 1, 2 We will focus on the case where there are no missing observations Do different measures of relative importance result in the same rankings of a set of correlated variables for distinguishing between two independent groups? What factors affect the variable ranking performance of relative importance measures? For exploratory analysis and model development Organizational research: Genetics research: Quality of life research: ◦ the relative contribution of various applicant characteristics in hire–not hire decisions made by managers ◦ Relative contribution of individual genes to distinguishing between patients with and without chronic health conditions ◦ Relative importance of quality of life domains for distinguishing between patients who do and do not receive healthcare treatments Back et al. (2008). Journal of Biopharmaceutical Statistics ◦ Rankings of variable importance were used to identify a set of genes to classify life-threatening diseases according to prognosis or type ◦ Variable importance was assessed using a variety of techniques, including non-parametric recursive partitioning techniques Statistical significance (e.g., t-test) Practical significance (e.g., effect size) Descriptive discriminant analysis (DDA): linear combination of variables that maximizes separation of the groups Stepwise multivariate analysis of variance (MANOVA): F-toremove statistic measures the decrease in the inter-group Mahalanobis distance caused by removing each of the variables in sequence Logistic regression analysis (LRA): Contribution of each variable to the total predicted variance in the dichotomous outcome Dominance analysis: Budescu, 1993 ◦ General dominance analysis determines relative importance based on the average ΔR2 observed by adding a predictor to all possible subsets of the remaining predictors Relative weights analysis: Johnson, 2000 ◦ creates a new set of variables that are orthogonal representations of the original set of variables Denote Xij as the m x 1 vector of observations for the ith study participant in the jth group (i = 1,…, nj; j = 1, 2) X j is the m x 1 vector of means for the jth group Vector of discriminant function coefficients is estimated by a S 1 ( X1 X2 ) where (n1 1)S1 (n2 1)S 2 S N 2 and S1 and S2 are the variance-covariance matrices for groups 1 and 2, respectively The kth standardized discriminant function coefficient is * ak ak sk where ak and sk are the kth estimated discriminant function coefficient and standard deviation, respectively By placing a constraint on the discriminant function coefficients such that aTSa = 1, where T is the transpose operator, the coefficients will range in value from -1 to +1 The parallel discriminant ratio coefficient for the kth variable is qk ak* fk where fk is the kth structure coefficient, the correlation between the kth variable and the discriminant function Coefficients can take on positive and negative values The total discriminant ratio coefficient for the kth variable is where STkk is the (k,k)th element of ST, ST = T/ (N – 1), T = H + E, and H and E are the hypothesis and error sum of squares and cross-product matrices, respectively Coefficients have a lower bound of zero but no upper bound For the kth variable, the F-to-remove statistic is F( k ) k2 ( D2 D(2k ) ) /(k3 D(2k ) ) where k2= N – m, k3 = N2/(n1n2), and D2 (X1 X2 )T S1 (X1 X2 ) 2 is the squared Mahalanobis distance, and D(k ) is the value of D2 when the kth variable is omitted Statistics take on positive values The model is pl A l β ln 1 pl where Al is the vector of (m + 1) observations for the lth study participant (l = 1 ,…, N) where the first element is equal to one pl = Pr(yl = 1| Al) is the probability the lth study participant is a member of group 1 conditional on the explanatory variables β is the (m + 1) vector of coefficients to be estimated, with the first element equal to the model intercept, β0 The estimated coefficient for the kth variable can be defined as ˆβ rlogit( pˆ ) k R k 1 R 2 2 ( k ) k |( k ) 2 k |( k ) R , where rlogit( pˆ ) k is the correlation between the kth variable and the logit of the predicted probabilities and R(2k ) is the R2 value for a LRA model in which the kth variable is excluded 2 R and k|(k ) is the R2 value for a model in which the kth variable is regressed on the remaining (m – 1) variables Standardized logistic regression coefficients have also been used to assess relative importance. The kth standardized coefficient is βˆk* βˆk sk R / slogit ( pˆ ) , where ˆk is the estimated coefficient and slogit(pˆ ) is the standard deviation of the logit of the predicted probabilities Coefficients can take on positive and negative values Pratt’s (1987) index for relative importance was originally proposed for multiple regression and then extended to LRA. The index value for the kth variable is βˆk* ρˆ k dk 2 , R where ρˆk is the estimated correlation between the kth explanatory variable and the logit of the predicted probabilities Coefficients can take on positive and negative values Data are from the Manitoba Inflammatory Bowel Disease (IBD) Cohort Study Started in 2002 and initially enrolled 388 patients who had recently diagnosed with Crohn’s disease or ulcerative colitis Health-related quality of life (HRQOL) data collected at regular intervals throughout the study ◦ SF-36: 8 domains ◦ IBD Questionnaire: 4 domains A central theme of the study is the effect of disease activity on quality of life, stress, well-being, and coping with illness Active Disease Inactive Disease (n1 = 244) (n2 = 105) Bowel Symptoms 4.92 (1.03) 6.08 (0.76) Emotional Health 4.81 (1.05) 5.85 (0.89) Social Function 4.09 (1.18) 5.19 (1.05) Systemic Symptoms 5.62 (1.35) 6.65 (0.64) Bodily Pain 60.78 (24.15) 77.45 (26.11) Role Physical 63.48 (29.07) 83.65 (24.08) General Health 43.40 (19.52) 59.18 (17.01) Mental Health 60.33 (14.11) 66.62 (12.47) Physical Functioning 77.49 (21.73) 91.11 (14.41) Role Emotional 76.06 (23.98) 85.82 (20.11) Social Functioning 63.74 (27.20) 78.85 (27.10) Vitality 46.13 (16.39) 57.84 (14.49) IBDQ SF-36 t-statistic SLRC LPI ALPI SDFC PDRC FTR IBDQ Bowel Symptoms 10.430* 0.463 0.471 0.376 0.587 0.542 5.034 Emotional Health 8.840* 0.309 0.28 0.223 0.428 0.347 4.033 Social Function 7.500* 0.183 0.165 0.132 0.044 -0.031 5.072 Systemic Symptoms 7.980* 0.145 -0.117 - 0.083 -0.062 14.334 Bodily Pain 5.690* 0.103 0.066 0.053 0.103 0.057 0.504 Role Physical 6.220* 0.015 -0.010 0.000 0.037 -0.022 6.099 General Health Mental Health 6.930* 0.135 0.095 0.076 0.226 0.149 12.334 3.790* 0.143 -0.059 - 0.1910 -0.072 0.952 Physical Functioning 5.890* 0.169 0.113 0.090 0.185 0.106 8.329 Role Emotional 3.640* 0.171 -0.066 - 0.120 -0.043 0.508 Social Functioning 4.770* 0.026 0.015 0.012 0.027 0.013 0.011 Vitality 6.080* 0.074 0.049 0.039 0.029 0.017 6.911 Domain SF-36 Note: * denotes a test statistic that is statistically significant at α = .05/12 = .004 SLRC ALPI SDFC PDRC FTR IBDQ Bowel Symptoms 1 1 1 1 7 Emotional Health 2 2 2 2 8 Social Function 3 3 9 9 6 Systemic Symptoms SF-36 6 - 8 - 1 Bodily Pain 9 6 7 5 11 12 9 10 9 5 General Health 8 5 3 3 2 Mental Health 7 - 4 - 9 Physical Functioning 5 4 5 4 3 Role Emotional 4 - 6 - 10 Social Functioning 11 8 12 7 12 Vitality 10 7 11 6 4 Domain Role Physical SDFC: standardized discriminant function coefficient PDRC: parallel discriminant ratio coefficients TDRC: total discriminant ratio coefficients FTR: F-to-remove statistic SLRC: standardized logistic regression coefficient LPI: Logistic Pratt’s index Number of variables (m = 4, 6, 8) Total sample size (N = 60, 80, 140, 200) Equality/inequality of group sizes Magnitude and pattern of correlation among the variables Group covariance homogeneity/heterogeneity Group means Shape of the population distribution Let ρ denote the average correlation between the variables ◦ ρ = 0, 0.3, 0.6 Pattern of correlation ◦ Compound symmetric ◦ Unstructured ◦ Modified simplex Mean Pattern I II III IV Note: μ2 is the null vector μ1 (2.5, 2, 1.5, 1) (1.5, 1, 0.5, 2) (1.0, 0.75, 0.5, 0.25) (0.75, 0.5, 0.25, 1.0) D2 13.5 7.5 1.9 1.9 Mean Pattern μ1 D2 I (4.5, 4, 3.5, 3, 2.5, 2, 1.5, 1) (2.5, 2, 1.5, 1, 0.5, 3, 3.5, 4) (2, 1.75, 1.5, 1.25, 1, 0.75, 0.5, 0.25) (1.25, 1, 0.75, 0.5, 0.25, 1.5, 1.75, 2) 71.0 47.0 12.8 12.8 II III IV Note: μ2 is the null vector Normal ◦ γ1 = 0; γ2 = 0 Skewed ◦ γ1 =1.8; γ2 =5.9 Heavy-Tailed ◦ γ1 = 0 and γ2 = 33 All-variable correct ranking percentage: percent of simulations in which the sample rank was the same as the corresponding population rank for the variable Average per-variable correct ranking percentage: the percent of simulations in which a variable in the sample had the same rank as the variable in the population, averaged across all variables Kendall’s concordance statistic (not reported in this presentation) Mean Pattern SDFC PDRC TDRC FTR SLRC LPI I 49.1 59.8 59.0 38.0 41.7 61.1 II 43.7 63.1 56.2 32.1 38.0 64.3 III 34.8 47.0 37.8 26.4 33.2 47.4 IV 37.0 54.3 41.1 28.3 34.8 54.7 Average 41.2 56.0 48.5 31.2 36.9 56.9 Mean Pattern SDFC PDRC TDRC FTR SLRC LPI I 17.5 28.3 27.1 9.1 13.6 29.4 II 12.2 32.1 23.6 5.7 9.8 33.6 III 7.7 12.7 9.4 2.1 7.3 12.8 IV 8.1 21.1 11.0 3.8 7.6 21.4 Average 11.4 23.5 17.8 5.2 9.6 24.3 Corr. Scenario SDFC PDRC TDRC FTR SLRC LPI 1 60.3 63.3 63.2 40.2 55.0 66.3 2 45.9 63.2 51.0 32.6 42.4 63.6 3 32.2 65.9 42.5 25.8 25.7 65.4 4 39.7 52.1 45.1 29.7 36.5 53.1 5 25.8 34.2 38.6 27.0 24.3 33.9 6 43.0 57.6 50.5 31.8 37.8 58.5 Average 41.2 56.0 48.5 31.2 36.9 56.9 Scenario 1: ρ = 0, where ρ is the average correlation; Scenario 2: compound symmetric matrix with ρ = 0.3; Scenario 3: compound symmetric matrix with ρ = 0.6; Scenario 4: unstructured matrix with ρ = 0.3; Scenario 5: unstructured matrix with ρ = 0.6; Scenario 6: modified simplex matrix with correlations of 0.3 and 0.6 on alternating diagonals. The LPI and PDRC measures tended to result in the highest percentages of correct rankings and values of the concordance statistic The FTR measure tended to result in the lowest percentages of correct rankings and concordance followed by the SLRC measure The LPI and PDRC measures were relatively insensitive to many of the correlation structures However, they resulted in a substantial drop in correct ranking percentages when the data exhibited an unstructured correlation pattern with a high average correlation (ρ = 0.6) Differences in correct ranking percentages across the correlation structures were smaller for the TDRC and SLRC measures than for other measures and were smallest for the FTR measure Violations of the assumption of covariance homogeneity had a very small effect on the correct ranking rates The correct ranking percentages for all measures were consistently lower for heavy-tailed than for skewed distributions The choice of measures of relative importance depends on the perspective the researcher wants to take on the data ◦ contribution of a variable to the discriminant function score ◦ contribution of a variable to the grouping variable effect ◦ contribution of a variable to explaining variation in a regression model Inference for relative importance measures and ranks Comparisons with recent developments in relative importance measures that are more computationally intensive (e.g., relative weights) Extensions to more than two groups