Discriminant Analysis (DA) 1. Basic Concept a) Dependent Variable: Categorical (nominal). If the dependent variable is ordinal or metric, convert it to mutually exclusive and exhaustive categories. If the number of categories is large, one may choose the polar extremes approach, where only extreme categories are analyzed, and the middle ones excluded from the analysis. The number of observations in each category (the group size) should be at least 20. If the group sizes vary significantly, select randomly observations from the larger groups to make the resulting samples comparable to the smaller groups. (i) (ii) Two-group DA - the dependent variable has two categories, e.g. Male = 1, Female = 2 Multiple Discriminant Analysis (MDA) – the dependent variable has more than two categories b) Independent Variables: Metric (required ratio of observations to the number of independent variables: at least 20; the minimum ratio: 5). This ratio applies to all variables initially included in the analysis (even if some would later be eliminated by the stepwise procedure) 2. Randomly split (e.g., 50-50, or 60-40, or 75-25) the sample into two parts (use a proportionately stratified sampling procedure): a) Analysis Sample: for estimation of the discriminant function b) Validation (Holdout) Sample: for validating the discriminant function. Ex. Split 60-40: Total sample: 80 males and 50 females Analysis Sample: 48 males and 30 females; Holdout Sample: 32 males and 20 females. 3. Assumptions of DA a) No outliers. Verify their presence with both the SRE (Studentized Residuals) and the Mahalanobis distance. Remove the outliers from further analysis. b) Linear relationships between the dependent variable and the independent variables. Check the linearity with the procedures explained in the previous sessions. Correct any nonlinear relationships with specific variable transformations. c) No multicollinearity among the independent variables. Check the Pooled WithinGroups Correlation Matrix. Analyze the collinearity diagnostics. d) Equal covariance matrices for the groups as defined by the dependent variable. Use Box’s M test. If Sig. > 0.05, cannot reject Ho: The population covariance matrices are equal. Note: In cases when some of the assumptions of DA are not clearly met (e.g. the normality is dubious), a researcher is allowed to use a threshold value less than 0.05, e.g. 0.03. e) The independent variables are normally distributed. Use appropriate tests explained in the previous sessions. 1 4. Apply DA to the Analysis Sample a) Direct (Simultaneous) Method – when the discrimination is to be based on ALL independent variables b) Stepwise (Elimination) DA – when only a subset of the most discriminatory independent variables is to be included in the discriminant function 5. Example: Discriminant Analysis (DA) Convert file File4ab.xls to File4ab.sav (the first 30 rows are Analysis Sample, and the last 12 rows are Holdout Sample) TransformRecodeInto Different VariablesPaste the variable “no” as Numeric Variable into Output Variable (“no1”)Old and New Values: Range 1 thru 30 1; Range 31 thru 42 0; ContinueChange AnalyzeClassifyDiscriminantGrouping Variable (visit)Define Range (Minimum 1; Maximum 2)Independents (income, travel, vacation, hsize, age)Selection Variable: no1 = 1 Statistics (check all boxes)Classify (Limit cases to first 30)Save (check all boxes)Method (Enter idependents together)OK Interpretation of the results based on the Analysis Sample: n = 30 (going from the top of the computer printout) (i) Group Means and Standard Deviations: Income (Large difference in means; large standard deviations in each group) Age (Small difference in means; large standard deviations in each group) Travel, vacation, hsize (small difference in means; small standard deviations in each group) Test of Equality of Group Means: Wilks’ Lambda Income: Sig. = 0.000 Travel: Sig. = 0.143 Vacation: Sig. = 0.021 Hsize: Sig. = 0.001 Age: Sig. = 0.257 Income, Vacation, and Hsize show significant (p = 0.05) univariate differences between the two groups (Group 1: Those who visited the resort during the last 2 years; Group 2: Those who not) There is no difference in resorts visits based on Attitude toward travel and Age. It is therefore obvious that Income, Vacation, and Hsize may best discriminate between the two groups of visitors. If you are interested in the efficiency of only these three variables in discriminating between the two groupsUse the Stepwise procedure. If you are interested in the 2 efficiency of each of the five independent variablesUse the Direct procedure (which we are going to follow below) (ii) Pooled within-groups correlation matrix: low correlations lack of multicollinearity among the independent variables (rule of thumb: if a correlation coefficient is < 0.30) (iii) Box’s M Test of Equality of Covariance Matrices Sig. = 0.141 > 0.05: Cannot reject Ho: The population covariance matrices are equal (iv) Summary of Canonical Discriminant Functions Canonical Correlation = 0.801(0.801)2 = 64.1% of the variance in the dependent variable (Resort Visit) can be accounted for by the model (all the five independent variables). The Wilks’ Lambda = 0.359 (which is equivalent to Chi-square = 26.130 with 5 d.f.) is significant at the 0.000 level. This means that the discriminant function computed in this procedure is statistically significant at the 0.000 level. Only then, one can proceed to interpret the results. (v) Standarized Canonical Discriminant Function Coefficients Income: 0.743 Travel: 0.096 Vacation: 0.233 Hsize: 0.469 Age: 0.209 Note: the signs (+ or -) indicate a positive or a negative relationship with the dependent variable. The discriminant function (based on standarized discriminant coefficients): Z = 0.743Income + 0.096Travel + 0.233Vacation + 0.469Hsize + 0.209Age (vi) Structure Matrix (Discriminant Loadings – order from highest to lowest by the absolute size of the loading, the sign + or – indicates only a positive or a negative relationship with the dependent variable) The discriminant function (based on discriminant loadings) Z = 0.822Income + 0.541Hsize + 0.346Vacation + 0.213Travel + 0.164Age (vii) (Unstandarized) Canonical Discriminant Function Coefficients 3 The discriminant function (based on unstandarized discriminant coefficients) Z = -7.975 + 0.085Income + 0.050Travel + 0.120Vacation + 0.427Hsize + 0.025 Age Which discriminant function to use and when? For interpretation purposes, use discriminant loadings. The standarized discriminant coefficients can also be used, although in the literature they are less preferred than the loadings. Any variable exhibiting a loading of more than +0.30 or less than –0.30 is considered a substantive discriminator (i.e., Income, Hsize, and Vacation) To calculate the discriminant Z scores for the classification purposes, use the unstandarized discriminant coefficients. (viii) Functions at Group Centroids: The value of the discriminant function (with the unstandarized coefficients) at the group means. Ex. Gc1 = -7.975 + 0.085*60.520 + 0.05*5.4 + 0.120*5.8 + 0.427*4.333 + 0.025*53.733 = 1.291 So, the group centroid for the visitors to the resort (Group 1) is +1.291 The group centroid for non-visitors (Group 2) is –1.291 (they do not have to be equal in absolute terms) The optimum cutting score is based on the Group Centroids: For two groups of equal size: Gc = (Gc1 + Gc2)/2 = (1.291 – 1.291)/2 = 0 Thus, if Zi > 0 assign case i to Group1; if Zi < 0 assign case i to Group 2 (see the additional columns dis_1 and dis1_1 saved by the DA in the SPSS Input file: For example, for Case number 1: dis1_1 = -0.17214 was calculated as follows (based on the unstandarized discriminant coefficients) dis1_1 = -7.975476 + 0.0847671*50.2 + 0.04964455*5 + … + 0.0245438*43 = -0.17214 < 0 Case 1 is assigned to Group 2, which is a mistake, because we know from the sample that Case 1 belongs to Group 1(visitors). There are altogether 3* such mistakes made, i.e. when a Case from Group 1 has been assigned to Group 2. However, there were 0* such mistakes made when assigning members from Group 2 – all of them were correctly assigned to Group 2. 4 Classification Results (Original) Group 1 Group 2 Group 1 12 0* Group 2 3* 15 (ix) For two groups of different sizes n1 and n2: Zcs = (n1*Gc1 + n2*Gc2)/(n1 + n2) Classification Function Coefficients (Fisher’s Linear Discriminant Functions) – can also be used for classification purposes. Z1 = -57.532 + 0.678Income + 1.509Travel + 0.938Vacation + 3.322Hsize + 0.832Age Z2 = -36.936 + 0.459Income + 1.381Travel + 0.628Vacation + 2.218Hsize + 0.768Age Ex. Consider again Case 1. Calculate: Z1(Case 1) = -57.532 + 0.678*50.2 + .. + 0.832*43 = 37.295 Calculate: Z2(Case 1) = -36.936 + 0.459*50.2 + … + 0.768*43 = 37.7128 Because 37.7128 > 37.295, assign Case 1 to Group 2 (x) Classification Results Based on the Analysis Sample (called in SPSS: Original) The Hit Ratio = % of correctly classified cases = (12 + 15)/30 = 90% Based on “leave-one-out” principle (called in SPSS: Cross-validated) The Hit Ratio = % of correctly classified cases = (11 + 13)/30 = 80% Based on the Holdout Sample (n = 12) (called in SPSS: Cases Not Selected – Original) The Hit Ratio = % of correctly classified cases = (4 + 6)/12 = 83.3% In either case, compare the Hit Ratio with the Chance Ratio: If the group sizes are equal: Chance Ratio = 1/(number of groups) In our example, Chance Ratio = ½ = 0.5. (The Hit Ratio should be 1.25 times greater than the Chance Ratio in order for the Validity of the DA to be satisfactory). The lowest of the three Hit Ratios = 80% > 1.25*50% = 62.5% The validity of our DA is satisfactory. If the group sizes are different, two Chance Ratios are possible: Maximum Chance Criterion = The percentage of the total sample represented by the largest of the groups 5 Ex. Group 1 = 65, Group 2 = 25, Group 3 = 10 The MCC = 0.65 = 65% Proportional Chance Criterion = p12 + p22 + .. + pk2 Ex. The PCC = 0.652 + 0.252 + 0.12 = 0.495 = 49.5% If the Hit Ratio > 1.25*max(MCC, PCC) The validity of DA is satisfactory Another statistical test for the discriminatory power of the classification matrix: Press’s Q statistic = [N – (Ncorrect*c)]2/N(c-1) Where: c = number of groups N = sample size Ncorrect = number of observations correctly classified Ex. c = 2, N = 30, Ncorrect = 24 (based on the cross-validated procedure) Press’s Q = [30 – 24*2]2/30*(2-1) = 10.8. The critical level at 0.05 is Q = 6.63 Q = 10.8>Qcritical, hence the classification matrix can be deemed significantly (p = 0.05) statistically better than chance. 6