Solutions and Comments on Assignment 6 Stat 501 Spring 2005 1. (a). Use the Hotelling two sample T2 –statistics to test for a difference in the population mean vectors. The null hypothesis is Ho: µ 1 = µ 2 and the value of the test statistic is ( T2 = X1 − X 2 )( T 1 n1 + ) 1 −1 n2 ( ) −1 S pooled X 1 − X 2 =17.608 This yields an F-statistic of F=8.38 with (2,20) and p-value=0.002, which suggests that the two population mean vectors are not the same. (b). Classify a new case with measurements X o as coming from population 1 if T −1 1 −1 X 1 − X 2 S pooled X o − X 1 − X 2 S pooled X2 + X2 >0 2 otherwise, classify the case into population 2. ( ) ( ) ( ) (c). If we assume equal misclassification costs and equal prior probabilities, we can use the formula reported in part (b). For thus particular case (X (X ) T 1 −1 − X 2 S pooled X o = -0.788 and ) ( ) ( ) ( ) ( ) 1 −1 X 1 − X 2 S pooled X 1 + X 2 = -0.774 2 1 −1 X 1 − X 2 S pooled X 1 + X 2 = -0.78707-0.774= -0.014<0 2 Therefore we classify this case into population 2. T 1 −1 − X 2 S pooled Xo − (d) The criterion for classification is to minimize the expected cost of misclassification. It is minimized by classifying an individual with measurement X o into population 1 if f1 ( x 0 ) c(1 | 2) p 2 ≥ f 2 ( x 0 ) c(2 | 1) p1 Then, the estimated minimum ECM rule for two normal populations allocates X o to population 1 if ⎡⎛ c(1 | 2) ⎞⎛ p2 ⎞⎤ 1 −1 ⎟⎟⎜⎜ ⎟⎟⎥ X 1 − X 2 S pooled X 2 + X 2 ≥ ln ⎢⎜⎜ ( 2 | 1 ) 2 c ⎝ ⎠⎝ p1 ⎠⎦ ⎣ Otherwise, it allocates X o to population 2. (X ) T 1 −1 − X 2 S pooled Xo − ( ) ( ) Here p1=0.65 and p=0.35, the cost of misclassifying a unit from population 1 into population 2 is ten times greater than the cost of misclassifying a unit from population 2 into 1. Consequently, c(1 | 2) / c(2 | 1) =.10 and (X ) T 1 −1 Xo − − X 2 S pooled ( ) ( ) 1 −1 X 1 − X 2 S pooled X 2 + X 2 =-0.014 2 ⎡⎛ c(1 | 2) ⎞⎛ p 2 ⎞ ⎤ ln ⎢⎜⎜ ⎟⎟⎜⎜ ⎟⎟ ⎥ = ln(.10 * (0.65 / 0.35)) = −1.68 c ( 2 | 1 ) ⎠⎝ p1 ⎠ ⎦ ⎣⎝ Therefore, we classify this case into the population 1. 2 (a). You were asked to do linear discriminant analysis. Examination of the data, however, reveals the covariance matrices are not homogeneous for the dew hours and the non-dew hours. Furthermore, none of the variables (a, r, w, d) appear to have a normal distribution. Consequently, there is no reason to believe that a linear discriminant rule would be optimal for classifying dew events and non-events. You could have tried to look for transformations to make the data more nearly normally distributed and to promote homogeneity of covariance matrices. You could also have investigated the creation of new variables, eg, dew point difference = (dew point) – (air temperature), but you were not asked t do so because the semester is at an end. Here we used priors of (.5, .5) and equal costs of misclassification. Linear Discriminant Functions: Variable Constant a r w d dry dew -499.97095 -532.44513 38.58379 39.61301 10.62615 10.99397 0.71609 0.13709 -39.17780 -40.25436 Air temperature(0C) Relative humidity (%) Wind speed (km/sec) Dew Points (0C) Subtracting the dry formula from the dew formula, we classify an hour as a dew event if (1.03)a + (0.37)r - (0.58)w - (1.08)d > 32.48 This makes some sense because dew is more likely to form when wind speeds are low and when humidity is high. Dew is also less likely to form when the dew point gets close to or below freezing. Cross-validation Summary using Linear Discriminant Function Number of Observations and Percent Classified into dew From Dew Dry Dry 4156 73.04 Wet 1534 26.96 Total 5690 100.00 Wet 334 7.89 3925 92.11 4259 100.00 Total Priors 4492 0.5 5457 0.5 9949 Overall estimate of the probability of misclassification: Rate Priors dry 0.2696 0.5000 wet 0.0789 0.5000 Total 0.17642 Now fit a linear discriminant model with the squares of a, r, w, d, and the crossproducts Ar, aw, ad, rw, rd, and wd added to the model (you were not asked to do this, but it was provided by the code posted on the course web page). This model does not make much improvement on the overall misclassification rate, but it makes the rate more nearly equal for the two types of errors. This is good because the previous model tended to underestimate the number of dew hours overall. Cross-validation Summary using a more Complex Linear Discriminant Function Number of Observations and Percent Classified into dew From Dew Dry Dry 4530 79.61 Wet 1160 20.39 Total 5690 100.00 Wet 613 14.39 3646 85.61 4259 100.00 Total Priors 5143 0.5 4806 0.5 9949 Overall estimate of the probability of misclassification: dry Rate Priors 0.2039 0.5000 wet Total 0.1439 0.5000 0.1739 Now fit a quadratic discriminant function using the variables a, r, w, d. This model is better at classifying dew events as dew events, but it misclassifies a higher proportion of non-dew events as dew events. Cross-validation Summary using a Quadratic Discriminant Function Number of Observations and Percent Classified into dew From Dew Dry Dry 3555 62.48 Wet 2135 37.52 Total 5690 100.00 Wet 131 3.08 4128 96.92 4259 100.00 Total 3686 6263 9949 Priors 0.5 0.5 Overall estimate of the probability of misclassification: dry Rate Priors 0.3752 0.5000 wet Total 0.0308 0.5000 0.2030 If you have some free time this summer, you could consider transformations of variables to make the distributions more near normal. Then investigate misclassification rates for linear and quadratic models. (b). Use logistic regression to construct a classification rule. For binary response data, the response is either a dew event or a nonevent. Proc Logistic models the probability of the event. The CTABLE option produces a 2 x 2 frequency table by crossclassifying the observed and predicted states. An approximate leave-one-out method is used by CTABLE. The accuracy of the classification is measured by its sensitivity and specificity. Sensitivity is the proportion of event responses that were predicted to be events while the specificity is the proportions of nonevent responses that were predicted to be nonevents. Sensitivity is the ability to predict an event correctly and specificity is ability to predict a nonevent correctly. The estimated coefficients for a logistic regression model with a, r, w, and d are shown in the following table. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept a w d r 1 1 1 1 1 -41.1777 3.0468 0.7379 -3.0693 0.3826 9.4533 0.5088 0.0268 0.5149 0.0958 18.9739 35.8584 757.5890 35.5368 15.9396 <.0001 <.0001 <.0001 <.0001 <.0001 All four variables appear to be significant. Here an hour is classified as a dew event if the estimated probability of a dew event is at least 0.5. Classification Table Prob Level 0.500 Correct NonEvent Event 4591 3567 Incorrect NonEvent Event 692 1099 Correct 82.0 Percentages Sensi- Speci- False tivity ficity POS 80.7 83.8 13.1 False NEG 23.6 Comparing these results with the results from linear discriminant analysis with a, r, w and d, we can see that logistic regression gives higher sensitivity and lower specificity for the dew event data. The following table shows approximate crossvalidation estimates of sensitivity and specificity for different values of the cutoff probability. This table was constructed using the options ctable and pprob=(.1 to .9 by .05). It appears that something near 0.5 is a good cutoff for maximizing the number of correct classifications. Using the stepwise selection option to build a logistic regression model from a, r, w, d, the squares of a, r. w. d, and the cross product terms ar, aw, ad, rw, rd, wd yields the following model. Classification Table Prob Level 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750 0.800 0.850 0.900 Correct NonEvent Event 5511 5403 5267 5122 4990 4866 4738 4604 4487 4349 4203 4040 3881 3689 3507 3302 3039 692 1299 1819 2273 2640 2923 3177 3374 3567 3718 3859 3967 4066 4129 4169 4211 4246 Incorrect NonEvent Event 3567 2960 2440 1986 1619 1336 1082 885 692 541 400 292 193 130 90 48 13 Correct Sensitivity 63.0 68.1 72.0 75.1 77.5 79.1 80.4 81.0 81.8 81.9 81.9 81.3 80.7 79.4 78.0 76.3 74.0 98.7 96.7 94.3 91.7 89.3 87.1 84.8 82.4 80.3 77.9 75.2 72.3 69.5 66.0 62.8 59.1 54.4 75 183 319 464 596 720 848 982 1099 1237 1383 1546 1705 1897 2079 2284 2547 Percentages Speci- False ficity POS 16.2 30.5 42.7 53.4 62.0 68.6 74.6 79.2 83.8 87.3 90.6 93.1 95.5 96.9 97.9 98.9 99.7 39.3 35.4 31.7 27.9 24.5 21.5 18.6 16.1 13.4 11.1 8.7 6.7 4.7 3.4 2.5 1.4 0.4 False NEG 9.8 12.3 14.9 17.0 18.4 19.8 21.1 22.5 23.6 25.0 26.4 28.0 29.5 31.5 33.3 35.2 37.5 You could consider the squares and cross-products of the explanatory variables and use the variable selection options in PROC LOGISTIC to look for a better model. Backward elimination provided the following model. You can check the misclassification rates, but this model offers liitle improvement over logistic regression on a, w, r, d. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Wald Chi-Square Pr > ChiSq Intercept a r w d w2 ar ad rw rd wd 1 1 1 1 1 1 1 1 1 1 1 -616.1 19.8568 6.1520 2.3911 -24.5666 0.2021 0.2102 0.00593 -0.0292 -0.1648 0.0115 158.7 4.4459 1.5868 0.5395 5.6570 0.0255 0.0615 0.000699 0.00591 0.0493 0.00486 15.0715 19.9478 15.0316 19.6471 18.8592 62.8593 11.6908 71.9551 24.4607 11.1742 5.6192 0.0001 <.0001 0.0001 <.0001 <.0001 <.0001 0.0006 <.0001 <.0001 0.0008 0.0178 (c). Use the rpart( ) and tree ( ) function ins S-Plus and R Since the tree( ) function uses a deviance measure of impurity and the rpart( ) function uses a measure on impurity based on the Gini index, the two algorithms can yield different classification trees. Cross-validation provided by rpart produces a plot of the complexity parameter that suggests a value around 0.0025. This yields a pruned tree with 14 terminal nodes. You should run the crossvalidation several times, because it randomly divides the cases into different subsets of 10 groups on each run which yields slightly different results.. This pruned tree provided by rpart classifies as the hour as dry (dew does not form) if : ♦ Relative humidity is less than 79.35 percent ♦ Relative humidity is between 79.35 and 88.25 percent and wind speed exceeds 1.039 m/sec ♦ Relative humidity is between 79.35 and 88.25 percent and wind speed does not exceed 1.039 m/sec and the dew point is less then 0.337 (degrees C) ♦ Wind speed exceeds 3.856 m/sec ♦ Relative humidity exceeds 88.25 percent and wind speed is between 1.652 and 3.856 m/sec and the dew point is less 0.4845 degrees C. ♦ Relative humidity is between 88.25 and 90.85 percent and wind speed exceeds 2.795 m/sec. ♦ Relative humidity is between 88,25 and 90.85 percent and wind speed is between 1.652 and 2.795 m/sec. and air temperature is less than 4.205 degrees C ♦ Relative humidity is between 88.25 and 89.85 percent and wind speed is between 1.652 and 2.795 m/sec and air temperature exceeds 4.205 degrees C and dew point is less than 18.96 degrees C It classifies an hour as a dew event if ♦ Relative humidity is greater than 90.85 percent and wind speed is less than 3.856 m/sec ♦ Relative humidity exceeds 88.25 percent and wind speed is less than 1.652 m/sec and the dew point is at least 0.4845 degrees C (dew does not form below freezing temperatures, you get frost instead) ♦ Relative humidity is between 89.85 and 90.85 percent and wind speed is between 1.652 and 2.795 m/sec and air temperature exceeds 4.205 degrees C ♦ Relative humidity is between 88.25 and 89.85 percent and wind speed is between 1.652 and 2.795 m/sec and air temperature exceeds 4.205 degrees C and dew point exceed 18.96 degrees C These results revel some of the interaction between wind speed and relative humidity. In border line situations for suitable relative humidity and wind speed conditions to allow due formation, air temperature and dew point cannot get to close to freezing. (d). Further suggestions: The classification tree indicates and interaction between humidity and wind speed. It also suggests that dew will not form when a threshold value of either humidity or wind speed is exceeded. One option that you could pursue is to set aside cases with humidity less than 79.35 percent or wind speed exceeding 3.856 as situations where dew is very unlikely to form. Fit a logistic regression model to the cases with humidity between 79.35 and 88.25. Fit a second logistic regression model to the cases with humidity greater than 88.25 and wind speed less than 3.856. One might consider creating new variables such as the difference between the air temperature and the dew point. One could also consider changes in some variables from the previous hour. One could consider prior probabilities that are proportional to the overall number of hours in the training samples with dew events. You could also consider doing this for each different hour to obtain different priors for different hours. You must be careful to check that the relative hours of dew and dry conditions in the training sample reflect the true state of affairs in the population of interest. Consider transformations of variables to make the distributions more symmetric or more nearly normal. Note that monotone transformations, such as logarithms, have no effect on classification trees. Try neural nets or support vector machines.