Solutions and Comments on Assignment 6 µ

advertisement
Solutions and Comments on Assignment 6
Stat 501
Spring 2005
1. (a). Use the Hotelling two sample T2 –statistics to test for a difference in the population mean
vectors. The null hypothesis is Ho: µ 1 = µ 2 and the value of the test statistic is
(
T2 = X1 − X 2
)(
T
1
n1
+
)
1 −1
n2
(
)
−1
S pooled
X 1 − X 2 =17.608
This yields an F-statistic of F=8.38 with (2,20) and p-value=0.002, which suggests that the two
population mean vectors are not the same.
(b). Classify a new case with measurements X o as coming from population 1 if
T −1
1
−1
X 1 − X 2 S pooled
X o − X 1 − X 2 S pooled
X2 + X2 >0
2
otherwise, classify the case into population 2.
(
)
(
)
(
)
(c). If we assume equal misclassification costs and equal prior probabilities, we can use the formula
reported in part (b). For thus particular case
(X
(X
)
T
1
−1
− X 2 S pooled
X o = -0.788 and
)
(
)
(
)
(
)
(
)
1
−1
X 1 − X 2 S pooled
X 1 + X 2 = -0.774
2
1
−1
X 1 − X 2 S pooled
X 1 + X 2 = -0.78707-0.774= -0.014<0
2
Therefore we classify this case into population 2.
T
1
−1
− X 2 S pooled
Xo −
(d) The criterion for classification is to minimize the expected cost of misclassification. It is
minimized by classifying an individual with measurement X o into population 1 if
f1 ( x 0 ) c(1 | 2) p 2
≥
f 2 ( x 0 ) c(2 | 1) p1
Then, the estimated minimum ECM rule for two normal populations allocates X o to population 1 if
⎡⎛ c(1 | 2) ⎞⎛ p2 ⎞⎤
1
−1
⎟⎟⎜⎜ ⎟⎟⎥
X 1 − X 2 S pooled
X 2 + X 2 ≥ ln ⎢⎜⎜
(
2
|
1
)
2
c
⎝
⎠⎝ p1 ⎠⎦
⎣
Otherwise, it allocates X o to population 2.
(X
)
T
1
−1
− X 2 S pooled
Xo −
(
)
(
)
Here p1=0.65 and p=0.35, the cost of misclassifying a unit from population 1 into population 2 is
ten times greater than the cost of misclassifying a unit from population 2 into 1. Consequently,
c(1 | 2) / c(2 | 1) =.10 and
(X
)
T
1
−1
Xo −
− X 2 S pooled
(
)
(
)
1
−1
X 1 − X 2 S pooled
X 2 + X 2 =-0.014
2
⎡⎛ c(1 | 2) ⎞⎛ p 2 ⎞ ⎤
ln ⎢⎜⎜
⎟⎟⎜⎜ ⎟⎟ ⎥ = ln(.10 * (0.65 / 0.35)) = −1.68
c
(
2
|
1
)
⎠⎝ p1 ⎠ ⎦
⎣⎝
Therefore, we classify this case into the population 1.
2 (a). You were asked to do linear discriminant analysis. Examination of the data, however,
reveals the covariance matrices are not homogeneous for the dew hours and the non-dew hours.
Furthermore, none of the variables (a, r, w, d) appear to have a normal distribution. Consequently,
there is no reason to believe that a linear discriminant rule would be optimal for classifying dew
events and non-events. You could have tried to look for transformations to make the data more
nearly normally distributed and to promote homogeneity of covariance matrices. You could also
have investigated the creation of new variables, eg, dew point difference = (dew point) – (air
temperature), but you were not asked t do so because the semester is at an end.
Here we used priors of (.5, .5) and equal costs of misclassification.
Linear Discriminant Functions:
Variable
Constant
a
r
w
d
dry
dew
-499.97095 -532.44513
38.58379
39.61301
10.62615
10.99397
0.71609
0.13709
-39.17780
-40.25436
Air temperature(0C)
Relative humidity (%)
Wind speed (km/sec)
Dew Points (0C)
Subtracting the dry formula from the dew formula, we classify an hour as a dew event if
(1.03)a + (0.37)r - (0.58)w - (1.08)d > 32.48
This makes some sense because dew is more likely to form when wind speeds are low and when
humidity is high. Dew is also less likely to form when the dew point gets close to or below
freezing.
Cross-validation Summary using Linear Discriminant Function
Number of Observations and Percent Classified into dew
From Dew
Dry
Dry
4156
73.04
Wet
1534
26.96
Total
5690
100.00
Wet
334
7.89
3925
92.11
4259
100.00
Total
Priors
4492
0.5
5457
0.5
9949
Overall estimate of the probability of misclassification:
Rate
Priors
dry
0.2696
0.5000
wet
0.0789
0.5000
Total
0.17642
Now fit a linear discriminant model with the squares of a, r, w, d, and the crossproducts
Ar, aw, ad, rw, rd, and wd added to the model (you were not asked to do this, but it was provided
by the code posted on the course web page). This model does not make much improvement on the
overall misclassification rate, but it makes the rate more nearly equal for the two types of errors.
This is good because the previous model tended to underestimate the number of dew hours overall.
Cross-validation Summary using a more Complex Linear Discriminant Function
Number of Observations and Percent Classified into dew
From Dew
Dry
Dry
4530
79.61
Wet
1160
20.39
Total
5690
100.00
Wet
613
14.39
3646
85.61
4259
100.00
Total
Priors
5143
0.5
4806
0.5
9949
Overall estimate of the probability of misclassification:
dry
Rate
Priors
0.2039
0.5000
wet
Total
0.1439
0.5000
0.1739
Now fit a quadratic discriminant function using the variables a, r, w, d. This model is better at
classifying dew events as dew events, but it misclassifies a higher proportion of non-dew events as
dew events.
Cross-validation Summary using a Quadratic Discriminant Function
Number of Observations and Percent Classified into dew
From Dew
Dry
Dry
3555
62.48
Wet
2135
37.52
Total
5690
100.00
Wet
131
3.08
4128
96.92
4259
100.00
Total
3686
6263
9949
Priors
0.5
0.5
Overall estimate of the probability of misclassification:
dry
Rate
Priors
0.3752
0.5000
wet
Total
0.0308
0.5000
0.2030
If you have some free time this summer, you could consider transformations of variables to make
the distributions more near normal. Then investigate misclassification rates for linear and quadratic
models.
(b). Use logistic regression to construct a classification rule.
For binary response data, the response is either a dew event or a nonevent. Proc Logistic models
the probability of the event. The CTABLE option produces a 2 x 2 frequency table by crossclassifying the observed and predicted states. An approximate leave-one-out method is used by
CTABLE. The accuracy of the classification is measured by its sensitivity and specificity.
Sensitivity is the proportion of event responses that were predicted to be events while the
specificity is the proportions of nonevent responses that were predicted to be nonevents.
Sensitivity is the ability to predict an event correctly and specificity is ability to predict a nonevent
correctly. The estimated coefficients for a logistic regression model with a, r, w, and d are shown
in the following table.
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
a
w
d
r
1
1
1
1
1
-41.1777
3.0468
0.7379
-3.0693
0.3826
9.4533
0.5088
0.0268
0.5149
0.0958
18.9739
35.8584
757.5890
35.5368
15.9396
<.0001
<.0001
<.0001
<.0001
<.0001
All four variables appear to be significant. Here an hour is classified as a dew event if the
estimated probability of a dew event is at least 0.5.
Classification Table
Prob
Level
0.500
Correct
NonEvent Event
4591
3567
Incorrect
NonEvent Event
692
1099
Correct
82.0
Percentages
Sensi- Speci- False
tivity ficity
POS
80.7
83.8
13.1
False
NEG
23.6
Comparing these results with the results from linear discriminant analysis with a, r, w and d, we can
see that logistic regression gives higher sensitivity and lower specificity for the dew event data.
The following table shows approximate crossvalidation estimates of sensitivity and specificity for
different values of the cutoff probability. This table was constructed using the options ctable and
pprob=(.1 to .9 by .05). It appears that something near 0.5 is a good cutoff for maximizing the
number of correct classifications.
Using the stepwise selection option to build a logistic regression model from a, r, w, d, the squares
of a, r. w. d, and the cross product terms ar, aw, ad, rw, rd, wd yields the following model.
Classification Table
Prob
Level
0.100
0.150
0.200
0.250
0.300
0.350
0.400
0.450
0.500
0.550
0.600
0.650
0.700
0.750
0.800
0.850
0.900
Correct
NonEvent Event
5511
5403
5267
5122
4990
4866
4738
4604
4487
4349
4203
4040
3881
3689
3507
3302
3039
692
1299
1819
2273
2640
2923
3177
3374
3567
3718
3859
3967
4066
4129
4169
4211
4246
Incorrect
NonEvent Event
3567
2960
2440
1986
1619
1336
1082
885
692
541
400
292
193
130
90
48
13
Correct
Sensitivity
63.0
68.1
72.0
75.1
77.5
79.1
80.4
81.0
81.8
81.9
81.9
81.3
80.7
79.4
78.0
76.3
74.0
98.7
96.7
94.3
91.7
89.3
87.1
84.8
82.4
80.3
77.9
75.2
72.3
69.5
66.0
62.8
59.1
54.4
75
183
319
464
596
720
848
982
1099
1237
1383
1546
1705
1897
2079
2284
2547
Percentages
Speci- False
ficity
POS
16.2
30.5
42.7
53.4
62.0
68.6
74.6
79.2
83.8
87.3
90.6
93.1
95.5
96.9
97.9
98.9
99.7
39.3
35.4
31.7
27.9
24.5
21.5
18.6
16.1
13.4
11.1
8.7
6.7
4.7
3.4
2.5
1.4
0.4
False
NEG
9.8
12.3
14.9
17.0
18.4
19.8
21.1
22.5
23.6
25.0
26.4
28.0
29.5
31.5
33.3
35.2
37.5
You could consider the squares and cross-products of the explanatory variables and use the variable
selection options in PROC LOGISTIC to look for a better model. Backward elimination provided
the following model. You can check the misclassification rates, but this model offers liitle
improvement over logistic regression on a, w, r, d.
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept
a
r
w
d
w2
ar
ad
rw
rd
wd
1
1
1
1
1
1
1
1
1
1
1
-616.1
19.8568
6.1520
2.3911
-24.5666
0.2021
0.2102
0.00593
-0.0292
-0.1648
0.0115
158.7
4.4459
1.5868
0.5395
5.6570
0.0255
0.0615
0.000699
0.00591
0.0493
0.00486
15.0715
19.9478
15.0316
19.6471
18.8592
62.8593
11.6908
71.9551
24.4607
11.1742
5.6192
0.0001
<.0001
0.0001
<.0001
<.0001
<.0001
0.0006
<.0001
<.0001
0.0008
0.0178
(c). Use the rpart( ) and tree ( ) function ins S-Plus and R
Since the tree( ) function uses a deviance measure of impurity and the rpart( ) function uses a
measure on impurity based on the Gini index, the two algorithms can yield different classification
trees. Cross-validation provided by rpart produces a plot of the complexity parameter that suggests
a value around 0.0025. This yields a pruned tree with 14 terminal nodes. You should run the
crossvalidation several times, because it randomly divides the cases into different subsets of 10
groups on each run which yields slightly different results.. This pruned tree provided by rpart
classifies as the hour as dry (dew does not form) if :
♦ Relative humidity is less than 79.35 percent
♦ Relative humidity is between 79.35 and 88.25 percent and wind speed exceeds 1.039 m/sec
♦ Relative humidity is between 79.35 and 88.25 percent and wind speed does not exceed
1.039 m/sec and the dew point is less then 0.337 (degrees C)
♦ Wind speed exceeds 3.856 m/sec
♦ Relative humidity exceeds 88.25 percent and wind speed is between 1.652 and 3.856 m/sec
and the dew point is less 0.4845 degrees C.
♦ Relative humidity is between 88.25 and 90.85 percent and wind speed exceeds 2.795 m/sec.
♦ Relative humidity is between 88,25 and 90.85 percent and wind speed is between 1.652 and
2.795 m/sec. and air temperature is less than 4.205 degrees C
♦ Relative humidity is between 88.25 and 89.85 percent and wind speed is between 1.652 and
2.795 m/sec and air temperature exceeds 4.205 degrees C and dew point is less than 18.96
degrees C
It classifies an hour as a dew event if
♦ Relative humidity is greater than 90.85 percent and wind speed is less than 3.856 m/sec
♦ Relative humidity exceeds 88.25 percent and wind speed is less than 1.652 m/sec and the
dew point is at least 0.4845 degrees C (dew does not form below freezing temperatures, you
get frost instead)
♦ Relative humidity is between 89.85 and 90.85 percent and wind speed is between 1.652 and
2.795 m/sec and air temperature exceeds 4.205 degrees C
♦ Relative humidity is between 88.25 and 89.85 percent and wind speed is between 1.652 and
2.795 m/sec and air temperature exceeds 4.205 degrees C and dew point exceed 18.96
degrees C
These results revel some of the interaction between wind speed and relative humidity. In border
line situations for suitable relative humidity and wind speed conditions to allow due formation, air
temperature and dew point cannot get to close to freezing.
(d). Further suggestions:
The classification tree indicates and interaction between humidity and wind speed. It also suggests
that dew will not form when a threshold value of either humidity or wind speed is exceeded. One
option that you could pursue is to set aside cases with humidity less than 79.35 percent or wind
speed exceeding 3.856 as situations where dew is very unlikely to form. Fit a logistic regression
model to the cases with humidity between 79.35 and 88.25. Fit a second logistic regression model
to the cases with humidity greater than 88.25 and wind speed less than 3.856.
One might consider creating new variables such as the difference between the air temperature and
the dew point. One could also consider changes in some variables from the previous hour.
One could consider prior probabilities that are proportional to the overall number of hours in the
training samples with dew events. You could also consider doing this for each different hour to
obtain different priors for different hours. You must be careful to check that the relative hours of
dew and dry conditions in the training sample reflect the true state of affairs in the population of
interest.
Consider transformations of variables to make the distributions more symmetric or more nearly
normal. Note that monotone transformations, such as logarithms, have no effect on classification
trees.
Try neural nets or support vector machines.
Download