STAT 587 Homework Assignment No

advertisement
STAT 587 Homework Assignment No.1
Problem 2
A. Brand preference. In a small-scale experimental study of the relation between degree of brand
liking (Y ) and moisture content ( X1 ) and sweetness ( X 2 ) of the product, the following results
were obtained from the experiment based on a completely randomized design (data are coded):
i:
Xn :
1
4
2
4
3
4
4
4
5
6
6
6
7
6
8
6
9
8
10
8
11
8
12
8
13
10
14
10
15
10
16
10
Xn :
2
4
2
4
2
4
2
4
2
4
2
4
2
4
2
4
Yi :
64
73
61
76
72
80
71
83
83
89
86
93
88
95
94 100
a. Fit regression model to the data. State the estimated regression function. How is ˆ1 interpreted
here?
The estimated regression function is Y=37.6500+4.4250*X1+4.3750*X2.
The second coefficient provides the dependency of brand likelihood of moisture content.
It indicates that given fixed amount of sweetness and as moisture content increases on
unit, the degree of brand liking increases about 4.425 on average.
b. Obtain the residuals and prepare a box plot of the residuals. What information does this plot
provide?
The boxplot for residuals shows that they are symmetrically distributed around zero and also have the
zero mean.
c. Plot the residuals against Yˆ , X1 , X 2 , and X 1  X 2 on separate graphs. Also prepare a normal
probability plot. Analyze the plots and summarize your findings.
Normal probability plot for residuals shows more or less normality of them:
The plot of residuals vs all the fitted and X1 and X2 and their product shows uniform spread, so
regression function is linear in any of these:
d. Conduct a formal test for lack of fit of the first-order regression function; use a = .01. State the
alternatives, decision rule, and conclusion.
H 0 : E Y    0  1 X 1   2 X 2
H a : E Y    0  1 X 1   2 X 2
J=1
X1=4,
X2=2
I=1
64
I=2
61
Mean j
62.5
Sum( Yij  Y j )^2 4.5
J=2
X1=4,
X2=4
73
76
74.5
4.5
Replicate
J=3
X1=6,
X2=2
72
71
71.5
0.5
J=4
X1=6,
X2=4
80
83
81.5
4.5
J=5
X1=8,
X2=2
83
86
84.5
4.5
J=6
X1=8,
X2=4
89
93
91
8
J=7
X1=10,
X2=2
88
94
91
18
J=8
X1=10,
X2=4
95
100
97.5
12.5
c=8
SSPE(from above)=57 df=n-c=16-8=8
SSE=94.3
Then SSLF=SSE-SSPE=37.3 df=c-p=8-3=5
F*=(SSLF/5)/(SSPE/8)=1.0470175 < F(1-0.01,5,8)= 6.631825
So F*<F for a = .01 and we do not reject H0 and conclude that the regression function is linear.
REMARK: We didn’t talk about the lack of fit test in our class reviewing liner regressions. You can find
more details about this test in Section 3.7 (page 115) of the book “Applied Linear Regression Models”
(3rd edition) by Neter, Kutner, Nachtshein and Wasserman.
B. Refer to Brand preference. The diagonal elements of the hat matrix are:
h55  h66  h77  h88  h99  h10,10  h11,11  h12,12  .137 and
h11  h22  h33  h44 h13,13  h14,14  h15,15  h16,16  .237.
i:
Xn :
1
4
2
4
3
4
4
4
5
6
6
6
7
6
8
6
9
8
10
8
11
8
12
8
13
10
14
10
15
10
16
10
Xn :
2
4
2
4
2
4
2
4
2
4
2
4
2
4
2
4
Yi :
64
73
61
76
72
80
71
83
83
89
86
93
88
95
94 100
a. Explain the reason for the pattern in the diagonal elements of the hat matrix.
i:
hii
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0.2375
0.2375
0.2375
0.2375
0.1375
0.1375
0.1375
0.1375
0.1375
0.1375
0.1375
0.1375
0.2375
0.2375
0.2375
0.2375
The diagonal elements of hat matrix indicate the effect of a given observation, so in this case we have two
groups of equally influential observations. Hence, no outliers can be identified by the elements of a hat
matrix in this case. Also, their sum equals number of the unknown parameters.
b. According to the rule of thumb stated in the chapter, are any of the observations outlying
with regard to their X values.
The rule of thumb suggests that points with a hat diagonal greater than 2p/n be considered high leverage
points. (Sometimes when p is small consider any point with a hat diagonal greater than .2 (or .5) as
having high leverage) Here, 2p/n=6/16=3/8=0.375 and no outliers can be identified.
c. Obtain the studentized deleted residuals and identify any outlying Y observations.
The formula for studentized deleted residual is:
1/ 2


n  p 1
ti 
 ei 
2 
MSE(i ) (1  hii )
 SSE (1  hii )  ei 
ei
This gives the following table of the studentized deleted residuals for each of the values of Y:
i:
1
64
Yi :
ti :
-.04
| ti |: .04
2
73
3
61
4
76
5
72
6
80
7
71
8
83
9
83
10
89
11
86
12
93
13
88
14
95
15
94
16
100
.06
-1.36
1.39
-.37
-.66
-.77
.50
.47
-.60
1.82 .98
-1.14
-2.10
1.49
.25
.06
1.36
1.39
.37
.66
.77
.50
.47
.60
1.82 .98
1.14
2.10
1.49
.25
The largest in the absolute value studentized deleted residual is #14. Then there is #11 and #15. However,
by the empirical rule, the cut-off point is t(.975, 12) = 2.1788; None of them exceed this value.
d. Case 14 appears to be a borderline outlying Y observation. Obtain the DFFITS,
DFBETAS, and Cook’s distance values for this case to assess its influence. What do you
conclude?
To determine if outlier is actually influential, calculate the following measures:
1/ 2
 h 
DFFITS  ti  ii 
 1  hii 
 2.1*
0.2375
 -1.17353123
0.7625
In the above DFFITS slightly exceeds 1 and the size of a data set is small (or larger than 0.87=2*sqrt(p/n)
for larger data sets) so we might consider the observation to be influential.
DFBETAS 
 k   k (i )
MSE(i ) ckk
. For different betas they are
0: 0.83881
1: -0.8077
2: -0.6020,
and they are the largest among those for other observations, although by the empirical rule none of them
are flagged for being potential influential point.
Cook’s Dist. 

e2i  hii

  0.363412
p * MSE  1  hii 2 
Cook’s distance value doesn’t say much but from the graph of Cook’s distance for every observation
below we can see that it is the largest. However, it is less than qf(.9, 3, 13) and empirically it is not
considered as an influential point.
e. Calculate the average absolute percent difference in the fitted values with and without
case 14. What does this measure indicate about the influence of case 14?
n

i=1
Yµi (14)  Yµi
*100
Yµi
n
 0.677679%
so the effect of case #14 on inferences is not large and therefore this case should be kept.
f.
Calculate Cook’s distance D, for each case. Are any cases influential according to this
measure?
Cook’s distance:
i:
Di
i:
Di
And graph:
1
2
3
4
5
6
7
8
0.00019 0.0004 0.1804 0.1863 0.0077 0.0245 0.0323 0.01435
9
10
11
12
13
14
15
16
0.0122 0.0204 0.1498 0.0510 0.1318 0.3634 0.21067 0.0068
.
Problem 3.
A. Car purchase. A marketing research firm was engaged by an automobile manufacturer to
conduct a pilot study to examine the feasibility of using logistic regression for ascertaining the
likelihood that a family will purchase a new car during the next year. A random sample of 33
suburban families was selected. Data on annual family income (X1, in thousand dollars) and the
current age of the oldest family automobile (X2, in years) were obtained. A follow-up interview
conducted 12 months later was used to determine whether the family actually purchased a new
car (Y = 1) or did not purchase a new car (Y = 0) during the year.
i:
1
2
3
...
31
32
33
Xi1:
32
45
60
...
21
32
17
Xi2:
3
2
2
...
3
5
1
Yj:
0
0
1
...
0
1
0
Multiple logistic regression model with two predictor variables in first-order terms is assumed to
be appropriate.
a. Find the maximum likelihood estimates of o, 1, and 2. State the fitted response function.
P(new.car = 1) =exp(b0 + b1*income + b2*car.age)/[1 + exp(b0 + b1*income + b2*car.age)]
Where:
b0 = -4.73931
b1 = 0.06773
b2 = 0.59863
b. Obtain exp(b1) and exp(b2) and interpret these numbers.
With a unit increase in the 1st or 2nd covariate (income or age) we expect the odds of success to increase
exp(b1)= 1.070079093 or exp(b2)= 1.819627221 times. This means that the odds of buying a new car
increase by 7% for every additional $1,000 of income and by 82% for every additional year of age.
c.What is the estimated probability that a family with annual income of $50 thousand and an
oldest car of 3 years will purchase a new car next year?
Using the model for the prediction we get the probability of a purchase of new car in the next year will be
0.6090245.
B. Refer to Car purchase
a. To assess the appropriateness of the logistic regression function, form three groups of 11
cases each according to their fitted logit values ˆ ' . Plot the estimated proportions pj against the
midpoints of the ˆ ' intervals. Is the plot consistent with a response function of monotonic
sigmoidal shape? Explain.
Cutpoints that divide fitted logit values into equal groups are 0.25110785 and 0.59013409. We have 11
observations in each cell and there are 3, 2 and 9 observations in each cell respectively with Y=1. So we
have the pattern below, which does not seem to have a sigmoidal shape, but we only have 3 intervals.
c. Obtain the deviance residuals and present them in an index plot. Do there appear to be any
outlying cases?
There are no extreme outlying cases seen on the plot below:
d. Construct a half-normal probability plot of the absolute deviance residuals. Do any cases here
appear to be outlying?
None of them appear to be outside the simulated envelope.
Additional Problem
Prove the two equations in Remark 3 on page 8 of the lecture notes on GLM
[Hint: Consider the expectations of the first and the second derivatives (w.r.t. theta_i) of the loglikelihood function. ]
First, one can prove the following equations
 l 2 
 l 
 l 2 
E

0
E

E

 2

 0
   
 i 
 i 
where l (i ,  | yi )  log( f ( yi | i ,  )) for any density function
f ( yi | i ,  ) .
Now in our case, note that, the density function
A
 
f ( yi | i ,  )  exp  i  yii   (i )   ( yi , )  .
Ai 

From the first equation, we have
 Ai
 l 
  Ai
/
/
E
  E   yi   (i )   E[ yi ]   (i )  0





 i

Hence
E[ yi ]   / (i ) .
From the second equation
2
2
 A



 l 2 
 Ai //

 
 l  

/
i
E  2   E     E   (i )   E   yi   (i )    0
 

 

 i 
 

   


So,
2
A
  (i )   i  E

 
Ai
//
therefore,
Var ( yi ) 
Ai

 // (i ) .
 y  E[ y ] 
2
i
i
2
A
   (i )   i  Var ( yi )  0

 
Ai
//
Download