Topic 8 - Pegasus @ UCF

advertisement
Lecture & Examples
Topic 8: Models with Qualitative Independent
Variable
Model with One Qualitative Independent Variable
with k Levels:
Suppose we want to develop a model for the mean yield
per acre, E(y), of four different varieties of snow peas (A,
B, C, and D). Notice that we can not assign a quantitative
measure for a given variety of snow pea. Although we
can assign 1, 2, 3, and 4 to these four varieties of snow
peas, these numbers have no meaningful quantitative
interpretation. To solve this problem, we introduce the
concept of a dummy variable. Let
 1 if the snow pea is variety A
x1  
0 if the snow pea is other variety
 1 if
x2  
0 if
 1 if
x3  
0 if
the snow pea is variety B
the snow pea is other variety
the snow pea is variety C
the snow pea is other variety
Then, we can write the following model equation:
y   0  1 x1   2 x2  3 x3   .
1
Suppose that  A ,  B ,  C ,  D is the mean yield for variety
A, B, C, and D, respectively. Now, we can represent the
mean yield of variety B by checking the
dummy variable x1, x2, and x3. We can see that we should
use x1 = 0, x2 = 1 and x3 = 0 to get
 B  E ( y )   0  1 (0)   2 (1)  3 (0)   0   2 . Similarly,
we can find that  A   0  1 ,  C   0  3 , and  D   0 .
In general, we can write the model with one qualitative
independent variable with k levels as follows:
Step 1: Use k1 dummy variables.
Step 2: Let xi be the dummy variable for level i, for i = 1
to k1.
Step 3: The model equation is
y   0  1 x1   2 x 2   k 1 xk 1  
 1 if y is observed at level i
x

where i 0 otherwise

Step 4: The unknown parameters and the mean effect of
each level have the following relationship:
2
1   0  1
 2  0  2
 3  0  3

 k 1   0   k 1
 k  0 .
Also, we have the following relationship:
0   k
1  1   k
2   2  k
3   3   k
 k 1

  k 1   k .
3
Step 5: The assumptions about the error terms for a
model with qualitative independent variables are similar
to the assumptions for a model with quantitative
independent variables.
 E() = 0;
 Var() = 2;
 The error for each observation comes from a normal
population;
 Error terms are independent.
4
Example 12.15:
The following model was used to relate E(y) to a single
qualitative variable with four levels:
E ( y )   0  1 x1   2 x 2  3 x3
 1 if the first level
x1  
0 if the other level
1 if
x2  
0 if
where
1 if
x3  
0 if
the second level
the other level
the third level
the other level
This model fits to n = 40 observations and the regression
prediction equation is
yˆ  87  63 x1  45 x2  57 x3 .
5
(a) Use the least squares prediction equation to find the
estimate of E(y) for each level of the qualitative
independent variable.
Solution:
ˆ 1  ˆ 0  ˆ 1  87  63  150
ˆ 2  ˆ 0  ˆ 2  87  45  132
ˆ  ˆ  ˆ  87  57  144
3
0
3
ˆ 4  ˆ 0  87
(b) Specify the null and alternative hypotheses you would
use to test whether E(y) is the same for all four levels of
the dependent variable.
Solution:
H 0 : 1   2  3  0
H a : at least one i  0
6
Example 12.16:
A large company in Iowa is currently investigating five
varieties of snow peas. The yields produced from each
plot are shown in Table 12.13.
Table 12.13 Data for Example 12.16
Variety
A
26.2
24.3
21.8
28.1
Variety
B
29.2
28.1
27.3
31.2
Variety
C
29.1
30.8
33.9
32.8
Variety
D
21.3
22.4
24.3
21.8
Variety
E
20.1
19.3
19.9
22.1
We define the dummy variables as follows:
x1 = 1 for variety A
x2 = 1 for variety B
x3 = 1 for variety C
x4 = 1 for variety D
7
SAS Printout analysis with Regression
Model: MODEL1
Dependent Variable: Y
Analysis of Variance
Source
Model
Error
C Total
DF
4
15
19
Root MSE
Dep Mean
C.V.
Sum of
Squares
342.04000
53.52000
395.56000
1.88892
25.70000
7.34986
Mean
Square
85.51000
3.56800
R-square
Adj R-sq
F Value
23.966
Prob>F
0.0001
0.8647
0.8286
Parameter Estimates
Variable
INTERCEP
X1
X2
X3
X4
Parameter
Estimate
20.350000
4.750000
8.600000
11.300000
2.100000
Standard
Error
0.94445752
1.33566463
1.33566463
1.33566463
1.33566463
T for H0:
Parameter=0
21.547
3.556
6.439
8.460
1.572
Prob > |T|
0.0001
0.0029
0.0001
0.0001
0.1367
(a) Find  A ,  B ,  C ,  D and  E .
Solution:
 A   0  1  20.35  4.75  25.10
 B   0   2  20.35  8.60  28.95
 C   0   3  20.35  11.30  31.65
 D   0   4  20.35  2.10  22.45
 E   0  20.35
8
(b) Report the least-squares prediction model from the
SAS printout with regression analysis.
Solution:
yˆ  20.35  4.75 x1  8.60 x 2  11.30 x3  2.10 x 4
(c) What null and alternative hypotheses are tested by the
global F-test for this model? Interpret the hypotheses
both in terms of the  coefficients and the mean yields for
the five varieties of peas.
Solution:
H 0 : 1   2  3   4  0
H a : at least one i  0
or
H 0 : A  B  C  D  E
H a : at least one  i   j
(d) Test the hypotheses in part (c) at  = 0.05.
Solution:
Test Statistic: Fc = 23.966
Rejection Region: F > 3.06
9
Thus, reject the null hypothesis and we can conclude
that at least one pair of mean yields are not equal.
(e) Place a 95% confidence interval on the difference
between the mean yields of varieties D and E.
Solution:
95% confidence = ˆ 4  t0.025,15  sˆ
4
= 2.10  2.1311.33566463
= [0.75, 4.94]
(f) Place a 95% confidence interval on the difference
between the mean yields of varieties D and A.
Note:
(1)  D   A   0   4    0  1    4  1
(2) s xD  xA  s
1
1
1 1

 1.88892 
  1.336
nD nA
4 4


95% confidence interval = ˆ 4  ˆ 1  t0.025,15  s x
=(2.10  4.75)  2.131 1.336
=[5.497, 0.197]
D  xA
10
SAS Printout Analysis with Complete
Randomized Design
Analysis of Variance Procedure
Dependent Variable: Y
Source
DF
Model
4
Error
15
Corrected Total 19
R-Square
0.864698
Source
VARIETY
DF
4
Sum of
Squares
342.04000000
53.52000000
395.56000000
Mean
Square
85.51000000
3.56800000
C.V.
7.349864
Anova SS
342.04000000
Root MSE
1.8889150
Mean Square
85.51000000
F Value
23.97
Pr > F
0.0001
Y Mean
25.700000
F Value
23.97
Pr > F
0.0001
Analysis of Variance Procedure
Level of
VARIETY
1
2
3
4
5
N
4
4
4
4
4
--------------Y-------------Mean
SD
25.1000000
2.69196335
28.9500000
1.69016764
31.6500000
2.12994523
22.4500000
1.31275791
20.3500000
1.21518174
(g) What are the null and alternative hypotheses tested by
the above SAS Printout?
Solution:
H 0 : A  B  C  D  E
H a : at least one  i   j
Test Statistic: Fc = 23.97
Rejection Region: F > 3.06
Thus, reject the null hypothesis and we can conclude
that at least one pair of mean yields are not equal.
11
12
Download