Lesson 21

advertisement
Chapter 8
Polynomial Regression
- One covariate, 2nd order: Yi = Bo + B1X1 + B2 X12  ei (shape is parabola)
and X is usual centered by Xi - X as this reduces the multicollinearity since X1 and X 12
are often highly correlated.
Notation that your book uses: Yi = Bo + B1X1 + B11X12  ei to reflect pattern of exponents.
For example, B1 would be coefficient for X1; B11 would be coefficient for X 12 ; B111 would
be coefficient for X13 ; and B22 would be coefficient for X 22 .
- One covariate, 3rd order: Yi = Bo + B1X1 + B11X12  B111X13  ei
- Two covariates, 2nd order: Yi = Bo + B1X1 +B2 X 2 + B11X12  B22 X 22  B12 X1X 2  ei
where X1 = Xi1 - X 1 and X2 = Xi2 - X 2
- and this expansion continues for various covariates and orders.
General Rules:
- For d distinct X-values, a polynomial of at most d – 1 can be fit. For example, if we
have 29 observations for the predictor variable(s) with 14 that are replicates, this leaves
us with only 15 distinct X-values, hence a polynomial of order 15 – 1 = 14 could be fit
(although fitting such a model would be absurd!).
- For studies especially in the biological or social sciences, one important consideration is
whether the regression relationship can be described by a monotonic function (i.e. one
that is always increasing or decreasing). If only monotonic functions are of interest, a 2nd
or 3rd order model usually suffices, although monotonicity is not guaranteed since some
parabolas increase then decrease. A more general consideration is the number of bends
in the polynomial curve one wishes to fit. For example, a 1st order model has zero bends
(straight line fit); a 2nd order model has no more than 1 bend, and each higher-order term
adds another potential bend. In practice, then, fitting polynomials higher than a cubic
usually leads to models that are neither always decreasing nor increasing.
Example: HO 8.1 – 8.3 Minitab Data Set
- The X’s were scaled by dividing each Xip by its respective mean, then dividing by the
value equal to the absolute value of the difference between adjacent values. That is, X1 is
the scaled result of taking Xi1 – 90 and dividing by 10 and Xi2 – 55 and dividing by 5.
This produces values of 1, -1 and 0. Using the denominator in this scaling technique is
only valid if the variables have evenly distributed intervals. Otherwise just center by the
mean. You can do this in Minitab by Calc > Standardize. Enter X1_F for input
column and type X1 as Store Results In. Click the radio button for Subtract first,
Divide second and enter First: 90, Second:10. Repeat and edit for X2_psi.
- Look at diagnostic plots on 8.2
1
8.1
o
x1 F
80
80
80
80
80
80
80
80
80
x1 F
80
80
80
80
80
80
80
80
80
90
90
90
90
90
90
90
90
90
100
100
100
100
100
100
100
100
100
Temperature, Pressure and Quality of the Finished Product
x2 psi
y
x1 oF
x2 psi
y
x1 oF
50
50.8
90
50
63.4
100
50
50.7
90
50
61.6
100
50
49.4
90
50
63.4
100
55
93.7
90
55
93.8
100
55
90.9
90
55
92.1
100
55
90.9
90
55
97.4
100
60
74.5
90
60
70.9
100
60
73.0
90
60
68.8
100
60
71.2
90
60
71.3
100
x2 psi
50
50
50
55
55
55
60
60
60
50
50
50
55
55
55
60
60
60
50
50
50
55
55
55
60
60
60
Y
50.8
50.7
49.4
93.7
90.9
90.9
74.5
73
71.2
63.4
61.6
63.4
93.8
92.1
97.4
70.9
68.8
71.3
46.6
49.1
46.4
69.8
72.5
73.2
38.7
42.5
41.4
X1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
X2
-1
-1
-1
0
0
0
1
1
1
-1
-1
-1
0
0
0
1
1
1
-1
-1
-1
0
0
0
1
1
1
X1X2
1
1
1
0
0
0
-1
-1
-1
0
0
0
0
0
0
0
0
0
-1
-1
-1
0
0
0
1
1
1
X1SQ
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
x2 psi
50
50
50
55
55
55
60
60
60
y
46.6
49.1
46.4
69.8
72.5
73.2
38.7
42.5
41.4
X2SQ
1
1
1
0
0
0
1
1
1
1
1
1
0
0
0
1
1
1
1
1
1
0
0
0
1
1
1
2
8.2
Regression Analysis: Quality Y versus X1, X2, X1X2, X1SQ, X2SQ
The regression equation is
Quality Y = 94.9 - 9.16 X1 + 3.94 X2 - 7.27 X1X2 - 13.3 X1SQ - 28.6 X2SQ
Analysis of Variance
Source
Regression
Residual Error
Total
Source
X1
X2
X1X2
X1SQ
X2SQ
DF
1
1
1
1
1
DF
5
21
26
SS
8402.3
59.2
8461.4
MS
1680.5
2.8
F
596.32
P
0.000
Seq SS
1510.7
279.3
635.1
1067.6
4909.7
Scatterplot of RESI1 vs x1 F, x2 psi, FITS1
x1 F
x2 psi
3.0
1.5
0.0
-1.5
RESI1
-3.0
80
85
90
FITS1
3.0
95
100 50.0
52.5
55.0
57.5
60.0
1.5
0.0
-1.5
-3.0
40
60
80
100
Probability Plot of RESI1
Normal - 95% CI
99
Mean
StDev
N
AD
P-Value
95
90
Percent
80
-4.10536E-14
1.509
27
0.151
0.955
70
60
50
40
30
20
10
5
1
-5.0
-2.5
0.0
RESI1
2.5
5.0
3
8.3
Correlations: note correlation between original-original squared compared to
these correlations when variables are scaled, the scaling reduces collinearity.
x1
F2
X1SQ
x1 F
0.999
X1
0.000
0.000
Lack of Fit – Ho: Is this second order model a good fit?
Analysis of Variance
Source
Regression
Residual Error
Lack of Fit
Pure Error
Total
DF
5
21
3
18
26
SS
8402.3
59.2
8.2
50.9
8461.4
MS
1680.5
2.8
2.7
2.8
F
596.32
P
0.000
0.97
0.428
Partial F-test – is a first order model sufficient?
Ho: B11 = B12 = B22 = 0 vs Ha: not all = 0
Using Partial SS from output:
F* 
SSR ( x12 , x22 , x1 x2 | x1 , x2 )  m
[ SSR ( x12 )  SSR ( x22 )  SSR ( x1 x2 ]  3

SSE ( x1 , x2 , x12 , x22 , x1 x2 )  (n  p )
(59.2)  (21)
( just MSE of full model)
(values can be found in Seq SS)

[1067.6  4909.7  635.1]  3
 782.15
2.8
The p-value found for this F-statistic with df = 3, 21 is ≈ 0.000. Alternatively, you could find the
critical value from the F-table (Table B.4 starting on page 1320 in your text) for say α = 0.01 with
numerator df = 3 and denominator df = 21 which results in a critical F of 4.87. With the p-value
less than alpha of 0.01 or the F* = 782.15 being greater than the F critical value of 4.87. we
would reject Ho and conclude that at least one of B11, B12, B22 ≠ 0.
Fitted model in terms of X:
2
 X  90 
 X 2  55 
 X1  90  X 2  55 
 X1  90 
 X 2  55 
Ŷ = 94.9 - 9.16  1
 + 3.94 
 - 7.27 

 - 13.3 
 - 28.6 

 10 
 5 
 10  5 
 10 
 5 
4
2
Interaction
We addressed briefly in Chapter 6 interaction, or cross-product, terms in multiple linear
regression. For discussion, let’s say we have two covariates, X1 and X2 with three levels
for X2; 1, 2, and 3. Now consider the following graphs and regression equations. The
graphs show Y on the vertical, X1 on the horizontal and the lines represent the response
functions, E(Y), as a function of X1 for each level of X2.
Additive Model
Reinforcement
Interaction
Effect
Interference
Interaction
Effect
Eq (1) E(Y) = 1 + 2X1 + X2 X2 no effect on X1, the response functions are ||, = slopes
X2 = 1, E(Y) = 2 + 2X1
X2 = 2, E(Y) = 3 + 2X1
X2 = 3, E(Y) = 4 + 2X1
Eq (2) E(Y) = 1 + 2X1 + X2 + X1X2 Not ||, the slopes differ and are increasing as X2 ↑
X2 = 1, E(Y) = 2 + 3X1
X2 = 2, E(Y) = 3 + 4X1
X2 = 3, E(Y) = 4 + 5X1
Eq (3) E(Y) = 1 + 2X1 + X2 - X1X2
X2 = 1, E(Y) = 2 + X1
X2 = 2, E(Y) = 3
X2 = 3, E(Y) = 4 - X1
Not ||, the slopes differ and decrease as X2 ↑
Easy interpretation of the last two figures: if slope coefficients are positive and
interaction coefficient is positive, then we have reinforcement. If slope coefficients are
positive and interaction coefficient is negative, then we have interference (and viceversa). Note: due to possibly high correlation, variables should be centered.
To test interaction we simply consider the sum of squares for adding the interaction term
to the model including the other terms. E.g F* = [SSR(x1x2|x1,x2)/1]/MSEfull
5
Example – Body Fat Table 7.1 page 312: As an exercise, try and re-create
1. Open the Body Fat data set and create interaction terms, prior to centering, x1x2, x1x3,
and x2x3
2. Create centered variables for the 3 predictors, using Calc > Standardize. You can enter
all 3 variables in the Input window and enter cx1 cx2 cx3 in the Store window.
3. Create interaction terms using the centered variables (i.e. cx1cx2, cx1cx3, cx2cx3)
4. To see the effect centering has on correlation, go to Calc > Basic Stat > Correlation
and enter the 6 uncentered variables, uncheck the box for p-values (this will may the
display easier to read) and click OK. Repeat this for centered variables. Note how the
correlation between the single predictors and the interactions has markedly dropped when
centered.
5. Perform the regression on page 312 using the centered variables and produce the
sequential sum of squares as shown on page 313
Y = Bo + B1cx1 + B2cx2 + B3cx3 + B4cx1cx2 + B5cx1cx3 + B6cx2cx3 +e
6. Find, as shown of page 313,
[ SSR( x1x2, x1x3, x2 x3 | x1, x2, x3)]  m (1.496  2.704  6.515)  3
F* 

 0.53
MSE ( full )
6.745
7. Use Minitab to find the p-value for F*. The df are 3 and 13. Go to Calc > Probability
Distributions > F. Cumulative Probability should be check (non-centrality should be 0.0)
enter 3 for numerator DF, 13 for denominator DF, select Input Constant and enter 0.53.
The p-value is 1 – the output value = 1 – 0.33 = 0.67.
Qualitative Predictor Variables
Minitab Data Set
We addressed qualitative predictor variables briefly when we discussed the cargo
example in Chapter 6 lecture notes. Now we look at these variable types in a little bit
more detail. Remember that if a variable has k categories then define only k – 1 dummy
or indicator variables. Dummy variables can be used, then, to compare two or more
regression models. We will consider two models first as the extension to more than two
is fairly straightforward. NOTE: 1. do not center dummy variables. 2. If using an
interaction term involving a dummy variable you do not have to center the non-dummy
variable. Create Dummy Variable using Calc > Calculator with Expression
'Gender'="Female"
Looking at Minitab Data set, SPB for AGE and Gender several questions come to mind:
Q1: Are the two slopes the same regardless of intercepts? – Lines parallel
Q2: Are the two intercepts the same regardless of slopes?
Q3: Are the two lines coincident, i.e. the same …= slopes, intercepts
To answer Q1 are the lines parallel:
Model: Y = Bo + B1X1 + B2Z + B3X1Z + e where Y = SBP, X = Age and
0 if male
Z
1 if female
6
From Minitab we get the fitted model:
SBP(Y) = 110.04 + 0.96 AGE(x1) – 12.96 Dummy Gender(z) - 0.012 xz
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
65
68
Source
AGE(x1)
Dummy Gender(z)
xz
DF
1
1
1
SS
18010.3
5201.4
23211.8
MS
6003.4
80.0
F
75.02
P
0.000
Seq SS
14951.3
3058.5
0.5
Z = 0 for Males: Yˆm = 110.04 + 0.96AGE
Z = 1 for Females: Yˆ = 97.08 + 0.95AGE
F
So the two slopes appear similar: 0.96 and 0.95
To test for parallelism: Ho: B3 = 0 which reduces the model to Y = Bo + B1X1 + B2Z + e,
0 Y = Bo + B1 Age
same slope
Z
1 Y = (Bo + B2 ) + B1 Age
We use again:
SSR( X 1Z | X 1 , Z )  m SSR( X 1Z ) 1 0.5
F* 


 0.006
MSE ( full )
MSE ( full )
80.0
From the F table, B.4, for α = 0.01, F0.01, 1,65 ≈ 7.08 and the p-value from Minitab is 0.94
We would not reject Ho and conclude that the two lines are parallel.
To answer Q2, equal intercepts:
Test Ho: B2 = 0 which reduces the model to Y = Bo + B1X1 + B3X1Z + e
0 Y = Bo + B1 Age
same intercept
Z
1 Y = Bo + (B1  B3 ) Age
If the slopes are equal, could use Y = Bo + B1X1 + B2Z + e and compare to
Y = Bo + B1X1 + e. Since we are pretty sure from Q1 that B3 = 0 from Q1, we will use
this latter approach and test by:
SSR( Z | X 1 )  q SSR( Z )  1 3058.5


 38.8 Note: from output we need to add
MSE ( full )
MSE ( full )
78.8
back to SSE from the full model the SS Seq of X1Z = 0.5 and then divide by 66 instead of
65 to get new MSE(full).
F* 
7
From the F table, B.4, for α = 0.01, F0.01, 1,66 ≈ 7.08 and the p-value from Minitab is ≈
0.000 We would reject Ho and conclude that the two lines have different intercepts, i.e.
the intercepts of 110.04 for Males and 97.08 are statistically different.
To answer Q3, lines coincide:
We test Ho: B2 = B3 = 0 reducing the model to: Y = Bo + B1X1 + e
F* 
SSR( Z , X 1Z | X 1 )  q SSR( Z , X 1Z )  2 (3058.5  0.5)  2 1529.5



 19.1
MSE ( full )
MSE ( full )
80.0
80
From the F table, B.4, for α = 0.01, F0.01, 2,65 ≈ 4.98 and the p-value from Minitab is ≈
0.000. We would reject Ho and conclude that the lines do not coincide.
Conclusion:
We conclude B3 = 0 meaning the slopes are equal and therefore could drop the interaction
term. But we rejected B2 = 0. So our final model regression model is
SBP(Y) = 110 + 0.956 AGE(x1) - 13.5 Dummy Gender(z)
Since B2 = - 13.5 we would also conclude that the SBP(females) and SBP(males)
increases at the same rate (i.e. equal slopes) but that the SPB(males) is higher than
SPB(females).
Important Notes!!
1. Do NOT center dummy variables
2. The tests used to answer Q1, Q2, and Q3 are only reliable if the variances are
equal for group. For instance in our example this means that  Y2M   Y2F
8
Download