Chapter 8 Polynomial Regression - One covariate, 2nd order: Yi = Bo + B1X1 + B2 X12 ei (shape is parabola) and X is usual centered by Xi - X as this reduces the multicollinearity since X1 and X 12 are often highly correlated. Notation that your book uses: Yi = Bo + B1X1 + B11X12 ei to reflect pattern of exponents. For example, B1 would be coefficient for X1; B11 would be coefficient for X 12 ; B111 would be coefficient for X13 ; and B22 would be coefficient for X 22 . - One covariate, 3rd order: Yi = Bo + B1X1 + B11X12 B111X13 ei - Two covariates, 2nd order: Yi = Bo + B1X1 +B2 X 2 + B11X12 B22 X 22 B12 X1X 2 ei where X1 = Xi1 - X 1 and X2 = Xi2 - X 2 - and this expansion continues for various covariates and orders. General Rules: - For d distinct X-values, a polynomial of at most d – 1 can be fit. For example, if we have 29 observations for the predictor variable(s) with 14 that are replicates, this leaves us with only 15 distinct X-values, hence a polynomial of order 15 – 1 = 14 could be fit (although fitting such a model would be absurd!). - For studies especially in the biological or social sciences, one important consideration is whether the regression relationship can be described by a monotonic function (i.e. one that is always increasing or decreasing). If only monotonic functions are of interest, a 2nd or 3rd order model usually suffices, although monotonicity is not guaranteed since some parabolas increase then decrease. A more general consideration is the number of bends in the polynomial curve one wishes to fit. For example, a 1st order model has zero bends (straight line fit); a 2nd order model has no more than 1 bend, and each higher-order term adds another potential bend. In practice, then, fitting polynomials higher than a cubic usually leads to models that are neither always decreasing nor increasing. Example: HO 8.1 – 8.3 Minitab Data Set - The X’s were scaled by dividing each Xip by its respective mean, then dividing by the value equal to the absolute value of the difference between adjacent values. That is, X1 is the scaled result of taking Xi1 – 90 and dividing by 10 and Xi2 – 55 and dividing by 5. This produces values of 1, -1 and 0. Using the denominator in this scaling technique is only valid if the variables have evenly distributed intervals. Otherwise just center by the mean. You can do this in Minitab by Calc > Standardize. Enter X1_F for input column and type X1 as Store Results In. Click the radio button for Subtract first, Divide second and enter First: 90, Second:10. Repeat and edit for X2_psi. - Look at diagnostic plots on 8.2 1 8.1 o x1 F 80 80 80 80 80 80 80 80 80 x1 F 80 80 80 80 80 80 80 80 80 90 90 90 90 90 90 90 90 90 100 100 100 100 100 100 100 100 100 Temperature, Pressure and Quality of the Finished Product x2 psi y x1 oF x2 psi y x1 oF 50 50.8 90 50 63.4 100 50 50.7 90 50 61.6 100 50 49.4 90 50 63.4 100 55 93.7 90 55 93.8 100 55 90.9 90 55 92.1 100 55 90.9 90 55 97.4 100 60 74.5 90 60 70.9 100 60 73.0 90 60 68.8 100 60 71.2 90 60 71.3 100 x2 psi 50 50 50 55 55 55 60 60 60 50 50 50 55 55 55 60 60 60 50 50 50 55 55 55 60 60 60 Y 50.8 50.7 49.4 93.7 90.9 90.9 74.5 73 71.2 63.4 61.6 63.4 93.8 92.1 97.4 70.9 68.8 71.3 46.6 49.1 46.4 69.8 72.5 73.2 38.7 42.5 41.4 X1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 X2 -1 -1 -1 0 0 0 1 1 1 -1 -1 -1 0 0 0 1 1 1 -1 -1 -1 0 0 0 1 1 1 X1X2 1 1 1 0 0 0 -1 -1 -1 0 0 0 0 0 0 0 0 0 -1 -1 -1 0 0 0 1 1 1 X1SQ 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 x2 psi 50 50 50 55 55 55 60 60 60 y 46.6 49.1 46.4 69.8 72.5 73.2 38.7 42.5 41.4 X2SQ 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 2 8.2 Regression Analysis: Quality Y versus X1, X2, X1X2, X1SQ, X2SQ The regression equation is Quality Y = 94.9 - 9.16 X1 + 3.94 X2 - 7.27 X1X2 - 13.3 X1SQ - 28.6 X2SQ Analysis of Variance Source Regression Residual Error Total Source X1 X2 X1X2 X1SQ X2SQ DF 1 1 1 1 1 DF 5 21 26 SS 8402.3 59.2 8461.4 MS 1680.5 2.8 F 596.32 P 0.000 Seq SS 1510.7 279.3 635.1 1067.6 4909.7 Scatterplot of RESI1 vs x1 F, x2 psi, FITS1 x1 F x2 psi 3.0 1.5 0.0 -1.5 RESI1 -3.0 80 85 90 FITS1 3.0 95 100 50.0 52.5 55.0 57.5 60.0 1.5 0.0 -1.5 -3.0 40 60 80 100 Probability Plot of RESI1 Normal - 95% CI 99 Mean StDev N AD P-Value 95 90 Percent 80 -4.10536E-14 1.509 27 0.151 0.955 70 60 50 40 30 20 10 5 1 -5.0 -2.5 0.0 RESI1 2.5 5.0 3 8.3 Correlations: note correlation between original-original squared compared to these correlations when variables are scaled, the scaling reduces collinearity. x1 F2 X1SQ x1 F 0.999 X1 0.000 0.000 Lack of Fit – Ho: Is this second order model a good fit? Analysis of Variance Source Regression Residual Error Lack of Fit Pure Error Total DF 5 21 3 18 26 SS 8402.3 59.2 8.2 50.9 8461.4 MS 1680.5 2.8 2.7 2.8 F 596.32 P 0.000 0.97 0.428 Partial F-test – is a first order model sufficient? Ho: B11 = B12 = B22 = 0 vs Ha: not all = 0 Using Partial SS from output: F* SSR ( x12 , x22 , x1 x2 | x1 , x2 ) m [ SSR ( x12 ) SSR ( x22 ) SSR ( x1 x2 ] 3 SSE ( x1 , x2 , x12 , x22 , x1 x2 ) (n p ) (59.2) (21) ( just MSE of full model) (values can be found in Seq SS) [1067.6 4909.7 635.1] 3 782.15 2.8 The p-value found for this F-statistic with df = 3, 21 is ≈ 0.000. Alternatively, you could find the critical value from the F-table (Table B.4 starting on page 1320 in your text) for say α = 0.01 with numerator df = 3 and denominator df = 21 which results in a critical F of 4.87. With the p-value less than alpha of 0.01 or the F* = 782.15 being greater than the F critical value of 4.87. we would reject Ho and conclude that at least one of B11, B12, B22 ≠ 0. Fitted model in terms of X: 2 X 90 X 2 55 X1 90 X 2 55 X1 90 X 2 55 Ŷ = 94.9 - 9.16 1 + 3.94 - 7.27 - 13.3 - 28.6 10 5 10 5 10 5 4 2 Interaction We addressed briefly in Chapter 6 interaction, or cross-product, terms in multiple linear regression. For discussion, let’s say we have two covariates, X1 and X2 with three levels for X2; 1, 2, and 3. Now consider the following graphs and regression equations. The graphs show Y on the vertical, X1 on the horizontal and the lines represent the response functions, E(Y), as a function of X1 for each level of X2. Additive Model Reinforcement Interaction Effect Interference Interaction Effect Eq (1) E(Y) = 1 + 2X1 + X2 X2 no effect on X1, the response functions are ||, = slopes X2 = 1, E(Y) = 2 + 2X1 X2 = 2, E(Y) = 3 + 2X1 X2 = 3, E(Y) = 4 + 2X1 Eq (2) E(Y) = 1 + 2X1 + X2 + X1X2 Not ||, the slopes differ and are increasing as X2 ↑ X2 = 1, E(Y) = 2 + 3X1 X2 = 2, E(Y) = 3 + 4X1 X2 = 3, E(Y) = 4 + 5X1 Eq (3) E(Y) = 1 + 2X1 + X2 - X1X2 X2 = 1, E(Y) = 2 + X1 X2 = 2, E(Y) = 3 X2 = 3, E(Y) = 4 - X1 Not ||, the slopes differ and decrease as X2 ↑ Easy interpretation of the last two figures: if slope coefficients are positive and interaction coefficient is positive, then we have reinforcement. If slope coefficients are positive and interaction coefficient is negative, then we have interference (and viceversa). Note: due to possibly high correlation, variables should be centered. To test interaction we simply consider the sum of squares for adding the interaction term to the model including the other terms. E.g F* = [SSR(x1x2|x1,x2)/1]/MSEfull 5 Example – Body Fat Table 7.1 page 312: As an exercise, try and re-create 1. Open the Body Fat data set and create interaction terms, prior to centering, x1x2, x1x3, and x2x3 2. Create centered variables for the 3 predictors, using Calc > Standardize. You can enter all 3 variables in the Input window and enter cx1 cx2 cx3 in the Store window. 3. Create interaction terms using the centered variables (i.e. cx1cx2, cx1cx3, cx2cx3) 4. To see the effect centering has on correlation, go to Calc > Basic Stat > Correlation and enter the 6 uncentered variables, uncheck the box for p-values (this will may the display easier to read) and click OK. Repeat this for centered variables. Note how the correlation between the single predictors and the interactions has markedly dropped when centered. 5. Perform the regression on page 312 using the centered variables and produce the sequential sum of squares as shown on page 313 Y = Bo + B1cx1 + B2cx2 + B3cx3 + B4cx1cx2 + B5cx1cx3 + B6cx2cx3 +e 6. Find, as shown of page 313, [ SSR( x1x2, x1x3, x2 x3 | x1, x2, x3)] m (1.496 2.704 6.515) 3 F* 0.53 MSE ( full ) 6.745 7. Use Minitab to find the p-value for F*. The df are 3 and 13. Go to Calc > Probability Distributions > F. Cumulative Probability should be check (non-centrality should be 0.0) enter 3 for numerator DF, 13 for denominator DF, select Input Constant and enter 0.53. The p-value is 1 – the output value = 1 – 0.33 = 0.67. Qualitative Predictor Variables Minitab Data Set We addressed qualitative predictor variables briefly when we discussed the cargo example in Chapter 6 lecture notes. Now we look at these variable types in a little bit more detail. Remember that if a variable has k categories then define only k – 1 dummy or indicator variables. Dummy variables can be used, then, to compare two or more regression models. We will consider two models first as the extension to more than two is fairly straightforward. NOTE: 1. do not center dummy variables. 2. If using an interaction term involving a dummy variable you do not have to center the non-dummy variable. Create Dummy Variable using Calc > Calculator with Expression 'Gender'="Female" Looking at Minitab Data set, SPB for AGE and Gender several questions come to mind: Q1: Are the two slopes the same regardless of intercepts? – Lines parallel Q2: Are the two intercepts the same regardless of slopes? Q3: Are the two lines coincident, i.e. the same …= slopes, intercepts To answer Q1 are the lines parallel: Model: Y = Bo + B1X1 + B2Z + B3X1Z + e where Y = SBP, X = Age and 0 if male Z 1 if female 6 From Minitab we get the fitted model: SBP(Y) = 110.04 + 0.96 AGE(x1) – 12.96 Dummy Gender(z) - 0.012 xz Analysis of Variance Source Regression Residual Error Total DF 3 65 68 Source AGE(x1) Dummy Gender(z) xz DF 1 1 1 SS 18010.3 5201.4 23211.8 MS 6003.4 80.0 F 75.02 P 0.000 Seq SS 14951.3 3058.5 0.5 Z = 0 for Males: Yˆm = 110.04 + 0.96AGE Z = 1 for Females: Yˆ = 97.08 + 0.95AGE F So the two slopes appear similar: 0.96 and 0.95 To test for parallelism: Ho: B3 = 0 which reduces the model to Y = Bo + B1X1 + B2Z + e, 0 Y = Bo + B1 Age same slope Z 1 Y = (Bo + B2 ) + B1 Age We use again: SSR( X 1Z | X 1 , Z ) m SSR( X 1Z ) 1 0.5 F* 0.006 MSE ( full ) MSE ( full ) 80.0 From the F table, B.4, for α = 0.01, F0.01, 1,65 ≈ 7.08 and the p-value from Minitab is 0.94 We would not reject Ho and conclude that the two lines are parallel. To answer Q2, equal intercepts: Test Ho: B2 = 0 which reduces the model to Y = Bo + B1X1 + B3X1Z + e 0 Y = Bo + B1 Age same intercept Z 1 Y = Bo + (B1 B3 ) Age If the slopes are equal, could use Y = Bo + B1X1 + B2Z + e and compare to Y = Bo + B1X1 + e. Since we are pretty sure from Q1 that B3 = 0 from Q1, we will use this latter approach and test by: SSR( Z | X 1 ) q SSR( Z ) 1 3058.5 38.8 Note: from output we need to add MSE ( full ) MSE ( full ) 78.8 back to SSE from the full model the SS Seq of X1Z = 0.5 and then divide by 66 instead of 65 to get new MSE(full). F* 7 From the F table, B.4, for α = 0.01, F0.01, 1,66 ≈ 7.08 and the p-value from Minitab is ≈ 0.000 We would reject Ho and conclude that the two lines have different intercepts, i.e. the intercepts of 110.04 for Males and 97.08 are statistically different. To answer Q3, lines coincide: We test Ho: B2 = B3 = 0 reducing the model to: Y = Bo + B1X1 + e F* SSR( Z , X 1Z | X 1 ) q SSR( Z , X 1Z ) 2 (3058.5 0.5) 2 1529.5 19.1 MSE ( full ) MSE ( full ) 80.0 80 From the F table, B.4, for α = 0.01, F0.01, 2,65 ≈ 4.98 and the p-value from Minitab is ≈ 0.000. We would reject Ho and conclude that the lines do not coincide. Conclusion: We conclude B3 = 0 meaning the slopes are equal and therefore could drop the interaction term. But we rejected B2 = 0. So our final model regression model is SBP(Y) = 110 + 0.956 AGE(x1) - 13.5 Dummy Gender(z) Since B2 = - 13.5 we would also conclude that the SBP(females) and SBP(males) increases at the same rate (i.e. equal slopes) but that the SPB(males) is higher than SPB(females). Important Notes!! 1. Do NOT center dummy variables 2. The tests used to answer Q1, Q2, and Q3 are only reliable if the variances are equal for group. For instance in our example this means that Y2M Y2F 8