13 Nonlinear and Multiple Regression Copyright © Cengage Learning. All rights reserved. 13.4 Multiple Regression Analysis Copyright © Cengage Learning. All rights reserved. Multiple Regression Analysis In multiple regression, the objective is to build a probabilistic model that relates a dependent variable y to more than one independent or predictor variable. Let k represent the number of predictor variables (k 2) and denote these predictors by x1, x2,..., xk. For example, in attempting to predict the selling price of a house, we might have k = 3 with x1 = size (ft2), x2 = age (years), and x3 = number of rooms. 3 Multiple Regression Analysis Definition The general additive multiple regression model equation is Y = 0 + 1x1 + 2x2 + ... + kxk + (13.15) where E() = 0 and V() = 2. In addition, for purposes of testing hypotheses and calculating CIs or PIs, it is assumed that is normally distributed. 4 Multiple Regression Analysis Let be particular values of x1,...,xk. Then (13.15) implies that (13.16) Thus just as 0 + 1x describes the mean Y value as a function of x in simple linear regression, the true (or population) regression function 0 + 1x1 + . . . + kxk gives the expected value of Y as a function of x1,..., xk. The i’s are the true (or population) regression coefficients. 5 Multiple Regression Analysis The regression coefficient 1 is interpreted as the expected change in Y associated with a 1-unit increase in x1 while x2,..., xk are held fixed. Analogous interpretations hold for 2,..., k. 6 Models with Interaction and Quadratic Predictors 7 Models with Interaction and Quadratic Predictors If an investigator has obtained observations on y, x1, and x2, one possible model is Y = 0 + 1x1 + 2x2 + . However, other models can be constructed by forming predictors that are mathematical functions of x1 and/or x2. For example, with and x4 = x1x2, the model Y = 0 + 1x1 + 2x2 + 3x3 + 4x4 + has the general form of (13.15). 8 Models with Interaction and Quadratic Predictors In general, it is not only permissible for some predictors to be mathematical functions of others but also often highly desirable in the sense that the resulting model may be much more successful in explaining variation in y than any model without such predictors. This discussion also shows that polynomial regression is indeed a special case of multiple regression. For example, the quadratic model Y = 0 + 1x + 2x2 + has the form of (13.15) with k = 2, x1 = x, and x2 = x2. 9 Models with Interaction and Quadratic Predictors For the case of two independent variables, x1 and x2, consider the following four derived models. 1. The first-order model: Y = 0 + 1x1 + 2x2 + 2. The second-order no-interaction model: 3. The model with first-order predictors and interaction: Y = 0 + 1x1 + 2x2 + 3x1x2 + 10 Models with Interaction and Quadratic Predictors 4. The complete second-order or full quadratic model: Understanding the differences among these models is an important first step in building realistic regression models from the independent variables under study. The first-order model is the most straightforward generalization of simple linear regression. It states that for a fixed value of either variable, the expected value of Y is a linear function of the other variable and that the expected change in Y associated with a unit increase in x1(x2) is 1(2) independent of the level of x2(x1). 11 Models with Interaction and Quadratic Predictors Thus if we graph the regression function as a function of x1 for several different values of x2, we obtain as contours of the regression function a collection of parallel lines, as pictured in Figure 13.13(a). (a) E(Y) = –1 + .5x1 – x2 Contours of four different regression functions Figure 13.13 12 Models with Interaction and Quadratic Predictors The function y = 0 + 1x1 + 2x2 specifies a plane in three-dimensional space; the first-order model says that each observed value of the dependent variable corresponds to a point which deviates vertically from this plane by a random amount . According to the second-order no-interaction model, if x2 is fixed, the expected change in Y for a 1-unit increase in x1 is 13 Models with Interaction and Quadratic Predictors Because this expected change does not depend on x2, the contours of the regression function for different values of x2 are still parallel to one another. However, the dependence of the expected change on the value of x1 means that the contours are now curves rather than straight lines. This is pictured in Figure 13.13(b). Contours of four different regression functions Figure 13.13 14 Models with Interaction and Quadratic Predictors In this case, the regression surface is no longer a plane in three-dimensional space but is instead a curved surface. The contours of the regression function for the first-order interaction model are nonparallel straight lines. This is because the expected change in Y when x1 is increased by 1 is 15 Models with Interaction and Quadratic Predictors This expected change depends on the value of x2, so each contour line must have a different slope, as in Figure 13.13(c). (c) E(Y) = –1 + .5x1 – x2 + x1x2 Contours of four different regression functions Figure 13.13 16 Models with Interaction and Quadratic Predictors The word interaction reflects the fact that an expected change in Y when one variable increases in value depends on the value of the other variable. Finally, for the complete second-order model, the expected change in Y when x2 is held fixed while x1 is increased by 1 unit is 1 + 3 + 23x1 + 5x2, which is a function of both x1 and x2. 17 Models with Interaction and Quadratic Predictors This implies that the contours of the regression function are both curved and not parallel to one another, as illustrated in Figure 13.13(d). Contours of four different regression functions Figure 13.13 18 Models with Interaction and Quadratic Predictors Similar considerations apply to models constructed from more than two independent variables. In general, the presence of interaction terms in the model Implies that the expected change in Y depends not only on the variable being increased or decreased but also on the values of some of the fixed variables. As in ANOVA, it is possible to have higher-way interaction terms (e.g., x1x2x3), making model interpretation more difficult. 19 Models with Interaction and Quadratic Predictors Note that if the model contains interaction or quadratic predictors, the generic interpretation of a i given previously will not usually apply. This is because it is not then possible to increase xi by 1 unit and hold the values of all other predictors fixed. 20 Models with Predictors for Categorical Variables 21 Models with Predictors for Categorical Variables Thus far we have explicitly considered the inclusion of only quantitative (numerical) predictor variables in a multiple regression model. Using simple numerical coding, qualitative (categorical) variables, such as bearing material (aluminum or copper/lead) or type of wood (pine, oak, or walnut), can also be incorporated into a model. 22 Models with Predictors for Categorical Variables Let’s first focus on the case of a dichotomous variable, one with just two possible categories—male or female, U.S. or foreign manufacture, and so on. With any such variable, we associate a dummy or indicator variable x whose possible values 0 and 1 indicate which category is relevant for any particular observation. 23 Example 11 The article “Estimating Urban Travel Times: A Comparative Study” (Trans. Res., 1980: 173–175) described a study relating the dependent variable y = travel time between locations in a certain city and the independent variable x2 = distance between locations. Two types of vehicles, passenger cars and trucks, were used in the study. Let 24 Example 11 cont’d One possible multiple regression model is Y = 0 + 1x1 + 2x2 + The mean value of travel time depends on whether a vehicle is a car or a truck: mean time = 0 + 2x2 when x1 = 0 (cars) mean time = 0 + 1 + 2x2 when x1 = 1 (trucks) 25 Example 11 cont’d The coefficient 1 is the difference in mean times between trucks and cars with distance held fixed; if 1 > 0, on average it will take trucks longer to traverse any particular distance than it will for cars. A second possibility is a model with an interaction predictor: Y = 0 + 1x1 + 2x2 + 3x1x2 + 26 Example 11 cont’d Now the mean times for the two types of vehicles are mean time = 0 + 2x2 when x1 = 0 mean time = 0 + 1 + (2 + 3 )x2 when x1 = 1 27 Example 11 cont’d For each model, the graph of the mean time versus distance is a straight line for either type of vehicle, as illustrated in Figure 13.14. (b) interaction (a) no interaction Regression functions for models with one dummy variable (x1) and one quantitative variable x2 Figure 13.14 28 Example 11 cont’d The two lines are parallel for the first (no-interaction) model, but in general they will have different slopes when the second model is correct. For this latter model, the change in mean travel time associated with a 1-mile increase in distance depends on which type of vehicle is involved—the two variables “vehicle type” and “travel time” interact. Indeed, data collected by the authors of the cited article suggested the presence of interaction. 29 Models with Predictors for Categorical Variables You might think that the way to handle a three-category situation is to define a single numerical variable with coded values such as 0, 1, and 2 corresponding to the three categories. This is incorrect, because it imposes an ordering on the categories that is not necessarily implied by the problem context. The correct approach to incorporating three categories is to define two different dummy variables. 30 Models with Predictors for Categorical Variables Suppose, for example, that y is the lifetime of a certain cutting tool, x1 is cutting speed, and that there are three brands of tool being investigated. Then let When an observation on a brand A tool is made, x2 = 1 and x3 = 0, whereas for a brand B tool, x2 = 0 and x3 = 1. 31 Models with Predictors for Categorical Variables An observation made on a brand C tool has x2 = x3 = 0, and it is not possible that x2 = x3 = 1 because a tool cannot simultaneously be both brand A and brand B. The no-interaction model would have only the predictors x1, x2, and x3. The following interaction model allows the mean change in lifetime associated with a 1-unit increase in speed to depend on the brand of tool: Y = 0 + 1x1 + 2x2 + 3x3 + 4x1x2 + 5x1x3 + 32 Models with Predictors for Categorical Variables Construction of a picture like Figure 13.14 with a graph for each of the three possible(x2, x3) pairs gives three nonparallel lines (unless 4 = 5 = 0). (b) interaction (a) no interaction Regression functions for models with one dummy variable (x1) and one quantitative variable x2 Figure 13.14 33 Models with Predictors for Categorical Variables More generally, incorporating a categorical variable with c possible categories into a multiple regression model requires the use of c – 1 indicator variables (e.g., five brands of tools would necessitate using four indicator variables). Thus even one categorical variable can add many predictors to a model. 34 Estimating Parameters 35 Estimating Parameters The data in simple linear regression consists of n pairs (x1, y1), . . . , (xn, yn). Suppose that a multiple regression model contains two predictor variables, x1 and x2. Then the data set will consist of n triples (x11, x21, y1), (x12, x22, y2), . . . , (x1n, x2n, yn). Here the first subscript on x refers to the predictor and the second to the observation number. More generally, with k predictors, the data consists of n (k + 1) tuples (x11, x21,..., xk1, y1), (x12, x22,..., xk2, y2), . . . , (x1n, x2n ,. . . , xkn, yn), where xij is the value of the ith predictor xi associated with the observed value yj. 36 Estimating Parameters The observations are assumed to have been obtained independently of one another according to the model (13.15). To estimate the parameters 0, 1,..., k using the principle of least squares, form the sum of squared deviations of the observed yj’s from a trial function y = b0 + b1x1 + ... + bkxk : f (b0, b1,..., bk) = [yj – (b0 + b1x1j + b2x2j + . . . + bkxkj )]2 (13.17) The least squares estimates are those values of the bi’s that minimize f(b0,..., bk). 37 Estimating Parameters Taking the partial derivative of f with respect to each bi (i = 0,1, . . . , k) and equating all partials to zero yields the following system of normal equations: b0n + b1x1j + b2x2j +. . . + bkxkj = yj (13.18) 38 Estimating Parameters These equations are linear in the unknowns b0, b1,..., bk. Solving (13.18) yields the least squares estimates . This is best done by utilizing a statistical software package. 39 Example 12 The article “How to Optimize and Control the Wire Bonding Process: Part II” (Solid State Technology, Jan. 1991: 67–72) described an experiment carried out to assess the impact of the variables x1 = force (gm), x2 = power (mW), x3 = tempertaure (C), and x4 = time (msec) on y = ball bond shear strength (gm). 40 Example 12 cont’d The following data was generated to be consistent with the information given in the article: 41 Example 12 cont’d A statistical computer package gave the following least squares estimates: –37.48 .2117 .4983 .1297 .2583 Thus we estimate that .1297 gm is the average change in strength associated with a 1-degree increase in temperature when the other three predictors are held fixed; the other estimated coefficients are interpreted in a similar manner. 42 Example 12 cont’d The estimated regression equation is y = –37.48 + .2117x1 + .4983x2 + .1297x3 + .2583x4 A point prediction of strength resulting from a force of 35 gm, power of 75 mW, temperature of 200° degrees, and time of 20 msec is = –37.48 + .2117(35) + .4983(75) + .1297(200) +.2583(20) = 38.41 gm This is also a point estimate of the mean value of strength for the specified values of force, power, temperature, and time. 43 ^2 R2 and 44 R2 and ^ 2 Predicted or fitted values, residuals, and the various sums of squares are calculated as in simple linear and polynomial regression. The predicted value results from substituting the values of the various predictors from the first observation into the estimated regression function: The remaining predicted values come from substituting values of the predictors from the 2nd, 3rd,..., and finally nth observations into the estimated function. 45 R2 and ^ 2 For example, the values of the 4 predictors for the last observation in Example 12 are x1,30 = 35, x2,30 = 75, x3,30 = 200, x4,30 = 20, so = –37.48 + .2117(35) + .4983(75) + .1297(200) +.2583(20) = 38.41 The residuals are the differences between the observed and predicted values. The last residual in Example 12 is 40.3 – 38.41 = 1.89. 46 R2 and ^ 2 The closer the residuals are to 0, the better the job our estimated regression function is doing in making predictions corresponding to observations in the sample. Error or residual sum of squares is SSE = (yi – )2. It is again interpreted as a measure of how much variation in the observed y values is not explained by (not attributed to) the model relationship. The number of df associated with SSE is n – (k + 1) because k + 1 df are lost in estimating the k + 1 coefficients. 47 R2 and ^ 2 Total sum of squares, a measure of total variation in the observed y values, is SST = (yi – y)2. Regression sum of squares SSR = ( – y)2 = SST – SSE is a measure of explained variation. Then the coefficient of multiple determination R2 is R2 = 1 – SSE/SST = SSR/SST It is interpreted as the proportion of observed y variation that can be explained by the multiple regression model fit to the data. 48 R2 and ^ 2 Because there is no preliminary picture of multiple regression data analogous to a scatter plot for bivariate data, the coefficient of multiple determination is our first indication of whether the chosen model is successful in explaining y variation. Unfortunately, there is a problem with R2: Its value can be inflated by adding lots of predictors into the model even if most of these predictors are rather frivolous. 49 R2 and ^ 2 For example, suppose y is the sale price of a house. Then sensible predictors include x1 = the interior size of the house, x2 = the size of the lot on which the house sits, x3 = the number of bedrooms, x4 = the number of bathrooms, and x5 = the house’s age. Now suppose we add in x6 = the diameter of the doorknob on the coat closet, x7 = the thickness of the cutting board in the kitchen, x8 = the thickness of the patio slab, and so on. 50 R2 and ^ 2 Unless we are very unlucky in our choice of predictors, using n – 1 predictors (one fewer than the sample size) will yield R2 = 1. So the objective in multiple regression is not simply to explain most of the observed y variation, but to do so using a model with relatively few predictors that are easily interpreted. It is thus desirable to adjust R2, as was done in polynomial regression, to take account of the size of the model: 51 R2 and ^ 2 Because the ratio in front of SSE/SST exceeds 1, is smaller than R2. Furthermore, the larger the number of predictors k relative to the sample size n, the smaller will be relative to R2. Adjusted R2 can even be negative, whereas R2 itself must be between 0 and 1. A value of that is substantially smaller than R2 itself is a warning that the model may contain too many predictors. The positive square root of R2 is called the multiple correlation coefficient and is denoted by R. 52 R2 and ^ 2 It can be shown that R is the sample correlation coefficient calculated from the ( , yi) pairs (that is, use in place of xi in the formula for r). SSE is also the basis for estimating the remaining model parameter: 53 Example 13 Investigators carried out a study to see how various characteristics of concrete are influenced by x1 = % limestone powder and x2 = water-cement ratio, resulting in the accompanying data (“Durability of Concrete with Addition of Limestone Powder,” Magazine of Concrete Research, 1996: 131–137). 54 Example 13 cont’d Consider first compressive strength as the dependent variable y. Fitting the first order model results in y = 84.82 + .1643x1 – 79.67x2, SSE = 72.52 (df = 6), R2 = .741, = .654 whereas including an interaction predictor gives y = 6.22 + 5.779x1 + 51.33x2 – 9.357x1x2 SSE = 29.35 (df = 5) R2 = .895 = .831 55 Example 13 cont’d Based on this latter fit, a prediction for compressive strength when % limestone = 14 and water–cement ratio = .60 is = 6.22 + 5.779(14) + 51.33(.60) – 9.357(8.4) = 39.32 Fitting the full quadratic relationship results in virtually no change in the R2 value. However, when the dependent variable is adsorbability, the following results are obtained: R2 = .747 when just two predictors are used, .802 when the interaction predictor is added, and .889 when the five predictors for the full quadratic relationship are used. 56 R2 and ^ 2 In general I ,can be interpreted as an estimate of the average change in Y associated with a 1-unit increase in xi while values of all other predictors are held fixed. Sometimes, though, it is difficult or even impossible to increase the value of one predictor while holding all others fixed. In such situations, there is an alternative interpretation of the estimated regression coefficients. For concreteness, suppose that k = 2, and let denote the estimate of 1 in the regression of y on the two predictors x1 and x2. 57 R2 and ^ 2 Then 1. Regress y against just x2 (a simple linear regression) and denote the resulting residuals by g1, g2, . . . , gn. These residuals represent variation in y after removing or adjusting for the effects of x2. 2. Regress x1 against x2 (that is, regard x1 as the dependent variable and x2 as the independent variable in this simple linear regression), and denote the residuals by f1, . . . , fn. These residuals represent variation in x1 after removing or adjusting for the effects of x2. 58 R2 and ^ 2 Now consider plotting the residuals from the first regression against those from the second; that is, plot the pairs (f1, g1),..., (fn, gn). The result is called a partial residual plot or adjusted residual plot. If a regression line is fit to the points in this plot, the slope turns out to be exactly (furthermore, the residuals from this line are exactly the residuals e1,...,en from the multiple regression of y on x1 and x2). Thus can be interpreted as the estimated change in y associated with a 1-unit increase in x1 after removing or adjusting for the effects of any other model predictors. 59 R2 and ^ 2 The same interpretation holds for other estimated coefficients regardless of the number of predictors in the model (there is nothing special about k = 2; the foregoing argument remains valid if y is regressed against all predictors other than x1 in Step 1 and x1 is regressed against the other k – 1 predictors in Step 2). As an example, suppose that y is the sale price of an apartment building and that the predictors are number of apartments, age, lot size, number of parking spaces, and gross building area (ft2). It may not be reasonable to increase the number of apartments without also increasing gross area. 60 R2 and ^ 2 However, if = 16.00, then we estimate that a $16 increase in sale price is associated with each extra square foot of gross area after adjusting for the effects of the other four predictors. 61 A Model Utility Test 62 A Model Utility Test With multivariate data, there is no picture analogous to a scatter plot to indicate whether a particular multiple regression model will successfully explain observed y variation. The value of R2 certainly communicates a preliminary message, but this value is sometimes deceptive because it can be greatly inflated by using a large number of predictors relative to the sample size. For this reason, it is important to have a formal test for model utility. 63 A Model Utility Test The model utility test in simple linear regression involved the null hypothesis H0: 1 = 0, according to which there is no useful relation between y and the single predictor x. Here we consider the assertion that 1 = 0, 2 = 0,..., k = 0, which says that there is no useful relationship between y and any of the k predictors. If at least one of these ’s is not 0, the corresponding predictor(s) is (are) useful. The test is based on a statistic that has a particular F distribution when H0 is true. 64 A Model Utility Test Null hypothesis: H0: 1 = 2 = … = k = 0 Alternative hypothesis: Ha: at least one i ≠ 0 (i = 1,..., k) Test statistic value: (13.19) SSR = regression sum of squares = SST – SSE Rejection region for a level test: f F,k,n – (k + 1) 65 A Model Utility Test Except for a constant multiple, the test statistic here is R2/(1 – R2), the ratio of explained to unexplained variation. If the proportion of explained variation is high relative to unexplained, we would naturally want to reject H0 and confirm the utility of the model. However, if k is large relative to n, the factor [n – (k + 1)/k] will decrease f considerably. 66 Example 14 Returning to the bond shear strength data of Example 12, a model with k = 4 predictors was fit, so the relevant hypotheses are H0: 1 = 2 = 3 = 4 = 0 Ha: at least one of these four s is not 0 Figure 13.15 shows output from the JMP statistical package. Multiple regression output from JMP for the data of Example 14 Figure 13.15 67 Example 14 cont’d Multiple regression output from JMP for the data of Example 14 Figure 13.15 68 Example 14 cont’d The values of s (Root Mean Square Error), R2, and adjusted R2 certainly suggest a useful model. The value of the model utility F ratio is 69 Example 14 cont’d This value also appears in the F Ratio column of the ANOVA table in Figure 13.15. The largest F critical value for 4 numerator and 25 denominator df is 6.49, which captures an upper-tail area of .001. Thus P-value < .001. The ANOVA table in the JMP output shows that P-value < .0001. This is a highly significant result. 70 Example 14 cont’d The null hypothesis should be rejected at any reasonable significance level. We conclude that there is a useful linear relationship between y and at least one of the four predictors in the model. This does not mean that all four predictors are useful; we will say more about this subsequently. 71 Inferences in Multiple Regression 72 Inferences in Multiple Regression Before testing hypotheses, constructing CIs, and making predictions, the adequacy of the model should be assessed and the impact of any unusual observations investigated. Methods for doing this are described at the end of the present section and in the next section. Because each is a linear function of the yi’s, the standard deviation of each is the product of and a function of the xij’s. An estimate of this SD is obtained by substituting s for . 73 Inferences in Multiple Regression The function of the xij ’s is quite complicated, but all standard statistical software packages compute and show the Inferences concerning a single i are based on the standardized variable which has a t distribution with n – (k + 1) df. 74 Inferences in Multiple Regression The point estimate of when ,...., , the expected value of Y is The estimated standard deviation of the corresponding estimator is again a complicated expression involving the sample xij’s. However, appropriate software will calculate it on request. Inferences about are based on standardizing its estimator to obtain a t variable having n – (k + 1) df. 75 Inferences in Multiple Regression 1. A 100(1 – )% CI for i, the coefficient of xi in the regression function, is 2. A test for H0: i = i0 uses the t statistic value based on n – (k + 1) df. The test is upper-, lower-, or two-tailed according to whether Ha contains the inequality >, < or ≠. 76 Inferences in Multiple Regression 3. A 100(1 – )% CI for is where is the statistic the calculated value of . and is 4. A 100(1 – )% PI for a future y value is 77 Inferences in Multiple Regression Simultaneous intervals for which the simultaneous confidence or prediction level is controlled can be obtained by applying the Bonferroni technique. 78 Example 15 Soil and sediment adsorption, the extent to which chemicals collect in a condensed form on the surface, is an important characteristic influencing the effectiveness of pesticides and various agricultural chemicals. The article “Adsorption of Phosphate, Arsenate, Methanearsonate, and Cacodylate by Lake and Stream Sediments: Comparisons with Soils” (J. of Environ. Qual., 1984: 499–504) gives the accompanying data (Table 13.5) on y = phosphate adsorption index, x1 = amount of extractable iron, and x2 = amount of extractable aluminum. 79 Example 15 cont’d Data for Example 15 Table 13.15 80 Example 15 cont’d The article proposed the model Y = 0 + 1x1 + 2x2 + . A computer analysis yielded the following information: 81 Example 15 cont’d A 99% CI for 1, the change in expected adsorption associated with a 1-unit increase in extractable iron while extractable aluminum is held fixed, requires t.005,13 – (2 + 1) = t.005,10 = 3.169. The CI is .11273 (3.169)(.02969) = .11273 .09409 (.019, .207) Similarly, a 99% interval for 2 is .34900 (3.169)(.07131) = .34900 .22598 (.123, .575) 82 Example 15 cont’d The Bonferroni technique implies that the simultaneous confidence level for both intervals is at least 98%. A 95% CI for Y•160,39, expected adsorption when extractable iron = 160 and extractable aluminum = 39, is 24.30 (2.228)(1.30) = 24.30 2.90 = (21.40, 27.20) A 95% PI for a future value of adsorption to be observed when x1 = 160 and x2 = 39 is 24.30 (2.228){(4.379)2 + (1.30)2}1/2 = 24.30 10.18 = (14.12, 34.48) 83 Inferences in Multiple Regression Frequently, the hypothesis of interest has the form H0: i = 0 for a particular i. For example, after fitting the four-predictor model in Example 12, the investigator might wish to test H0: 4 = 0. According to H0, as long as the predictors x1, x2, and x3 remain in the model, x4 contains no useful information about y. The test statistic value is the t ratio . Many statistical computer packages report the t ratio and corresponding P-value for each predictor included in the model. 84 Inferences in Multiple Regression For example, Figure 13.15 shows that as long as power, temperature, and time are retained in the model, the predictor x1 = force can be deleted. Multiple regression output from JMP for the data of Example 14 Figure 13.15 85 Inferences in Multiple Regression An F Test for a Group of Predictors. The model utility F test was appropriate for testing whether there is useful information about the dependent variable in any of the k predictors (i.e., whether 1 = ... = k = 0). In many situations, one first builds a model containing k predictors and then wishes to know whether any of the predictors in a particular subset provide useful information about Y. 86 Inferences in Multiple Regression For example, a model to be used to predict students’ test scores might include a group of background variables such as family income and education levels and also some school characteristic variables such as class size and spending per pupil. One interesting hypothesis is that the school characteristic predictors can be dropped from the model. Let’s label the predictors as x1, x2,..., xl, xl+1,..., xk, so that it is the last k – l that we are considering deleting. 87 Inferences in Multiple Regression The relevant hypotheses are as follows: H0: l+1 = l+2 = . . . = k = 0 (so the “reduced” model Y = 0 + 1x1 + . . . + lxl + is correct) versus Ha: at least one among l+1,..., k is not 0 (so in the “full” model Y = 0 + 1x1 + . . . + kxk + , at least one of the last k – l predictors provides useful information) 88 Inferences in Multiple Regression The test is carried out by fitting both the full and reduced models. Because the full model contains not only the predictors of the reduced model but also some extra predictors, it should fit the data at least as well as the reduced model. That is, if we let SSEk be the sum of squared residuals for the full model and SSEl be the corresponding sum for the reduced model, then SSEk SSEl. 89 Inferences in Multiple Regression Intuitively, if SSEk is a great deal smaller than SSEl, the full model provides a much better fit than the reduced model; the appropriate test statistic should then depend on the reduction SSEl – SSEk in unexplained variation. SSEk = unexplained variation for the full model SSEl = unexplained variation for the reduced model Test statistic value: (13.20) Rejection region: f F,k–l,n – (k + 1) 90 Assessing Model Adequacy 91 Assessing Model Adequacy The standardized residuals in multiple regression result from dividing each residual by its estimated standard deviation; the formula for these standard deviations is substantially more complicated than in the case of simple linear regression. We recommend a normal probability plot of the standardized residuals as a basis for validating the normality assumption. Plots of the standardized residuals versus each predictor and versus should show no discernible pattern. Adjusted residual plots can also be helpful in this endeavor. The book by Neter et al. is an extremely useful reference. 92 Example 17 Figure 13.16 shows a normal probability plot of the standardized residuals for the adsorption data and fitted model given in Example 15. The straightness of the plot casts little doubt on the assumption that the random deviation is normally distributed. A normal probability plot of the standardized residuals for the data and model of Example 15 Figure 13.16 93 Example 17 cont’d Figure 13.17 shows the other suggested plots for the adsorption data. Diagnostic plots for the adsorption data: (a) standardized residual versus x1; (b) standardized residual versus x2; (c) standardized residual versus ; (d) versus y Figure 13.17 94 Example 17 cont’d Given that there are only 13 observations in the data set, there is not much evidence of a pattern in any of the first three plots other than randomness. The point at the bottom of each of these three plots corresponds to the observation with the large residual. We will say more about such observations subsequently. For the moment, there is no compelling reason for remedial action. 95