13
Nonlinear and Multiple
Regression
Copyright © Cengage Learning. All rights reserved.
13.4
Multiple Regression
Analysis
Copyright © Cengage Learning. All rights reserved.
Multiple Regression Analysis
In multiple regression, the objective is to build a
probabilistic model that relates a dependent variable y to
more than one independent or predictor variable.
Let k represent the number of predictor variables (k  2)
and denote these predictors by x1, x2,..., xk.
For example, in attempting to predict the selling price of a
house, we might have k = 3 with x1 = size (ft2),
x2 = age (years), and x3 = number of rooms.
3
Multiple Regression Analysis
Definition
The general additive multiple regression model
equation is
Y = 0 + 1x1 + 2x2 + ... + kxk + 
(13.15)
where E() = 0 and V() =  2.
In addition, for purposes of testing hypotheses and
calculating CIs or PIs, it is assumed that  is normally
distributed.
4
Multiple Regression Analysis
Let
be particular values of x1,...,xk.
Then (13.15) implies that
(13.16)
Thus just as 0 + 1x describes the mean Y value as a
function of x in simple linear regression, the true
(or population) regression function 0 + 1x1 + . . . + kxk
gives the expected value of Y as a function of x1,..., xk.
The i’s are the true (or population) regression
coefficients.
5
Multiple Regression Analysis
The regression coefficient 1 is interpreted as the expected
change in Y associated with a 1-unit increase in x1 while
x2,..., xk are held fixed.
Analogous interpretations hold for 2,..., k.
6
Models with Interaction and
Quadratic Predictors
7
Models with Interaction and Quadratic Predictors
If an investigator has obtained observations on y, x1, and
x2, one possible model is Y = 0 + 1x1 + 2x2 + .
However, other models can be constructed by forming
predictors that are mathematical functions of x1 and/or x2.
For example, with
and x4 = x1x2, the model
Y = 0 + 1x1 + 2x2 + 3x3 + 4x4 + 
has the general form of (13.15).
8
Models with Interaction and Quadratic Predictors
In general, it is not only permissible for some predictors
to be mathematical functions of others but also often highly
desirable in the sense that the resulting model may be
much more successful in explaining variation in y than any
model without such predictors.
This discussion also shows that polynomial regression is
indeed a special case of multiple regression.
For example, the quadratic model
Y = 0 + 1x + 2x2 +  has the form of (13.15) with
k = 2, x1 = x, and x2 = x2.
9
Models with Interaction and Quadratic Predictors
For the case of two independent variables, x1 and x2,
consider the following four derived models.
1. The first-order model:
Y = 0 + 1x1 + 2x2 + 
2. The second-order no-interaction model:
3. The model with first-order predictors and interaction:
Y = 0 + 1x1 + 2x2 + 3x1x2 + 
10
Models with Interaction and Quadratic Predictors
4. The complete second-order or full quadratic model:
Understanding the differences among these models is an
important first step in building realistic regression models
from the independent variables under study. The first-order
model is the most straightforward generalization of simple
linear regression.
It states that for a fixed value of either variable, the
expected value of Y is a linear function of the other variable
and that the expected change in Y associated with a unit
increase in x1(x2) is 1(2) independent of the level of x2(x1).
11
Models with Interaction and Quadratic Predictors
Thus if we graph the regression function as a function of x1
for several different values of x2, we obtain as contours of
the regression function a collection of parallel lines, as
pictured in Figure 13.13(a).
(a) E(Y) = –1 + .5x1 – x2
Contours of four different regression functions
Figure 13.13
12
Models with Interaction and Quadratic Predictors
The function y = 0 + 1x1 + 2x2 specifies a plane in
three-dimensional space; the first-order model says that
each observed value of the dependent variable
corresponds to a point which deviates vertically from this
plane by a random amount .
According to the second-order no-interaction model, if x2 is
fixed, the expected change in Y for a 1-unit increase in x1
is
13
Models with Interaction and Quadratic Predictors
Because this expected change does not depend on x2, the
contours of the regression function for different values of x2
are still parallel to one another.
However, the dependence of the expected change on the
value of x1 means that the contours are now curves rather
than straight lines. This is pictured in Figure 13.13(b).
Contours of four different regression functions
Figure 13.13
14
Models with Interaction and Quadratic Predictors
In this case, the regression surface is no longer a plane in
three-dimensional space but is instead a curved surface.
The contours of the regression function for the first-order
interaction model are nonparallel straight lines.
This is because the expected change in Y when x1 is
increased by 1 is
15
Models with Interaction and Quadratic Predictors
This expected change depends on the value of x2, so each
contour line must have a different slope, as in
Figure 13.13(c).
(c) E(Y) = –1 + .5x1 – x2 + x1x2
Contours of four different regression functions
Figure 13.13
16
Models with Interaction and Quadratic Predictors
The word interaction reflects the fact that an expected
change in Y when one variable increases in value depends
on the value of the other variable.
Finally, for the complete second-order model, the expected
change in Y when x2 is held fixed while x1 is increased by
1 unit is 1 + 3 + 23x1 + 5x2, which is a function of both x1
and x2.
17
Models with Interaction and Quadratic Predictors
This implies that the contours of the regression function
are both curved and not parallel to one another, as
illustrated in Figure 13.13(d).
Contours of four different regression functions
Figure 13.13
18
Models with Interaction and Quadratic Predictors
Similar considerations apply to models constructed from
more than two independent variables.
In general, the presence of interaction terms in the model
Implies that the expected change in Y depends not only on
the variable being increased or decreased but also on the
values of some of the fixed variables.
As in ANOVA, it is possible to have higher-way interaction
terms (e.g., x1x2x3), making model interpretation more
difficult.
19
Models with Interaction and Quadratic Predictors
Note that if the model contains interaction or quadratic
predictors, the generic interpretation of a i given
previously will not usually apply.
This is because it is not then possible to increase xi by 1
unit and hold the values of all other predictors fixed.
20
Models with Predictors for
Categorical Variables
21
Models with Predictors for Categorical Variables
Thus far we have explicitly considered the inclusion of only
quantitative (numerical) predictor variables in a multiple
regression model.
Using simple numerical coding, qualitative (categorical)
variables, such as bearing material (aluminum or
copper/lead) or type of wood (pine, oak, or walnut), can
also be incorporated into a model.
22
Models with Predictors for Categorical Variables
Let’s first focus on the case of a dichotomous variable, one
with just two possible categories—male or female,
U.S. or foreign manufacture, and so on.
With any such variable, we associate a dummy or
indicator variable x whose possible values 0 and 1
indicate which category is relevant for any particular
observation.
23
Example 11
The article “Estimating Urban Travel Times:
A Comparative Study” (Trans. Res., 1980: 173–175)
described a study relating the dependent variable
y = travel time between locations in a certain city and the
independent variable x2 = distance between locations.
Two types of vehicles, passenger cars and trucks, were
used in the study.
Let
24
Example 11
cont’d
One possible multiple regression model is
Y = 0 + 1x1 + 2x2 + 
The mean value of travel time depends on whether a
vehicle is a car or a truck:
mean time = 0 + 2x2
when x1 = 0 (cars)
mean time = 0 + 1 + 2x2
when x1 = 1 (trucks)
25
Example 11
cont’d
The coefficient 1 is the difference in mean times between
trucks and cars with distance held fixed; if 1 > 0, on
average it will take trucks longer to traverse any particular
distance than it will for cars.
A second possibility is a model with an interaction
predictor:
Y = 0 + 1x1 + 2x2 + 3x1x2 + 
26
Example 11
cont’d
Now the mean times for the two types of vehicles are
mean time = 0 + 2x2
when x1 = 0
mean time = 0 + 1 + (2 + 3 )x2
when x1 = 1
27
Example 11
cont’d
For each model, the graph of the mean time versus
distance is a straight line for either type of vehicle, as
illustrated in Figure 13.14.
(b) interaction
(a) no interaction
Regression functions for models with one dummy variable (x1) and one quantitative variable x2
Figure 13.14
28
Example 11
cont’d
The two lines are parallel for the first (no-interaction)
model, but in general they will have different slopes when
the second model is correct.
For this latter model, the change in mean travel time
associated with a 1-mile increase in distance depends on
which type of vehicle is involved—the two variables
“vehicle type” and “travel time” interact.
Indeed, data collected by the authors of the cited article
suggested the presence of interaction.
29
Models with Predictors for Categorical Variables
You might think that the way to handle a three-category
situation is to define a single numerical variable with coded
values such as 0, 1, and 2 corresponding to the three
categories.
This is incorrect, because it imposes an ordering on the
categories that is not necessarily implied by the problem
context.
The correct approach to incorporating three categories is to
define two different dummy variables.
30
Models with Predictors for Categorical Variables
Suppose, for example, that y is the lifetime of a certain
cutting tool, x1 is cutting speed, and that there are three
brands of tool being investigated.
Then let
When an observation on a brand A tool is made, x2 = 1 and
x3 = 0, whereas for a brand B tool, x2 = 0 and x3 = 1.
31
Models with Predictors for Categorical Variables
An observation made on a brand C tool has x2 = x3 = 0, and
it is not possible that x2 = x3 = 1 because a tool cannot
simultaneously be both brand A and brand B. The
no-interaction model would have only the predictors x1,
x2, and x3.
The following interaction model allows the mean change in
lifetime associated with a 1-unit increase in speed to
depend on the brand of tool:
Y = 0 + 1x1 + 2x2 + 3x3 + 4x1x2 + 5x1x3 + 
32
Models with Predictors for Categorical Variables
Construction of a picture like Figure 13.14 with a graph for
each of the three possible(x2, x3) pairs gives three
nonparallel lines (unless 4 = 5 = 0).
(b) interaction
(a) no interaction
Regression functions for models with one dummy variable (x1) and one quantitative variable x2
Figure 13.14
33
Models with Predictors for Categorical Variables
More generally, incorporating a categorical variable with c
possible categories into a multiple regression model
requires the use of c – 1 indicator variables (e.g., five
brands of tools would necessitate using four indicator
variables).
Thus even one categorical variable can add many
predictors to a model.
34
Estimating Parameters
35
Estimating Parameters
The data in simple linear regression consists of n pairs
(x1, y1), . . . , (xn, yn). Suppose that a multiple regression
model contains two predictor variables, x1 and x2.
Then the data set will consist of n triples (x11, x21, y1), (x12,
x22, y2), . . . , (x1n, x2n, yn). Here the first subscript on x
refers to the predictor and the second to the observation
number.
More generally, with k predictors, the data consists of
n (k + 1) tuples (x11, x21,..., xk1, y1), (x12, x22,..., xk2, y2), . . . ,
(x1n, x2n ,. . . , xkn, yn), where xij is the value of the ith
predictor xi associated with the observed value yj.
36
Estimating Parameters
The observations are assumed to have been obtained
independently of one another according to the model
(13.15).
To estimate the parameters 0, 1,..., k using the principle
of least squares, form the sum of squared deviations of the
observed yj’s from a trial function y = b0 + b1x1 + ... + bkxk :
f (b0, b1,..., bk) =
[yj – (b0 + b1x1j + b2x2j + . . . + bkxkj )]2
(13.17)
The least squares estimates are those values of the bi’s
that minimize f(b0,..., bk).
37
Estimating Parameters
Taking the partial derivative of f with respect to each
bi (i = 0,1, . . . , k) and equating all partials to zero yields the
following system of normal equations:
b0n + b1x1j + b2x2j +. . . + bkxkj = yj
(13.18)
38
Estimating Parameters
These equations are linear in the unknowns b0, b1,..., bk.
Solving (13.18) yields the least squares estimates
.
This is best done by utilizing a statistical software package.
39
Example 12
The article “How to Optimize and Control the Wire Bonding
Process: Part II” (Solid State Technology, Jan.
1991: 67–72) described an experiment carried out to
assess the impact of the variables x1 = force (gm),
x2 = power (mW), x3 = tempertaure (C), and
x4 = time (msec) on y = ball bond shear strength (gm).
40
Example 12
cont’d
The following data was generated to be consistent with the
information given in the article:
41
Example 12
cont’d
A statistical computer package gave the following least
squares estimates:
–37.48
.2117
.4983
.1297
.2583
Thus we estimate that .1297 gm is the average change in
strength associated with a 1-degree increase in
temperature when the other three predictors are held fixed;
the other estimated coefficients are interpreted in a similar
manner.
42
Example 12
cont’d
The estimated regression equation is
y = –37.48 + .2117x1 + .4983x2 + .1297x3 + .2583x4
A point prediction of strength resulting from a force of 35
gm, power of 75 mW, temperature of 200° degrees, and
time of 20 msec is
= –37.48 + .2117(35) + .4983(75) + .1297(200)
+.2583(20)
= 38.41 gm
This is also a point estimate of the mean value of strength
for the specified values of force, power, temperature, and
time.
43
^2
R2 and 
44
R2 and ^ 2
Predicted or fitted values, residuals, and the various sums
of squares are calculated as in simple linear and
polynomial regression.
The predicted value results from substituting the values
of the various predictors from the first observation into the
estimated regression function:
The remaining predicted values
come from
substituting values of the predictors from the 2nd, 3rd,...,
and finally nth observations into the estimated function.
45
R2 and ^ 2
For example, the values of the 4 predictors for the last
observation in Example 12 are x1,30 = 35, x2,30 = 75,
x3,30 = 200, x4,30 = 20, so
= –37.48 + .2117(35) + .4983(75) + .1297(200)
+.2583(20)
= 38.41
The residuals
are the differences
between the observed and predicted values.
The last residual in Example 12 is 40.3 – 38.41 = 1.89.
46
R2 and ^ 2
The closer the residuals are to 0, the better the job our
estimated regression function is doing in making
predictions corresponding to observations in the sample.
Error or residual sum of squares is SSE = (yi – )2.
It is again interpreted as a measure of how much variation
in the observed y values is not explained by (not attributed
to) the model relationship.
The number of df associated with SSE is n – (k + 1)
because k + 1 df are lost in estimating the k + 1
coefficients.
47
R2 and ^ 2
Total sum of squares, a measure of total variation in the
observed y values, is SST = (yi – y)2.
Regression sum of squares SSR = ( – y)2 = SST – SSE
is a measure of explained variation.
Then the coefficient of multiple determination R2 is
R2 = 1 – SSE/SST = SSR/SST
It is interpreted as the proportion of observed y variation
that can be explained by the multiple regression model fit to
the data.
48
R2 and ^ 2
Because there is no preliminary picture of multiple
regression data analogous to a scatter plot for bivariate
data, the coefficient of multiple determination is our
first indication of whether the chosen model is successful in
explaining y variation.
Unfortunately, there is a problem with R2: Its value can be
inflated by adding lots of predictors into the model even if
most of these predictors are rather frivolous.
49
R2 and ^ 2
For example, suppose y is the sale price of a house. Then
sensible predictors include
x1 = the interior size of the house,
x2 = the size of the lot on which the house sits,
x3 = the number of bedrooms,
x4 = the number of bathrooms, and
x5 = the house’s age.
Now suppose we add in
x6 = the diameter of the doorknob on the coat closet,
x7 = the thickness of the cutting board in the kitchen,
x8 = the thickness of the patio slab, and so on.
50
R2 and ^ 2
Unless we are very unlucky in our choice of predictors,
using n – 1 predictors (one fewer than the sample size) will
yield R2 = 1.
So the objective in multiple regression is not simply to
explain most of the observed y variation, but to do so using
a model with relatively few predictors that are easily
interpreted.
It is thus desirable to adjust R2, as was done in polynomial
regression, to take account of the size of the model:
51
R2 and ^ 2
Because the ratio in front of SSE/SST exceeds 1,
is
smaller than R2. Furthermore, the larger the number of
predictors k relative to the sample size n, the smaller will
be relative to R2.
Adjusted R2 can even be negative, whereas R2 itself must
be between 0 and 1. A value of
that is substantially
smaller than R2 itself is a warning that the model may
contain too many predictors.
The positive square root of R2 is called the multiple
correlation coefficient and is denoted by R.
52
R2 and ^ 2
It can be shown that R is the sample correlation coefficient
calculated from the ( , yi) pairs (that is, use in place of xi
in the formula for r).
SSE is also the basis for estimating the remaining model
parameter:
53
Example 13
Investigators carried out a study to see how various
characteristics of concrete are influenced by x1 = %
limestone powder and x2 = water-cement ratio, resulting in
the accompanying data (“Durability of Concrete with
Addition of Limestone Powder,” Magazine of Concrete
Research, 1996: 131–137).
54
Example 13
cont’d
Consider first compressive strength as the dependent
variable y.
Fitting the first order model results in
y = 84.82 + .1643x1 – 79.67x2, SSE = 72.52 (df = 6),
R2 = .741,
= .654
whereas including an interaction predictor gives
y = 6.22 + 5.779x1 + 51.33x2 – 9.357x1x2
SSE = 29.35 (df = 5) R2 = .895
= .831
55
Example 13
cont’d
Based on this latter fit, a prediction for compressive
strength when % limestone = 14 and
water–cement ratio = .60 is
= 6.22 + 5.779(14) + 51.33(.60) – 9.357(8.4) = 39.32
Fitting the full quadratic relationship results in virtually no
change in the R2 value.
However, when the dependent variable is adsorbability, the
following results are obtained: R2 = .747 when just two
predictors are used, .802 when the interaction predictor is
added, and .889 when the five predictors for the full
quadratic relationship are used.
56
R2 and ^ 2
In general I ,can be interpreted as an estimate of the
average change in Y associated with a 1-unit increase in xi
while values of all other predictors are held fixed.
Sometimes, though, it is difficult or even impossible to
increase the value of one predictor while holding all others
fixed.
In such situations, there is an alternative interpretation of
the estimated regression coefficients.
For concreteness, suppose that k = 2, and let denote the
estimate of 1 in the regression of y on the two predictors x1
and x2.
57
R2 and ^ 2
Then
1. Regress y against just x2 (a simple linear regression)
and denote the resulting residuals by g1, g2, . . . , gn.
These residuals represent variation in y after removing
or adjusting for the effects of x2.
2. Regress x1 against x2 (that is, regard x1 as the
dependent variable and x2 as the independent variable
in this simple linear regression), and denote the
residuals by f1, . . . , fn. These residuals represent
variation in x1 after removing or adjusting for the effects
of x2.
58
R2 and ^ 2
Now consider plotting the residuals from the first regression
against those from the second; that is, plot the pairs
(f1, g1),..., (fn, gn).
The result is called a partial residual plot or adjusted
residual plot. If a regression line is fit to the points in this
plot, the slope turns out to be exactly (furthermore, the
residuals from this line are exactly the residuals e1,...,en
from the multiple regression of y on x1 and x2).
Thus can be interpreted as the estimated change in y
associated with a 1-unit increase in x1 after removing or
adjusting for the effects of any other model predictors.
59
R2 and ^ 2
The same interpretation holds for other estimated
coefficients regardless of the number of predictors in the
model (there is nothing special about k = 2; the foregoing
argument remains valid if y is regressed against all
predictors other than x1 in Step 1 and x1 is regressed
against the other k – 1 predictors in Step 2).
As an example, suppose that y is the sale price of an
apartment building and that the predictors are number of
apartments, age, lot size, number of parking spaces, and
gross building area (ft2).
It may not be reasonable to increase the number of
apartments without also increasing gross area.
60
R2 and ^ 2
However, if = 16.00, then we estimate that a $16
increase in sale price is associated with each extra square
foot of gross area after adjusting for the effects of the other
four predictors.
61
A Model Utility Test
62
A Model Utility Test
With multivariate data, there is no picture analogous to a
scatter plot to indicate whether a particular multiple
regression model will successfully explain observed y
variation.
The value of R2 certainly communicates a preliminary
message, but this value is sometimes deceptive because it
can be greatly inflated by using a large number of
predictors relative to the sample size.
For this reason, it is important to have a formal test for
model utility.
63
A Model Utility Test
The model utility test in simple linear regression involved
the null hypothesis H0: 1 = 0, according to which there is
no useful relation between y and the single predictor x.
Here we consider the assertion that 1 = 0, 2 = 0,..., k = 0,
which says that there is no useful relationship between y
and any of the k predictors. If at least one of these ’s is not
0, the corresponding predictor(s) is (are) useful.
The test is based on a statistic that has a particular F
distribution when H0 is true.
64
A Model Utility Test
Null hypothesis: H0: 1 = 2 = … = k = 0
Alternative hypothesis: Ha: at least one i ≠ 0
(i = 1,..., k)
Test statistic value:
(13.19)
SSR = regression sum of squares = SST – SSE
Rejection region for a level  test: f  F,k,n – (k + 1)
65
A Model Utility Test
Except for a constant multiple, the test statistic here is
R2/(1 – R2), the ratio of explained to unexplained variation.
If the proportion of explained variation is high relative to
unexplained, we would naturally want to reject H0 and
confirm the utility of the model.
However, if k is large relative to n, the factor [n – (k + 1)/k]
will decrease f considerably.
66
Example 14
Returning to the bond shear strength data of Example 12,
a model with k = 4 predictors was fit, so the relevant
hypotheses are
H0: 1 = 2 = 3 = 4 = 0
Ha: at least one of these four  s is not 0
Figure 13.15 shows output from the JMP statistical
package.
Multiple regression output from JMP for the data of Example 14
Figure 13.15
67
Example 14
cont’d
Multiple regression output from JMP for the data of Example 14
Figure 13.15
68
Example 14
cont’d
The values of s (Root Mean Square Error), R2, and
adjusted R2 certainly suggest a useful model.
The value of the model utility F ratio is
69
Example 14
cont’d
This value also appears in the F Ratio column of the
ANOVA table in Figure 13.15.
The largest F critical value for 4 numerator and 25
denominator df is 6.49, which captures an upper-tail area of
.001. Thus P-value < .001.
The ANOVA table in the JMP output shows that
P-value < .0001.
This is a highly significant result.
70
Example 14
cont’d
The null hypothesis should be rejected at any reasonable
significance level.
We conclude that there is a useful linear relationship
between y and at least one of the four predictors in the
model.
This does not mean that all four predictors are useful;
we will say more about this subsequently.
71
Inferences in Multiple
Regression
72
Inferences in Multiple Regression
Before testing hypotheses, constructing CIs, and making
predictions, the adequacy of the model should be assessed
and the impact of any unusual observations investigated.
Methods for doing this are described at the end of the
present section and in the next section.
Because each is a linear function of the yi’s, the standard
deviation of each is the product of  and a function of the
xij’s.
An estimate
of this SD is obtained by substituting s for .
73
Inferences in Multiple Regression
The function of the xij ’s is quite complicated, but all
standard statistical software packages compute and show
the
Inferences concerning a single i are based on the
standardized variable
which has a t distribution with n – (k + 1) df.
74
Inferences in Multiple Regression
The point estimate of
when
,....,
, the expected value of Y
is
The estimated standard deviation of the corresponding
estimator is again a complicated expression involving the
sample xij’s.
However, appropriate software will calculate it on request.
Inferences about
are based on standardizing its
estimator to obtain a t variable having n – (k + 1) df.
75
Inferences in Multiple Regression
1. A 100(1 – )% CI for i, the coefficient of xi in the
regression function, is
2. A test for H0: i = i0 uses the t statistic value
based on n – (k + 1) df.
The test is upper-, lower-, or two-tailed according to
whether Ha contains the inequality >, < or ≠.
76
Inferences in Multiple Regression
3. A 100(1 – )% CI for
is
where is the statistic
the calculated value of .
and
is
4. A 100(1 – )% PI for a future y value is
77
Inferences in Multiple Regression
Simultaneous intervals for which the simultaneous
confidence or prediction level is controlled can be obtained
by applying the Bonferroni technique.
78
Example 15
Soil and sediment adsorption, the extent to which
chemicals collect in a condensed form on the surface, is an
important characteristic influencing the effectiveness of
pesticides and various agricultural chemicals.
The article “Adsorption of Phosphate, Arsenate,
Methanearsonate, and Cacodylate by Lake and Stream
Sediments: Comparisons with Soils” (J. of Environ. Qual.,
1984: 499–504) gives the accompanying data (Table 13.5)
on y = phosphate adsorption index, x1 = amount of
extractable iron, and x2 = amount of extractable aluminum.
79
Example 15
cont’d
Data for Example 15
Table 13.15
80
Example 15
cont’d
The article proposed the model
Y = 0 + 1x1 + 2x2 + .
A computer analysis yielded the following information:
81
Example 15
cont’d
A 99% CI for 1, the change in expected adsorption
associated with a 1-unit increase in extractable iron while
extractable aluminum is held fixed, requires
t.005,13 – (2 + 1) = t.005,10 = 3.169.
The CI is
.11273  (3.169)(.02969) = .11273  .09409  (.019, .207)
Similarly, a 99% interval for 2 is
.34900  (3.169)(.07131) = .34900  .22598  (.123, .575)
82
Example 15
cont’d
The Bonferroni technique implies that the simultaneous
confidence level for both intervals is at least 98%.
A 95% CI for Y•160,39, expected adsorption when
extractable iron = 160 and extractable aluminum = 39, is
24.30  (2.228)(1.30) = 24.30  2.90 = (21.40, 27.20)
A 95% PI for a future value of adsorption to be observed
when x1 = 160 and x2 = 39 is
24.30  (2.228){(4.379)2 + (1.30)2}1/2 = 24.30  10.18
= (14.12, 34.48)
83
Inferences in Multiple Regression
Frequently, the hypothesis of interest has the form
H0: i = 0 for a particular i.
For example, after fitting the four-predictor model in
Example 12, the investigator might wish to test H0: 4 = 0.
According to H0, as long as the predictors x1, x2, and x3
remain in the model, x4 contains no useful information
about y.
The test statistic value is the t ratio
.
Many statistical computer packages report the t ratio and
corresponding P-value for each predictor included in the
model.
84
Inferences in Multiple Regression
For example, Figure 13.15 shows that as long as power,
temperature, and time are retained in the model, the
predictor x1 = force can be deleted.
Multiple regression output from JMP for the data of Example 14
Figure 13.15
85
Inferences in Multiple Regression
An F Test for a Group of Predictors. The model utility F
test was appropriate for testing whether there is useful
information about the dependent variable in any of the k
predictors (i.e., whether 1 = ... = k = 0).
In many situations, one first builds a model containing k
predictors and then wishes to know whether any of the
predictors in a particular subset provide useful information
about Y.
86
Inferences in Multiple Regression
For example, a model to be used to predict students’ test
scores might include a group of background variables such
as family income and education levels and also some
school characteristic variables such as class size and
spending per pupil.
One interesting hypothesis is that the school characteristic
predictors can be dropped from the model.
Let’s label the predictors as x1, x2,..., xl, xl+1,..., xk,
so that it is the last k – l that we are considering deleting.
87
Inferences in Multiple Regression
The relevant hypotheses are as follows:
H0: l+1 = l+2 = . . . = k = 0
(so the “reduced” model Y = 0 + 1x1 + . . . + lxl +  is
correct)
versus
Ha: at least one among l+1,..., k is not 0
(so in the “full” model Y = 0 + 1x1 + . . . + kxk + , at least
one of the last k – l predictors provides useful information)
88
Inferences in Multiple Regression
The test is carried out by fitting both the full and reduced
models. Because the full model contains not only the
predictors of the reduced model but also some extra
predictors, it should fit the data at least as well as the
reduced model.
That is, if we let SSEk be the sum of squared residuals for
the full model and SSEl be the corresponding sum for the
reduced model, then SSEk  SSEl.
89
Inferences in Multiple Regression
Intuitively, if SSEk is a great deal smaller than SSEl, the full
model provides a much better fit than the reduced model;
the appropriate test statistic should then depend on the
reduction SSEl – SSEk in unexplained variation.
SSEk = unexplained variation for the full model
SSEl = unexplained variation for the reduced model
Test statistic value:
(13.20)
Rejection region: f  F,k–l,n – (k + 1)
90
Assessing Model Adequacy
91
Assessing Model Adequacy
The standardized residuals in multiple regression result
from dividing each residual by its estimated standard
deviation; the formula for these standard deviations is
substantially more complicated than in the case of simple
linear regression.
We recommend a normal probability plot of the
standardized residuals as a basis for validating the
normality assumption.
Plots of the standardized residuals versus each predictor
and versus should show no discernible pattern. Adjusted
residual plots can also be helpful in this endeavor. The
book by Neter et al. is an extremely useful reference.
92
Example 17
Figure 13.16 shows a normal probability plot of the
standardized residuals for the adsorption data and fitted
model given in Example 15. The straightness of the plot
casts little doubt on the assumption that the random
deviation  is normally distributed.
A normal probability plot of the standardized residuals for the data and model of Example 15
Figure 13.16
93
Example 17
cont’d
Figure 13.17 shows the other suggested plots for the
adsorption data.
Diagnostic plots for the adsorption data: (a) standardized residual versus x1;
(b) standardized residual versus x2; (c) standardized residual versus ; (d) versus y
Figure 13.17
94
Example 17
cont’d
Given that there are only 13 observations in the data set,
there is not much evidence of a pattern in any of the first
three plots other than randomness. The point at the bottom
of each of these three plots corresponds to the observation
with the large residual. We will say more about such
observations subsequently. For the moment, there is no
compelling reason for remedial action.
95