CHAPTER FOUR MULTIPLE LINEAR REGRESSION

advertisement
CHAPTER FOUR
MULTIPLE LINEAR REGRESSION
TABLE OF CONTENTS
OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Notation for multiple regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
ORDINARY METHOD OF LEAST SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3
Evaluation of partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4
System of equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6
Solution for a known intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
Solution for known slope(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
ANALYSIS OF SUM OF SQUARES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10
Partitioning of total sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Mean square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11
Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12
Extra sums of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15
EXAMPLE PROBLEM #1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17
Key matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18
Regression coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19
ANOVA and R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
EXCEL Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20
Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21
REPRESENTATION FOR STATISTICAL ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
Properties of least squares estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25
Hypothesis testing of parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
F test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26
EXAMPLE PROBLEM #2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
Test significance of regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29
Confidence interval for $1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29
Confidence interval for regression surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30
Test slope parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-31
Final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-32
MULTICOLLINEARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Nearly perfect multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Theoretical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SELECTION OF REGRESSION MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Overview of concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Guidelines for removing variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stepwise regression analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
OTHER REGRESSION TOPICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Polynomial regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Piecewise linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Common slope or intercept parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Common variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
APPENDIX 4-A: SELECTED DERIVATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Total and regression sum of squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Expected value of least square estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Variance-covariance matrix of b . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gauss-Markov theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PROBLEM ASSIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SOLUTION KEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-32
4-32
4-33
4-35
4-37
4-42
4-42
4-43
4-44
4-51
4-52
4-52
4-53
4-54
4-56
4-58
4-60
4-60
4-60
4-61
4-63
4-65
4-66
4-71
CHAPTER FOUR
MULTIPLE LINEAR REGRESSION
I am too familiar with the manner in which actual data are met with the suggestion that other data, if they were
collected, might show something else to believe it to have any value as an argument. “Statistics on the table,
please,” can be my sole reply.
Karl Pearson, 1910
OVERVIEW
Introduction
Background
Multiple linear regression is a method of fitting and evaluating linear algebraic equation to
observed data where there are more than one independent variables.
Similar to simple regression, multiple regression analysis can be viewed as having a
* Optimization criteria for fitting the equation
* Statistical inference
Noteworthy differences include:
* Requires matrix algebra for efficient solution
* Correlation among independent variable very important
Uses
In general, multiple regression analysis is used to:
*
*
Description: Primary objective is to identify significant variables. Statistical inferences
are of critical importance.
Prediction: Primary objective is to develop a predictive model. Physical insight into the
process should be used in the selection of parameters.
Notation for multiple regression
Model for population
Let’s expand on our grain bin example used in Chapters 2 and 3 by considering the drying of
4-1
4-2
using different systems over several years. We are now interesting in the moisture content of
the kernels (dependent variable) as influenced by independent variable such as the height in
the bin (x1), air flow rate (x2), air temperature (x3), relative humidity (x4) and other possible
factors.
If all kernels were measured in the population, then the regression model would be
where yi is the dependent variable (moisture content of the kernel) and 0i is the linear model
defined for the population as
where $o through $p are the population parameters, xi1 through xip are the independent
variable values (height, air flow rate, etc) corresponding to yi.
The residual, gi, is the deviation between the linear model and the observation. This
deviation is assumed to be random and is the result of measurement error and/or other
factors.
Linear model from sample
The corresponding linear model using a sample of y and x’s values is
where
For n observations, we have the following set of equations,
|
|
|
|
which can be written in matrix form as
where
,
, and
4-3
where y^ = nx1 vector, b = mx1 vector and x = nxm matrix where m=p+1 is the number of
parameters.
The errors or residuals for n observations are simply defined as
|
|
|
|
which can be written in matrix form as
where
and
Extrapolation
Extrapolation outside the range of x should be done cautiously. As discussed in the Chapter
3, the confidence intervals become wider as the distance from the mean increases. However
more importantly, the relationship may be approximately linear within the range of x, but
nonlinear outside x. Substantial errors are then possible. Both of these concepts are shown
below. As demonstrated in an example problem later, the acceptable interpolation region of
the observed data is more difficult to identify for multiple regression problems.
y
y
Nonlinear
Confidence intervals
x
ORDINARY METHOD OF LEAST SQUARES
Linear
x
4-4
Definition
As discussed in Chapter 3, the ordinary least square method minimizes the sum of squared
residuals. In matrix notation, the residual sum of squares is defined as
and therefore the objective function for determining the least square parameters is defined as
A necessary condition for M to be minimized has previously been shown is
or
Evaluation of partial derivatives
Partial with respect to bo
Let’s first expand M as
from which is easily shown that
For linear regression, we have
and therefore the derivatives are easily obtained as
By using these results, by setting the derivative equal to zero, and by dividing through by -2,
we obtain the first equation for minimum of
4-5
Important results for minimization with respect to bo
Once again, there are two important results of minimizes M with respect to bo. Since ei = (yiy^ i), we obtain
that is, the sum of residuals is zero.
Let’s rearrange the above equation as
where the definition of y^ has been used and each term has been summed separately. By
dividing through by n, we obtain
y
The mean values of y and x1 through xp lie on the regression surface. The representation
for two independent variables is shown below.
x2
E(y)
E(x2)
E(x1)
x1
Evaluate with respect to other parameters
After expanding M, the partial derivative with respect to b1 is
By using the previously given equations for y^ i for linear regression, we quickly conclude that
4-6
By using these results, by once again setting the derivative equal to zero, and by dividing
through by -2, we obtain the second equation for minimum of
The partial derivative with other parameters would result in the same solution except having
a different independent variable. The general solution1 for j…0 is therefore
System of equations
Standard Equation Set
Let’s review our minimization results. The minimum with respect to bo gave us
and with respect to b1
and with respect to b2 and subsequent parameters
|
|
|
|
and with respect to bp
This system of equations can be written as
1
A simple evaluation of matrix solutions for multiple regression is obtained by testing Gxijei = for each j.
4-7
For linear regression, the partial derivatives are easily evaluated to obtain the following
simple matrix solution:
where xT as previously defined has been used. By using the definition for the vector y, and
since we have previously defined that y^ = x b, we can write the system of equations can be
written as
or
where xT = m x n matrix, y= n x 1 vector, xTy = m x 1 vector, x b = n x 1 vector, and xT x b =
m x 1 vector (where m=p+1).
If the inverse is known
By using the definition of the identity matrix, we obtain
Solution approaches for nonlinear equation are given in Chapter 6. The xTy matrix is easily
evaluated directly as
and the xTx matrix as
4-8
No Intercept/Mean Difference Formulation
For some applications it is more convenient to formulate the estimation of the least square
parameters using the difference between the dependent and independent variables from their
respective means. Let’s start by using the solution for MM/Mbo=0, that is
By using this result, the regression equation can alternatively be written as
where the prime terms are defined as the variable relative to it’s mean. To solve for b1
through bp, we again find the minimum with respect to each variable resulting in the
following set of “p” equations
|
|
|
which can be written in the following matrix form
We therefore obtain the following matrix solution for b1 through bp
where b is a vector of slope parameters and yN is a vector of (yi-yG ) values. As discussed in
more detail later, variance-covariance for b is defined as
4-9
where for independent and constant variance we have used
.
Let’s now examine the matrix xNTyN and xNTxN more closely. We obtain from matrix
multiplication
and
or
Solution for a known intercept
Computation of bj
Let’s consider the special case where the intercept has a known value of $o, usually $o = 0.
where the linear regression model is
By removing the first condition in the previous section, the appropriate system of equations is
4-10
where xT is defined here as
The system of equations can then be written as
which can be rearranged as
which can be solved for the unknown vector b.
Important restrictions
Since we are no longer minimizing M with respect to bo, we lose the useful conditions that
the sum of residuals is zero and that the mean values lie on the regression surface, or
and
Solution for known slope(s)
Let’s consider the solution where the slope of one or more term is know. For illustration, we
will assume that values are known for $1 and $p-1. The solution can be obtained with
essentially the same set of equation with two fewer equation, that is
where xT is redefined here with two fewer rows. The system of equations can then be written
as
which can be rearranged as
4-11
which can be solved for the unknown vector b. As before, the sum of the residuals is again
zero and the mean values lie on the regression line.
ANALYSIS OF SUM OF SQUARES
Sum of squares
Similar to simple regression, the total sum of squares (SSTO) is defined as the sum of the
squared deviation between each observed value and the mean of y, or
or in matrix notation
The residual (or error) sum of squares (SSE) is defined as the sum of the squared deviation
between each observed and predicted (local mean) values of y, or
or in matrix notation
The regression sum of squares (SSR) is defined as the sum of the squared deviation of the
predicted (local mean) value and the mean of y, or
For the minimization solution for bo, we have shown that Eei = 0 and we conclude that
and therefore E(y^ ) = yG . The SSR can then be simplified as
An alternative computational form that is frequently used is shown below.
Details of this derivation are given in Appendix 4-A.
Partitioning of total sum of squares
4-12
Similar to the results in Chapter 2, we are interesting in partitioning the total sum of squares
into residual and regression sums of squares. As shown in Appendix 4-A, we obtain for
multiple regression analysis
for the conditions of
* Linear relationship
* Least squares estimate for all bj parameters
Once again, the first condition is violated in the nonlinear regression chapter, and the second
condition is violate when the intercept term is set to zero. When either of the above
condition is violate, we have
Mean square
Regression and residual mean squares
Similar to the result of simple regression, the “average” square deviations for the residual
sum of square is called the residual (or error) mean square and is defined a
where m is the number of estimated parameters (m = p+1) and n - m is the degrees of
freedom.
Likewise, the regression mean square is defined as
Equivalent ANOVA table
For a linear relationship with all of the coefficients estimated by the method of least squares,
the results of the sum of squares analysis is convenient summarized in the following
equivalent ANOVA table (where SSR is valid only if all b’s are optimized).
Source
Degree of Freedom
Sum of Squares
Mean Square
Regression
m-1
MSR = SSR/(m - 1)
Residual
n-m
MSE = SSE/(n - m)
4-13
Total
n-1
In addition to the coefficient of determination, the parameters that are particularly useful in
assessing the regression analysis are
which is an unbiased estimate of the variance of the pdf describing g. As discussed in greater
detail later, the overall F is used to evaluate the slope terms and is defined as
A large F indicates a significant regression model.
Coefficient of determination
Definition
A useful measure of the goodness of fit of the regression model is the coefficient of
determination defined as
where Cd is the general definition of the coefficient of determination and other terms are as
previously defined.
Least squares estimate of all bj parameters
As previously shown for least squares estimate of all parameters, SSTO = SSR + SSE. The
coefficient of determination can then be written as
where R2 is used for the coefficient of determination of multiple regression.
Similar to r2, it has the following useful characteristics:
* Lies between zero and one
* Fraction of the total sum of squares corresponding to the regression model
Discussion
The value of R2 always increases with additional independent variables, even if there is no
4-14
relationship between these variables and the dependent variable. This result will be apparent
later in the chapter with our example problems. Theoretical insight can be obtained for the
two-dimensional problem shown below (the raw data is given later in the chapter.
y = ybar+b1(x1-x1bar)+(0)(x2-x2bar)
y = ybar+b1(x1-x1bar)+b2(x2-x2bar)
600
400
SSE
SSE=171.
0986
200
-0.1
0
15
0.0
14
13.150
13
12
b1
11
0.1
b2
16
SSE=171.
0983
15
-0.1
0.00.011
14
13.151
13
12
b1
11
0.1
b2
The left figure shows the SSE using a single slope parameter in the regression equation
(actually the intercept parameter has been optimized in both figures). For this illustration, it
is useful to view this graph as a special case of a two-parameter equation where b2= 0. The
right graph shows the solution using two optimization parameters. The optimized b2 value is
0.0112. The SSE values are very similar, differing by only 0.0003. Let’s consider the
implication of adding another independent variable. If b2 = 0 (exactly), then the SSE is the
same as the one parameter model. If b2 …0 (exactly), there exists a value of b2 that yields a
smaller SSE.
Since a value of bi is never exactly equal to zero in practice, the SSE will always be reduced
(even if by the very small amount as in the above figure) with the addition of another
parameter, and therefore the value of R2 will always increase. If your goal is to simply have
a multiple regression model with the largest coefficient of determination, then your solution
is to use as many independent parameters as possible, even if they are frivolous and have no
relationship to the dependent variable.
An “adjusted” R2 is frequently used in statistical software to allow the coefficient of
determination to decrease with the addition of unrelated independent variables. It is defined
as
4-15
where R^ 2 is the adjusted R2 and is defined as one minus the ratio of the variance of residuals
divided by the variance of y relative to the mean. If SSE increases by a small value with an
additional parameter, R^ 2 can actually decrease because of the potentially larger increase in the
(n-m) term.
Since SSTO and n-1 are constant for a given set of observations, the selection of appropriate
independent variables using R^ 2 is then really determined by the parameter set that yields the
smallest MSE. As discussed at length later in the chapter, Wilson recommends that you
consider R2, MSE, and overall F in determining the best parameter set. If you use this
approach, little is gained by using the R^ 2 criterion.
Extra sums of squares
Example
Additional insight into the impact of adding independent variable on SSE and SSR is often
considered using extra sums of squares. This concept is best introduced using an example
problem. Consider the data reported by Neter and Wasserman's (1974) of skin cream sales
(y) in different district as a function of population numbers (x1)and per capita income (x2).
Obs
1
2
3
4
5
6
7
8
y
x1
162
120
223
131
67
169
81
192
274
180
375
205
86
265
98
330
x2
2450
3254
3802
2838
2347
3782
3008
2450
Obs
9
10
11
12
13
14
15
y
x1
116
55
252
232
144
103
212
195
53
430
372
236
157
370
x2
2137
2560
4020
4427
2660
2088
2605
Sum of squares
Let’s consider the sum of squares obtained by including different independent variable in the
regression analysis. The total sum of squares is constant for all option and SSTO = 53,902.
The following sum of squares were obtained with x1 and x2 in the regression model:
The following sum of squares were obtained with just x1 in the regression model:
The following sum of squares were obtained with just x2 in the regression model:
4-16
A schematic illustrating these different sums of squares is shown below.
SSE(x1,x2)
SSE(x1)
SSTO
SSE(x2)
n
∑ ( yi − y)
so2 = i =1
n −1
2
SSR(x1,x2)
SSR(x1)
SSR(x1/x2) =
SSE(x2) - SSE(x1,x2)
SSR(x2/x1) =
SSE(x1) - SSE(x1,x2)
SSR(x2)
Variability Regression Regression
around mean x1 and x2
x1
Regression
x2
Extra sum of squares
Let’s now examine the difference in the residual terms for the regression models with just x2
and with both x1 and x2, or
This corresponds to a shift of sum of squares from residual to regression when x1 is added to
the model (given that x2 is already in the model), or
Likewise the shift when x2 is added (given original model of x1) can be represented as
which again represents the marginal increase in sum of squares due to regression when x2 is
added to the model.
ANOVA table
The extra sums of squares are sometimes also summarized in ANOVA tables. The results
obtained by adding x2 is shown below.
4-17
Source
df
Sum of Squares
Regression
2
SSR(x1,x2) = 53,845
x1
1
SSR(x1) = 53,417
x2 given x1
1
SSR(x2/x1) = 428
Residual
12
SSE(x1,x2) = 56.9
Total
14
53,902
The results obtained by adding x1 is shown below.
Source
df
Sum of Squares
Regression
2
SSR(x1,x2) = 53,845
x2
1
SSR(x2) = 22,030
x1 given x2
1
SSR(x1/x2) = 31,815
Residual
12
SSE(x1,x2) = 56.9
Total
14
53,902
EXAMPLE PROBLEM #1
Data set
Problem statement
Let’s predict the mean annual flood (Q in thousands of cfs) as a function of watershed area
(A in thousands of square miles) and average annual maximum rainfall depth (I in inches)
using the following regression model,
which in our notation for sample statistics,
Raw data
The data reported by Haan (1979) for fourteen different watersheds are shown below.
Obs
#
1
2
3
y
(cfs)
15.50
8.50
85.00
x1
(sq mi)
1.250
0.871
5.690
x2
(in)
1.7
2.1
1.9
y^
(cfs)
18.1
13.1
76.5
ei
(cfs)
-2.62
-4.64
8.49
(y^ -yG )2
(cfs)2
13.0
73.8
3001.4
4-18
4
5
6
7
8
9
10
11
12
13
14
Sum
Mean
105.00
24.80
3.80
1.76
18.00
8.75
8.25
3.56
1.90
16.50
2.80
304.1
21.7
8.270
1.620
0.175
0.148
1.400
0.297
0.322
0.178
0.148
0.872
0.091
21.33
1.52
1.9
2.1
2.4
3.2
2.7
2.9
2.9
2.8
2.7
2.1
2.9
34.3
2.45
110.4
23.0
4.0
3.6
20.1
5.6
5.9
4.0
3.6
13.1
2.9
304.1
21.7
Useful summation terms for this data set are given below.
and
Key matrices
Data set matrices
For the above data set, the y and x matrices are defined as
and
Matrix products
By using the above matrices, we obtain the following matrix products
-5.44
1.82
-0.19
-1.88
-2.10
3.16
2.33
-0.47
-1.73
3.35
-0.09
0.00
7870.2
1.6
314.6
327.0
2.6
260.1
249.6
313.1
327.2
73.5
354.8
13182.6
4-19
and
and
Inverse matrix
Let’s review the procedures given in Chapter 1 for computing the inverse of a 3x3 matrix
with the following elements,
and the inverse matrix is calculated as
By using the results in the previous section, the inverse of the xTx matrix is defined as
Regression coefficients
As previously shown, the regression coefficients are defined as
By using the above matrices, we obtain
4-20
or
Therefore, the regression model is
ANOVA and R2
ANOVA table
Let’s now compute SSR as
or
The SSTO is computed from previously given values as
Since SSE = SSTO - SSR, the ANOVA table for this example problem can be written as
Source
Regression
df
Sum of Squares
Mean Square
2
13182.6
6591.3
Residual
11
171
15.6
Total
13
13353.7
Multiple coefficient of determination
The multiple coefficient of determination is defined as
Roughly 99% of the variance of flow rates around the mean is “explained” by the regression
4-21
equation.
EXCEL Solution
The solution approach using Microsoft EXCEL is shown on the next page.
Obs
y
1
2
3
4
5
6
7
8
9
10
11
12
13
14
y=
n=
ybar =
ANOVA
Calculations
15.5
8.5
85
105
24.8
3.8
1.76
18
8.75
8.25
3.56
1.9
16.5
2.8
(y-ybar)^2
38.723951
174.843951
4003.99681
6935.08252
9.46880816
321.228808
398.515665
13.8596653
168.295022
181.51788
329.88938
392.945665
27.2782367
358.074522
x=
1
1
1
1
1
1
1
1
1
1
1
1
1
1
x1
1.25
0.871
5.69
8.27
1.62
0.175
0.148
1.4
0.297
0.322
0.178
0.148
0.872
0.091
SSTO=
df=
b
1.657027
13.15104
0.011182
b=
m=
yhat=
yhat
18.11484
13.13507
76.50772
110.4374
22.9852
3.985297
3.639164
20.09868
5.595315
5.924091
4.029222
3.633573
13.14822
2.8862
1
5.69
1.9
14
21.332
21.332 108.7412
34.3 43.3419
34.3
43.3419
86.99
3.71678 -0.180936 -1.375371
-0.1809 0.020283 0.06123695
-1.3754 0.061237 0.52329118
ei
-2.614842
-4.635069
8.492284
-5.43741
1.814799
-0.185297
-1.879164
-2.098681
3.154685
2.325909
-0.469222
-1.733573
3.35178
-0.0862
SSE =
df=
MSE =
1
1
8.27 1.62
1.9 2.1
1
0.175
2.4
xTy =
Var(b) =
e^2
6.837398
21.48386
72.1189
29.56543
3.293495
0.034335
3.531257
4.40446
9.952038
5.409852
0.22017
3.005275
11.23443
0.00743
(yhat-ybar)^2
13.01777358
73.75010678
3001.380705
7870.2719
1.593512432
314.6210545
327.0199597
2.637949558
260.0976163
249.6010089
313.064707
327.2222019
73.52440296
354.8196588
3
13353.7209
13
Use Paste Special - Transpose
1
1
xT =
1.25
0.871
1.7
2.1
(xTx)^-1 =
1.7
2.1
1.9
1.9
2.1
2.4
3.2
2.7
2.9
2.9
2.8
2.7
2.1
2.9
14
21.72286
R^2 =
F=
xTx =
x2
171.0983 13182.62256 "= SSR"
11
2 "= df"
15.55439 6591.311279 "=MSR"
0.987187
423.7588
1
0.148
3.2
1
1.4
2.7
1
0.297
2.9
1
0.322
2.9
1
0.178
2.8
1
0.148
2.7
1
0.872
2.1
1
0.091
2.9
304.12
1465.893
627.8
57.8123 -2.814351 -21.39306
-2.814351 0.31549 0.952504
-21.39306 0.952504 8.139477
Extrapolation
Extrapolation is more difficult to identify for multiple regression problems. This can be well
illustrated for this example problem. The range of area is 0.09 to 8.3 and depth is 1.7 to 3.2.
It is therefore reasonable to use the above regression model for area = 6 and depth = 3. A
plot of this point is shown below with the raw data. Although our point is within the range of
observed values, it is outside the range of paired values of area and depth.
4-22
4
Area = 6
Depth = 3
Depth
3
2
Observed
Data
1
0
2
4
6
8
10
Area
REPRESENTATION FOR STATISTICAL ANALYSIS
Overview
Statistical model
Statistical analysis is very powerful component of multiple regression analysis. The
statistical representation of the population is assumed to be
where gm is a random measurement error in x and g is the random residual (or error) from
predicted and observed values. As previously discussed, xi1 through x1p are the independent
variables.
Similar to simple regression, there are assumed a population of y values for the same
independent variables because of measurement error and other factors not included in the
regression model. This variability is random represented by the pdf for g. The population
parameters correspond to a surface that passes through the mean (local) of y for each
combination of independent variables. In general, the population parameters are unobserved
and unobservable.
Our goal is to make inferences on the population parameters for a sample data set used to
determine the sample parameters of bo through bp. Since a different sample will result in a
different set of sample parameters, the regression parameters are random variables. Pdfs for
inferences can be defined using the statistical assumption for g given below.
4-23
Instead of discussing the pdf for b resulting from sample estimates, this chapter will only
focus on the mean and variance of b. Additional theoretical development to represent the pdf
is given in Chapter 5.
Statistical assumptions for the pdfs of residuals
Statistical inferences for the regression parameters and the regression surface can be made for
the following conditions:
* Mean of g is zero; E(g) = 0
* Normally distributed g
* Homoscedasticity; VAR(g) = constant
* Uncorrelated residuals; COV(gi,gj) = 0 for i … j
* Small random measurement error of x; gm .0
These concepts have been previously discussed in Chapter 3.
Residual plots
Similar to simple regression analysis, the validity of these assumptions are frequently
evaluated using residual plots. Residuals are frequently plotted with respect to time (or
order) and with respect to predicted values. The residual plot for the example problem is
shown below.
Normalized residuals
2
1
0
0
-1
-2
20
40
60
80
100
Predicted
4-24
Properties of least squares estimator
Mean
Once again, different samples result in different values of b and therefore the mean of b is of
interest. As shown in Appendix 4-A, the mean of b is
that is, on the average, the least square estimate gives the true population value. From
Chapter 2, the least square estimate of $ is unbiased.
Variance-covariance matrix
The variance-covariance matrix is used to define the variability in the different b resulting
from sampling. As shown in Appendix 4-A, the variance-covariance matrix is defined as
where E[gTg] is the variance-covariance matrix of residual. For independent and constant
variance, we know that
and, as shown in Appendix 4-A, the variance-covariance matrix for b is
The estimated variance covariance matrix is computed as
The standard error for bj is obtained as the square root of the appropriate diagonal element.
Minimum variance unbiased estimator
Let’s consider the least square estimator of the form
with a variance-covariance matrix defined as
Out of the class of all linear unbiased estimators, the least squares estimator has the minimum
variance, that is, it results in the smallest uncertainty in the regression parameters obtained by
sampling. This important conclusion is obtained from the Gauss-Markov theorem derived in
Appendix 4-A. It is valid for independent and constant variance of residuals.
4-25
Confidence intervals
General format
We will again use the following general format for confidence interval,
where the upper and lower limits will be defined using the t distribution with the appropriate
variance and (n-m) degrees of freedom. The above notation implies that (1-") fraction of the
intervals defined by L and U contains the population value.
Confidence interval for regression parameters
The lower and upper confidence intervals for the parameter bj (j = 0, 1, ..., p) are
and
where sbj is the standard error of bj defined as
The VAR(bj) is obtained from the diagonal elements of the MSE (xTx)-1 matrix.
Variance for the mean value (regression surface) and predicted value
Similar to simple regression, the mean value for a particular set of independent variable is
that obtained from the regression surface. The variance of this mean value varies with its
location. Let’s assume that we are interested in the regression value for independent variable
of xo1, xo2,..., xop. The regression value is then obtained as
where
As derived in Appendix 4-A, the variance for the regression value at this point is defined
using a sample estimate of F as
By using the theoretical development in Chapter 2 and 3, the variance for the predicted value
is defined as
Confidence/prediction intervals for mean and predicted values
The appropriate confidence intervals for the mean value are
4-26
and
where s^yi is the standard error of mean value defined as
where the variance is as previously defined.
Likewise, the appropriate prediction intervals for ypi are
and
where the standard error of predicted value is defined as
Hypothesis testing of parameters
The null and alternative hypotheses for parameters bj’s (j = 0, 1, .... , p) are usually defined as
and
which is evaluated with the following t-distribution test statistic
where sbj is the standard error of bj as previously defined.
The null hypothesis is rejected if
Tests are usually conducted for kj = 0 to remove insignificant terms. Because of the possible
dependency among bj's, only one term should be removed before repeating the regression
using the t test. This is explained in greater detail later.
F test
Introduction
For regression analysis, we will define F as
4-27
Similar to the results derived in Appendix 3-A, the expected value of MSR is defined as
(Neter and Wasserman,1974, p. 227)
If $1 = $2 = ... = $p = 0, then E(MSR) = MSE and we conclude that
If on the other hand, $1…0, then E(MSR) > MSE and we obtain
Testing the full regression surface
To test whether the regression model, with all terms, is explaining a significant amount of the
variation in y, the following null and alternative hypotheses are used
which is evaluating using F defined as
The null hypothesis is rejected if
that is, with (m-1) degrees of freedom in the numerator and (n-m) degrees of freedom in the
denominator.
Testing one or more slope terms
Let’s assume that you wish to test one ore more parameters in the model. If the full model is
defined as
The reduced model is then defined (remove one or more variables) as
where there are q parameters in the reduced model.
The null and alternative hypotheses are then defined as
4-28
We will use the following notation
SSE(F) = Residual sum of squares of full model
SSE(R) = Residual sum of squares of reduced model
SSR(F/R) = Extra sum of squares = SSE(R) - SSE(F)
The ANOVA table summary is therefore
Source
df
Sum of
Squares
Mean
Squares
Regression
m-1
SSR(F)
SSR(F)/(m-1)
R
q-1
SSR(R)
F, given R
m-q
SSR(F/R)
Residual
dfF = n-m
SSE(F)
R
dfR = n-q
SSE(R)
n-1
SSTO
Total
SSR(F/R)/(m-q)
MSE = SSE/(n-m)
If SSR(F/R) is small then the null hypothesis parameters add little to the regression analysis,
that is, it supports the hypothesis of insignificant variables.
If SSR(F/R) is large then the null hypothesis parameters add much to the regression analysis,
that is, it refutes the hypothesis of insignificant variables.
The appropriate test statistic is
The null hypothesis is rejected if
that is, with (m-q = dfR - dfF) degrees of freedom in the numerator and (n-m = dfF) degrees of
freedom in the denominator.
EXAMPLE PROBLEM #2
Problem statement
4-29
The goal is to make statistical inferences for the multiple regression problem of analyzing
the mean annual flood as a function of watershed area and average annual maximum rainfall
depth. The previously computed regression model and ANOVA table are repeated below.
Source
df
Sum of Squares
Regression
Mean Square
2
13182.6
6591.3
Residual
11
171
15.6
Total
13
13353.7
and the previously computed (xTx)-1 was
Test significance of regression
The significance of the full regression model will be tested at 5% level. The appropriate null
and alternative hypotheses are
and
which is evaluated using the test statistic of
Since F(0.95,2,11) = 3.98, Ho is rejected. We conclude that the regression model is
explaining a significant amount of the variation in y.
Confidence interval for $1
Standard error of b1
Let’s first determine the standard error of b1. The variance-covariance matrix is defined as
Since s2 =MSE=15.554 and
4-30
we obtain
and the standard error of
Confidence interval
The 95% confidence interval for $1 can now be easily computed as
and
Confidence interval for regression surface
Point of interest
The confidence interval for the regression surface (mean value) is a function of its location.
We will compute the confidence interval for x1 = 4 and x2 = 2, or
The regression model value for that point is
Variance for regression surface
The variance for the regression value has been previously defined as
from which we obtain the following value for this problem
4-31
and the standard error of
Confidence interval
The 95% confidence interval can now be computed as
and
Test slope parameters
Test $1 = 0
Although we have previously shown by the F test that at least one of the slope coefficient is
significantly different than zero, let’s evaluate whether each slope term is significantly
different than zero at the 5% level. The null and alternative hypotheses for $1 are
that is evaluated using the test statistic of
where the value for sb1 was previously determined. Since t11,0.975 = 2.201, we reject the null
hypothesis and conclude that area (x1) is explaining a significant amount of variation in y.
Test $2 = 0
Let’s now evaluate whether $2 is significantly different than zero at the 5% level. The
appropriate null and alternative hypotheses are
that is evaluated using the test statistic of
4-32
The variance-covariance matrix defined as
can be used to obtain
and therefore
The test statistic can now be computed as
Since t11,0.975 = 2.201, we would not reject the null hypothesis. The slope parameter is not
significantly different than zero at the 5% level, that is, it is not “explaining” a significant
amount of the variation in y (with area in the analysis).
Final model
Since depth was not significant in explaining the variation in y, it likely should be removed
from the model. A regression analysis would then be conducted using just area. The new
equation for this regression is shown below.
Note that the intercept changed slightly. The slope term remains very close to the original
value because of the very low contribution of depth to the sum of squares in the first model.
The R2 for the above regression model had negligible decrease, the overall F value increased,
the residual mean square decreased, and the standard errors for both bo and b1 decreased. The
interpretation of these trends is discussed in greater detail later.
MULTICOLLINEARITY
Introduction
General
4-33
Multicollinearity exists when there are correlations (or linear interrelationships) among
independent variables. This is sometimes referred to as intercorrelation.
The degree of multicollinearity is usually assessed using correlation coefficients (for more
discussion see Judge et al., 1987, pp. 868-871). As discussed in Chapter 2, correlation is a
measure of the strength of a linear relationship between two variables (defined for random
variables). Correlation coefficient is computed as
discussed in the chapter, problems that result from multicollinearity include:
*
*
*
Difficulty in identifying the separate effects of variables,
Large variances for pdfs of b’s, making it more difficult to conclude that bj …0.
Odd trends between independent and dependent variables.
Nearly perfect multicollinearity
Perfect correlation
For perfectly correlated variables r = ± 1. This corresponds to linear relationship between
two or more variables. For example,
or
that is, a independent variable can be written as a linear combination of the other independent
variables.
As discussed in Chapter 1, a matrix that is composed of linear combinations of elements is
singular and an unique solution does not exists. For regression analysis, the matrix (xT x)-1 is
singular for exact multicollinearity. For nearly perfect correlation, special algorithms must be
used to account for numerical instabilities in the solution technique.
Example illustration for highly correlated variables
To illustrate the impact of correlated variables, twenty values of y were obtained using the
following equation
where ei is a normally distributed residual. A second variable to be used in the regression
analysis was obtained
4-34
where here ei is uniformly distributed residual set so that x1 and x2 are highly correlated
(r=0.99). The value of y was determined independently of x2. The data obtained using these
two equations are shown below.
x2
x1
1.03
1.92
3.03
4.06
5.04
5.94
7.08
8.04
9.08
9.94
y
7.48
11.88
17.02
21.97
27.41
32.16
37.72
42.71
47.13
51.35
40.83
0.99
47.15
55.88
37.81
80.64
63.64
108.15
109.91
77.76
x1
x2
10.92
12.06
12.93
13.99
14.91
16.04
16.93
18.06
19.03
20.05
56.09
62.48
66.54
71.59
76.66
82.03
86.35
92.14
97.48
102.26
y
110.07
91.66
110.26
130.86
171.55
190.78
196.46
215.75
184.15
210.77
The regression equation for x2 is
and therefore a = 2.148 and c = 4.986.
The regression equation for the first model using x1 is
and therefore bos = 3.748 and b1s = 10.282 (where the superscript s is used to indicate a single
independent variable). The standard error of b1s is 0.813 and b1s is significantly different than
zero at the 1% level (t = 12.6). The R2 = 0.9.
The regression equation for the second model using both x1 and x2 is
where bo = 2.826, b1 = 8.142, and b2 = 0.429. Since we obtain y from an algorithm, we know
that y is, in fact, not a function of x2. The standard error of b1 is now 82.5 and b1 is not
significantly different than zero at the 10% level (t = 0.1). The R2 = 0.9.
Insight into Results
Insight into the role of correlated independent variable can be obtained by considering nearly
perfectly correlated variables. Let’s consider two nearly perfectly correlated independent
variables where
4-35
and regression models of
and
Since x1 and x2 are highly correlated, the second regression model can be well approximated
by
We conclude that the intercept and slope of the one variable model are defined for the two
variable model as
Since we are typically interest in the change in y for a change in x1, the above relationship for
the slope term is particularly important. When a collinear variable is added to the regression
analysis (i.e, the second model), the part of the change in y with respect to x1 is shifted to the
second variable. For example, if c and b2 are both positive, b1 must be smaller than b1s,
indicating a smaller change in y for the same change in x1. The change in the b1s value
implies that a range of parameters can be used to represent the observed dependent values.
This greater range is reflected in a large variance for b1s.
For our example data set, the above relationships are shown to hold, that is,
and
Theoretical Considerations
Multicollinearity effects on slope parameter
Let’s consider the following regression model:
where terms are as previously defined. From the “no-intercept” formulation previously
given, we know that the matrix xNTyN is defined as
and the xNTxN matrix as
4-36
The determinant is then defined using rules given in Chapter 1 as
By using the rules for determining an inverse matrix, we obtain
or
We can now evaluate b1 and b2 using the results for (xNTx)-1 and xNTyNas
Although relationships for b1 and b2 can easily be determined, we will focus on b1. By using
matrix multiplication, we obtain
which can be rearranged as
As shown in Appendix 3-A, the first term in the numerator is alternative solution form for b1
4-37
for simple regression. We then can further evaluate b1 as
For rx1,x2 = 0, the value of the b1 with two parameter is identical to that obtain from simple
regression. However, if there is correlation among x1 and x2, the value is clearly different.
Evaluation of variance
We can also easily evaluate the VAR(b) using (xNTx)-1 as
We will focus on the variance of b1. From the above matrix, it is simply defined as
where the definition for the variance of b1 in simple regression is the numerator of the first
term. For uncorrelated x1 and x2 values, the variance of b1 is the same when a second
variable is added to the regression model if the mean square of residual (F2) is constant.
Frequently the mean square of residual is smaller for more than one independent variable.
The above relationship shows that the addition of another correlated variable the variance
increases and approaches infinity as the correlation among these variables approaches one.
The above result can be generalized for more than two independent variables. The 1/(1rx1,x22) term is called the variance inflation factor.
Example problem
Data
The data below were reported by Neter et al. (1983) to examine the relationship of body fat
(y) to triceps skinfold thickness (x1), thigh circumference (x2) and midarm circumference (x3).
x1
19.5
24.7
30.7
29.8
x2
43.1
49.8
51.9
54.3
x3
29.1
28.2
37.0
31.1
y
11.9
22.8
18.7
20.1
x1
31.1
30.4
18.7
19.7
x2
56.6
56.7
46.5
44.2
x3
30.0
28.3
23.0
28.6
y
25.4
27.2
11.7
17.8
4-38
19.1
25.6
31.4
27.9
22.1
25.5
42.2
53.9
58.5
52.1
49.9
53.5
30.9
23.7
27.6
30.6
23.2
24.8
12.9
21.7
27.1
25.4
21.3
19.3
14.6
29.5
27.7
30.2
22.7
25.2
42.7
54.4
55.3
58.6
48.2
51.0
21.3
30.1
25.7
24.6
27.1
27.5
12.8
23.9
22.6
25.4
14.8
21.1
The correlation coefficients among independent variables are:
x1 and x2: r = 0.92 (highly correlated)
x1 and x3: r = 0.46 (correlated)
x2 and x3: r = 0.08 (low correlation)
ryx1 = 0.84, ryx2 = 0.88 and ryx3 = 0.14
We can also compute the following sum terms:
Regression of y on x1 (1st model)
The regression parameters, corresponding standard errors, ANOVA table, coefficient of
determination and overall F values are shown below
Variable
Parameter
Intercept
x1
Source
Regression
Standard Error
t
-1.496
3.319
-0.45
0.857
0.129
6.64
df
Sum of Squares
Mean Square
1
352.3
352.3
Residual
18
143.1
8.0
Total
19
495.4
R2 =
0.71
Overall F =
Comments:
* 71% of the total sum of squares explained by regression
* b1 is significantly different than zero (1% level)
* Overall F is significantly different than zero (1% level)
Regression of y on x2 (2nd model)
44.3
4-39
The regression parameters, corresponding standard errors, ANOVA table, coefficient of
determination and overall F values are shown below
Variable
Parameter
Intercept
x2
Source
Standard Error
-23.6
5.66
-4.17
0.86
0.11
7.82
df
Sum of Squares
Regression
t
Mean Square
1
382.0
382.0
Residual
18
113.4
6.3
Total
19
495.4
R2 =
0.77
Overall F =
60.6
Comments:
* 77% of the total sum of squares “explained” by regression
* bo and b1 are significantly different than zero (1% level)
* Overall F is significantly different than zero (1% level)
* Slightly improved model compared to the previous model
Regression of y on x1 and x2 (3rd model)
Let’s now consider the impact of adding a highly correlated variable (x2) to the 1st model
(equivalent to adding highly correlated variable (x1) to the 2nd model). Theoretically, we
know that the “new” slope term for x1 is defined as
and the “new” standard error as
The numerical results are shown below
4-40
Variable
Parameter
Intercept
Standard Error
t
-19.2
8.36
-2.30
x1
0.22
0.30
0.73
x2
0.66
0.29
2.28
Source
df
Regression
Sum of Squares
Mean Square
2
385.4
192.7
Residual
17
110.0
6.5
Total
19
495.4
R2 =
0.78
Overall F =
29.8
In comparison to the 1st model
* R2 increased 71% to 78%
* MSE decreased from 8 to 6.5
* Standard error of b1 increased from 0.129 to 0.30
* b1 decreased from 0.86 to 0.22
* b1 is not significantly different than zero at the 1% level
* Overall F decreased from 44.3 to 29.8 (still significant at 1% level)
We therefore conclude that in terms of the goodness of fit parameters of R2 and MSE the
addition of the second parameter improved the fit. However, in terms of confidence in the
relationship between y and the b1, we have less confidence.
In comparison to the 2nd model
* R2 increased slightly 77% to 78%
* MSE increased slightly from 6.4 to 6.5
* Standard error of b2 increased from 0.11 to 0.29
* b2 decreased from 0.86 to 0.66
* b2 is not significantly different than zero at the 1% level
* Overall F decreased from 60.6 to 29.8 (still significant at 1% level)
Given these results, it is reasonable to conclude that the third model is inferior to the second
model.
Regression of y on x2 and x3 (4th model)
Let’s consider the impact of adding a variable (x3) to the 2nd model that has a low correlation.
The numerical results are shown below.
4-41
Variable
Parameter
Intercept
Standard Error
t
-26.0
7.0
-3.71
x2
0.85
0.11
7.73
x3
0.096
0.16
0.60
Source
df
Regression
Sum of Squares
Mean Square
2
384.3
192.2
Residual
17
111.1
6.5
Total
19
495.4
R2 =
0.78
Overall F =
29.4
In comparison to the 2nd model
* R2 increased slightly 77% to 78%
* MSE increased slightly from 6.4 to 6.5
* Standard error of b2 was essentially the same (0.11)
* b2 was essentially the same (0.86)
* b2 is significantly different than zero at the 1% level
* Overall F decreased from 60.6 to 29.4 (still significant at 1% level)
For independent variables that are uncorrelated, the regression parameters are the
same with or without the other variables in the model. The change in the standard
errors for the regression parameters are in direct proportion to the possible decrease in
MSE with more than one independent variable. The t test can then be conducted for each
parameters without redoing the regression analysis. The use of x3 added very little to the
representation of the data.
Regression of y on x1, x2, and x3 (final model)
Let’s consider the results obtained using all of the independent variables.
Variable
Intercept
Parameter
Standard Error
t
117.1
99.8
1.17
x1
4.33
3.02
1.43
x2
-2.86
2.58
-1.11
x3
-2.19
1.59
-1.38
4-42
Source
df
Regression
Sum of Squares
Mean Square
3
397.0
132.3
Residual
16
98.4
6.2
Total
19
495.4
R2 =
0.80
Overall F =
21.5
In comparison to the 2nd model
* R2 increased slightly 77% to 80%
* MSE decreased slightly from 6.4 to 6.2
* Standard error of b2 increased 0.11 to 2.58 (about twenty times)
* Sign of b2 is negative
* No slope parameters are significantly different than zero at the 1% level
* Overall F decreased from 60.6 to 21.5 (but still significant at the 1% level)
The 4th model predicts an decrease in body fat with x2, which is the opposite trend obtained
with the 2nd model and is contrary to physical insight. Using this results could result in
serious misinterpretation of the study and major error in possible extrapolation.
To summarize, multicollinearity increases the variances of slope parameters and makes
physical interpretation is more difficult.
SELECTION OF REGRESSION MODEL
Overview of concepts
Review uses of regression analysis
Description: The primary goal is identify significant variables influencing the response of a
system, including the relative importance of these variables as indicated by the slope term.
Sometimes the system is so poorly understood that there is little physical insight.
Prediction: The primary goal is to develop an equation to predict the response of the system
for a different set of values for the independent variables. Researchers should rely on their
physical insight to select variables.
It is possible that steps variable removed/retained for description studies should be
retained/removed for prediction studies.
Why reduce number of variables?
4-43
For descriptive studies, the primary goal is to determine the most significant variables. Since
the standard error increases with correlated variables, insignificant variables should be
removed from the analysis.
For predictive studies, insignificant variables do not improve the fit of the model to the
observed data while
*
*
*
*
*
Making the model more difficult and more expensive to use,
Making the model more difficult to understand and possibly losing physical
insight, which is especially a concern if you are unsure of extrapolation,
Increasing the uncertainty (variance) of parameters and therefore confidence in
predicted trends,
Increasing the risk of extrapolation by reducing the range of possible values
and
Increasing the numerical roundoff errors in computations (at least for some
numerical methods)
Additional comments for predictive model
Statistics should be used to supplement and not replace the physical insight a scientist or
engineer brings to the problem.
Do not use a model that violates your physical knowledge or understanding of the system.
This is especially important if there is a possibility of extrapolation in a predictive model.
Avoid using "canned" procedures, such as stepwise regression, to obtain parameters. This
type of approach removes variables without allowing the scientist to provide physical insight
into the problem.
Sometimes the dominant physical process can not easily be obtained. Variables are then
selected for the regression that the modeler thinks may be highly correlated to that process.
Test data set is a good when is used with a different or poorly defined population.
A poor fit with the regression models suggests that (1) there is not a strong linear relationship
and/or (2) important variables have not been included in the study.
Guidelines for removing variables
Examine correlation matrix
As previously discussed, independent variables that are highly correlated should be avoided
because they (1) increase the standard errors and (2) can result in false trends between
4-44
dependent and independent variables. The latter issue is particularly dangerous if there is
possible extrapolation.
Significant variables
For descriptive studies, the primary goal is to determine these variables.
For predictive studies, try to retain only those variables that make a significant contribution to
the regression. Based on your physical insight, you may decide that a 10% level of
significance is acceptable for some variables, whereas, 1% level might be used for other
variables.
Multiple coefficient of determination
The multiple coefficient of determination is
A large R2 indicates that a large fraction of the variation about the mean is “explained” by the
regression equation. The disadvantage is that R2 always increases when more variables are
included in the model, even if they have no physical significance.
The best model using only R2 is the one with the most parameters. Therefore, R2 should not
be the sole criterion in selecting your model.
Residual mean square
The residual mean square is an estimate of the variance around the predicted surface and
observed values. It has been previously defined as
A small value implies small variance around the predicted surface and hence greater
confidence in the predicted value. Let’s consider the impact of adding additional parameters
(m 8)
SSE 9 and
n-m 9
and therefore additional parameters can increase s2 if (n-m) is reduced more than SSE.
Overall F statistic
The overall F statistic has been previously defined as
4-45
A large F means that at least one of the slope terms is significantly different zero. It is
directly related to the significance of the regression. As shown in the previous section, the
overall F value can increase with the removal of parameters.
Example problem
Problem definition
The following example data set given by Haan (1979) will be used to illustrate procedures for
selecting independent variables for a multiple regression model. The goal is to predict runoff
as function of the following measurable watershed and rainfall characteristics:
y = Runoff =
x1 = Prec =
x2 = Area =
x3 = Slope =
x4 = Len =
x5 = Perim =
x6 = Diam =
x7 = Shape =
x8 = Strm =
x9 = Relief =
Mean annual runoff (inches)
Mean annual precipitation (inches)
Watershed area (sq miles)
Average watershed slope (percent)
Axial length (miles)
Watershed perimeter (miles)
Diameter of the largest circle possible within basin (miles)
Diam/(Diameter of largest circle enclosing basin)
Stream frequency = # streams / Area (1/sq miles)
(Total relief)/(Largest dimension)
Observed results for 13 different watersheds are shown below.
Runoff
y
17.38
14.62
15.48
14.72
18.37
17.01
18.20
18.95
13.94
18.64
17.25
17.48
13.16
Prec
x1
44.37
44.09
41.25
45.50
46.09
49.12
44.03
48.71
44.43
47.72
48.38
49.00
47.03
Correlation matrix
Area
x2
2.21
2.53
5.63
1.55
5.15
2.14
5.34
7.47
2.10
3.89
0.67
0.85
1.72
Slope
x3
50
7
19
6
16
26
7
11
5
18
21
23
5
Len
x4
2.38
2.55
3.11
1.84
4.14
1.92
4.73
4.24
2.00
2.10
1.15
1.27
1.93
Perim
x5
7.93
7.65
11.61
5.31
11.35
5.89
12.59
12.33
6.81
9.87
3.93
3.79
5.19
Diam
x6
0.91
1.23
2.11
0.94
1.63
1.41
1.30
2.35
1.19
1.65
0.62
0.83
0.99
Shape
x7
0.38
0.48
0.57
0.49
0.39
0.71
0.27
0.52
0.53
0.60
0.48
0.61
0.52
Strm
x8
1.36
2.37
2.31
3.87
3.30
1.87
0.94
1.20
4.76
3.08
2.99
3.53
2.33
Relief
x9
332
55
77
68
68
230
44
72
40
115
352
300
39
4-46
The first step is to examine the correlations among the independent and between the
dependent and independent variables as shown below. Slope, length, and perimeter are
highly correlated with area. Slope-and-relief and length-and-perimeter are also highly
correlated. You would need to consider carefully if you want to include these highly
correlated variables in your regression model. The data also suggests that runoff increases
with all of the independent variables except for Shape and Strm.
Runoff
Prec
Area
Slope
Len
Perim
Diam
Shape
Strm
Relief
Runoff
Prec
Area
Slope
Len
Perim
Diam
Shape
Strm
Relief
1.00
0.39
0.47
0.41
0.42
0.46
0.33
-0.15
-0.40
0.35
1.00
-0.25
0.08
-0.34
-0.41
-0.15
0.45
0.04
0.42
1.00
-0.17
0.90
0.96
0.91
-0.25
-0.48
-0.52
1.00
-0.21
-0.10
-0.16
0.05
-0.30
0.80
1.00
0.92
0.67
-0.58
-0.53
-0.54
1.00
0.81
-0.41
-0.48
-0.51
1.00
0.15
-0.32
-0.50
1.00
0.29
0.17
1.00
-0.08
1.00
Results for full model
Let’s consider the results obtained using all of the independent variables.
Variable
Intercept
Prec
Area
Slope
Len
Perim
Diam
Shape
Strm
Relief
Source
Parameter
Standard Error
-14.74
0.45
0.19
-0.02
0.29
0.99
-3.05
5.67
0.37
0.01
t
6.779
0.148
1.424
0.048
0.777
0.366
4.796
9.029
0.267
0.006
df
-2.17
3.05
0.13
-0.38
0.37
2.69
-0.64
0.63
1.40
2.14
Sum of Squares
Probability
0.118
0.055
0.903
0.731
0.735
0.074
0.569
0.574
0.255
0.121
Mean Square
Regression
9
43.5
4.83
Residual
3
1.4
0.47
Total
12
44.9
R2 =
0.97
Overall F =
10.4
A plot of the standardized residuals is shown below. Although there appears to be a trend of
a decrease in variance with predicted depth, there is not enough data to be conclusively. We
4-47
should nonetheless be wary about the statistical conclusion drawn with this regression model.
Standardized residuals
2
1
0
14
16
18
Predicted
20
-1
-2
Let’s first determine if one or more of the slope parameters are significantly different than
zero at the 5% level. From the F table for nine and three degrees of freedom in the numerator
and denominator, respectively, we obtain F9,30.95 = 8.81. Since F = 10.4, we conclude that:
At least one slope parameter is significantly different than zero. The model is
better than using the mean of the data.
Let’s now evaluate which (if any) of the individual slope parameters are significantly
different than zero at the 5% level. From the t table, we obtain t3,0.975 = 3.18. We therefore
conclude that:
No individual slope parameters are significant at 5% level (assuming that the other
parameters are in the model).
Three most significant variables
Let’s consider regression using the three most significant variables of the full model, that is,
Prec, Perim, and Relief. The results are shown below.
Variable
Intercept
Prec
Perim
Relief
Parameter
-9.64
0.43
0.62
0.01
Standard Error
6.441
0.093
0.075
0.002
t
-2.17
4.62
8.24
5.19
Probability
0.058
0.001
0.000
0.001
4-48
Source
df
Sum of Squares
Mean Square
Regression
3
40.6
13.56
Residual
9
4.3
0.47
Total
12
44.9
R2 =
0.90
Overall F =
28.9
A plot of the standardized residuals is shown below. Although there is not enough data to be
conclusive, there is insufficient evidence to reject the standard statistical assumptions.
Standardized residuals
2
1
0
14
16
18
Predicted
20
-1
-2
Let’s evaluate whether the removed terms have significance, that is, we are interested in the
following null hypothesis
and alternative hypothesis
Let’s review the test statistics for removing one or more slope parameters
which from the above ANOVA table and that given for the full model we obtain
Since F6,3,0.95 = 8.94, we do not reject Ho and conclude that the slope parameters are not
significantly different than zero at the 5% level.
This appears to be a good model because:
4-49
* Trends for independent variables correspond to hydrologic principles,
* Reduction in R2 , compared to the full model, is relatively small,
* Improved MSE and F statistic relative to the full model
* All variables in the regression are significant,
* All variable excluded are statistically insignificant,
* Easier to use than the full model, and
* Residual plot supports statistical assumptions.
Consider area in model
Since area is more readily available for watersheds than perimeter, and since it is highly
correlated to perimeter, it worthwhile to consider replacing perimeter with area in the
predictive model. The results are shown below.
Variable
Parameter
Intercept
Prec
Area
Relief
Standard Error
0.45
0.26
0.82
0.01
Source
t
6.200
0.136
0.166
0.003
df
Probability
0.07
1.91
4.97
3.49
Sum of Squares
0.943
0.089
0.001
0.007
Mean Square
Regression
3
35.2
11.70
Residual
9
9.7
1.08
Total
12
44.9
R2 =
0.78
Overall F =
10.8
A plot of the residuals is shown below.
Standardized residuals
2
1
0
14
-1
-2
16
18
Predicted
20
4-50
Although there may be a trend with the variance, the data lack enough points to be conclusive
and therefore the standard statistical assumptions will not be rejected.
The regression model using area instead of perimeter has poorer goodness-of-fit statistics.
For example,
* R2 decreases from 90% to 78% and
* MSE increases from 0.47 to 1.08
The statistical representation is also not as desirable. For example,
* Overall F decreased from 28 to 11
* Standard error for precipitation increased from 0.093 to 0.136
* Standard error for relief increased from 0.002 to 0.003
Based on these results, I would not select this model unless (1) obtaining the perimeter is a
serious limitation in using the model or (2) a compelling physical argument can be made that
runoff depth is more closely related to area than perimeter.
Summary of results
A summary of the different combination of independent variables is given in the following
table. The MSE, R2, and F statistic for each model is reported. Variables are evaluated for
significance at the 10%.
Model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1
Prec
1
#
#
#
#
#
#
#
#
#
#
#
Area
2
x
x
x
Slope
Len
Perim
Diam
Shape
Strm
Relief
MSE
R2
F
x
x
x
#
#
#
#
#
x
x
x
x
x
x
x
x
x
x
x
#
#
#
#
#
#
#
#
0.48
0.37
0.35
0.36
0.47
1.10
1.29
1.12
0.97
1.41
1.70
1.44
3.47
3.20
3.60
0.97
0.97
0.95
0.94
0.90
0.78
0.74
0.78
0.81
0.72
0.62
0.68
0.15
0.22
0.12
10.1
14.5
20.4
23.8
28.6
10.9
8.6
10.4
12.5
7.6
8.2
10.6
1.9
3.0
1.5
#
#
#
#
#
#
#
#
#
x
x
x
# - Indicates variable was included in the model and was significant
4-51
2
x - Indicates variable was included in the model and was insignificant
The above table supports the use of Model 5. All variables are statistically significant. Other
models greater than 5 show a noticeable increase in MSE and reductions in R2 and F.
Stepwise regression analysis
Introduction
Most statistical software has routines to automatically select variables based on set criteria.
You should use this option carefully because it removes your insight into the system when
eliminating variables.
An example of the results of stepwise regression is shown below using the F statistics. These
values were obtained from the MICROSTAT software.
1st regression
The first step is to compute the all simple (one parameter) regression for each of the p
parameters. For each equation, compute F as
The parameter with the largest F value is selected as the first parameter. If this value is not
greater than the significant level of F, the regression is stopped.
For the multiple regression analysis using the Haan data set, we obtain for each parameter
Variable
Prec
Area
Slope
bj’s
F
0.31
0.43
0.06
1.93
3.13
2.21
Variable
Len
Perim
Diam
bj’s
F
0.71
0.28
1.28
2.37
3.03
1.39
Variable
Shape
Strm
Relief
bj’s
F
-2.63
-0.69
0.01
0.26
2.07
1.49
Since area had the largest F, the model after the first regression is
2nd regression
The second step is to compute all regression equation with two independent variables, where
Area is one of the pair. For each equation, evaluate F defined as
4-52
The parameter with the largest F value is selected as the second parameter. If this value is
not greater than the significant level of F, the regression is stopped.
For the multiple regression analysis using the Haan data set, we obtain for each parameter
Variable
Prec
Area
Slope
bj’s
F
2.30
#
2.16
5.29
#
4.66
Variable
Len
Perim
Diam
bj’s
F
0.18
0.83
0.13
0.03
0.68
0.02
Variable
Shape
Strm
Relief
bj’s
F
0.13
0.72
3.50
0.02
0.52
15.60
Since Relief had the largest F, the second regression model is then obtained using Area and
Relief to obtain
Any existing parameters (Area in this example) is evaluated using the following F-test:
If variables are correlated, it is possible that the addition of another variables may change the
need for having the original variable in the analysis. This step guarantees that only
significant variables remain in the regression. For the Haan’s example, Area is still a
significant variable.
Stepwise model
The above approach is continued until no significant F terms remain. For example, all
regression models are obtained with three independent variables, where Area and Relief are
two of the variables. The third variable then corresponds to the one that gives the largest F.
The significance of Area and Relief are evaluated with this new parameter. The final model
obtained for the Haan’s example is
Based on physical arguments, one would expect runoff to increase with increasing watershed
diameter. This model violates that assumption. Therefore, it should probably not be used.
The above techniques assumes that there exists a single best set of predictor variables. There
often is no unique "best" set. If variables are highly correlated, stepwise regression may
arrive at an unreasonable "best" set of parameters; that is, the more descriptive variables of
the process may not be included in the model.
OTHER REGRESSION TOPICS
Polynomial regression
4-53
Formulation
A polynomial of degree p (or order m = p+1) is defined as
which can be represented for a linear regression as
where
,
,
, .... ,
where $o, ..., $p are coefficients that can be defined and statistically evaluated using
techniques previously discussed.
Discussion
Because of unexpected direction, polynomial regressions of higher order than third degree are
usually avoided. It is recommended that the polynomial equation be plotted to examine the
shape before using it.
To avoid highly correlated variables, the polynomial regression is usually formulated using
deviations from the mean, that is the above independent variables are redefined as
,
,
, .... ,
Indicator variables
Definition
Indicator variables are used to represent qualitative variables such as gender (male or female)
and status of experimental design variables (treatment or no treatment). Indicator variables
usually have a value of zero or one. They are sometimes called dummy variables or binary
variables.
Analysis of variance can be performed using a regression analysis where all the independent
variables are qualitative defined as indicator variables. This feature is sometimes used in the
analyses of data gathered in some experimental designs.
Example applications
In this chapter, we will use indicator variables for regression analyses of:
* Piecewise linear regression,
4-54
* Two different data sets with common slope or intercept terms, and
* Two different data sets with common variance.
Illustration example
Widths of the river channels are sometimes represented as a power function of flow rate, or
where W is the channel width at a flow depth of h. The parameters “a” and b1 can be
estimated using natural log transformation as
where y^ = ln(W), bo=ln(a) and x = ln(Q).
Piecewise linear regression
Potential application
y = ln(W)
To introduce the concept of piecewise linear regression, let’s consider a typical channel with
a flood plain region as shown below. It is likely that separate equations could be used for the
main channel and the flood plain.
Flood plain
yb
Q for h
Main channel
Qb
xb
x = ln(Q)
The separate linear equations for the channel and flood plain are shown below.
and
where the prime terms correspond to the flood plain region. Since the width is continuous
between the two region, it is desirable for the two equations to predict the same value at the
breakpoint flow rate (xb = ln(Qb).
Regression model
Let’s consider the following regression model:
4-55
where I is an indicator variable defined as
I=0
I=1
for main channel
for flood plain region
The above regression model can then solved for the main channel (I=0) as
and for the flood plain region (I=1)
and therefore bo' and b1' can be defined as function of bo, b1, and b2. Furthermore, it is clear
that for x = xb that we obtain the same value for y^ , that is
The solution for vector the vector b
can be obtained for the matrices y and x defined as
and
where yci and xci refer to the ln(W) and ln(Q) of values in the main channel and yfi and xfi
refer to these values in the flood plain. In the above formulation, there are n observations in
the main channel and q observations in the flood plain.
The least square parameters can then be obtained using the multiple regression solution
previously given as
After the vector b has been determined, the linear parameters for the flood plain region are
defined by the relationships given of bo' = bo - b1xb and b1' = b1 + b2.
Extension to three piecewise segments
4-56
The approach can be extended to more than two piecewise segments. For example, the
formulation for two breakpoints (xb1 and xb2) is
where I’s are indicator variables defined as
I1 = 1 for x > xb1 and I1 = 0
I2 = 1 for x > xb2 and I2 = 0
for x < xb1
for x < xb2
Common slope or intercept parameter
Common slope parameter
Channel 1
y = ln(W)
For two approximately trapezoidal channels with the same sideslope ratio and different
bottom widths, the slope parameter would likely be constant and the intercept would be
different. You wish to combine the information from the two channels to estimate the slope
term. This concept is shown below.
Channel 2
b1
b’o
Channel 1
b1
Channel 2
bo
x = ln(Q)
Let’s consider the following regression model:
where I is an indicator variable defined as
I=0
I=1
for the first channel
for the second channel
The above regression model can then solved for the first channel (I=0) as
and for the second channel (I=1)
where the intercept parameter for the second channel is therefore defined as bo' = bo+b2. The
4-57
vector b is as previously defined. The matrices y and x are defined as
and
where y1i and x1i refer to the ln(W) and ln(Q) of values in the first channel and y2i and x2i refer
to these values in the second channel. In the above formulation, there are n observations in
the first channel and q observations in the second channel.
The vector b can now be computed using (xTx)-1xTy.
Common intercept parameter
Channel 1
y = ln(W)
For two approximately trapezoidal channels with the same bottom widths and different
sideslopes, the intercept parameter would likely be constant and the slope parameter would
be different. You wish to combine the information from the two channels to estimate the
intercept as shown below.
Channel 2
b’1
b1
bo
Channel 2
x = ln(Q)
Let’s use the following regression model:
where I is an indicator variable defined as
I=0
I=1
for the first channel
for the second channel
The above regression model can then solved for the first channel (I=0) as
Channel 1
4-58
and for the second channel (I=1)
where the slope parameter for the second channel is therefore defined as b1' = b1+b2. The
vector b and matrix y are as previously defined. The matrix x is now defined as
The vector b can again be computed using (xTx)-1xTy.
Common variance
Location 1
y = ln(W)
Let’s assume that you wish to fit separate equations to two different cross sections where
neither the slope nor the intercept terms are shared, but they have the same variance. You
wish to combine the data set to obtain a better estimate of this variance. The representation
of this problem is shown below.
b’o
Location 2
bo
Location 2
b’1
b1
x = ln(Q)
The appropriate regression model is
where I is an indicator variable defined as
I=0
I=1
for the first location
for the second location
Location 1
4-59
The above regression model can then solved for the first location (I=0) as
and for the second location (I=1)
where the intercept and slope parameters are defined similar to the approach previously
given.
As shown by Judge et al. (1988, pp. 428-430), the parameters obtained by solving the
indicator variable formulation is identical to that obtained by solving each equation
separately. If the data have the same variance, the above formulation allows the pooled
variance to be determined directly.
4-60
APPENDIX 4-A: SELECTED DERIVATIONS
Total and regression sum of squares
The matrix form of the linear model is written in the chapter as y = y^ + e. Let’s consider Ey2i
defined as
where the property given in Chapter 1 of (y^ + e)T = y^T + eT has been used. Since y^Te=eTy^,
we can simplify as
By using y^ =x b and the property given in Chapter 1 of (xb)T = bT xT, we obtain
From the minimization solution for bj using the least square methods given in the chapter,
we have shown that
By using this result and by subtracting nyG 2 from both sides of the equation, we obtain
As shown in the chapter, the first term of the left-hand side of the equation equals SSR using
the minimization solution for bo, or
and therefore we able to partition the total sum of squares as
An alternative, and computational simpler, form of the SSR can be obtained by using the
result given in the chapter of y^ = x b, or
where the property given in Chapter 1 of (xb)T = bT xT has again been used. Since we have
previously shown that xT e = 0, we simplify as
Expected value of least square estimator
As derived in the chapter, the least square estimator for the vector b is defined as
4-61
The expected value can then be evaluated as
where y is defined using the linear model of y = x $ + g. By expanding terms, we obtain
where (xTx)-1xTx = I has been used.
Since $ and (xTx)-1xT are constants (nonrandom) for a given formulation, the expected value
can be written as
where E[g] = 0 from the minimization solution for bo.
Variance-covariance matrix of b
General definition
Let’s propose the following definition for the variance-covariance matrix
where $ = E[b] has been used. We will now show that this definition corresponds to
variance on the diagonal elements and covariance for the other elements.
Let’s first consider the following matrix product
or
The variance-covariance is then obtained by taking the expected value of both sides. Since
the expected value of the matrix can be evaluated by the expected value of individual terms,
we conclude
4-62
and therefore the original definition of the variance-covariance has been shown to result in
the variance-covariance matrix of
Least squares variance-covariance matrix
Let’s now consider more carefully our starting definition of the variance-covariance matrix.
From the previous section, it is clear that the b can be written as
and therefore the variance-covariance matrix can be written
and simplified as
From the matrix rules in Chapter 1 of (xy)T = yT xT and (y-1)T = (yT)-1, we can evaluate the last
multiplicative term as
and therefore the variance-covariance matrix can be written as
where (xTx)-1xT and x (xTx)-1 are constants (nonrandom). For independent and constant
variance, we know that
where I is the identity matrix. We can therefore simplify as
The variance-covariance matrix for b is obtained as
4-63
Gauss-Markov theorem
General approach
The Gauss-Markov theorem is used to show that of all possible unbiased linear estimators,
the least squares estimator results in the minimum variance. This powerful result is limited
by the assumptions of independence and constant variance of residuals used in the previous
section. The derivation follows that given by Judge et al. (1988). Their approach utilizes the
definition of a positive semidefinite matrix. As discussed in Chapter 1, a symmetrical nxn
matrix W is positive semidefinite if
where u is a 1xn vector.
Let’s first introduce a general formulation by considering the linear combination of the
regression parameters defined as
where aT = [ao, a1, ... , ap] is any set of constants. A possible alternative estimator for the
regression parameters will be represented by the vector b^ . The least squares estimator b has a
smaller variance than b^ if
For example, if aT = [1, 0, ... , 0] then we require that VAR(bo) # VAR(b^ o); if aT = [0, 1, ... ,
0] then VAR(b1) # VAR(b^ 1); and if aT = [1, 1, ... , 0] then VAR(bo+b1) # VAR(b^ o+b^ 1). The
above conditions can be represented using the covariance matrices as
or for all a
Let’s consider matrix W = COV(b^ )-COV(b). We only need to show that W is positive
semidefinite to meet the above criterion.
Unbiased condition
Let’s consider the class of all possible linear estimators. Previously in this appendix, we used
the following definition of the least squares estimator
The class of all possible linear estimators can be represented as
where C is a general mxn matrix. The mathematical details to obtain the last equality have
4-64
been previously given in this appendix. Let’s consider possible constraints on C to limit the
analysis to unbiased linear estimators. For an unbiased estimator, we know that E(b^ ) = $. By
taking the expected value of the above general linear estimator, we obtain
where $, C and x are non-random variables or constants. For unbiased estimator, then matrix
is constraint by the condition of Cx = 0. By using this result, the general definition of the
linear estimator is now defined for an unbiased conditions as
Minimum variance
Let’s now compute the covariance of the general unbiased linear estimator determined in the
previous subsection. Many of the covariance manipulations are similar to those previously
given. The covariance matrix is defined as
By using the matrix rules in Chapter 1 of (x+y)T = xT+yT, (xy)T = yT xT and (y-1)T = (yT)-1, we
can evaluate the transpose term as
where the details of the manipulations have been previously given in this appendix. By using
this result, the covariance matrix simplifies as
Since x and C are non-random, we can use the expected values rules to obtain
By using
for independent and constant variance residuals, we further simplify as
We previously shown that Cx is a null matrix for unbiased estimator. Similar CTxT is also a
null matrix. By using COV(b) = F2 (xTx)-1, we obtain
As discussed in Chapter 1, the matrix-transpose matrix product is always positive
semidefinite. Since F2 is also always positive, we conclude that W is positive semidefinite.
We have therefore shown that, out of all of the class of linear unbiased estimators, the least
squares method results in the smallest possible variance.
4-65
REFERENCES
Beck and Arnold. 1977. Parameter Estimation in Engineering and Science. John Wiley and
Sons.
Devore, J.L. 2001. Probability and Statistics for Engineering and the Sciences. Duxbury Press,
New York.
Draper and Smith 1998. Applied Regression Analysis. John Wiley and Sons, New York..
Haan, C.T. 1979. Statistical Methods in Hydrology. Iowa State University Press.
Judge, G.G., R.C. Hill, W.E. Griffiths, H. Lutkepohl, and T-C. Lee. 1988. Introduction to the
Theory and Practice of Econometrics. John Wiley and Sons, New York.
Neter, J. and W. Wasserman. 1974. Applied Linear Statistical Models. Irwin Inc., Homewood,
IL.
Neter, Kutner, Nachtsheim, and Wasserman. 1996. Applied Linear Statistical Models. McGrawHill, Boston.
Pratt, J.W. 1965. Bayesian interpretation of standard inference statements. J. Roy. Statist. Soc. B
27: 169-203.
4-66
PROBLEM ASSIGNMENT
Set #1:
Due Date:
Set #2:
Due Date:
Please do not use commercial statistical software to solve Problems #1 and #2. These problems
should be solved using the matrix solution techniques of Chapter 1. Statistical software can be
used (and is recommended) to check your values.
Problem #1 (20 points)
You are interested in predicting the gas mileage of cars as a function of horsepower, length,
and weight. The following data have been measured for ten different cars (4_gas.xls).
Car
Id
1
2
3
4
5
6
7
8
9
10
Gas Mileage
y
13.9
16.5
16.5
17.8
18.75
19.73
20.07
20.3
30.4
36.5
Length
x1
215.5
195.4
185.2
199.9
199.9
215.3
194.1
168.8
165.2
160.6
Width
x2
78.5
74.4
69
74
74
76.3
71.8
69.4
65
62.2
Weight
x3
4540
3885
3660
3890
3890
4370
3365
2700
2320
2009
You are required to (1) determine the xTy, xTx, and (xTx)-1 matrices, (2) compute the
parameters bo, b1, b2 and b3 using matrix multiplication, (3) sum of squares and mean squares
corresponding to an ANOVA table, (4) the coefficient of determination, and the (5) variancecovariance matrix for b. What are the standard errors of bo, b1, b2 and b3?
Problem #2 (20 points)
You are interested in predicting the heat evolved of cement (calorie/gram) as a function of the
percentages (by weight) of dicalcium silicate, calcium aluminum ferrate, and tricalcium
silicate. The following data have been recorded (4_heat.xls).
4-67
Heat
Evolved (y)
Aluminum
Ferrate (x2)
Dicalcium
Silicate (x1)
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.3
109.4
60
52
20
47
33
22
6
44
22
26
34
12
12
Tricalcium
Silicate (x3)
6
15
8
8
6
9
17
22
18
4
23
9
8
26
29
56
31
52
55
71
31
54
47
40
66
68
You are required to (1) determine the xTy, xTx, and (xTx)-1 matrices, (2) compute the
parameters bo, b1, b2 and b3 using matrix multiplication, (3) sum of squares and mean squares
corresponding to an ANOVA table, (4) the coefficient of determination, and the (5) variancecovariance matrix for b. What are the standard errors of bo, b1, b2 and b3?
Problem #3 (20 points)
The data given below are the results of a small scale experiment on the effects of work crew
size and level of bonus pay on crew productivity scores (4_crew.xls).
Crew Size
x1
Bonus Pay ($)
x2
Productivity Scores
y
4
4
4
4
6
6
6
6
2
2
3
3
2
2
3
3
42
39
48
51
49
53
61
60
You are to:
(1)
(2)
Determine the correlation between crew size and bonus pay. Are these variables
correlated (accounting for possible roundoff error)?
Calculate the regression of (a) y on x1 and x2 , (b) y on x1 only, and (c) y on x2
only. Report both the ANOVA tables and slope coefficients for all three options.
Does the slope parameter for x1 change when x2 is included in the model? Does
the slope parameter for x2 change when x1 is included in the model?
4-68
(3)
Calculate SSR(x2/x1) and SSR(x1/x2). Is SSR(x2/x1) equal (within roundoff error)
to SSR(x2)? Is SSR(x1/x2) equal (within roundoff error) to SSR(x1)?
Problem #4 (20 points)
Consider the following data obtained from ten shipments of chemicals in drums (4_ship.xls).
You are interested in the number of human-minutes required to handle the shipment as a
function of the number of drums and the total weight of the shipment.
Shipment
1
2
3
4
5
6
7
8
9
10
Number of
Barrels - x1
7
18
5
14
11
5
23
9
16
5
Total Weigh
(100 pounds) x2
5.11
16.70
3.20
7.00
11.00
4.00
22.10
7.00
10.60
4.80
Human-minutes
y
58
152
41
93
101
38
203
78
117
44
You are to:
(1)
(2)
(3)
Determine the correlation between number of barrels and total weight. Are these
values highly correlated?
Calculate the regression of (a) y on x1 and x2 , (b) y on x1 only, and (c) y on x2
only. Report both the ANOVA tables and slope coefficients for all three options.
Does the slope parameter for x1 change when x2 is included in the model? Does
the slope parameter for x2 change when x1 is included in the model?
Calculate SSR(x2/x1) and SSR(x1/x2). Is SSR(x2/x1) equal to SSR(x2)? Is
SSR(x1/x2) equal to SSR(x1)?
Problem #5 (30 points)
Physical fitness measurements on men were made at North Carolina State University. The
results are shown below (4_fitness.xls), where age is in years, weight in kg, o_rate is the
oxygen uptake rate in mL/kg/min, t_run is the time to run 1.5 minutes in minutes, h_rest is
the heart rate while resting in beats/min, h_run is the heart rate while running in beats/mine,
and h_max is the maximum recorded heart rate in beats/min.
Obtain a predictive equation for oxygen uptake rate as a function of one or more of the other
variables. Assume that all variables are equally easy to obtain. Report the correlation
coefficients for the independent variables. Summarize your results in a table showing the
4-69
significant variables and corresponding R2, MSE, and F values for each set of variables
considered in your analysis. Show the residual plots for your predictive model.
O_rate
y
44.609
45.313
54.297
59.571
49.874
44.811
45.681
49.091
39.442
60.055
50.541
37.388
44.754
47.273
51.855
49.156
40.836
46.672
46.774
50.388
39.407
46.08
45.441
54.625
45.118
39.203
45.79
50.545
48.673
47.92
47.467
Age
x1
44
40
44
42
38
47
40
43
44
38
44
45
45
47
54
49
51
51
48
49
57
54
52
50
51
54
51
57
49
48
52
Weight
x2
89.47
75.07
85.84
68.15
89.02
77.45
75.98
81.19
81.42
81.87
73.03
87.66
66.45
79.15
83.12
81.42
69.63
77.91
91.63
73.37
73.37
79.38
76.32
70.87
67.25
91.63
73.71
59.08
76.32
61.24
82.78
T_run
x3
11.37
10.07
8.65
8.17
9.22
11.63
11.95
10.85
13.08
8.63
10.13
14.03
11.12
10.6
10.33
8.95
10.95
10
10.25
10.08
12.63
11.17
9.63
8.92
11.08
12.88
10.47
9.93
9.4
11.5
10.5
H_rest
x4
62
62
45
40
55
58
70
64
63
48
45
56
51
47
50
44
57
48
48
67
58
62
48
48
48
44
59
49
56
52
53
H_run
x5
178
185
156
166
178
176
176
162
174
170
168
186
176
162
166
180
168
162
162
168
174
156
164
146
172
168
186
148
186
170
170
H_max
x6
182
185
168
172
180
176
180
170
176
186
168
192
176
164
170
185
172
168
164
168
176
165
166
155
172
172
188
155
188
176
172
Create a new independent variable in the spreadsheet or in your software package defined as
weight in pounds (1 kg = 2.2 lbm). Try a regression model with both weight in kg and lbm as
independent variables. What happens?
Problem #6 (30 points)
The following variables have been measured for 108 different diesel tractors (4_tract.xls):
* Cylinder - x1 = Number of cylinders
* Fuel_wt - x2 = Weight density of fuel (lbs of fuel/gallon)
4-70
* Bore - x3 = Bore diameter (inches)
* Stroke - x4 = Length of engine stroke (inches)
* Pto_m = Maximum recorded PTO power (hp)
* Pto-gas - x5 = PTO power-hour (energy) per gallon of fuel consumption (hp-h/gal)
* Displace - x6 = Cylinder displacement (cubic inches)
* D_bar -y = Maximum recorded drawbar power (hp)
You are required to determine a predictive equation for D_bar as a function of Cylinder,
Fuel_wt, Bore, Stroke, Pto-gas and/or Displace (note: Pto-m is not included). Assume that
all variables are equally easy to obtain (note that Displace = cylinder*stroke*bore2 *B/4).
Summarize your results in a table showing the significant variables and corresponding R2,
MSE, and F values for each set of variables considered in your analysis.
Cylinder
x1
Fuel_wt
x2
3
3
2
2
4
3
6
3
3
3
6.94
6.94
6.94
6.94
6.976
7.008
7.062
6.976
6.976
7.008
Bore
x3
3
3
3
3
3
3
3.23
3.23
3.23
3.23
Stroke
x4
3.23
3.23
3.23
3.23
3.23
3.23
3.23
3.23
3.23
3.23
Pto_m
22.06
22.35
15.33
15.45
29.35
19.59
49.72
26.46
26.21
23.42
Pto_gas
x5
13.26
13.6
12.81
13.13
12.78
14.17
14.37
13.99
13.24
14.21
Displace
x6
68.495
68.495
45.663
45.663
91.326
68.495
158.8
79.4
79.4
79.4
D_bar
y
17.8173
18.1154
12.48
12.1457
25.2997
15.3504
40.8637
21.576
21.7507
18.8695
Problem #7 (30 points)
Consider the following set of x and y values (4_piece.xls). Perform a simple regression
analysis of this data. Summarize your results in ANOVA table. Also analyzed the data using
a piecewise regression model with a breakpoint at 10. Here you will need to use indicator
variables. Summarize the results in ANOVA table.
Sample
1
2
3
4
5
6
7
8
9
10
x
1.03
1.92
3.03
4.06
5.04
5.94
7.08
8.04
9.08
9.94
y
9.17
1.20
10.43
12.18
8.56
17.13
13.73
22.63
22.98
16.55
Sample
11
12
13
14
15
16
17
18
19
20
x
10.92
12.06
12.93
13.99
14.91
16.04
16.93
18.06
19.03
20.05
y
30.35
35.81
46.45
59.10
74.56
87.51
95.77
108.60
110.08
123.52
4-71
SOLUTION KEY
Problem #1 (20 points)
Obs
y
1
2
3
4
5
6
7
8
9
10
y=
n=
ybar =
ANOVA
Calculations
(y-ybar)^2
13.9 51.051025
16.5 20.657025
16.5 20.657025
17.8 10.530025
18.75
5.267025
19.73
1.729225
20.07
0.950625
20.3
0.555025
30.4 87.516025
36.5 238.857025
x=
1
1
1
1
1
1
1
1
1
1
x1
215.5
195.4
185.2
199.9
199.9
215.3
194.1
168.8
165.2
160.6
10
21.045
SSTO=
df=
Use Paste Special - Transpose
1
1
xT =
215.5
195.4
78.5
74.4
4540
3885
(xTx)^-1 =
10
1899.9
1899.9 364446.2
714.6 136624.3
34629 6726230
x2
4540
3885
3660
3890
3890
4370
3365
2700
2320
2009
m=
b
39.11828
0.63069
-1.24523
-0.01413
b=
1
185.2
69
3660
yhat=
yhat
13.1532435
14.83381678
18.30317321
18.09938526
18.09938526
18.16788824
24.59661617
21.02197034
29.5980608
34.57646043
SSE =
df=
MSE =
1
1
199.9 200
74
74
3890 3890
714.6
34629
136624.32 6726229.9
51297.74 2511871.8
2511871.8 126493231
146.892 -0.32108 -2.0998342 0.0185581
-0.3211 0.005927 -0.006675 -9.47E-05
-2.0998 -0.00668 0.05501102 -0.000163
0.01856 -9.5E-05 -0.0001626 3.191E-06
1
215.3
76.3
4370
xTy =
Var(b) =
ei
0.746756
1.666183
-1.80317
-0.29939
0.650615
1.562112
-4.52662
-0.72197
0.801939
1.92354
34.89298
e^2
0.557645265
2.776166511
3.25143364
0.089631534
0.423299541
2.440193139
20.49025397
0.52124117
0.643106488
3.700004478
402.8770743
(yhat-ybar)^2
62.27982
38.5788
7.517614
8.676646
8.676646
8.277772
12.61398
0.00053
73.15485
183.1004
6.13E-11
4
437.77005
9
R^2 =
F=
xTx =
x2
78.5
74.4
69
74
74
76.3
71.8
69.4
65
62.2
34.89298 402.8770743 "= SSR"
6
3 "= df"
5.815496 134.2923581 "=MSR"
0.920294
23.09216
1
194.1
71.8
3365
1
168.8
69.4
2700
1
165.2
65
2320
1
160.6
62.2
2009
210.45
39035.77
14763.5
682200.2
854.2503 -1.86727 -12.2116 0.107925
29.22756048
-1.86727 0.034466 -0.03882 -0.00055 StError(b)=
-12.2116 -0.03882 0.319916 -0.00095
0.107925 -0.00055 -0.00095 1.86E-05
0.18565
0.565611471
0.004308
4-72
Problem #2 (20 points)
Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
y
y=
n=
ybar =
ANOVA
Calculations
78.5
74.3
104.3
87.6
95.9
109.2
102.7
72.5
93.1
115.9
83.8
113.3
109.4
(y-ybar)^2
286.390533
446.184379
78.7997633
61.2005325
0.22745562
189.803609
52.9536095
525.467456
5.39668639
419.304379
135.095917
319.584379
195.354379
x1
1
1
1
1
1
1
1
1
1
1
1
1
1
x=
x2
60
52
20
47
33
22
6
44
22
26
34
12
12
13
95.42308
SSTO=
df=
Use Paste Special - Transpose
1
1
xT =
60
52
6
15
26
29
(xTx)^-1 =
13
390
153
626
390
15062
4628
15739
b
26
29
56
31
52
55
71
31
54
47
40
66
68
m=
203.642
-1.55704
-1.44797
-0.92342
b=
yhat
77.52262406
74.17699521
109.2059999
90.25118597
95.55402201
105.5673548
104.121649
74.65072463
93.45903045
113.9663586
80.46245907
110.9802285
110.5813677
yhat=
1
47
8
31
153
4628
2293
7201
626
15739
7201
33050
51.9812 -0.59965 -0.1993881
-0.65557
-0.5996 0.007097 0.0020035 0.0075419
-0.1994 0.002004 0.00263704 0.0022479
-0.6556 0.007542 0.00224794 0.0083661
1
33
6
52
e^2
0.955263731
0.015130178
24.06883484
7.028787024
0.119700768
13.19611083
2.021085997
4.62561643
0.128902864
3.73896909
11.13917948
5.38133977
1.395629729
2641.948526
(yhat-ybar)^2
320.4262
451.396
189.969
26.74846
0.017147
102.9064
75.66516
431.4906
3.857479
343.8533
223.8201
242.025
229.7738
-5.3E-12
73.81455 2641.948526 "= SSR"
9
3 "= df"
8.201617 880.6495087 "=MSR"
SSE =
df=
MSE =
1
20
8
56
ei
0.977376
0.123005
-4.906
-2.65119
0.345978
3.632645
-1.42165
-2.15072
-0.35903
1.933641
3.337541
2.319771
-1.18137
73.81455
4
2715.76308
12
0.97282
107.3751
R^2 =
F=
xTx =
x2
6
15
8
8
6
9
17
22
18
4
23
9
8
1
22
9
55
xTy =
Var(b) =
1
6
17
71
1
44
22
31
1
22
18
54
1
26
4
47
1
34
23
40
1
12
9
66
1
12
8
68
1240.5
34733.3
13981.5
62027.8
426.3298 -4.91807
-1.6353 -5.37673
20.64775475
-4.91807 0.058203 0.016432 0.061855 StError(b)=
0.241254
-1.6353 0.016432 0.021628 0.018437
0.147064608
-5.37673 0.061855 0.018437 0.068615
0.261945
Problem #3 (20 points)
crew
bonus
4
4
4
4
6
6
6
6
r=
2
2
3
3
2
2
3
3
prod
42
39
48
51
49
53
61
60
0
Variables are uncorrelated. The slope coefficients are the same for both models. SSR(x2/x1)
equals SSR(x2) and SSR(x1/x2) equals SSR(x1). Differences in the standard errors are caused by
smaller MSE for the model with both parameters.
4-73
df
Regression
Residual
Total
Intercept
Crew
Pay
SS
402.25
17.63
419.88
2
5
7
Coeff
Std Error
0.375
4.74
5.375
0.66
9.25
1.33
SSR(x2/x1) = SSE(x1) - SSE(x1,x2)
188.75
17.63
171.13
df
Regression
Residual
Total
SS
231.13
188.75
419.88
1
6
7
df
Regression
Residual
Total
Coeff
Std Error
23.5
10.11
5.375
1.98
Intercept
Crew
Intercept
Pay
1
6
7
SS
171.13
248.75
419.88
Coeff
Std Error
27.25
11.61
9.25
4.55
SSR(x2/x1) =SSE(x1) - SSE(x1,x2)
248.75
17.63
231.13
Problem #4 (20 points)
obs
1
2
3
4
5
6
7
8
9
10
correl =
df
Regression
Residual
Total
Intercept
Barrels
Weight
2
7
9
SS
25720.11
78.39
25798.50
Coeff
3.37
4.07
4.72
Std Error
2.34
0.49
0.51
SSR(x2/x1) =SSE(x1) - SSE(x1,x2)
1046.85
78.39
968.46
Number
7
18
5
14
11
5
23
9
16
5
0.93
Weight
5.11
16.70
3.20
7.00
11.00
4.00
22.10
7.00
10.60
4.80
df
Regression
Residual
Total
Intercept
Barrels
1
8
9
Human
58
152
41
93
101
38
203
78
117
44
SS
24751.65
1046.85
25798.50
Coeff
Std Error
-1.98
7.76
8.36
0.61
df
Regression
Residual
Total
Intercept
Weight
SS
1 24964.73
8
833.77
9 25798.50
Coeff
Std Error
13.70
6.03
8.61
0.56
SSR(x2/x1) =SSE(x1) - SSE(x1,x2)
833.77
78.39
755.38
Variables are highly correlated. The slope coefficients are different for both models. SSR(x2/x1)
4-74
does not equal SSR(x2) nor does SSR(x1/x2) equals SSR(x1).
Problem #5 (30 points)
Correlation among variables:
Orate
Orate
Age
Weight
T_run
H-rest
H-run
H-max
Age
Weight
T_run
H-rest
H-run
H-max
1
-0.304592
1
-0.162753 -0.233539
1
-0.862195 0.188745 0.143508
1
-0.399356
-0.1641 0.043974 0.450383
1
-0.397974 -0.33787 0.181516 0.313648 0.352461
1
-0.23674 -0.432916 0.249381 0.226103 0.305124 0.929754
1
All variables:
df
Regression
Residual
Total
Intercept
age
weight
t_run
h_rest
h_run
h_max
SS
MS
F
6 722.5436 120.4239 22.43263
24 128.8379 5.368247
R Square
30 851.3815
Adjusted R2
Std Error
Value Std Error
t Stat
P-value
102.9345 12.40326 8.298987 1.64E-08
-0.226974 0.099837 -2.273433 0.032235
-0.074177 0.054593 -1.358731 0.186866
-2.628653 0.384562 -6.835443 4.54E-07
-0.021534 0.066054 -0.325999 0.74725
-0.369628 0.119853 -3.084011 0.005079
0.303217 0.136495 2.221449 0.036007
0.848672
0.81084
2.316948
4-75
3
Std Residuals
2
1
0
-1
30
35
40
45
50
55
60
-2
-3
Predicted
Model
1
2
3
1
2
Age
x1
#1
#
#
4
5
6
7
#
#
8
#
R2
F
0.84
0.82
0.81
22.4
22.4
28.0
5.95
7.17
7.27
7.53
0.81
0.76
0.76
0.74
38.6
45.4
44.7
84.0
26.6
0.09
3.0
Weight t_run h_rest h_run h_max MSE
x2
x3
x4
x5
x6
2
x
#
x
#
#
5.37
x
#
x
#
6.21
#
x
#
6.17
#
#
#
#
#
#
# - Indicates variable was included in the model and was significant at 10%
x - Indicates variable was included in the model and was insignificant at 10%
All the bold models are reasonable. The best three parameter model is Model 4. The
regression model and residual plots are shown below.
There is insufficient evidence from the residual plots to conclude that the standard assumptions
for the residuals are unreasonable.
4-76
3
Std Residuals
2
1
0
-1
35
40
45
50
55
60
-2
-3
Predicted
The best two parameter model is Model 5. The regression model and residual plots are shown
below.
There is insufficient evidence from the residual plots to conclude that the standard assumptions
for the residuals are unreasonable.
3
Std Residuals
2
1
0
-1
35
40
45
50
55
60
-2
-3
Predicted
The best one parameter model is Model 7. The regression model and residual plots are shown
below.
There is insufficient evidence from the residual plots to conclude that the standard assumptions
for the residuals are unreasonable.
4-77
3
Std Residuals
2
1
0
-1
35
40
45
50
55
60
-2
-3
Predicted
The selection of the “best” of the three relative good models is now dependent on the expertise of
the system by the engineer or scientist.
Correlation matrix with weight in kg.
Orate
Age
Weight
T_run
H-rest
H-run
H-max
wt_kg
Orate
1.00
-0.30
-0.16
-0.86
-0.40
-0.40
-0.24
-0.16
Age
Weight
1.00
-0.23
0.19
-0.16
-0.34
-0.43
-0.23
T_run
1.00
0.14
0.04
0.18
0.25
1.00
H-rest
1.00
0.45
0.31
0.23
0.14
H-run
1.00
0.35
0.31
0.04
H-max
1.00
0.93
0.18
Regression results in EXCEL. Large standard error :
ANOVA
df
Regression 7
Residual 23
Total
30
Intercept
age
weight
t_run
h_rest
Value
102.9
-0.2
1.3
-2.6
0.0
SS
722.54
128.84
851.38
MS
103.22
5.60
Std Error t Stat
12.7
0.1
257094.2
0.4
0.1
8.1
-2.2
0.0
-6.7
-0.3
F
18.43
P-value
0.0
0.0
1.0
0.0
0.8
R Square0.85
Adjusted R20.80
Std Error2.37
wt_kg
1.00
0.25
1.00
4-78
h_run
h_max
wt_kg
-0.4
0.3
-0.6
0.1
0.1
116861.0
-3.0
2.2
0.0
0.0
0.0
1.0
Problem #6 (30 points)
Correlation matrix:
d_bar
Cylinder
Fuel_wt
bore
stroke
pto_m
pto_gas
displace
d_bar
1
0.90
0.09
0.65
0.64
0.99
0.37
0.96
Cylinder
Fuel_wt
bore
stroke
pto_m
pto_gas
displace
1.00
0.18
0.42
0.45
0.91
0.23
0.93
1.00
-0.03
-0.03
0.10
0.07
0.11
1.00
0.68
0.64
0.47
0.65
1.00
0.64
0.46
0.69
1.00
0.36
0.97
1.00
0.34
1.00
Regression analysis using all variables:
df
Regression
Residual
Total
6
101
107
Intercept
Cylinder
Fuel_wt
bore
stroke
pto_gas
displace
Value
59.734
0.151
-10.263
3.849
-3.192
0.957
0.206
SS
MS
F
40219.2 6703.199 226.1354
2993.884 29.64242
43213.08
R Square
Adjusted R2
Std Error
Std Error t Stat
P-value
85.704
0.697
0.487
2.281
0.066
0.947
12.065
-0.851
0.397
3.915
0.983
0.328
2.158
-1.479
0.142
0.487
1.967
0.052
0.043
4.784
0.000
Residuals appear to violate the condition of constant variance.
0.930718
0.926602
5.444485
4-79
4
Std Residuals
3
2
1
0
-1
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
-2
-3
Predicted
Model Cylinder Fuel_wt
x1
x2
1
x1
x2
2
3
4
5
6
7
8
9
1
2
#
#
#
#
bore
x3
x
x
R2
F
0.93
0.93
226.1
342.3
29.53
30.19
30.82
0.93
0.93
0.92
453.2
663.1
1296.3
46.71
65.76
48.10
77.36
0.88
0.84
0.88
0.81
273.7
276.1
396.67
452.61
stroke pto_gas displace MSE
x4
x5
x6
x
#
#
29.64
#
#
#
29.28
#
#
#
#
#
#
#
#
#
#
# - Indicates variable was included in the model and was significant at 10%
x - Indicates variable was included in the model and was insignificant at 10%
Three best models are shown in bold. Model 5 would likely be the best model with small
increase in MSE, small decrease in R2, and the largest overall F compared to Models 3 and 4.
For comparison, the best three parameter model is Model 3. The regression model and
residual plots are shown below.
Residuals suggest that the variance is increasing with predicted values.
4-80
4
Std Residuals
3
2
1
0
-1
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
-2
-3
Predicted
The best two parameter model is Model 4. The regression model and residual plots are shown
below.
4
Std Residuals
3
2
1
0
-1
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
-2
-3
Predicted
The best one parameter and overall model is Model 5. The regression model and residual
plots are shown below.
4-81
3
Std Residuals
2
1
0
-1
10
15
20
25
30
35
40
45
50
55
60
65
-2
-3
Predicted
Residuals suggest that the variance is increasing with predicted values.
Problem #7 (30 points)
Simple regression analysis using on x in the model (no breakpoint):
ANOVA
Regression 1
Residual 18
df
SS
26931.4
3622.9
Total
30554.3
19
Value
Intercept
x1 0nly
-21.562
6.367
MS
26931.4
F
133.8
201.3
Std Error t Stat
P-value
6.595
-3.270 0.004257
0.550
11.567 9.09E-10
R Square
Adj R2
Std Error
0.881
0.875
14.187
70
75
80
4-82
140
120
100
y
80
60
40
20
0
-20 0
5
10
15
20
25
20
25
x
Residual plots show possible correlation of y with x.
3
Std Residuals
2
1
0
-1
0
5
10
15
-2
-3
x
4-83
Multiple regression analysis using on x in the model (with breakpoint):
ANOVA
Regression
Residual
df
2
17
SS
MS
F
30273.52 15136.76 916.3535
R Square
280.814 16.518
0.991
Total
19
30554.33
0.990
Value
Intercept
4.428
1.561
8.919
x1
x2
Adj R2
Std Error
4.064
Std Error t Stat
P-value
2.628
1.685 0.110279
0.373
0.627
4.187 0.000618
14.224
7.17E-11
All of the statistical for the piecewise regression analysis are better. Residuals don’t show
possible correlation among values.
140
120
100
y
80
60
40
20
0
-20 0
5
10
15
x
20
25
4-84
3
Std Residuals
2
1
0
-1
0
5
10
15
-2
-3
x
20
25
Download