Some notes on regression analysis that may be useful in preparing

advertisement
Some notes on regression analysis that may be useful in preparing Case 3
The purpose of these notes is to remind you of some of the properties of regression
analysis and to provide you some guidance in interpreting the output in Exhibit 2 (p. 85).
Case 3 utilizes univariate regression analysis, in which one variable (the “independent”
variable, delivery volume in this case) is used to estimate another variable (the
“dependent” variable, delivery expense in this case). (Although the model in Case 3 has
only one independent variable, it is of course possible to have more than one.) The
relationship between the two variables is assumed to be linear:
Y = α + βX + ε .
Y is the delivery expense (which we want to predict or estimate); α is the line’s intercept
(which we will view as the fixed component of delivery expense); β is the line’s slope
(which we will view as the per-unit variable cost component of delivery expense); X is
the delivery volume; and ε is a random error term that indicates that the relationship
between X and Y is not perfectly predictable. It is assumed that the ε’s are independent
across the observations and normally distributed, with a mean of zero.
The model is estimated using data from the company’s records. The dependent variable,
Yr, is the recorded delivery expense. The estimate of the intercept (α), we will denote as
A. The estimate of the slope (β), we will denote as B. The predicted delivery expense,
then, is equal to A + BX. The difference between the predicted delivery expense and the
“actual” delivery expenses (i.e., Yr) is the prediction error. The coefficients, A and B, are
estimated so that the sum of the squared errors, Σ [Yr – (A + BX)]2 , is minimized. (The
summation is over the observations in the model, in this case the 50 observations from
the outlets.) The coefficients, A and B, are “optimal” in the sense that no other estimates
will provide for smaller prediction errors.
In the regression output (Exhibit 2), A is 848.964. That is, the estimated fixed cost per
outlet is about $849 (you might consider whether it is reasonable to interpret A as fixed
cost). B is 1.246. That is, the estimated per-unit variable cost is $1.246. For every oneunit increase in X (delivery volume), delivery expense is expected to increase by $1.246.
In the first plot in Exhibit 2, the best-fitting line through the data points would have a
slope of +1.246. The model coefficients can be used to predict the delivery expenses at
individual outlets. For example, for the McGill Univ. outlet, the model would estimate
delivery expense to be $849 + $1.246 * 1975 = $3,310. The recorded expense for this
outlet was $3,257, indicating a prediction error (also called a residual or “d” in Case 3) of
-$53. [Note that this calculation of d employs the A and B coefficients estimated from the
data and is the computation used for the d plot in Exhibit 2. The case also, however,
calculates d using the planned A and B coefficients (i.e., Ap and Bp), see equation (5).
Depending on the purpose of the calculation, either way is acceptable.]
Each of the terms in the regression model has an associated standard error. Exhibit 2
indicates that the standard error for A is 101.185 and for B is 0.053. These standard
1
errors can be used to build the usual confidence intervals for the coefficients. For
example, an approximate 95% confidence interval for the slope coefficient B is equal to
1.246 ± 1.96*.053, or 1.14 to 1.35. Recall that we can interpret a 95% confidence
interval as follows: if we were to conduct this model estimation procedure many times,
95% of the confidence intervals would contain β. (In Case 3, we might want to compare
the planned B (i.e., Bp) to the confidence interval as a way to evaluate the planning
process.) Another standard error in the Exhibit 2 output is the “Standard error of
estimate” which equals 60.336. This can be viewed as the standard error of the residuals
(i.e., d’s in Case 3). It is a measure of the dispersion of the d’s. For example, we would
expect 95% of the d’s to lie in the interval -1.96 * 60.336 to 1.96 * 60.336 or between
-118 and +118. The second plot in Exhibit 2 can be inspected with this in mind.
A measure of how well the model fits the data is the “Squared multiple R” (i.e., R2) or
“Adjusted squared multiple R” (see Exhibit 2). The .920 value for R2 indicates that 92%
of the variance in the Y’s can be “explained” by the X’s. The highest value possible for
R2 is 1.00, so this model fits the data very well. There are other ways to assess the
goodness of the model. For example, the standard error of estimate is useful in
evaluating how well the model predicts. The lower this value, the better the model
predicts.
2
Download