Some notes on regression analysis that may be useful in preparing Case 3 The purpose of these notes is to remind you of some of the properties of regression analysis and to provide you some guidance in interpreting the output in Exhibit 2 (p. 85). Case 3 utilizes univariate regression analysis, in which one variable (the “independent” variable, delivery volume in this case) is used to estimate another variable (the “dependent” variable, delivery expense in this case). (Although the model in Case 3 has only one independent variable, it is of course possible to have more than one.) The relationship between the two variables is assumed to be linear: Y = α + βX + ε . Y is the delivery expense (which we want to predict or estimate); α is the line’s intercept (which we will view as the fixed component of delivery expense); β is the line’s slope (which we will view as the per-unit variable cost component of delivery expense); X is the delivery volume; and ε is a random error term that indicates that the relationship between X and Y is not perfectly predictable. It is assumed that the ε’s are independent across the observations and normally distributed, with a mean of zero. The model is estimated using data from the company’s records. The dependent variable, Yr, is the recorded delivery expense. The estimate of the intercept (α), we will denote as A. The estimate of the slope (β), we will denote as B. The predicted delivery expense, then, is equal to A + BX. The difference between the predicted delivery expense and the “actual” delivery expenses (i.e., Yr) is the prediction error. The coefficients, A and B, are estimated so that the sum of the squared errors, Σ [Yr – (A + BX)]2 , is minimized. (The summation is over the observations in the model, in this case the 50 observations from the outlets.) The coefficients, A and B, are “optimal” in the sense that no other estimates will provide for smaller prediction errors. In the regression output (Exhibit 2), A is 848.964. That is, the estimated fixed cost per outlet is about $849 (you might consider whether it is reasonable to interpret A as fixed cost). B is 1.246. That is, the estimated per-unit variable cost is $1.246. For every oneunit increase in X (delivery volume), delivery expense is expected to increase by $1.246. In the first plot in Exhibit 2, the best-fitting line through the data points would have a slope of +1.246. The model coefficients can be used to predict the delivery expenses at individual outlets. For example, for the McGill Univ. outlet, the model would estimate delivery expense to be $849 + $1.246 * 1975 = $3,310. The recorded expense for this outlet was $3,257, indicating a prediction error (also called a residual or “d” in Case 3) of -$53. [Note that this calculation of d employs the A and B coefficients estimated from the data and is the computation used for the d plot in Exhibit 2. The case also, however, calculates d using the planned A and B coefficients (i.e., Ap and Bp), see equation (5). Depending on the purpose of the calculation, either way is acceptable.] Each of the terms in the regression model has an associated standard error. Exhibit 2 indicates that the standard error for A is 101.185 and for B is 0.053. These standard 1 errors can be used to build the usual confidence intervals for the coefficients. For example, an approximate 95% confidence interval for the slope coefficient B is equal to 1.246 ± 1.96*.053, or 1.14 to 1.35. Recall that we can interpret a 95% confidence interval as follows: if we were to conduct this model estimation procedure many times, 95% of the confidence intervals would contain β. (In Case 3, we might want to compare the planned B (i.e., Bp) to the confidence interval as a way to evaluate the planning process.) Another standard error in the Exhibit 2 output is the “Standard error of estimate” which equals 60.336. This can be viewed as the standard error of the residuals (i.e., d’s in Case 3). It is a measure of the dispersion of the d’s. For example, we would expect 95% of the d’s to lie in the interval -1.96 * 60.336 to 1.96 * 60.336 or between -118 and +118. The second plot in Exhibit 2 can be inspected with this in mind. A measure of how well the model fits the data is the “Squared multiple R” (i.e., R2) or “Adjusted squared multiple R” (see Exhibit 2). The .920 value for R2 indicates that 92% of the variance in the Y’s can be “explained” by the X’s. The highest value possible for R2 is 1.00, so this model fits the data very well. There are other ways to assess the goodness of the model. For example, the standard error of estimate is useful in evaluating how well the model predicts. The lower this value, the better the model predicts. 2