Interpreting Residual Plots

advertisement
YALE School of Management
MGT511: HYPOTHESIS TESTING AND REGRESSION
K. Sudhir
Lecture 7
Multiple Regression: Residual Analysis and Outliers
Residual Analysis
Recall: Use of Transformations with Simple Regression (Lecture 4)
In lecture 4, in the context of simple regression (regression with one x variable), we
suggested that we should use scatter plots of y against x to check if there are nonlinearity
and non-constant variance problems in the data. Note that such plots are informative only
if there is only one relevant predictor variable (if y is affected by many predictors, it
becomes impossible to discover such problems in scatterplots of y against x).
We showed how the presence of nonlinearity and non-constant variance led to the
violation of two of the assumptions about the error term (and hence the residuals) in
regression.1
(1) Nonlinearity in the relationship between the x and y variables. If the nonlinearity is
ignored, the assumption E(  i) = 0 that we used in regression is violated. This
assumption implies that the expected value of the error term is zero, for every
possible value of x. More generally, if we apply the wrong functional form, this
assumption will be violated.
(2) Non-constant variance. The second assumption in regression is Var (  i) = 2 ; The
variance of the error term is the same for every possible value of x.
We suggested the following solutions to the nonlinearity and non-constant variance
problems:
Nonlinearity Problem:
We recommended two solutions to the nonlinearity problem:
Solution 1: Add an x2 term in addition to the x term. This is especially appropriate if the
marginal effect of interest not only changes in magnitude but also in sign.
y   0  1 x   2 x 2
Solution 2: Under certain conditions (refer to lecture 6 for more details), we suggested
taking the natural logarithm of the x variable.
y   0  1 ln( x) , where ln(x) is the natural logarithm of x.
Non-constant Variance Problem:
1
An additional assumption required in regression (for statistical inference about slopes, intercepts,
predictions) is that the values for the error term (and hence the residuals) are normally distributed. Strictly
speaking, violations of this assumption, and for violations of the other assumptions, usually implies that the
model is misspecified (e.g. missing variable, wrong functional form).
We suggested using the logarithm of the y variable to solve the non-constant variance
problem.
ln( y)   0  1 x
Diagnosing the Need for Transformations with Multiple Regression
In multiple regression, it is difficult and often impossible to detect the types of problems
noted above by looking at the scatter plots of y against x, because the relationship
between y and any particular x variable is typically confounded by the effects of all of the
other x variables on y. For multiple regression, we can examine possible violations of
assumptions by using residual plots. The advantages of residual plots are: 1. We look for
systematic patterns in the unexplained variation in y against a horizontal line; 2. All the
(linear, and perhaps some nonlinear) effects of the x-variables have been accounted for,
so the residuals allow for an examination of violations that remain. In this lecture we
focus on the use of residual plots to detect nonlinearity and non-constant variance
problems.
Residuals and Residual Plots with Excel
Before we can diagnose the need for transformations using residual plots, we need to
know how to obtain residuals and residual plots using Excel. Obtaining residuals and
residual plots is relatively simple in Excel.
Obtaining Residual and Residual Plots
For example, suppose we want to run a multiple regression of y against x1 and x12 using
the data shown in the excel spreadsheet. We can use standard regression dialog box, enter
the y and x variables etc. as always. In addition you would check the “Residuals”,
“Standardized Residuals” as well as the “Residual Plots” checkboxes as well. Note that
there is a checkbox for “Line Fit Plots”, which we use in the next section when we
discuss the problem of Outliers. 2
Interpreting Residual Output
On running the regression with the above boxes checked, we obtain the normal three part
output that includes (1) the R-square, standard error of regression etc, (2) ANOVA table
and (3) the regression coefficients, its p-values, confidence intervals etc. In addition, we
obtain a table of residuals under the label “Residual Output” as seen below.
RESIDUAL OUTPUT
ObservationPredicted Y Residuals Standard Residuals
1
13
2
1
2
13
-2
-1
3
18
-2
-1
4
18
2
1
5
25 -7.105E-15
-3.55271E-15
The above regression has 5 observations, so there are 5 residuals. The columns in the
table are:
(1) The first column shows the “observation number”
(2) The second column labeled “Predicted Y” ( ŷ ), shows the predicted values of Y
based on the regression. These predicted values are obtained for the sample data,
and are often referred to as fitted values. This is given by ŷ  ˆ0  ˆ1 x1  ˆ2 x12
(3) The third column labeled “Residuals”, shows the difference between actual and
predicted (fitted) values, which is given by ˆ  y  yˆ .
(4) The fourth column labeled “Standardized Residuals” shows the standardized
residual obtained by subtracting out the average of the residuals (which is zero by
definition, as long as the model includes an intercept) and dividing by the standard
deviation of the residuals.
ˆ  Expected(ˆ)
Standardized Residual 
Std Dev(ˆ)
For the purposes of identifying nonlinearity and non-constant variance, either residuals or
standardized residuals can be used. However, standardized residuals can be somewhat
more easily interpreted. For example if the residuals follow a normal distribution, we
know that 95% of the values of the standardized residuals will lie between  1.96.
Interpreting Residual Plots
As requested, Excel will provide plots with residuals on the y axis against each of the
included x variables. Thus for example, if we had three variables, x1, x2 and x3 in the
multiple regression, we would obtain three residual plots. If we did a regression of y
against x and x2, we would get two residual plots.
Residual plots can be used to diagnose nonlinearity. What we need to check is whether
the residuals “systematically” have an average that is different from zero for any range of
2
We do not consider “Normal Probability Plots” in this lecture.
x values. For example, we could “compute” the average residual separately for small,
medium and large values of a predictor variable. The more these averages differ from
zero, the more reason we have to reconsider the model specification. We should
recognize that there will be some random deviations from average values, so we should
be concerned about nonlinearity if we can see clear and systematic variations of these
averages from zero. We use some examples to discuss when there are systematic
deviations. We now look at a step-by step approach to diagnose and correct nonlinearity
problems.
Before illustrating a problematic residual plot, we first show examples of “good”
residual plots where nonlinearity is not a problem. Below we show residual plots from a
regression with two variables x1 and x2 and a look at the plots may suggest there are no
systematic deviations from the assumption of zero average values of the residuals at all
values of x1 and x2. This does not mean that the residuals are exactly zero around every
value of x1 and x2.
X2 Residual Plot
X1 Residual Plot
10
5
0
-5 0
5
10
-10
15
Residuals
Residuals
10
5
0
-5
0
2
4
6
8
-10
X1
X2
Diagnosis and Correction of Nonlinearity Problems
Nonlinearity problems can seriously bias the estimates of a regression that fails to account
for nonlinearity. Hence we should accommodate nonlinearities in the regression model.
Here is a step-by-step approach to detect and correct for nonlinearity problems.
Step 1: Use of Theory or A priori Knowledge of Nonlinearity
Before doing a regression, think about whether there are any theoretical reasons to expect
some nonlinearity between the y and x variables. For example, we had suggested earlier
that it well known that the experience curve (the relationship between marginal cost of
production and cumulative production) has a well known nonlinear relationship. To
accommodate the nonlinearity we take the logarithm of the y and the x variables.
Another example concerns the relationship between Sales and Advertising. While we
would expect Sales to increase with an increase in advertising, we also expect the
marginal effect of advertising to decline at higher levels of advertising. This suggests that
we should use ln(Advertising) rather than advertising directly in the regression (ln =
natural logarithm).
10
12
Step 2: If we do not have a priori expectations of nonlinearity, we estimate a linear
regression model and then diagnose possible nonlinearity using residual plots.
For example if the true relationship is curved as in the picture on the left, and we fit a
straight regression line, the regression estimates will systematically overpredict the y
values for very small and large values of x, while it systematically underpredicts the y
Data
y
Negative
Residuals
Residuals
Regression
Line
Positive
Residuals
Negative
Residuals
x
x
values for medium values of x. This is highlighted in the residual plot on the right hand
side.
In this case we would take an appropriate transformation of the x variable (add an x2 term
or use ln(x)) and run the regression again.
Step 3: Look at the residual plots against each of the x- variables with the new regression
and make appropriate transformations on any of the variables until “good” residual plots
emerge.
Example:
Consider a firm that seeks to estimate the potential profits from new product development
projects that it invests in. It decides to use two explanatory variables (1) R&D
investments used and (2) Managers’ a priori estimate of project risk prior to the beginning
of the project. It uses data from a number of past projects to estimate a regression model.
The data are as follows (Profits and R&D are in tens of thousands of dollars and Risk is
on a 1-10 scale).
PROFIT R&D
RISK
396
133
130
82
508
146
172
90
256
114
31
54
102
76
102
72
536
152
102
75
214
109
200
92
158
92
31
61
116
78
120
90
270
106
270
112
9
8
10
8
7
8
7
8
10
8
6
9
7
7
8
6
9
8
Step 1: Though managers felt that there was some potential nonlinearity between R&D
investments and profits, they did not have a good a priori notion of what this relationship
might be. They decided that they would let the data tell them about any potential
nonlinearity rather than assume a priori a specific type of nonlinear relationship.
Step 2. They estimated a linear regression with Profits as the y variable and R&D and
Risk as the x variables. Residual plots were used to diagnose nonlinearity problems. The
residual plots are shown below.
Looking at the residual plot (against R&D), there appears to be evidence of nonlinearity
in the partial relationship between Profits and R&D. For small values of R&D (50-75),
RISK Residual Plot
20
20
10
10
0
0
50
100
-10
150
200
Residuals
Residuals
R&D Residual Plot
0
-10 0
5
10
-20
-20
R&D
RISK
residuals are systematically positive. For moderate values of R&D (75-125), the residuals
are systematically negative. For larger values (close to 150), they are positive.
This nonlinearity of R&D suggests that we could include an R&D2 term in the regression.
The residual plot against RISK is not clear. So we run a regression of Profits against
R&D, R&D2, and Risk.
15
Step 3: The residual plots with the new regression are given below:
RISK Residual Plot
Residuals
20
10
0
0
-10
5
10
15
RISK
R&D Residual Plot
R&D^2 Residual Plot
20
10
0
-10
0
50
100
150
200
Residuals
Residuals
20
10
0
-10
0
RD
5000
10000
15000
20000
25000
RD^2
With the new residuals plots, there does not appear to be systematic nonlinearity
remaining. Hence we stop looking for further transformations and conclude that the
residuals are well-behaved. Of course, we would make sure that the new results are
interpretable (i.e. substantively meaningful).
Diagnosis and Correction of Non-constant Variance Problems
Unlike nonlinearity, non-constant variance problems do not bias the slope coefficients or
the predicted values. Instead, we obtain incorrect estimates of prediction and confidence
intervals for our estimates. To understand why, look at the following picture where we
plot y against x and the residuals against x.
y
Residual
x
x
Here we see a clear violation of the constant variance assumption. It appears that the
residuals are much smaller for small values of x than for large values of x. Thus
intuitively, we can be far more confident about our predictions for smaller values of x,
than we can about predictions for larger values of x. However with the constant variance
assumption, the computations assume that the uncertainty with regard to the residuals (the
residual variance) is the same for all values of x.
Hence with a constant variance assumption in this example, we are “under-confident”
about our predictions for small values of x, while we are “over-confident” about our
predictions for large values of x.
As discussed earlier, we can solve this problem, by taking the ln(y) variable when doing
the regression. This enables us to get variances that are constant over the entire range of
the x variable.
As we move to multiple regression with several x variables, it is more useful to plot the
residuals not just against each of the x variables as we do above, but to plot the residuals
against the predicted values (fitted values) of the y variable. If the solution to the nonconstant variance problem is simply to transform the y variable, it is reasonable to check
if the residuals are of constant variance at all predicted values of y. However, it is also
possible that a single x-variable is responsible for the non-constant variance problem. In
that case, it is useful to identify which x-variable can be transformed (via residual plots
against each x-variable) to achieve the desired constant residual variance.
Obtaining the plot of residuals against predicted y
Excel does not directly provide a plot of residuals against predicted y values ( ŷ ).
However, the Excel “Residual Output” we showed earlier can be used to obtain the plot.
Note the second column of the “Residual Output” had predicted y, the third column had
the residuals, and the fourth column had standardized residuals. We can therefore obtain a
scatter plot of residuals (or standardized residuals) against the predicted y values easily.
Revisiting the earlier example of Profits versus R&D and Risk
We plot the scatter plot of standardized residuals (residuals would have been fine as well)
against predicted profits and we find there is no systematic evidence of non-constant
variance. So we conclude there is no need to transform the Profit variable in this
regression.
3
2.5
2
Std Residuals
1.5
1
0.5
0
0
100
200
300
-0.5
-1
-1.5
-2
Predicted Profits
400
500
600
Another Approach to correct for non-constant variance: Normalizing the y-variable
So far, we had suggested that non-constant variance can be corrected by taking the natural
logarithm of the y variable. In certain situations, an appropriate transformation is to
normalize the y-variable. We illustrate the idea of normalizing the variable through an
example.
Suppose we had data on sales of a product in different cities, with different kinds of
variables measuring the marketing mix: (1) price, (2) feature, indicating whether the
product was featured in the Sunday newspaper insert, and (3) display, indicating whether
the retailer prominently displayed the product at the end of the aisle. Since cities with
different populations will have different levels of sales, we could consider the following
regression equation:
Sales  0  1 Pr ice   2 Feature  3 Display   4 Population (1)
However this regression equation would be plagued by a non-constant variance problem,
because the same price cut should lead to much greater change in sales in a larger city
(say, New York or Chicago), than in a smaller city (say, New Haven or Des Moines).
Similarly, the use of features or displays should also increase sales much more in a larger
city, than it would in a smaller city. Hence we would see much greater variability in the
residuals for smaller cities than we would in smaller cities.
One way to tackle this problem is to take the natural logarithm of sales as the dependent
variable. This would usually not be sufficient. A better solution may become clear if we
consider the reason for the non-constant variance problem.
Our intuition suggests that the impact of price or feature or display would be roughly the
same on an individual no matter which town this person lives in. A more appropriate
regression equation can be obtained by taking the dependent variable to be
Sales/Population and estimating the following regression equation:
Sales
  0  1 Pr ice   2 Feature   3 Display (2)
Population
Thus the non-constant variance problem can be overcome in some situations where the
problem is due to the effects of scale. By appropriately dividing by the relevant scaling
variable (in this case population), we can solve the non-constant variance problem.
See the graphs below to see how using the normalized variable corrects the non-constant
variance problem. The graph on the left has the residuals from the regression equation
(1), while the graph on the right has the residuals from the regression equation (2). The
non-constant variance problem that is evident in the graph on the left is largely absent in
the graph on the right.
2.5
2
2
1.5
1.5
1
0
0
10
20
30
40
50
60
-0.5
70
80
90
Std Residuals
Std Residuals
1
0.5
0.5
0
7.35
-0.5
7.4
7.45
7.5
7.55
7.6
7.65
7.7
7.75
7.8
-1
-1
-1.5
-2
-1.5
-2.5
-2
Predicted Prices
Predicted
Sales
Predicted
Per-Capita
Sales
Predicted
P/E
A Solution using Interaction Effects: Why this is not as appealing?
Could we not solve this problem using interaction variables by estimating the following
regression equation?
Sales   0 Population  1 Pr ice * Population   2 Feature * Population   3 Display * Population
(3)
Although mathematically, the two equations (2) and (3) are identical, equation (3) is not
as appealing from the point of view of estimation. The reason is that by multiplying all
the variables by Population, we have introduced a high level of correlation among the X
variables used in the regression. This creates potential “multicollinearity” problems.
Hence the normalization approach is preferred over the use of these interaction variables.
What if some of the X variables may not have effects on a per-capita basis?
Suppose there is an X variable such as advertising that is proportional to the population
that needs to be included in the regression. While other X variables such as Price, Feature
and Display affect Sales per capita (Sales/Population), Advertising is related to total
sales. So if we define the dependent variable to be Sales/Population, how can we include
Advertising into the regression?
The answer to this question is to include Advertising also on a per-capita basis.
Such a regression would have the form:
Sales
Advertisin g
  0  1 Pr ice   2 Feature   3 Display   4
Population
Population
When to use Normalization: Common sense helps most of the time!!
The approach of normalizing the dependent variable to solve for the non-constant
variance problem is commonly used when we have effects that should be measured on a
per-capita basis. Recall the Homework problem when we studied the effects of OSHA on
accidents in companies. Here we would expect that OSHA would reduce accidents more
in companies with more employees and less in companies with fewer employees. The
appropriate dependent variable in that case would not be total accidents in the company,
but Accidents/Employee. We recognize where such normalization should be used
routinely by appealing to our common sense. For example, when we pool data from a
large number of countries and try to understand the impact of an environmental policy on
GNP, we would use per-capita GNP rather than total GNP in order to account for vastly
different populations in different countries.
Outliers
Outliers are data points (either x or y values) that are far away from the rest of the data.
We classify them to be of two types: Y-outliers and X-outliers
Y-Outliers
A Y-outlier is one which lies far away from the prediction line for one of the x variables.
For example, see the graph below, called a Line Fit Plot in Excel. This can be obtained by
checking the “Line Fit Plot” checkbox in Excel.
Y
X Line Fit Plot
70
60
50
40
30
20
10
0
Y
Predicted Y
0
10
20
30
X
Here the blue dots represent the data (Y) and the pink dots represented the fitted values of
Y. As can be seen from the graph, there is one outlier (Y=10, X=24). As can be readily
seen, this one points shifts the prediction line (connecting the pink points) considerably,
from where it would otherwise have been (which would have connected all the blue
points). Thus it can be seen that the Y outliers bias the results of the regression seriously.
What to do when we have a Y outlier?
(1) Drop the Outliers
Y outliers may occur due to errors in coding the data. For example, it is common to have
a decimal point in the wrong place etc. By graphing the data, we are able to recognize
such errors. We can then either correct the error, or drop the data, if we cannot get access
to the true data.
Another possibility is that the Y outlier is due to a non-representative observation. For
example, if we want to understand the effects of price changes on airline ticket sales, we
may wish to drop the period around September 11, 2001 because this period is quite
unrepresentative of normal behavior of airline passengers.
(2) Y Outliers may require management action
A Y outlier may call for management action. Suppose we are trying to understand the
relationship between salesperson performance and territory characteristics. If we find one
salesperson is an outlier in terms of superior performance, perhaps management should
congratulate him, reward him with a bonus etc. If the person is an outlier with far below
average performance, it would be useful to send the person a warning or fire him. Clearly
some action is required in these situations by management. However, for the purpose of
understanding representative behavior, it might be useful to drop these outlier
observations and perform the regression itself.
On dropping the Y outlier, we are able to obtain the unbiased regression line as seen in
the Line Fit Plot below.
Y
X Line Fit Plot
70
60
50
40
30
20
10
0
Y
Predicted Y
0
10
20
30
X
X outliers
X outliers are those data points where the values of X are far away from other
observations. The effect of X outliers is not to bias estimates, but to make the confidence
intervals appear far tighter than it should be and also inflate the R-square values.
To understand the intuition for why X outliers artificially narrow the confidence intervals,
recall our earlier discussion as to in which of the two cases below, we would be more
confident of the regression line. In the figure below, the regression on the left is better,
because we have greater variation in the x variable there, relative to the regression on the
right. This enables us to be more confident about the slope of the regression line on the
left, compared to the regression line on the right.
y
y
x
x
Also recall the formula for the slope of a simple regression. The slope of a simple
regression is given by sˆ 

, which also indicates that the standard error of the slope
S xx
falls when there is greater spread in the x values (as indicated by the
1

Sxx   x i  x
2 term). An X outlier will make the S
xx
term appear much larger and make
the confidence intervals appear narrower.
Illustrating the Effects of X outliers:
Consider the following data and the line fit plot:
Y1
13
11
15
13
16
11
9
X1
1
1
2
2
3
0
0
X1 Line Fit Plot
20
Y_1
15
Y_1
Predicted Y_1
10
5
0
0
1
2
3
4
X1
As can be seen there are no X outliers in this data. The regression results are given
below:
The regression equation is
Y_1 = 10.0 + 2.00 X1
Predictor
Coef
SE Coef
Constant
10.0000
0.6622
X1
2.0000
0.4019
S = 1.095
R-Sq = 83.2%
T
15.10
4.98
P
0.000
0.004
R-Sq(adj) = 79.8%
As can be seen the standard error of the slope coefficient is 0.4, the t-statistic is 4.98 and
the R-square is 83.2%.
Now consider the same data with one outlier included:
Y_1
13
11
15
13
16
11
9
210
X1
1
1
2
2
3
0
0
100
The regression equation is
Y_1 = 10.0 + 2.00 X1
Predictor
Coef
SE Coef
T
P
Constant
10.0000
0.3831
26.10
0.000
X1
2.00000
0.01082
184.76
0.000
S = 1.000
R-Sq = 100.0%
R-Sq(adj) = 100.0%
Comparing the results with the previous regression we notice that the standard error of
the slope coefficient is 0.01, the t-statistic is 184.76 and the R-square is 100%.
Intuitively, we understand that one data point should not have so much influence on the
standard errors and the R-square. This happens because the one additional data point has
an X which is far away from the other values of X, causing the Sxx values to increase
dramatically. In this problem, Sxx= 7.4 without the outlier, but with the outlier it
increases to 8533.
However, as stated earlier the X outlier does not bias the estimates. We obtain the same
estimates with and without the X outlier.
Regression Plot
Y_1 = 10 + 2 X1
S=1
R-Sq = 100.0 %
R-Sq(adj) = 100.0 %
Y_1
200
100
0
0
10
20
30
40
50
60
70
80
90
100
X1
What to do when we have a X outlier?
As with the Y outlier, it makes sense to drop the X outlier and run the regression without
the outlier.
Y and X outliers
When we have both Y and X outliers, the effect will be to combine both outlier effects.
The Y outlier will cause bias, while the X outlier will cause the estimates to have
narrower confidence intervals as well as higher R-squares.
Y_1
13
11
15
13
16
11
9
210
X1
1
1
2
2
3
0
0
50
In the previous example, the estimated regression was y1=10+2X1. Consider the last data
point. Hence when X1=50, Y1=110. However, here we have Y1=210. Thus we have an X
outlier (X1=50 is far away from the other X values) and a Y outlier (the value of Y1=210
is far away from what would be the predicted value without outlier, 110).
The regression equation is
Y_1 = 7.41 + 4.05 X1
Predictor
Coef
SE Coef
Constant
7.4147
0.9678
X1
4.04547
0.05454
R-square = 99.9%
T
7.66
74.17
P
0.000
0.000
As expected the estimates are now biased (intercept is different from 10, slope is different
from 2). The standard errors are quite small and the t-statistics become large. The Rsquare is close to 100%. Thus with Y and X outliers, we have bias, smaller confidence
intervals and too high R-square.
Download