Lecture 5

advertisement
Some General Measures of Fit
R2 (coefficient of determination). R2 measures the proportion of the variation in costs (y)
explained by enrollment (x). This is a commonly cited measure of how well the line “fits” the
data. To make this concrete, imagine we initially impose the condition b1  0 (= a horizontal
line) so that the information provided by x (enrollment) is totally ignored in fitting our line to the
data. The horizontal line that minimizes the sum of squared errors is obtained by taking the
intercept b0  y , which results in the line y  y . The actual minimum sum of squared errors is
called SST (total sum of squares), and is given by
n
SST   ( y i  y ) 2 .
i 1
Observe that SST is really a measure of the inherent variation in the y-data without making any
adjustments for x. SST will serve as a benchmark by which we will judge future improvements.
If the slope coefficient is now freed to take on non-zero values, then we are permitting costs (y)
to be adjusted for enrollment (x). This should help improve the overall fit of the line because
now we get to choose an intercept and a slope. The line that minimizes the sum of squared errors
will fit at least as well as the line y  y discussed earlier because we have greater flexibility.
The optimal choices for the intercept and slope are denoted by b0 and b1 , and these are
computed by Excel. These two parameters determine what we refer to as THE least squares line,
which has the equation yˆ  b0  b1 x . The remaining sum of squared errors for this line is called
the error sum of squares and denoted by SSE
n
SSE   ( y i  yˆ i ) 2
i 1
where yˆ i  b0  b1 xi is the y-coordinate of the least squares line at xi ,
and yi is the observed data value occurring at xi .
The value yˆ i  b0  b1 xi is also called the predicted value for y i . The term y i  yˆ i (=
yi  [b0  b1 xi ] ) is the ith residual. Observe that we must have SSE  SST .
The difference between SST and SSE represents the improvement in SST obtained by including
an unconstrained slope term in the model, i.e., the improvement obtained by adjusting y to
account for x. This difference is termed SSR, the regression sum of squares:
SSR  SST  SSE .
You can think of SSR as measuring the “value added” by including x in the model compared to a
model that does not include x. Technically speaking, SSR measures the amount of variability (or
uncertainty) in y that is eliminated by including x in the model. R2 simply converts this into
32 | P a g e
SSR
SSE
1
. This is why people often make the
SST
SST
statement “R2 is the proportion of variation in y explained by x.” The remaining error sum of
squares SSE is considered “unexplained.” The Excel output includes SSR, SSE, and SST.
percentage terms using the formula R 2 
The Standard Error, SThe second measure of fit is the standard error, denoted by s . The
notation and formula for this term will be discussed in greater detail later. For now, we only
state that this value measures the variability (the “spread”) of the data about the line. Smaller is
better. Generally speaking, most data points should fall within 2 s of the least squares line,
literally all of the data should fall within 3 s . Practitioners like to look at the ratio s / y so that
spread is measured as a percentage of the average y-value. A ratio less than 10% implies a good
fit, although a good fit may not always have a ratio less than 10%.
Simple Linear Regression: Stochastic Assumptions
If we make additional stochastic assumptions about the random error term in our model, we can
develop some additional statistical insights.
Assumptions. The error  at any value of x is normally distributed with constant mean   0
and constant variance  2 (in our notation,  ~ N (0,  2 )) . Errors associated with different
observations are independent of one another.
Note that we are really making an infinite number of assumptions concerning an infinite number
of distributions, one for each possible value of x. Another way of stating the assumptions is that
we are assuming the distribution of y for a given value of x is y( x) ~ N (  0  1 x,  2 ) .
These assumptions must be checked in practice! For now, we will assume they are true.
Statistical Estimation of Parameters
Our theoretical model specifies the relationship yi   0  1 xi   i for i  1,..., n . The random
error component  i ~ N (0,  2 ) (a normal distribution with a mean of 0 and constant variance).
An estimate of the standard deviation of the error term is the aforementioned standard error, s ,
whose formula is
SSE
.
s 
n2
The standard error is printed out on your Excel output, and it measures the spread about the line.
We would like this value to be small. In our HMO example, s  $26,273,660.
33 | P a g e
A Hypothesis Test for the Slope Coefficient
Observe that our slope estimate b1 would change with a new sample, thus our computed value is
simply one observation of a random variable (much like X , the estimator of  , is a random
variable). An estimate of the standard deviation of the slope estimator is the standard error of
b1 , defined by
s
.
sb1 
n
 (x
i 1
i
 x)2
The value of s b1 is given in your Excel output immediately to the right of the slope estimate. In
our HMO example it is s b1  6.752009305. You will never need to compute this on a calculator.
This quantity is used as part of an important hypothesis test regarding the slope that is routinely
performed in most regression analyses. This hypothesis test concerns whether or not the
coefficient (slope) of x is truly different from 0. This is often regarded as testing whether a linear
relationship exists between y and x. Under our stochastic assumptions, one can show that the
quantity
b  1
t 1
~ t n  2 df ,
s b1
which is the test statistic for the formal hypothesis test
H 0 : 1  0
H A : 1  0
.
The null hypothesis is rejected at level  if the computed t-value exceeds the critical values in a
standard two-tailed t-test with n-2 degrees of freedom. In this case we conclude (at level  ) that
the inclusion of the slope explains a significant amount of the inherent variation in y, and thus a
linear relationship exists between x and y. The t-statistic and its associated p-value are generated
as part of the Excel output.
Using Regression for Confidence/Prediction Intervals
To predict the cost of healthcare for 100,000 employees, you naturally want to “plug in” the
appropriate x value (10000012=1200000 member months) to predict y (cost). The value
1200000 is “given” to you and is therefore called the “given value of x,” denoted by x g .
Plugging in this value in our health care example, we get
Estimated cost = 50332853 + 1200000  91.966899 = $160,693,131.1
How accurate is this prediction? First, we need to recall that our least squares line is intended to
estimate expected costs, and in this capacity it is not perfect (it is, after all, an estimate).
34 | P a g e
Moreover, our theoretical model permits random, unexplained deviations from this line of
expected costs (the error term  ) for individual HMOs. Both of these factors add uncertainty to
our cost prediction. It is therefore customary to use a prediction interval to capture this
uncertainty. A prediction interval includes a margin of error, just like a confidence interval.
A 100(1   )% Prediction Interval for a Future Observation at x g =1200000. Here, we are
trying to estimate where a single value of y will be at a given value of x.
b0  b1  x g  t / 2,n2  s
2
1 (xg  x)
1 
n (n  1)  s x2
This is an ugly formula, but we can exploit Excel output to reduce the workload. I will walk you
through the calculations in class. Don’t forget that this is a “plug and chug” problem.
(From Excel’s Regression Output)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.964278248
R Square
0.92983254
Adjusted R Square
0.924820578
s Standard Error
26273660.36
Observations
16
(From Excel’s Descriptive Statistics)
x
sx / n
s x2
n
TOTAL MEMBER MONTHS (x)
Mean
2257346.438
Standard Error
251178.1912
Median
2056727.5
Mode
#N/A
Standard Deviation
1004712.765
Sample Variance
1.00945E+12
Count
16
Regression Diagnostics
Checking the regression assumptions and diagnosing data problems is an essential step in any
regression analysis. Inspecting the residuals is the main feature of this process. We may find
evidence that our assumptions are supported or violated. In the latter case, we must take some
sort of corrective action. We use the residuals to check for:
1.
2.
3.
4.
Homoscedasticity (or equal variance) (a Model Assumption)
Independence (a Model Assumption)
Normality (a Model Assumption)
Outliers and influential points (a Data Issue)
35 | P a g e
Homoscedasticity (Model Assumption)
A visual inspection of the residual plot does not reveal any violations of the homoscedasticity
assumption.
Independence (Model Assumption)
Check the residual plots for “randomness.” You should plot the residuals against different x-axes
and see if any patterns emerge1. Looking for runs (consecutive residuals with the same sign) is
helpful. For example, if the signs of the residuals exhibit the pattern + + + + + + - - - - - -, then
there is a problem (only 2 runs, which is too few). Conversely, if the signs follow the pattern + + - + - + - + - + -, then there is a problem (12 runs, which is too many). The residual plot for our
healthcare example does not exhibit any patterns.
Checking Normality with a Histogram (Model Assumption)
An effective visual test for normality is to simply build a histogram of the residuals. For our
healthcare example, the histogram (using Excel) is provided below.
Histogram
6
Frequency
5
4
3
2
1
More
2.5
1.5
0.5
-0.5
-1.5
-2.5
0
Stud. Res.
I used the standard residuals given by Excel. Standard residuals are residuals that have been
rescaled to have a mean of 0 and a standard deviation of approximately 1. With small samples
such as n=16, we are fairly liberal in our assessment of normality. The overall shape is
acceptable.
Outliers and Influential Points
Residuals that are unusually large (positive or negative) correspond to “outliers.” i.e., data points
that do not conform to the data set. Outliers arise from several sources: data errors, model
misspecification, and chance. The Excel output doesn’t reveal any significant outliers, but one
residual (the one having value –2.2) is borderline and probably should be double-checked for a
transcription error.
1
In data collected over time (called time series data), it is customary to include a plot of the residuals versus time.
36 | P a g e
Individual data points can also exert a great deal of influence on the overall regression results (the
R2, the estimated coefficients, etc.). In extreme cases, a single influential data point can be
driving the regression results. This is problematic. One simple way to assess whether a data
point is influential is to remove it and observe the impact it has on the overall regression results.
Data points that are potentially influential are typically characterized by two properties: (1) they
are positioned far away from the other data points and (2) they have a fairly large residual. The
product of these two factors determines whether a data point is influential or not. Inspection of
the residual plots for points satisfying both (1) and (2) usually identifies any influential points.
Assignment #5
1.
2.
3.
4.
5.
Book, Simple Linear Regression, Problem #21
Book, Simple Linear Regression, Problem #29
Book, Simple Linear Regression, Problem #31
Book, Simple Linear Regression, Problem #43
Book, Simple Linear Regression, Problem #53
6.
Simple Linear Regression, Case 2 (US Department of Transportation: Driver Fatalities,
page 639-640). Develop an appropriate regression model and answer the following
questions:
(a)
(b)
(c)
(d)
(p. 584)
(p. 595)
(p. 595)
(p. 610)
(p. 627)
Does the model fit reasonably well? (Do not attach any Excel output)
Is there a linear relationship between fatalities and percentage drivers under 21?
(Do not attach any Excel output)
Are the regression assumptions for the error term satisfied? Display the residual
plot(s) and comment on independence and homoscedasticity. Display a histogram
of residuals and comment on the normality assumption.2 (Supply a residual plot
and a histogram)
Are there any influential points? Look at the residual plots and make a judgment
call. If you think a point is influential, remove it and re-run the regression model.
(Do not attach any Excel output)
2
You may want to play with the bin separators in the histogram menu. I usually try {-2.5, -1.5, -.5, +.5, +1.5, +2.5}
for small data sets. I usually try {–3, -2.5, -2, -1.5, -1, -.5, 0, +.5, +1, +1.5, +2, +2.5, +3) for larger data sets.
37 | P a g e
Download