Some General Measures of Fit R2 (coefficient of determination). R2 measures the proportion of the variation in costs (y) explained by enrollment (x). This is a commonly cited measure of how well the line “fits” the data. To make this concrete, imagine we initially impose the condition b1 0 (= a horizontal line) so that the information provided by x (enrollment) is totally ignored in fitting our line to the data. The horizontal line that minimizes the sum of squared errors is obtained by taking the intercept b0 y , which results in the line y y . The actual minimum sum of squared errors is called SST (total sum of squares), and is given by n SST ( y i y ) 2 . i 1 Observe that SST is really a measure of the inherent variation in the y-data without making any adjustments for x. SST will serve as a benchmark by which we will judge future improvements. If the slope coefficient is now freed to take on non-zero values, then we are permitting costs (y) to be adjusted for enrollment (x). This should help improve the overall fit of the line because now we get to choose an intercept and a slope. The line that minimizes the sum of squared errors will fit at least as well as the line y y discussed earlier because we have greater flexibility. The optimal choices for the intercept and slope are denoted by b0 and b1 , and these are computed by Excel. These two parameters determine what we refer to as THE least squares line, which has the equation yˆ b0 b1 x . The remaining sum of squared errors for this line is called the error sum of squares and denoted by SSE n SSE ( y i yˆ i ) 2 i 1 where yˆ i b0 b1 xi is the y-coordinate of the least squares line at xi , and yi is the observed data value occurring at xi . The value yˆ i b0 b1 xi is also called the predicted value for y i . The term y i yˆ i (= yi [b0 b1 xi ] ) is the ith residual. Observe that we must have SSE SST . The difference between SST and SSE represents the improvement in SST obtained by including an unconstrained slope term in the model, i.e., the improvement obtained by adjusting y to account for x. This difference is termed SSR, the regression sum of squares: SSR SST SSE . You can think of SSR as measuring the “value added” by including x in the model compared to a model that does not include x. Technically speaking, SSR measures the amount of variability (or uncertainty) in y that is eliminated by including x in the model. R2 simply converts this into 32 | P a g e SSR SSE 1 . This is why people often make the SST SST statement “R2 is the proportion of variation in y explained by x.” The remaining error sum of squares SSE is considered “unexplained.” The Excel output includes SSR, SSE, and SST. percentage terms using the formula R 2 The Standard Error, SThe second measure of fit is the standard error, denoted by s . The notation and formula for this term will be discussed in greater detail later. For now, we only state that this value measures the variability (the “spread”) of the data about the line. Smaller is better. Generally speaking, most data points should fall within 2 s of the least squares line, literally all of the data should fall within 3 s . Practitioners like to look at the ratio s / y so that spread is measured as a percentage of the average y-value. A ratio less than 10% implies a good fit, although a good fit may not always have a ratio less than 10%. Simple Linear Regression: Stochastic Assumptions If we make additional stochastic assumptions about the random error term in our model, we can develop some additional statistical insights. Assumptions. The error at any value of x is normally distributed with constant mean 0 and constant variance 2 (in our notation, ~ N (0, 2 )) . Errors associated with different observations are independent of one another. Note that we are really making an infinite number of assumptions concerning an infinite number of distributions, one for each possible value of x. Another way of stating the assumptions is that we are assuming the distribution of y for a given value of x is y( x) ~ N ( 0 1 x, 2 ) . These assumptions must be checked in practice! For now, we will assume they are true. Statistical Estimation of Parameters Our theoretical model specifies the relationship yi 0 1 xi i for i 1,..., n . The random error component i ~ N (0, 2 ) (a normal distribution with a mean of 0 and constant variance). An estimate of the standard deviation of the error term is the aforementioned standard error, s , whose formula is SSE . s n2 The standard error is printed out on your Excel output, and it measures the spread about the line. We would like this value to be small. In our HMO example, s $26,273,660. 33 | P a g e A Hypothesis Test for the Slope Coefficient Observe that our slope estimate b1 would change with a new sample, thus our computed value is simply one observation of a random variable (much like X , the estimator of , is a random variable). An estimate of the standard deviation of the slope estimator is the standard error of b1 , defined by s . sb1 n (x i 1 i x)2 The value of s b1 is given in your Excel output immediately to the right of the slope estimate. In our HMO example it is s b1 6.752009305. You will never need to compute this on a calculator. This quantity is used as part of an important hypothesis test regarding the slope that is routinely performed in most regression analyses. This hypothesis test concerns whether or not the coefficient (slope) of x is truly different from 0. This is often regarded as testing whether a linear relationship exists between y and x. Under our stochastic assumptions, one can show that the quantity b 1 t 1 ~ t n 2 df , s b1 which is the test statistic for the formal hypothesis test H 0 : 1 0 H A : 1 0 . The null hypothesis is rejected at level if the computed t-value exceeds the critical values in a standard two-tailed t-test with n-2 degrees of freedom. In this case we conclude (at level ) that the inclusion of the slope explains a significant amount of the inherent variation in y, and thus a linear relationship exists between x and y. The t-statistic and its associated p-value are generated as part of the Excel output. Using Regression for Confidence/Prediction Intervals To predict the cost of healthcare for 100,000 employees, you naturally want to “plug in” the appropriate x value (10000012=1200000 member months) to predict y (cost). The value 1200000 is “given” to you and is therefore called the “given value of x,” denoted by x g . Plugging in this value in our health care example, we get Estimated cost = 50332853 + 1200000 91.966899 = $160,693,131.1 How accurate is this prediction? First, we need to recall that our least squares line is intended to estimate expected costs, and in this capacity it is not perfect (it is, after all, an estimate). 34 | P a g e Moreover, our theoretical model permits random, unexplained deviations from this line of expected costs (the error term ) for individual HMOs. Both of these factors add uncertainty to our cost prediction. It is therefore customary to use a prediction interval to capture this uncertainty. A prediction interval includes a margin of error, just like a confidence interval. A 100(1 )% Prediction Interval for a Future Observation at x g =1200000. Here, we are trying to estimate where a single value of y will be at a given value of x. b0 b1 x g t / 2,n2 s 2 1 (xg x) 1 n (n 1) s x2 This is an ugly formula, but we can exploit Excel output to reduce the workload. I will walk you through the calculations in class. Don’t forget that this is a “plug and chug” problem. (From Excel’s Regression Output) SUMMARY OUTPUT Regression Statistics Multiple R 0.964278248 R Square 0.92983254 Adjusted R Square 0.924820578 s Standard Error 26273660.36 Observations 16 (From Excel’s Descriptive Statistics) x sx / n s x2 n TOTAL MEMBER MONTHS (x) Mean 2257346.438 Standard Error 251178.1912 Median 2056727.5 Mode #N/A Standard Deviation 1004712.765 Sample Variance 1.00945E+12 Count 16 Regression Diagnostics Checking the regression assumptions and diagnosing data problems is an essential step in any regression analysis. Inspecting the residuals is the main feature of this process. We may find evidence that our assumptions are supported or violated. In the latter case, we must take some sort of corrective action. We use the residuals to check for: 1. 2. 3. 4. Homoscedasticity (or equal variance) (a Model Assumption) Independence (a Model Assumption) Normality (a Model Assumption) Outliers and influential points (a Data Issue) 35 | P a g e Homoscedasticity (Model Assumption) A visual inspection of the residual plot does not reveal any violations of the homoscedasticity assumption. Independence (Model Assumption) Check the residual plots for “randomness.” You should plot the residuals against different x-axes and see if any patterns emerge1. Looking for runs (consecutive residuals with the same sign) is helpful. For example, if the signs of the residuals exhibit the pattern + + + + + + - - - - - -, then there is a problem (only 2 runs, which is too few). Conversely, if the signs follow the pattern + + - + - + - + - + -, then there is a problem (12 runs, which is too many). The residual plot for our healthcare example does not exhibit any patterns. Checking Normality with a Histogram (Model Assumption) An effective visual test for normality is to simply build a histogram of the residuals. For our healthcare example, the histogram (using Excel) is provided below. Histogram 6 Frequency 5 4 3 2 1 More 2.5 1.5 0.5 -0.5 -1.5 -2.5 0 Stud. Res. I used the standard residuals given by Excel. Standard residuals are residuals that have been rescaled to have a mean of 0 and a standard deviation of approximately 1. With small samples such as n=16, we are fairly liberal in our assessment of normality. The overall shape is acceptable. Outliers and Influential Points Residuals that are unusually large (positive or negative) correspond to “outliers.” i.e., data points that do not conform to the data set. Outliers arise from several sources: data errors, model misspecification, and chance. The Excel output doesn’t reveal any significant outliers, but one residual (the one having value –2.2) is borderline and probably should be double-checked for a transcription error. 1 In data collected over time (called time series data), it is customary to include a plot of the residuals versus time. 36 | P a g e Individual data points can also exert a great deal of influence on the overall regression results (the R2, the estimated coefficients, etc.). In extreme cases, a single influential data point can be driving the regression results. This is problematic. One simple way to assess whether a data point is influential is to remove it and observe the impact it has on the overall regression results. Data points that are potentially influential are typically characterized by two properties: (1) they are positioned far away from the other data points and (2) they have a fairly large residual. The product of these two factors determines whether a data point is influential or not. Inspection of the residual plots for points satisfying both (1) and (2) usually identifies any influential points. Assignment #5 1. 2. 3. 4. 5. Book, Simple Linear Regression, Problem #21 Book, Simple Linear Regression, Problem #29 Book, Simple Linear Regression, Problem #31 Book, Simple Linear Regression, Problem #43 Book, Simple Linear Regression, Problem #53 6. Simple Linear Regression, Case 2 (US Department of Transportation: Driver Fatalities, page 639-640). Develop an appropriate regression model and answer the following questions: (a) (b) (c) (d) (p. 584) (p. 595) (p. 595) (p. 610) (p. 627) Does the model fit reasonably well? (Do not attach any Excel output) Is there a linear relationship between fatalities and percentage drivers under 21? (Do not attach any Excel output) Are the regression assumptions for the error term satisfied? Display the residual plot(s) and comment on independence and homoscedasticity. Display a histogram of residuals and comment on the normality assumption.2 (Supply a residual plot and a histogram) Are there any influential points? Look at the residual plots and make a judgment call. If you think a point is influential, remove it and re-run the regression model. (Do not attach any Excel output) 2 You may want to play with the bin separators in the histogram menu. I usually try {-2.5, -1.5, -.5, +.5, +1.5, +2.5} for small data sets. I usually try {–3, -2.5, -2, -1.5, -1, -.5, 0, +.5, +1, +1.5, +2, +2.5, +3) for larger data sets. 37 | P a g e