Statistics 305 ASSESSING GOODNESS OF A LEAST SQUARES FIT • Basically, the magnitude of the set of residuals indicates relative success or failure. However, it must be remembered that in every least squares fit the sum of residuals is zero. This is a mathematical fact. Thus, to look at magnitude of residuals, we use the square of each, and often employ the sum of squared residuals as a point indicator. • A somewhat related point indicator is the coefficient of determination, R2, defined as R2 = Σ( yi − y ) 2 − Σ( yi − yˆ i ) 2 Σ( y i − y ) 2 . The second term in the numerator is the sum of squared residuals. The first term is a measure of total variation in the sample. The interpretation of R2 is - “the proportion of the variation in the data which is explained by the fitted function.” An R2 close to 1 is the result of small sum of squared residuals which is itself the result of the fitted function being “close” to the data points. This is evidence of a “desirable” fit. • Graphical indicators of a good fit are: 1. The normal plot of residuals is strongly linear. 2. The plot of residuals at observation numbers shows a random looking scattering about zero, i.e., no pattern. 3. The plot of residuals vs. ŷ ’s shows no pattern. 4. The plot of residuals vs. x’s shows no pattern. All of the above help one decide whether a modification in general form of fitted function seems to be indicated. Don’t lose sight of the fact that there is not a correct (as opposed to an incorrect) answer. A least squares fit can be judged to be pretty good or not so pretty good. In the latter case one looks for improvements. Cautions given in your textbook include: 1. Don’t infer causality. The good fit across x values does not mean that the x variable “causes” the behavior observed in the y’s. 2. A least squares fit can be strongly influenced by one or more outlying data points. In a worst case, the fitted function can essentially be determined by a single erroneous y value. 3. Do not expect that the fitted function will be a good predictor outside the region of x values where observations were made. Quite often it won’t be. 2