Prediction, Correlation, and Lack of Fit in Regression (§11.4, 11.5, 11.7) Outline • Confidence interval and prediction interval. • Regression Assumptions. • Checking Assumptions (model adequacy). • Correlation. • Influential observations. 13-1 Prediction Our regression model is Y 0 1 X , ~ N(0, 2 ) so that the average value of the response at X=x is E[ y x ] 0 1 x Number of components i xi 1 1 2 2 3 4 4 4 5 4 6 5 7 6 8 6 9 8 10 8 11 9 12 9 13 10 14 10 Repair time yi 23 29 64 72 80 87 96 105 127 119 145 149 165 154 13-2 The estimated average response at X=x is therefore Eˆ [ yx ] yˆ x ˆ0 ˆ1 x The expected value! This quantity is a statistic, a random variable, hence it has a sampling distribution. Regression Assumptions Normal Distribution for Sample estimate, and associated variance: yˆ x ˆ0 ˆ1 x 2 1 ( x x ) 2 Var[ yˆ x ] ˆ S xx n ˆ 2 MSE A (1-)100% CI for the average response at X=x is therefore: yˆ x tn2, Var[ yˆ x ] 2 13-3 Prediction and Predictor Confidence The best predictor of an individual response y at X=x, yx,pred, is simply the average response at X=x. Random variables -- they vary from sample-to-sample. yˆ x , pred ˆ0 ˆ1 x Hence the predicted value is also a random variable. A (1-)100% CI for an individual response at X=x: 2 1 ( x x ) 2 Var[ yˆ x , pred ] ˆ 1 S xx n Variance associated with an individual prediction is larger than that for the mean value! Why? yˆ x , pred t n 2, / 2 Var[ yˆ x , pred ] 13-4 Prediction band - what would we expect for one new observation. Confidence band - what would we expect for the mean of many observations taken at the value of X=x. 13-5 13-6 Regression Assumptions and Lack of Fit Regression Model Assumptions • • • • yi E yi i Effect additivity (multiple regression) Normality of the residuals Homoscedasticity of the residuals Independence of the residuals 13-7 Additivity Additivity assumption. E ( yi ) 0 1 xi “The expected value of an observation is a weighted linear combination of a number of factors.” Which factors? (model uncertainty) • number of factors in the model • interactions of factors • powers or transformations of factors 13-8 Homoscedasticity and Normality Observations never equal their expected values. yi E yi i E i 0 No systematic biases. Homoscedasticity assumption. Var i The unexplained component has a common variance for all values i. 2 Normality assumption. i ~ N 0, 2 The unexplained component has a normal distribution. 13-9 Independence yi E yi i Independence assumption. Corr i , j Corr yi , y j 0, for i j. Responses in one experimental unit are not correlated with, affected by, or related to, responses for other experimental units. 13-10 Correlation Coefficient A measure of the strength of the linear relationship between two variables. Product Moment Correlation Coefficient corr ( x, y ) r In SLR, r is related to the slope of the fitted regression equation. r2 (or R2) represents that proportion of total variability of the Y-values that is accounted for by the linear regression with the independent variable X. Sxy SxxSyy Sxx ˆ r 1 Syy r 2 S xy 2 S xx S yy R2: Proportion of variability in Y explained by X. SSR TSS 13-11 Properties of r 1. r lies between -1 and +1. r > 0 indicates a positive linear relationship. r < 0 indicates a negative linear relationship. r = 0 indicates no linear relationship. r = 1 indicates perfect linear relationship. 2. The larger the absolute value of r, the stronger the linear relationship. 3. r2 also lies between 0 and 1. 13-12 Checking Assumptions How well does the model fit? Do predicted values seem to be placed in the middle of observed values? Do residuals satisfy the regression assumptions? (Problems seen in plot of X vs. Y will be reflected in residual plot.) • • • Constant variance? Regularities suggestive of lack of independence or more complex model? Poorly fit observations? y x 13-13 Model Adequacy Studentized residuals (ei) ei i MSE (i) (1 - h i ) Allows us to gauge whether the residual is too large. It should have a standard normal distribution, hence it is very unlikely that any studentized residual will be outside the range [-3,3]. MSE(I) is the calculated MSE leaving observation i out of the computations. hi is the ith diagonal of the projection matrix for the predictor space (ith hat diagonal element). 13-14 Normality of residuals Formal Goodness of fit tests: Kolmogorov-Smirnov Test Shapiro-Wilks Test (n<50) D’Agostino’s Test (n50) All quite conservative - they fail to reject the hypothesis of normality more often than they should. Graphical Approach: Quantile-quantile plot (qq-plot) 1. Compute and sort the simple residuals [1],[2],…[n]. 2. Associate to each residual a standard normal quantile [z[i]=normsinv((i-.5)/n)]. 3. Plot z[I] versus e[I]. Compare to 45o line. 13-15 13-16 Influence Diagnostics (Ways to detect influential observations) Does a particular observation consisting of a pair of (X,Y) values (a case) have undue influence on the fit of the regression model? i.e. what cases are greatly affecting the estimates of the p regression parameters in the model. (For simple linear regression p=2.) Standardized/Studentized Residuals. The ei are used to detect cases that are outlying with respect to their Y values. Check cases with |ei| > 2 or 3. Hat diagonal elements. The hi are used to detect cases that are outlying with respect to their X values. Check cases with hi > 2p/n. 13-17 Dffits. Measures the influence that the ith case has on the ith fitted value. Compares the ith fitted value with the ith fitted value obtained by omitting the ith case. Check cases for which |Dffits|>2(p/n). Cook’s Distance. Similar to Dffits, but considers instead the influence of the ith case on all n fitted values. Check when Cook’s Dist > Fp,n-p,0.50. Covariance Ratio. The change in the determinant of the covariance matrix that occurs when the ith case is deleted. Check cases with |Cov Ratio 1| 3p/n. Dfbetas. A measure of the influence of the ith case on each estimated regression parameter. For each regression parameter, check cases with |Dfbeta| > 2/n. 13-18 Cutoffs: Hat=0.29, CovRatio=0.43, Dffits=0.76, Dfbetas=0.53 13-19 Regression Plot Y = 7.71100 + 15.1982 X S = 6.43260 R-Sq = 98.1 % R-Sq(adj) = 98.0 % 170 Y 120 70 20 0 1 2 3 4 5 6 7 8 9 10 X Normal Probability Plot of the Residuals (response is Y) 2 Normal Score 1 0 -1 -2 -10 0 Residual 10 13-20 Residuals Versus the Fitted Values (response is Y) Obs 5 Residual 10 Obs 1 0 -10 20 70 120 170 Fitted Value Obs 2 13-21