Simple Linear Regression Simple Linear Regression • Our objective is to study the relationship between two variables X and Y. • One way to study the relationship between two variables is by means of regression. • Regression analysis is the process of estimating a functional relationship between X and Y. A regression equation is often used to predict a value of Y for a given value of X. • Another way to study relationship between two variables is correlation. It involves measuring the direction and the strength of the linear relationship. 2 First-Order Linear Model = Simple Linear Regression Model Yi b0 b1Xi e i where y = dependent variable x = independent variable b0= y-intercept b1= slope of the line e = error variable 3 Simple Linear Model Yi b0 b1Xi e i This model is – Simple: only one X – Linear in the parameters: No parameter appears as exponent or is multiplied or divided by another parameter – Linear in the predictor variable (X): X appears only in the first power. 4 Examples • Multiple Linear Regression: Yi b0 b1X1i b2X2i e i • Polynomial Linear Regression: Yi b 0 b1X i b 2 X 2i e i • Linear Regression: log10(Yi ) b0 b1Xi b2 exp(Xi ) e i • Nonlinear Regression: Yi b0 /(1 b1 exp(b2Xi )) e i Linear or nonlinear in parameters 5 Deterministic Component of Model y-intercept 50 45 40 35 30 25 20bˆ0 15 10 5 0 0 y bˆ0 bˆ1x Rise bˆ1(slope)=rise/run Run x 5 10 15 20 6 Mathematical vs Statistical Relation x 50 45 40 35 30 25 20 15 10 5 0 ^ y = - 5.3562 + 3.3988x x 0 5 10 15 20 7 Error • The scatterplot shows that the points are not on a line, and so, in addition to the relationship, we also describe the error: yi b 0 b1 xi e i , i=1,2,...,n • The Y’s are the response (or dependent) variable. The x’s are the predictors or independent variables, and the epsilon’s are the errors. We assume that the errors are normal, mutually independent, and have variance 2. 8 Least Squares: n Minimize n e i2 i 1 ( yi b 0 bi xi )2 i 1 The minimizing y-intercept and slope are given by bˆ0 and bˆ1 . We use the notation: yˆi bˆ0 bˆ1 xi • The quantities Ri yi yˆi are called the residuals. If we assume a normal error, they should look normal. Error: Yi-E(Yi) unknown; Residual: Ri yi yˆi estimated, i.e. known 9 Minimizing error 10 • The Simple Linear Regression Model y b 0 b1 x e • The Least Squares Regression Line where yˆ bˆ0 bˆ1 x SS xy ˆ ˆ y bˆ x b1 b 0 1 SS x SS x SS xy (x x ) x 2 2 i i x ( x x )( y y ) i i 2 i n x y xy i i i i n 11 What form does the error take? • Each observation may be decomposed into two parts: y yˆ ( y yˆ ) • The first part is used to determine the fit, and the second to estimate the error. • We estimate the standard deviation of the error by: 2 S xy 2 ˆ SSE (Y Y ) S yy S xx 12 Estimate of 2 • We estimate 2 by SSE s MSE n2 2 13 Example • An educational economist wants to establish the relationship between an individual’s income and education. He takes a random sample of 10 individuals and asks for their income ( in $1000s) and education ( in years). The results are shown below. Find the least squares regression line. Education Income 11 12 11 15 8 10 11 12 17 11 25 33 22 41 18 28 32 24 53 26 14 Dependent and Independent Variables • The dependent variable is the one that we want to forecast or analyze. • The independent variable is hypothesized to affect the dependent variable. • In this example, we wish to analyze income and we choose the variable individual’s education that most affects income. Hence, y is income and x is individual’s education 15 First Step: x 118 x 1450 y 302 y 10072 x y 3779 i 2 i i 2 i i i 16 Sum of Squares: x )( y ) (118)(302) x y 3779 215.4 ( SS xy i i i SS x i n 10 x n Therefore, 2 i ( xi ) 2 (118) 2 1450 57.6 10 SS xy 215.4 ˆ b1 3.74 SS x 57.6 302 118 ˆ ˆ b 0 y b1 x 3.74 13.93 10 10 17 The Least Squares Regression Line • The least squares regression line is yˆ 13.93 3.74 x • Interpretation of coefficients: *The sample slope bˆ1 3.74 tells us that on average for each additional year of education, an individual’s income rises by $3.74 thousand. • The y-intercept is bˆ0 13.93 . This value is the expected (or average) income for an individual who has 0 education level (which is meaningless here) 18 Error Variable: e is normally distributed. E(e) = 0 The variance of e is 2. The errors are independent of each other. The estimator of 2 is SSE 2 se MSE n2 where • • • • • 2 S xy SSE (Y Yˆ )2 S yy S xx and SS y yi2 ( yi ) 2 n 19 Example (continue) • For the previous example SS y y n 2 i ( yi ) 2 (302) 2 10072 951.6 10 Hence, SSE is 2 S xy (215.4)2 SSE S yy 146.09 951.6 S xx 57.6 SSE 146.09 se 18.26 and se se2 4.27 n 2 10 2 2 20 2 Interpretation of se • The value of “s” can be compared with the mean value of y to provide a rough guide as to whether s is small or large. • Since y 30.2 and se=4.27, we would conclude that “s” is relatively small, indicating that the regression line fits the data quite well. 21 Example • Car dealers across North America use the red book to determine a cars selling price on the basis of important features. One of these is the car’s current odometer reading. • To examine this issue 100 three-year old cars in mint condition were randomly selected. Their selling price and odometer reading were observed. 22 Portion of the data file Odometer 37388 44758 45833 30862 ….. 34212 33190 39196 36392 Price 5318 5061 5008 5795 … 5283 5259 5356 5133 23 Example (Minitab Output) Regression Analysis The regression equation is Price = 6533 - 0.0312 Odometer Predictor Constant Odometer Coef 6533.38 -0.031158 S = 151.6 StDev 84.51 0.002309 R-Sq = 65.0% T P 77.31 0.000(SIGNIFICANT) -13.49 0.000(SIGNIFICANT) R-Sq(adj) = 64.7% Analysis of Variance Source Regression Error Total DF 1 98 99 SS 4183528 2251362 6434890 MS 4183528 22973 F 182.11 P 0.000 24 Example • The least squares regression line is yˆ 6533.38 0.031158 x Price 6000 5500 5000 20000 30000 Odometer 40000 50000 25 Interpretation of the coefficients • bˆ1 0.031158 means that for each additional mile on the odometer, the price decreases by an average of 3.1158 cents. • bˆ0 6533.38 means that when x = 0 (new car), the selling price is $6533.38 but x = 0 is not in the range of x. So, we cannot interpret the value of y when x=0 for this problem. • R2=65.0% means that 65% of the variation of y can be explained by x. The higher the value of R2, the better the model fits the data. 26 R² and R² adjusted • R² measures the degree of linear association between X and Y. • So, a high R² does not necessarily indicate that the estimated regression line is a good fit. • Also, an R² close to 0 does not necessarily indicate that X and Y are unrelated (relation can be nonlinear) • As more and more X’s are added to the model, R² always increases. R²adj accounts for the number of parameters in the model. 27 Scatter Plot Odometer .vs. Price Line Fit Plot Price 6000 5500 5000 4500 19000 29000 39000 49000 Odometer 28 Testing the slope • Are X and Y linearly related? H 0 : b1 0 H A : b1 0 •Test Statistic: ˆ b1 b1 t sbˆ where se sbˆ SS x 1 1 29 Testing the slope (continue) • The Rejection Region: Reject H0 if t < -t/2,n-2 or t > t/2,n-2. • If we are testing that high x values lead to high y values, HA: b1>0. Then, the rejection region is t > t,n-2. • If we are testing that high x values lead to low y values or low x values lead to high y values, HA: b1 <0. Then, the rejection region is t < - t,n-2. 30 Assessing the model Example: • Excel output Intercept Odometer Coefficients 6533.4 -0.031 Standard Error t Stat P-value 84.512322 77.307 1E-89 0.0023089 -13.49 4E-24 • Minitab output Predictor Constant Odometer Coef 6533.38 -0.031158 StDev 84.51 0.002309 T 77.31 -13.49 P 0.000 0.000 31 Coefficient of Determination 2 SS xy SSE R 1 SS x SS y SS y 2 For the data in odometer example, we obtain: SSE 2,251,363 R 1 1 SSy 6,434,890 2 1 0.3499 0.6501 2 R adj n 1 SSE 1 ( ) n p SS y 32 Using the Regression Equation • Suppose we would like to predict the selling price for a car with 40,000 miles on the odometer yˆ 6,533 0.0312 x 6,533 0.0312(40,000) $5,285 33 Prediction and Confidence Intervals • Prediction Interval of y for x=xg: The confidence interval for predicting the particular value of y for a given x 1 ( xg x ) yˆ t / 2,n 2 se 1 n SS x 2 • Confidence Interval of E(y|x=xg): The confidence interval for estimating the expected value of y for a given x yˆ t / 2,n 2 se 1 n ( xg x ) 2 SS x 34 Solving by Hand (Prediction Interval) • From previous calculations we have the following estimates: yˆ 5285, se 151.6, SS x 4309340160, x 36,009 • Thus a 95% prediction interval for x=40,000 is: 1 (40, 000 36, 009) 2 5, 285 1.984(151.6) 1 100 4,309,340,160 5, 285 303 •The prediction is that the selling price of the car will fall between $4982 and $5588. 35 Solving by Hand (Confidence Interval) • A 95% confidence interval of E(y| x=40,000) is: 1 (40,000 36,009) 2 5,285 1.984(151.6) 100 4,309,340,160 5,285 35 •The mean selling price of the car will fall between $5250 and $5320. 36 Prediction and Confidence Intervals’ Graph 6300 Predicted Prediction interval 5800 5300 Confidence interval 4800 20000 30000 40000 50000 Odometer 37 Notes • No matter how strong is the statistical relation between X and Y, no cause-and-effect pattern is necessarily implied by the regression model. Ex: Although a positive and significant relationship is observed between vocabulary (X) and writing speed (Y), this does not imply that an increase in X causes an increase in Y. Other variables, such as age, may affect both X and Y. Older children have a larger vocabulary and faster writing speed. 38 Regression Diagnostics How to diagnose violations and how to deal with observations that are unusually large or small? • Residual Analysis: Non-normality Heteroscedasticity (non-constant variance) Non-independence of the errors Outlier Influential observations 39 Standardized Residuals • The standardized residuals are calculated as ri Standardized residual se where ri yi y ˆi . • The standard deviation of the i-th residual is sr se i 1 ( xi x ) 1 hi where hi n SS x 2 40 Non-normality: • The errors should be normally distributed. To check the normality of errors, we use histogram of the residuals or normal probability plot of residuals or tests such as Shapiro-Wilk test. • Dealing with non-normality: – Transformation on Y – Other types of regression (e.g., Poisson or Logistic …) – Nonparametric methods (e.g., nonparametric regression(i.e. smoothing)) 41 Heteroscedasticity: 2 • The error variance should be constant. e When this requirement is violated, the condition is called heteroscedasticity. • To diagnose hetersocedastisticity or homoscedasticity, one method is to plot the residuals against the predicted value of y (or x). If the points are distributed evenly around the expected value of errors which is 0, this means that the error variance is constant. Or, formal tests such as: Breusch-Pagan test 42 Example • A classic example of heteroscedasticity is that of income versus expenditure on meals. As one's income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating less expensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption. 43 Dealing with heteroscedasticity • Transform Y • Re-specify the Model (e.g., Missing important X’s?) • Use Weighted Least Squares instead of n e i2 Ordinary Least Squares min i 1 Var (e i ) 44 Non-independence of error variable: • The values of error should be independent. When the data are time series, the errors often are correlated (i.e., autocorrelated or serially correlated). To detect autocorrelation we plot the residuals against the time periods. If there is no pattern, this means that errors are independent. 45 Outlier: • An outlier is an observation that is unusually small or large. Two possibilities which cause outlier is 1. Error in recording the data. Detect the error and correct it The outlier point should not have been included in the data (belongs to another population) Discard the point from the sample 2. The observation is unusually small or large although it belong to the sample and there is no recording error. Do NOT remove it 46 Influential Observations • One or more observations have a large influence on the statistics. Scatter Plot Without the Influential Observation Scatter Plot of One Influential Observation 150 60 50 100 y 40 y 30 50 20 10 0 0 10 20 30 x 40 50 0 0 10 20 30 40 50 x 47 Influential Observations • Look into Cook’s Distance, DFFITS, DFBETAS (Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W., (1996) Applied Linear Statistical Models, 4th edition, Irwin, pp. 378-384) 48 Multicollinearity • A common issue in multiple regression is multicollinearity. This exists when some or all of the predictors in the model are highly correlated. In such cases, the estimated coefficient of any variable depends on which other variables are in the model. Also, standard errors of the coefficients are very high… 49 Multicollinearity • Look into correlation coefficient among X’s: If Cor>0.8, suspect multicollinearity • Look into Variance inflation factors (VIF): VIF>10 is usually a sign of multicollinearity • If there is multicollinearity: – Use transformation on X’s, e.g. centering, standardization. Ex: Cor(X,X²)=0.991; after standardization Cor=0! – Remove the X that causes multicollinearity – Factor analysis – Ridge regression 50 Exercise • It is doubtful that any sports collects more statistics than baseball. The fans are always interested in determining which factors lead to successful teams. The table below lists the team batting average and the team winning percentage for the 14 league teams at the end of a recent season. 51 Team-B-A Winning% 0.254 0.414 0.269 0.519 0.255 0.500 0.262 0.537 0.254 0.352 0.247 0.519 0.264 0.506 0.271 0.512 0.280 0.586 0.256 0.438 0.248 0.519 0.255 0.512 0.270 0.525 0.257 0.562 y = winning % and x = team batting average 52 a) LS Regression Line y 7.001, y 3.549 x y 1.824562 xi 3.642, 2 i i i i x )( y ) (3.642)(7.001) x y 1.824562 0.0033 ( SS xy xi2 0.949 SS x i i i x x 2 i i n i n 14 2 (3.642)2 0.948622 0.00118 14 53 SS xy 0.003302 ˆ b1 0.7941 SS x 0.001182 bˆ y bˆ x 0.5 (0.7941)0.26 0.2935 0 1 • The least squares regression line is yˆ 0.2935 0.7941x • The meaning bˆ1 0.7941 is for each additional batting average of the team, the winning percentage increases by an average of 79.41%. 54 b) Standard Error of Estimate SSE S yy 2 2 2 S xy y S i xy yi2 S xx S xx n 7.0012 0.0033022 (3.548785 ) 0.03856 14 0.00182 SSE 0.03856 2 s 0.00321 and s s So, e e 0.0567 n 2 14 2 2 e • Since se=0.0567 is small, we would conclude that “s” is relatively small, indicating that the regression line fits the data quite well. 55 c) Do the data provide sufficient evidence at the 5% significance level to conclude that higher team batting average lead to higher winning percentage? H 0 : b1 0 H A : b1 0 bˆ1 b1 Test statistic: t 1.69 sbˆ (p-value=.058) 1 Conclusion: Do not reject H0 at = 0.05. The higher team batting average does not lead to higher winning percentage. 56 d)Coefficient of Determination SS xy2 SSE 0.03856 R 1 1 0.1925 SS x SS y SS y 0.04778 2 The 19.25% of the variation in the winning percentage can be explained by the batting average. 57 e) Predict with 90% confidence the winning percentage of a team whose batting average is 0.275. yˆ 0.2935 0.7941(0.275) 0.5119 yˆ t / 2,n2 se 2 1 ( xg x ) 1 n SS x 1 (0.275 0.2601) 2 0.5119 (1.782)(0.0567) 1 14 0.001182 0.5119 0.1134 90% PI for y: (0.3985,0.6253) •The prediction is that the winning percentage of the team will fall between 39.85% and 62.53%. 58