Chapter 17 Simple Linear Regression and Correlation 1 17.1 Introduction • In this chapter we employ Regression Analysis to examine the relationship among quantitative variables. • The technique is used to predict the value of one variable (the dependent variable - y)based on the value of other variables (independent variables x1, x2,…xk.) 2 17.2 The Model • The first order linear model y b0 b1x y = dependent variable x = independent variable b0 = y-intercept b1 = slope of the line = error variable y b0 and b1 are unknown, therefore, are estimated from the data. Rise b0 b1 = Rise/Run Run x 3 17.3 Estimating the Coefficients • The estimates are determined by – drawing a sample from the population of interest, – calculating sample statistics. – producing a straight line that cuts into the data. y The question is: Which straight line fits best? w w w w w w w w w w w w w w w x 4 The best line is the one that minimizes the sum of squared vertical differences between the points and the line. Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89 Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99 4 3 2.5 2 Let us compare two lines The second line is horizontal (2,4) w w (4,3.2) (1,2) w w (3,1.5) 1 1 2 3 4 The smaller the sum of squared differences the better the fit of the line to the data. 5 To calculate the estimates of the coefficients that minimize the differences between the data points and the line, use the formulas: b1 The regression equation that estimates the equation of the first order linear model is: cov( X , Y ) s 2x ŷ b 0 b1x b 0 y b1 x 6 • Example 17.1 Relationship between odometer reading and a used car’s selling price. – A car dealer wants to find the relationship between the odometer reading and the selling price of used cars. – A random sample of 100 cars is selected, and the data recorded. – Find the regression line. Car Odometer 1 37388 2 44758 3 45833 4 30862 5 31705 6 34010 . . . . . . Price 5318 5061 5008 5795 5784 5359 . . . Independent variable x Dependent variable y 7 • Solution – Solving by hand • To calculate b0 and b1 we need to calculate several statistics first; 2 ( x x) i x 36,009.45; s 2x y 5,411.41; ( x x)(y cov( X , Y ) 43,528,688 n 1 i i y) n 1 1,356,256 where n = 100. b1 cov(X, Y) s 2x 1,356,256 .0312 43,528,688 b 0 y b1x 5411.41 ( .0312)(36,009.45) 6,533 ŷ b 0 b1x 6,533 .0312 x 8 – Using the computer (see file Xm17-01.xls) Tools > Data analysis > Regression > [Shade the y range and the x range] > OK SUMMARY OUTPUT Price Regression Statistics Multiple R 0.806308 R Square 0.650132 Adjusted R Square 0.646562 Standard Error 151.5688 Observations 100 6000 5500 5000 4500 19000 29000 ŷ 6,533 .0312 x 39000 49000 Odometer ANOVA df Regression Residual Total 1 98 99 SS MS F Significance F 4183528 4183528 182.1056 4.4435E-24 2251362 22973.09 6434890 CoefficientsStandard Error t Stat P-value Intercept 6533.383 84.51232 77.30687 1.22E-89 Odometer -0.03116 0.002309 -13.4947 4.44E-24 9 6533 Price 6000 5500 5000 4500 0 No data 19000 29000 39000 49000 Odometer ŷ 6,533 .0312 x The intercept is b0 = 6533. Do not interpret the intercept as the “Price of cars that have not been driven” This is the slope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0312 10 17.4 Error Variable: Required Conditions • The error is a critical part of the regression model. • Four requirements involving the distribution of must be satisfied. – – – – The probability distribution of is normal. The mean of is zero: E() = 0. The standard deviation of is s for all values of x. The set of errors associated with different values of y are all independent. 11 From the first three assumptions we have: y is normally distributed with mean E(y) = b0 + b1x, and a constant standard deviation s E(y|x3) The standard deviation remains constant, m3 b0 + b1x3 E(y|x2) b0 + b1x2 m2 but the mean value changes with x b0 + b1x1 E(y|x1) m1 x1 x2 x3 12 17.5 Assessing the Model • The least squares method will produce a regression line whether or not there is a linear relationship between x and y. • Consequently, it is important to assess how well the linear model fits the data. • Several methods are used to assess the model: – Testing and/or estimating the coefficients. – Using descriptive measurements. 13 • Sum of squares for errors – This is the sum of differences between the points and the regression line. – It can serve as a measure of how well the line fits the n data. SSE (y i ŷ i ) 2 . i 1 SSE (n 1)s 2Y cov( X, Y) s 2x – This statistic plays a role in every statistical technique we employ to assess the model. 14 • Standard error of estimate – The mean error is equal to zero. – If s is small the errors tend to be close to zero (close to the mean error). Then, the model fits the data well. – Therefore, we can, use s as a measure of the suitability of using a linear model. – An unbiased estimator of s2 is given by s2 S tan dard Error of Estimate SSE s n2 15 • Example 17.2 – Calculate the standard error of estimate for example 17.1, and describe what does it tell you about the model fit? • Solution ( y i ŷ i ) 2 Calculated before 6,434,890 64,999 n 1 99 2 cov( X , Y ) ( 1 , 356 , 256 ) SSE (n 1)s 2Y 99(64,999) 2,252,363 2 43,528,688 sx s 2Y Thus, It is hard to assess the model based on s even when compared with the SSE 2,251,363 s 151.6 mean value of y. n2 98 s 151.6, y 5,411.4 16 • Testing the slope – When no linear relationship exists between two variables, the regression line should be horizontal. q q qq q q q q q q q q Linear relationship. Different inputs (x) yield different outputs (y). No linear relationship. Different inputs (x) yield the same output (y). The slope is not equal to zero The slope is equal to zero 17 • We can draw inference about b1 from b1 by testing H0: b1 = 0 H1: b1 = 0 (or < 0,or > 0) – The test statistic is b1 b1 t s b1 where s b1 s (n 1)s 2x The standard error of b1. – If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2. 18 • Solution – Solving by hand – To compute “t” we need the values of b1 and sb1. b1 .312 s b1 t s (n 1)s 2x 151.6 (99)(43,528,688 b1 b1 .312 0 13.49 .00231 s b1 – Using the computer .00231 There is overwhelming evidence to infer that the odometer reading affects the auction selling price. Coefficients Standard Error t Stat Intercept 6533.383035 84.51232199 77.30687 Odometer -0.031157739 0.002308896 -13.4947 P-value 1.22E-89 4.44E-24 19 • Coefficient of determination – When we want to measure the strength of the linear relationship, we use the coefficient of determination. R 2 [cov( X , Y )] s x2 s 2y 2 or R 1 2 SSE ( yi y) 2 20 – To understand the significance of this coefficient note: The regression model Overall variability in y The error 21 Two data points (x1,y1) and (x2,y2) of a certain sample are shown. y2 y y1 x1 Total variation in y = (y1 y) 2 (y 2 y) 2 x2 Variation explained by the regression line) (ŷ1 y) 2 (ŷ 2 y) 2 + Unexplained variation (error) (y1 ŷ1 ) 2 (y 2 ŷ 2 ) 2 22 Variation in y = SSR + SSE • R2 measures the proportion of the variation in y that is explained by the variation in x. 2 R 1 SSE (y i y) (y y) ( y i y ) 2 SSE 2 i 2 SSR (y i y) 2 • R2 takes on any value between zero and one. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y. 23 • Example 17.4 – Find the coefficient of determination for example 17.1; what does this statistic tell you about the model? • Solution 2 – Solving by hand; R – Using the computer [cov( X , Y )]2 s 2x s 2y [ 1,356 ,256 ]2 ( 43 ,528 ,688 )( 64 ,999 ) .6501 • From the regression output we have Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.8063 0.6501 0.6466 151.57 100 65% of the variation in the auction selling price is explained by the variation in odometer reading. The rest (35%) remains unexplained by this model. 24 17.6 Finance Application: Market Model • One of the most important applications of linear regression is the market model. • It is assumed that rate of return on a stock (R) is linearly related to the rate of return on the overall market. R = b0 + b1Rm + Rate of return on a particular stock Rate of return on some major stock index The beta coefficient measures how sensitive the stock’s rate of return is to changes in the level of the overall market. 25 • Example 17.5 The market model SUMMARY OUTPUT Regression Statistics Multiple R 0.560079 R Square 0.313688 Adjusted R Square 0.301855 Standard Error 0.063123 Observations 60 ANOVA • Estimate the market model for Nortel, a stock traded in the Toronto Stock Exchange. • Data consisted of monthly percentage return for Nortel and monthly percentage return for all the stocks. dfstock’s SS This is aMS F total Significance F This is a measure of the measure of the risk embedded 1 0.10563 0.10563 3.27E-06 marketRegression related risk. In this sample, in the Nortel stock,26.50969 that is market-related. Residual 58 0.231105 0.003985 for each 1% increase in the TSE Specifically, 31.37% of the variation in Nortel’s Total 59 0.336734 return, the average increase in return are explained by the variation in the Nortel’s return is .8877%. Coefficients Standard TSE’s Error returns. t Stat P-value Intercept TSE 0.012818 0.008223 1.558903 0.887691 0.172409 5.148756 0.12446 3.27E-06 26 17.7 Using the Regression Equation • Before using the regression model, we need to assess how well it fits the data. • If we are satisfied with how well the model fits the data, we can use it to make predictions for y. • Illustration – Predict the selling price of a three-year-old Taurus with 40,000 miles on the odometer (Example 17.1). ŷ 6533 .0312 x 6533 .0312( 40,000) 5,285 27 • Prediction interval and confidence interval – Two intervals can be used to discover how closely the predicted value will match the true value of y. • Prediction interval - for a particular value of y, • Confidence interval - for the expected value of y. – The prediction interval – The confidence interval ( x g x) 2 1 1 n ( x i x) 2 ( x g x) 2 1 n ( x i x) 2 ŷ t 2 s ŷ t 2 s The prediction interval is wider than the confidence interval 28 • Example 17.6 interval estimates for the car auction price – Provide an interval estimate for the bidding price on a Ford Taurus with 40,000 miles on the odometer. – Solution • The dealer would like to predict the price of a single car • The prediction interval(95%) = ŷ t 2 s 1 1 n ( x g x) 2 ( x i x) 2 t.025,98 1 ( 40,000 36,009) 2 [ 6533 .0312( 40000)] 1.984(151.6) 1 5,285 303 29 100 4,309,340,160 – The car dealer wants to bid on a lot of 250 Ford Tauruses, where each car has been driven for about 40,000 miles. – Solution • The dealer needs to estimate the mean price per car. • The confidence interval (95%) = ŷ t 2 s 1 n ( x g x)2 ( x i x)2 1 ( 40,000 36,009) 2 [ 6533 .0312( 40000)] 1.984(151.6) 5,285 35 100 4,309,340,160 30 • The effect of the given value of x on the interval – As xg moves away from x the interval becomes longer. That is, the shortest interval is found at x. ŷ b0 b1x g ŷ( x g x 1) ŷ( x g x 1) 2 ( x x ) 1 g The ŷconfidence t 2 s interval when xg =nx ( x x)2 1 The ŷconfidence t 2 s interval x when xg = n 1 x 2 x 1 x 1 x 2 x ( x( x 2)1)xx21 ( x 12))xx 12 1 The confidence interval ŷ t 2 s when xg = xn 2 i 12 ( xi x )2 22 ( xi x )2 31 17.8 Coefficient of correlation • The coefficient of correlation is used to measure the strength of association between two variables. • The coefficient values range between -1 and 1. – If r = -1 (negative association) or r = +1 (positive association) every point falls on the regression line. – If r = 0 there is no linear pattern. • The coefficient can be used to test for linear relationship between two variables. 32 • Testing the coefficient of correlation – When there are no linear relationship between two variables, r = 0. – The hypotheses are: H0: r = 0 H1: r = 0 Y X – The test statistic is: t r n2 The statistic is Student t distributed with d.f. = n - 2, provided the variables are bivariate normally distributed. 1 r2 w here r is the sample coefficient of correlation cov (X, Y) calculated by r s x sy 33 • Example 17.7 Testing for linear relationship – Test the coefficient of correlation to determine if linear relationship exists in the data of example 17.1. • Solution – We test H0: r = 0 H1: r 0. – Solving by hand: • The rejection region is |t| > t/2,n-2 = t.025,98 = 1.984 or so. • The sample coefficient of correlation r=cov(X,Y)/sxsy=-.806 The value of the t statistic is t r n2 1 r 2 13.49 Conclusion: There is sufficient evidence at = 5% to infer that there are linear relationship between the two variables. 34 • Spearman rank correlation coefficient – The Spearman rank test is used to test whether relationship exists between variables in cases where • at least one variable is ranked, or • both variables are quantitative but the normality requirement is not satisfied 35 – The hypotheses are: • H0: rs = 0 • H1: rs = 0 – The test statistic is cov (a, b) rs s a sb a and b are the ranks of the data. – For a large sample (n > 30) rs is approximately normally distributed z rs n 1 36 • Example 17.8 – A production manager wants to examine the relationship between • aptitude test score given prior to hiring, and • performance rating three months after starting work. – A random sample of 20 production workers was selected. The test scores as well as performance rating was recorded. 37 Aptitude Performance Employee test rating 1 59 3 2 47 2 3 58 4 4 66 3 5 77 2 . . . . . . . . . Scores range from 0 to 100 Scores range from 1 to 5 • Solution – The hypotheses are: • H0: rs = 0 • H1: rs = 0 – The problem objective is to analyze the relationship between two variables. – Performance rating is ranked. – The test statistic is rs, and the rejection region is |rs| > rcritical (taken from the Spearman rank correlation table). 38 Employee 1 2 3 4 5 . . . Aptitude test 59 47 58 66 77 . . . Rank(a) 9 3 8 14 20 . . . Performance rating 3 2 4 3 2 . . . Rank(b) 10.5 3.5 17 10.5 3.5 . . . Ties are broken by averaging the ranks. – Conclusion: Solving by hand Rank each separately.At 5% significance Do• not reject thevariable null hypothesis. level there is insufficient to infer that the • Calculate sa = 5.92; sevidence b =5.50; cov(a,b) = 12.34 two• variable related tos one another. Thus rs =are cov(a,b)/[s a b] = .379. • The critical value for = .05 and n = 20 is .450. 39 17.9 Regression Diagnostics - I • The three conditions required for the validity of the regression analysis are: – the error variable is normally distributed. – the error variance is constant for all values of x. – The errors are independent of each other. • How can we diagnose violations of these conditions? 40 • Residual Analysis – Examining the residuals (or standardized residuals), we can identify violations of the required conditions – Example 17.1 - continued • Nonnormality. – Use Excel to obtain the standardized residual histogram. – Examine the histogram and look for a bell shaped diagram with mean close to zero. 41 A Partial list of Standard residuals RESIDUAL OUTPUT Observation 1 2 3 4 5 For each residual we calculate the standard deviation as follows: Residuals Standard Residuals -50.45749927 -0.334595895 -77.82496482 -0.516076186 -97.33039568 -0.645421421 223.2070978 1.480140312 238.4730715 1.58137268 sri s 1 hi w here 1 hi n ( x i x )2 ( x j x )2 Standardized residual i = Residual i / Standard deviation 40 30 20 We can also apply the Lilliefors test or the c2 test of normality. 10 0 -2.5 -1.5 -0.5 0.5 1.5 2.5 More 42 • Heteroscedasticity – When the requirement of a constant variance is violated we have heteroscedasticity. + ++ ^y Residual + + + + + + + + + + + + ++ + + + + + + + + + + + + + + + y^ ++ + ++ ++ ++ + + ++ + + The spread increases with ^y 43 • When the requirement of a constant variance is not violated we have homoscedasticity. + ++ ^y Residual + + + + + + + + + + + + + + + + + + + ++ + + + + + + + + The spread of the data points does not change much. y^ ++ ++ ++ + + +++ +++ + ++ + + ++ + + 44 • When the requirement of a constant variance is not violated we have homoscedasticity. ^y Residual + + + + + + + + + + + + + + + + + + + ++ + + + + ++ + + As far as the even spread, this is a much better situation y^ + + + ++ ++ ++ + + + ++ + ++ ++ +++ +++ + +++ + ++ 45 • Nonindependence of error variables – A time series is constituted if data were collected over time. – Examining the residuals over time, no pattern should be observed if the errors are independent. – When a pattern is detected, the errors are said to be autocorrelated. – Autocorrelation can be detected by graphing the residuals against time. 46 Patterns in the appearance of the residuals over time indicates that autocorrelation exists. Residual Residual + ++ + 0 + + + + + + + + + + ++ + + + Time Note the runs of positive residuals, replaced by runs of negative residuals + + + 0 + + + + Time + + Note the oscillating behavior of the residuals around zero. 47 • Outliers – An outlier is an observation that is unusually small or large. – Several possibilities need to be investigated when an outlier is observed: • There was an error in recording the value. • The point does not belong in the sample. • The observation is valid. – Identify outliers from the scatter diagram. – It is customary to suspect an observation is an outlier if its |standard residual| > 2 48 An outlier + + + + + + + + + An influential observation +++++++++++ … but, some outliers may be very influential + + + + + + + The outlier causes a shift in the regression line 49 • Procedure for regression diagnostics – Develop a model that has a theoretical basis. – Gather data for the two variables in the model. – Draw the scatter diagram to determine whether a linear model appears to be appropriate. – Check the required conditions for the errors. – Assess the model fit. – If the model fits the data, use the regression equation. 50