Chapter II Simple Regression Analysis Dr Hédi ESSID Format of the simple linear regression model We write the simple linear regression model as Yi = β0 + β1 Xi + ui for i = 1,2,...,n Yi, the value of the dependent variable in observation i, has two components: (1) the non-random component β0 + β1Xi, X being described as the explanatory (or independent) variable, and the fixed quantities β0 and β1 as the parameters of the equation, and (2) the disturbance term, ui. Econometrics 2 Format of the simple linear regression model Figure 2.1 illustrates how these two components combine to determine Y. X1, X2, X3, and X4 are four hypothetical values of the explanatory variable. If the relationship between Y and X were exact, the corresponding values of Y would be represented by the points Q1 – Q4 on the line. The disturbance term causes the actual values of Y to be different. In the diagram, the disturbance term has been assumed to be positive in the first and fourth observations and negative in the other two, with the result that, if one plots the actual values of Y against the values of X, one obtains the points P1 – P4. Econometrics 3 Format of the simple linear regression model It must be emphasized that in practice the P points are all one can see of Figure 2.1. The actual values of β0 and β1, and hence the location of the Q points, are unknown, as are the values of the disturbance term in the observations. The task of regression analysis is to obtain estimates of β0 and β1, and hence an estimate of the location of the line, Econometrics given the P points. 4 Format of the simple linear regression model Why does the disturbance term exist? There are several reasons ? 1. Omission of explanatory variables 2. Aggregation of variables 3. Model misspecification 4. Functional misspecification 5. Measurement error The disturbance term is the collective outcome of all these factors. Econometrics 5 Least squares regression Suppose that you are given the four observations on X and Y represented in Figure 2.1 and you are asked to obtain estimates of the values of β0 and β1. As a rough approximation, you could do this by plotting the four P points and drawing a line to fit them as best you can. This has been done in Figure 2.2. Econometrics 6 Least squares regression The intersection of the line with the Y-axis provides an estimate of the intercept β0, which will be denoted βˆ0 , and the slope provides an estimate of the slope coefficient β1, which will be denoted βˆ1 . The line, known as the fitted model, will be written Yˆi = βˆ0 + βˆ1 X i the caret mark over Y indicating that it is the fitted value of Y corresponding to X, not the actual value. Econometrics 7 Least squares regression In Figure 2.3, the fitted points are represented by the points R1 – R4. Econometrics 8 Least squares regression Drawing a regression line by eye is all very well, but it leaves a lot to subjective judgment. Furthermore, as will become obvious, it is not even possible when you have a variable Y depending on two or more explanatory variables instead of only one. The question arises, is there a way of calculating good estimates of β0 and β1 algebraically ? Econometrics 9 Least squares regression The first step is to define what is known as a residual for each observation. This is the difference between the actual value of Y in any observation and the fitted value given by the regression line, that is, the vertical distance between Pi and Ri in observation i. It will be denoted ei (or uˆi ): ei = Yi − Yˆi = Yi − βˆ0 − βˆ1 X i Econometrics 10 Least squares regression Obviously, we wish to fit the regression line, that is, choose β0 and β1, in such a way as to make the residuals as small as possible. Equally obviously, a line that fits some observations well will fit others badly and vice versa. We need to devise a criterion of fit that takes account of the size of all the residuals simultaneously. Econometrics 11 Least squares regression One way of overcoming the problem is to minimize RSS, the sum of the squares of the residuals. RSS = ∑ ei2 (or i 2 ˆ u ∑ i) i The smaller one can make RSS, the better is the fit, according to this criterion. If one could reduce RSS to 0, one would have a perfect fit, for this would imply that all the residuals are equal to 0. The line would go through all the points, but of course in general the disturbance term makes this impossible. Econometrics 12 Least squares regression There are other quite reasonable solutions, but the least squares criterion yields estimates of β1 and β2 that are unbiased and the most efficient of their type, provided that certain conditions are satisfied. For this reason, the least squares technique is far and away the most popular in uncomplicated applications of regression analysis. The form used here is usually referred to as Ordinary Least Squares and abbreviated OLS. Econometrics 13 Derivation of the Normal Equations Least squares estimation chooses estimators β1 and β2 so as to minimise the sum of the squares of the differences between the actual and fitted values of Y i.e. choose βˆ0 and βˆ1 to minimise RSS = ∑ (Yi − Yˆi )2 i where Yˆi = βˆ0 + βˆ1 X i Substituting for Ŷ we have RSS = ∑ (Yi − βˆ0 − βˆ1 X i )2 i Econometrics 14 The necessary conditions for minimising RSS with respect to βˆ0 and βˆ1 are ∂RSS ∂RSS =0, =0 ∂βˆ0 ∂βˆ1 i .e. ∂RSS = −2∑ (Yi − βˆ0 − βˆ1 X i ) = 0 and ∂βˆ0 i ∂RSS = −2∑ X i (Yi − βˆ0 − βˆ1 X i ) = 0 ∂βˆ1 i After rearrangement this gives the Normal Equations ∑Y = nβˆ + βˆ ∑ X ∑ XY = βˆ ∑ X + βˆ ∑ X 0 0 1 2 1 Econometrics 15 These can now solved for βˆ0 and βˆ1 n • βˆ1 = n ∑(X i ∑XY − X )(Yi − Y ) i =1 n i = i − nXY i =1 n 2 ( X − X ) ∑ i ∑ i =1 i =1 X i2 − nX 2 Cov ( X ,Y ) = Var X and • βˆ0 = Y − βˆ1 X Econometrics 16 Example : Some (fictitious) sales-advertising data Observation 1 2 3 4 5 6 7 8 9 10 11 12 Sales(Y) 36 48 45 40 30 56 63 53 61 68 66 65 Advertising(X) 56.7 63.9 62.7 59.7 55.9 68.7 69.2 65.5 69.4 73.4 74.1 74.4 NOTE: Both variables are measured in thousands of dollars Econometrics 17 The sales-advertising model : regression output Econometrics 18 Are the coefficient estimates plausible ? The results here show an estimated intercept of -75 and a slope (X) coefficient of just under 2 What do you think about these values ? Are they significantly different from zero ? How good is the fit ? Econometrics 19 Scatter diagram of sales vs ads with fitted regression line Econometrics 20 Analysis of Variance (ANOVA) and Sums of Squares As you can see from the ANOVA table (regression output) we can decompose the Total Sum of Squares (of the dependent variable Y around its mean) (TSS) into two parts: the Explained (or Regression) Sum of Squares (ESS) and the Residual Sum of Squares (RSS). ( ) 2 ( ) 2 ( ) 2 Yi −Y = ∑ Yˆi −Y + ∑ Yi −Yˆi ∑ TSS ESS RSS Econometrics 21 Goodness of fit : R squared (the Coefficient of determination) We can now define the Coefficient of Determination or R squared as the proportion of the Total Variation of the dependent variable (around its mean) which can be explained by, or attributed to, the regression. 2 R =∑ ( Yˆi − Y 2 ) / ∑ (Y − Y ) i 2 = ESS / TSS = 1 − RSS / TSS R squared is taken as a measure of the “ goodness of fit” of the regression. 0 ≤ R2 ≤ 1 The closer to 1 is R squared, the better the fit. Econometrics 22 Alternative interpretation of R2 It should be intuitively obvious that, the better is the fit achieved by the regression equation, the higher should be the correlation coefficient for the actual and predicted values of Y. We will show that R2 is in fact equal to the square of this correlation coefficient, which we will denote rY ,Yˆ = ∑ (Yˆ − Y ) 2 i 2 ∑ (Y − Y ) ∑ (Yˆ − Y ) i i 2 = ∑ (Yˆ − Y ) ∑ (Y − Y ) 2 i 2 = R2 i Econometrics 23