Univariate Linear Regression Selcen Cakir November 16, 2021 In most of econometrics, we are interested in answering causal questions. What is the causal effect of studying Economics at Bogazici on an individual’s lifetime income? What is the causal effect of having a new international trade partner on a country’s GDP? What is the effect of recruiting more economists on a firm’s profits? The examples are numerous. We have seen that to answer a causal question, we must make ceteris paribus, i.e., apples to apples comparisons. One way to do this is to conduct randomized control trials (RCTs), where we construct two samples that are on average identical except for the treatment status. However, depending on the question, conducting RCTs maybe infeasible, too expensive, or unethical. Therefore, we need to find ways to answer causal questions of interest with observational data (rather than trying to get experimental data). Sometimes we may want to think if we can use observational data to make ceteris paribus comparisons. This could be possible, for example, if the selection bias between the treatment and control groups can be captured by observable characteristics. For example, suppose that we want to study the effect of studying economics at Bogazici on someone’s wages. We know that a simple comparison of wages of people with and without a degree from Bogazici economics will capture the treatment effect plus selection bias. Selection bias appears because people who study at Bogazici economics are ranked very highly in the university entrance exam, and comparing their outcomes to the outcomes of people who ranked lower is unlikely to be an apples-to-apples comparison. To make an apples-to-apples comparison, we want to compare individuals who have similar exam scores. In econometrics language, we say that we want to control for characteristics that cause the selection bias. Regression is a valuable tool for controlling for such characteristics. Regression has another important advantage. If our treatment variable is continuous rather than binary as in previous lectures, we can use regression analysis to find the treatment effect. In addition to these, we will see that regression analysis has some really nice properties. For example, we will call it BLUE, which means that it is the best linear unbiased estimator. We will define what each of these words mean. 1 Figure 1 This note describes the mechanics of regression analysis when we have a treatment variable and no control variables. We call this a univariate regression. In a few weeks, we will start learning about how to add control variables to our regression. 1 An example Suppose that you are hired as an econometrician by the Ministry of Education to find the causal effect of reducing the class size on students’ university entrance exam performance.1 You asked the Ministry to provide data on each school’s student-teacher ratio and average test score per class. Then, you plotted those variables to obtain Figure 1. Figure 1 shows that there is a lot of dispersion in the data, even for schools with the same student-teacher ratio. This tells us that the student-teacher ratio is not a perfect predictor of average test scores. Why so? How can we summarize the relationship in the plot? We start by finding the Conditional Expectation Function, shortly the CEF. We find the CEF by computing the average value of the outcome variable for each possible value of the explanatory variable, as in Figure 2. Let’s add the CEF to the scatterplot of our raw data, as in Figure 3. The CEF is a reasonable predictor of the relationship between the class size and the average test score, 1 This note is based on SW Ch 4 and 5 and some other notes. Throughout the note, many parts of the material are left incomplete to encourage the students to attend the class and read the book. 2 Figure 2 Figure 3 3 Figure 4 but it is difficult to express the CEF in a formal way. At this point, we want to know the regression line, which is the best linear approximation of the nonlinear CEF. Check out Figure 4, which replaces the CEF in Figure 3 with the regression line. See Figure 5 to have a grasp of the relationship between the CEF and the regression line. Unlike the CEF, the regression line is linear, so it expresses the relationship between the class size and the test scores simply by 2 parameters, an intercept and a slope. Check out Figure 6. Our goal is to express these two parameters as a function of data, that is, to compute the regression coefficients. You want to quantify: βClassSize = =⇒ ∆T estScore changeinT estScore = changeinClassSize ∆ClassSize ∆T estScore = β∆ClassSize. (1.1) Suppose that β = −2.28 What would be the predicted change in T estScore when you reduce the class size by 2,3,...,n? Equation 1.1 defines the slope of a straight line relating test scores and class size. T estScore = α + βClassSize 4 (1.2) Figure 5 Figure 6 5 However, we should not interpret equation 1.2 as an identity. At best, we can interpret it as a statement about a relationship that holds on average across the population of districts. What is the relevant population of interest here? T estScore = α + βClassSize + otherf actors, (1.3) where other factors include everything else that affect the test scores (examples?), including luck. We can generalize equation 1.3 to many different examples and use the following notation: yi = α + βxi + ui (1.4) for i = 1, 2, ..., n. • y : dependent variable. • x : independent variable. • α + βxi : population regression line. • α and β: Parameters of this regression line (intercept and slope). • u : error term which incorporates all of the factors responsible for the difference between the ith observation’s average test score and the value predicted by the population regression line. Equation 1.4 is the linear regression model with a single regressor, in which Y is the dependent variable and X is the independent variable or the regressor. If you knew the value of X in this regression, you would predict that Y is α + βX. Explain with a picture. Other examples: • Traffic fatalities as a function of penalties. • Consumption as a function of income. 1.1 Estimation We must use data to estimate unknown α and β. A “hat” on a variable throughout this note means that we are talking about the estimated value of that variable. What are (α̂, β̂, û, ŷ)? 6 1.1.1 The Ordinary Least Squares Estimator Consider the solution to (α̂, β̂) = arg min a,b n X [yi − a − bxi ]2 (1.5) i=1 We have 2 FOCs. The FOC w.r.t. a is 0 = −2 n X [yi − a − bxi ] i=1 n 1X [yi − a − bxi ] = n i=1 n = n n 1X 1X 1 X xi yi − a− b n i=1 n i=1 n i=1 1 na − bx̄ n = ȳ − a − bx̄ = ȳ − Solve for a to get α̂ = ȳ − β̂ x̄ (1.6) Notice that α b is the value of α that minimizes the sum of squared residuals. The second FOC is found by taking the derivative of equation 1.5 w.r.t b and setting this 7 to 0: 0 −2 = n X [yi − a − bxi ]xi i=1 n X [yi − (ȳ − bx̄) − bxi ]xi =⇒ 0= =⇒ n X 0= [ỹi − bx̃i ]xi i=1 i=1 n X [ỹi − bx̃i ](x̃i + x̄) =⇒ 0= =⇒ n n X X 0= [ỹi − bx̃i ]x̃i + [ỹi − bx̃i ]x̄ =⇒ n n n X X X 0= [ỹi − bx̃i ]x̃i + x̄ ỹi − bx̄ x̃i i=1 i=1 =⇒ 0= i=1 i=1 n X i=1 i=1 [ỹi − bx̃i ]x̃i i=1 Note that, for any variable, n X z̃i = i=1 n X (zi − z̄) = i=1 n X zi − nz̄ = nz̄ − nz̄ = 0. i=1 Solving for β̂, one gets Pn x̃i ỹi β̂ = Pi=1 = n 2 i=1 x̃i Pn 1 n−1 1 n−1 Pi=1 n x̃i ỹi i=1 x̃2 = σ̂xy cov(X, c Y) = σ̂xx vd ar(X) Note that one can take equation 1.4 and write it as ỹi = β x̃i + ũi , set n X β̂ = argmin [ỹi − bx̃i ]2 , b i=1 8 (1.7) and get the same β̂. The OLS predicted values of Ybi ’s and u bi ’s are: Ybi = α̂ + β̂Xi , i = 1, ..., n u bi = Yi − Ŷi , i = 1, ..., n The OLS estimators are the sample counterparts of the population parameters. Let d T estScore = 698.9 − 2.28ClassSize. (1.8) We say that an increase in ClassSize by 1 student is, on average, associated with a decline in 2.28 points decrease in TestScore. Suppose that in our sample, the ClassSize ranges from 30 to 150. Can you use these results to predict the TestScore when ClassSize is 5? Our sample does not contain any observations with a ClassSize of 5. Thus, it would not be reliable to predict the TestScore of a student who is in such a small class based on out sample regression. 1.1.2 Properties of Residuals 1. ui is the error, and ûi = yi − ŷi = yi − α̂ − β̂xi 6= ui . is the residual. 2. Show that the sum of residuals is zero: 9 n X ûi = i=1 n X yi − ŷi i=1 n = = = = = 1X n yi − ŷi n i=1 n ȳ − ˆ¯yi ¯ ¯ n ȳ − (α̂ + β̂ x̄) (by definition of ŷ) n ȳ − α̂ + β̂ x̄ (because α̂ is a constant) n ȳ − ȳ (by equation 1.6) =0 ¯ Hint: it follows from the first “F.O.C.” 3. Show that ȳ = ŷ. 4. Show that residuals are orthogonal to x’s, i.e. n X ûi x̃i = i=1 = n X i=1 n X (ỹi − β̂ x̃i )x̃i ỹi x̃i − β̂ n X i=1 = = n X i=1 n X x̃2i i=1 Pn n ỹi x̃i X 2 i=1 x̃i ỹi x̃i − Pn 2 i=1 x̃i i=1 ỹi x̃i − i=1 n X i=1 = 0. where yi = α + βxi + ui ȳ = α + β x̄ + ū ỹi = β x̃i + ũi 10 ỹi x̃i 1.1.3 Goodness of Fit Does our regressor account for much or for little of the variation in our dependent variable? R2 provides an answer. Are the observations tightly clustered around the regression line, or are they spread out? The standard error of the regression provides an answer. Let Yi = Ŷi + ûi . (1.9) which implies that n n X X 2 (Ŷi + ûi − Ȳ )2 (Yi − Ȳ ) = | (1.10) i=1 i=1 {z } TSS = n X 2 (Ŷi − Ȳ ) + i=1 | n X û2i (1.11) i=1 {z ESS } | {z } SSR because the explained and unexplained parts of Yi are orthogonal to each other. In words, total variation in Y is equal to the sum of the explained variance and the residual variance. R2 is the ratio of the sample variance of Ŷi to the sample variance of Yi : R where SSR = Pn i=1 2 Pn 2 i=1 (Ŷi − Ȳ ) P = n 2 i=1 (Yi − Ȳ ) ESS = T SS SSR = 1− , T SS (1.12) û2i . 0 ≤ R2 ≤ 1. When is R2 = 1? When is R2 = 0? The standard error of the regression (SER) is an estimator of the standard deviation of the regression error ui . It is a measure of the spread of the observations around the regression 11 line. Because the regression errors u1 , u2 , ..., un are unobserved, the SER is computed using their sample counterparts, the OLS residuals û1 , ..., ûn : n SSR 1 X 2 , SER = ûi = n − 2 i=1 n−2 2 (1.13) where the formula for variance SER uses the fact that the sample average of the OLS residuals is zero. Note that we lost 2 degrees of freedom for estimating α and β. What is the unit of SER? 1.2 OLS assumptions required for unbiasedness and consistency 1. Conditional independece We assume that E[u|x] = 0, i.e, how u is distributed among observations is independent of what x value each observation has. We can prove that E[u|x] = 0 =⇒ E[u] = 0 and cov(x, u) = 0 (homework). • Other factors captured by ui sometimes lead to better test scores and sometimes lead to worse test scores. But on average, ui is zero. • We assume random assignment of Xi , i.e., X is distributed independently of all other personal characteristics. Is this a good assumption for the observational data? 2. Random sampling (Xi , Yi ), i = 1, ..., n are independently and identically distributed across observations. Cov(ui , uj ) = 0, ∀i 6= j. If the observations are drawn by simple random sampling from a single larger population, then the i.i.d. assumption holds. Think of some example of i.i.d. and non i.i.d samples of a population. 3. No outliers V ar(ui ) = σ 2 < ∞. Large outliers are unlikely. Explain Figure 2. 12 Figure 7: Outliers 1.3 Finite Sample Properties of β̂ From now on, assume that xi , ∀i = 1, 2, ..., n are deterministic (i.e., we observe xi in our data and treat them as deterministic variables instead of random ones.) β̂ = = = = Pn x̃i ỹi Pi=1 n x̃2 Pni=1 i x̃i (β x̃i + ũi ) i=1P n x̃2i Pn i=1 Pn x̃2i x̃i ũi i=1 β Pn 2 + Pi=1 n 2 i=1 x̃i i=1 x̃i Pn x̃i ũi β + Pi=1 . n 2 i=1 x̃i b We will use this So now we have obtained a new expression for the OLS estimator, β. b expression to derive the finite sample distribution, i.e., the mean and the variance, of β. This way, we can prove whether the OLS estimator has some desired properties such as unbiasedness and efficiency. 13 1.3.1 OLS estimator is unbiased Pn x̃i ũi E β̂ = Eβ + E Pi=1 . n 2 i=1 x̃i Pn i=1 x̃i E[ũi ] = β+ P n 2 i=1 x̃i = β. =⇒ β̂ is an unbiased estimator of β. 1.3.2 OLS estimator is efficient b = E[βb − E β̂]2 V ar(β) = E[βb − β]2 Pn x̃i ũi − β]2 = E[β + Pi=1 n 2 x̃ i=1 i h Pn x̃ ũ i2 i i = E Pi=1 n x̃2 Pn i=1 i 2 E[ i=1 x̃i ũi ] P = [ ni=1 x̃2i ]2 P P E[ ni=1 nj=1 x̃i x̃j ũi ũj ] P = [ ni=1 x̃2i ]2 Pn Pn x̃i x̃j E ũi ũj ] i=1 Pj=1 = n x̃2i ]2 [ Pn 2 i=1 2 i=1 x̃i σ P = [ ni=1 x̃2i ]2 n hX i−1 = σ2 x̃2i . (1.14) i=1 To see how we moved from 6th line to 8th line, suppose, for example, n = 3, and check 14 that E[ 3 X x̃i ũi ]2 = E[(x̃1 ũ1 + x̃2 ũ2 ) + x̃3 ũ3 ]2 i=1 = E[(x̃1 ũ1 + x̃2 ũ2 )]2 + 2E[(x̃1 ũ1 + x̃2 ũ2 )x̃3 ũ3 ] + E[x̃3 ũ3 ]2 = E[x̃1 ũ1 ]2 + 2E[x̃1 ũ1 ][x̃2 ũ2 ] + E[x̃2 ũ2 )]2 + 2E[(x̃1 ũ1 + x̃2 ũ2 )x̃3 ũ3 ] + E[x̃3 ũ3 ]2 2 2 = x̃21 E[ũ21 ] +2x̃1 x̃2 E[ũ1 ũ2 ] +x̃22 E[ũ22 ] +2E[x̃1 ũ1 x̃3 ũ3 ] + 2E[x̃2 ũ2 x̃3 ũ3 ] + E[x̃3 ũ3 ]2 | {z } | {z } | {z } 0 2 σu 2 σu 2 2 = x̃21 E[ũ21 ] +2x̃1 x̃2 E[ũ1 ũ2 ] +x̃22 E[ũ22 ] +2x̃1 x̃3 E[ũ1 ũ3 ] +2x̃2 x̃3 E[ũ2 ũ3 ] +x̃23 E[ũ3 ]2 | {z } | {z } | {z } | {z } | {z } | {z } 0 2 σu = 3 X 2 σu 0 0 2 σu x̃2i σu2 i=1 Note that b declines as n increases. 1. V ar(β) b declines as variation in x increases. 2. V ar(β) b increases as σ 2 increases. 3. V ar(β) 1.3.3 Small sample distribution of βb b If we also assume that ui ∼ iidN (0, σ 2 ), then, We have derived the mean and variance of β. since Pn x̃i ũi β̂ = β + Pi=1 n 2 i=1 x̃i is a linear combination of normal random variables, β̂ is normal n hX i−1 2 =⇒ β̂ ∼ N β, σ x̃2i . i=1 15 1.4 Large Sample (Asymptotic) Properties of β̂ Instead of assuming ui ∼ iidN (0, σ 2 ), assume n → ∞. Then, since β̂ − β is a weighted average of ui , i = 1, 2, ..., n, where each ui has Eui = 0, E[β̂ − β] V ar[β̂ − β] = 0, = 2 σ n hX x̃2i i−1 → 0, i=1 √ E n[β̂ − β] √ V ar[ n(β̂ − β)] = = 0, nσ 2 n hX x̃2i i−1 =σ 2 n h1 X i=1 (CLT ) =⇒ In either case, since involving β. 2 √ √ n(β̂ − β) ∼ N 0, σ x̃2i n i=1 n h1 X 2 n i−1 x̃2i , i−1 . i=1 n(β̂ − β) has a known distribution, one can construct hypothesis tests Hypothesis Tests and Confidence Intervals Consider H0 : β = β0 vs. HA : β 6= β0 . Under the null hypothesis, √ n(β̂ − β0 ) √ ∼ N (0, 1), σβ / n where σ σβ = pPn 2 i=1 x̃i , or √ n(β̂ − β0 ) ∼ tn−2 , σ̂β 16 where σ̂β2 = s2 = 1 n s2 Pn x̃2i n X , i=1 1 n−2 û2 . i=1 Why n − 2? We say that β is (statistically) significant if β̂ is significantly different from β0 = 0. We can also construct confidence intervals for β which give us the interval that contains the true value of β with some probability. A 95% confidence interval contains the true value of β in 95% of all samples. 95% confidence interval for β = [β̂ − 1.96σ̂β̂ , β̂ + 1.96σ̂β̂ ] (2.1) What is the 95% confidence interval for β∆x? 2.1 Regression when X is a binary variable Example: 1 if student i is at BoUn Di = 0 if student i is not at BoUn Other examples: Female, Urban, CollegeGraduate, etc. A binary variable is also called an indicator variable or a dummy variable. Yi = α + βDi + ui We cannot think of β as a slope anymore (why?). Notice that E[Yi |Di = 1] = α + β (2.2) E[Yi |Di = 0] = α (2.3) (2.4) 17 Figure 8: How data looks like with a binary regressor So, we have β = E[Yi |Di = 1] − E[Yi |Di = 0]. (2.5) β is the difference between the sample averages of Yi in the two groups. See Figure 3. 2.2 Heteroskedasticity and Homoskedasticity, Gauss-Markov Theorem, and the WLS So far, what we have assumed about the error terms is that E[u|x] = 0. If we also assume that V ar(u|x) = σ 2 , i.e., the variance of this conditional distribution does not depend on X, then the errors are called homoskedastic. Otherwise, the error term is heteroskedastic. W agei = α + βF emalei + ui (2.6) Assuming homoskedasticity in this example amounts to assuming that “the variance of wages is the same for men as it is for women.” Another example: W agei = α + βT eacheri + ui 18 (2.7) Figure 9: How data looks like with a binary regressor and heteroskedastic error terms where T eacher = 1 if a person’s occupation is being a high school teacher in Turkey and T eacheri = 0 for all other occupations. Is assuming homoskedasticity plausible here? When homoskedasticity assumption fails, the OLS estimator remains unbiased, consistent, and asymptotically normal. However, failure of the homoskedasticity assumption causes the variance of the OLS estimator to go up. Thus, heteroskedasticity makes the OLS estimator less efficient. Indeed, the Gauss-Markov theorem states that, when the previously stated three assumptions hold and if the errors are homoskedastic, then the OLS estimator β̂ is the Best (most efficient) Linear Conditionally Unbiased Estimator ( BLUE). Econometricians have developed methods for dealing with heteroskedastic standard errors. For example, if the nature of heteroskedasticity is known, i.e., if var(ui |Xi ) is known up to a constant factor of proportionality, we can do weighted least squares (WLS) estimation, where the ith observation is weighted by the inverse of the square root of var(ui |Xi ). The errors in this weighted regression becomes homoskedastic (show how), so the OLS procedure continues to be BLUE. 19