Lecture 8: Ordinary Least Squares BUEC 333 Professor David Jacks 1 A lot of the discussion last week surrounding the population regression function, Yi 0 1 X 1i 2 X 2i 3 X 3i k X ki i We said because the coefficients (β) and the errors (εi) are population quantities, we do not and cannot observe them. This lead us to a consideration to the sample analog of the regression function above, namely Overview of regression analysis 2 Overview of regression analysis 3 Overview of regression analysis 4 Overview of regression analysis 5 Overview of regression analysis 6 Most of the time, our primary interest is the coefficients themselves. βk measures the marginal effect of independent variable Xki dependent variable Yi, holding the value Sometimes we are more interested in predicting Yi; given sample data, we can calculate predicted values Overview of regression analysis 7 In either case, we need some way to estimate the unknown β’s. That is, we need a way to compute ̂ ' s from a sample of data. Unsurprisingly, there are lots and lots of ways to estimate the β’s (compute ̂ ' s). By far, the most common—and one of the most intuitive—methods is called ordinary least squares, or OLS. Overview of regression analysis 8 Recall that we can write the following: Yi 0 1 X 1i 2 X 2i k X ki i ˆ ˆ X ˆ X ˆ X e 0 1 1i 2 2i k ki i Yˆi ei where ei are known as residuals. Residuals as: 1.) the sample counterpart to εi; 2.) how far our predicted values are from the observed values What does OLS do? 9 We want to estimate the β’s in a way that makes the residuals as small as possible. That is, we want our predicted values to be as close to “the truth” as possible. Or in other words, we want to minimize our “prediction mistakes”. To accomplish this, OLS minimizes the What does OLS do? 10 Computationally, OLS is “easy”: computers can perform it in a fraction of a second and you could do it by hand if necessary (albeit slowly). Comparatively, OLS estimates are not only unbiased but also most efficient in the class of linear unbiased estimators (more on this later). Conceptually, minimizing squared residuals is But why “least squares”? 11 If one were to minimize the sum (or average) of residuals, the positive and negative residuals would only serve to cancel one another out. Consequently, we may end up with really inaccurate predicted values. Thus, squaring penalizes “big” mistakes (large ei) But why “least squares”? 12 Suppose you have a linear regression model with one independent variable: Yi 0 1 X i i The OLS estimates of β0 and β1 are the values that minimize: n n ˆ e Y Y i i i 1 2 i i 1 2 How to determine this minimum? Differentiate w.r.t. β0 and β1, set to zero, and solve. How does OLS work? 13 At the end of the day, the solutions to this minimization problem are: βˆ0 Y X n i 1 i X Yi Y X n i 1 i X 2 But where did these equations come from? How does OLS work? 14 Yi 0 1 X i i ˆ ˆ X e 0 1 1 i First, define your residual as ei Yi Yˆi . Next, set up a minimization problem for the linear regression model with one independent variable. That is, let 𝛽 be defined as a set of estimators that How does OLS work? 15 n Min 𝛽 n ˆ ˆ X e Y i 0 1 i i 1 2 i i 1 2 As usual, this involves taking the first derivatives with respect to the betas and setting the two equations equal to zero. Once we have the two first-order conditions (FOCs), we solve for the value of the betas where How does OLS work? 16 In this case, we need to apply the chain rule: if y is a differentiable function of u and u is a differentiable function of x, then: dy dy du * dx du dx y and u are the “outside” and “inside” functions… n n 2 2 y ei u i 1 i 1 u Yi ˆ0 ˆ1 X i How does OLS work? 17 n d ei2 i 1 d ˆ0 n d ui2 i 1 dui (Yi ˆ0 ˆ1 X i ) * 0 n d ei2 i 1 d ˆ0 n 2 ui * Yi ˆ0 ˆ1 X i i 1 ˆ0 n d ei2 i 1 d ˆ 0 n 2 i 1 Yi ˆ0 ˆ1 X i * How does OLS work? Yi ˆ0 ˆ1 X i ˆ0 0 18 Solve for the partial derivatives: Yi ˆ0 ˆ1 X i ˆ0 Yi ˆ0 ˆ1 X i ˆ1 Now, substitute in the partial derivatives: n 2 Yi ˆ0 ˆ1 X i 0 i 1 How does OLS work? 19 The FOCs actually tells us a few useful things: n 1.) 2 Yi ˆ0 ˆ1 X i 0 i 1 n 2.) 2 X i Yi ˆ0 ˆ1 X i 0 i 1 1.) the mean of the residuals is going to be equal 2.) the covariance of the residuals and X is going to be equal How does OLS work? 20 Finally, solve for the values of the coefficients: ˆ0 : n 2 Yi ˆ0 ˆ1 X i 0 i 1 Y ˆ n i i 1 n 0 ˆ1 X i n n i 1 n i 1 ˆ ˆ X Y i 0 1 i i 1 n Y ˆ X i 1 i i 1 How does OLS work? 1 i 21 n n i 1 n i 1 n ˆX Y i 1 i Y ˆ X i 1 i 1 i 1 n i 1 ˆ 0 Yi n i 1 n ˆ0 Y i 1 n i n ˆ1 X i 1 n i Great, but what about the slope coefficient? How does OLS work? 22 ˆ1 : n 2 X i Yi ˆ0 ˆ1 X i i 1 X Y Y ˆ X X 0 n ˆ X ˆ X X Y Y i i 1 1 i i 1 n i 1 i i How does OLS work? 1 i 23 n ˆ X X X 0 X Y Y i i 1 i i i 1 X Y Y n ˆ1 i i 1 n X X i 1 i i i X Great, but what do we do with this? How does OLS work? 24 Next, we need to work some mathemagics: 1.) Expand the previous expression X Y Y n ˆ1 n i i 1 n i Xi Xi X i 1 i 1 n 2 X i Xi X i 1 2.) Note that n X X i 1 n n i n Xi i 1 ˆ1 X Y nXY i 1 n i i 2 X i nXX i 1 How does OLS work? 25 3.) Finally, make use of the facts that n X Y nXY i 1 i i 2 X i nXX n i 1 n ˆ1 X Y nXY i 1 n i i X nXX i 1 2 i How does OLS work? 26 An important interpretation…the estimated coefficients are simply weighted averages of Y: X i X Yi Y n Xi X 1 ˆ1 i 1 n Yi 2 2 n n i 1 Xi X Xi X i 1 i 1 X X n i 1 1 ˆ0 Y ˆ1 X X n Yi 2 n i 1 n Xi X i 1 n Thus, it is a special kind of sample mean How does OLS work? 27 A second important interpretation: n ˆ1 X i 1 i X Yi Y n X i 1 i X 2 ˆ1 ˆ1 How does OLS work? 28 For the basic regression model Yi = β0 + β1Xi + εi y represents variation in y; x represents variation in x. y x Overlap between the two (in green) represents variation that y and x have in common. Another way of “seeing” OLS 29 Knowing the summation formulas for OLS estimates is useful for understanding how it works. But once we add more than one independent variable, these formulas become problematic. In practice, we rarely do least squares calculations by hand—this is why God invented computers. Time for an example via Stata. OLS in practice 30 Suppose we are interested in how an NHL hockey player’s salary varies with the number of points they score. That is, variation in salary is related to variation in points scored (with causality presumably from latter to former). Dependent variable (Yi) will be SALARY. Independent variable (Xi) will be POINTS. An example 31 A few helpful steps: 1.) open NHL 1601.xlsx 2.) copy all columns 3.) open Stata > type “edit” in command window An example 32 Stata now does the heavy lifting…your results should look like the following. An example 33 The column labeled “Coef.” gives the least squares estimates of the regression coefficients. So our estimated model is: SALARY = 365,739.80 + (40,546.89)*POINTS Players who scored zero points earned $365,739.80 on average. For each point scored, players were paid an additional $40,546.89 with the “average” 100point player being paid $4,420,428.80. What the results mean 34 The column labeled “Std. Err.” gives the standard error—that is, the square root of the sampling variance—of the regression coefficients. Remember: the OLS estimates are a function of the sampled data and, therefore, RVs. Every RV has a sampling distribution, so “Std. Err.” tells us how spread out estimates are. What the results mean 35 In particular, the column labeled “t-Statistic” is a test statistic for the null hypothesis that the corresponding regression coefficient is zero. The column labeled “Prob.” is the p-value associated with this test; it is the probability of making a type I error if we reject the null. We can ignore the rest for the time being. Now, add a player’s age & years of NHL experience to our model. What the results mean 36 An example 37 First thing to notice: the estimated coefficient on POINTS and the intercept have changed; this is because they now measure different things. In our original model, the intercept (_cons) measured the average SALARY when POINTS was zero ($365,739.80). That is, the intercept originally estimated E(SALARY | POINTS = 0) which put no restriction What the results mean 38 In the new model, the intercept measures the average SALARY when POINTS, AGE, and YEARS_EXP are all zero ($309,516.30) That is, the new intercept estimates E(SALARY | POINTS = 0, AGE = 0, YEARS_EXP = 0). The point is that what your estimated regression coefficients measure depends What the results mean 39 Originally, the coefficient on POINTS was an estimated of the marginal effect of POINTS on SALARY: d (SALARY) 40,546.89 d (POINTS) Now, the coefficient on POINTS measures the marginal effect of POINTS on SALARY, holding AGE and YEARS_EXP constant: (SALARY) 35,150.35 (POINTS) What the results mean 40