Economics of Antitrust Econometrics Lecture Notes These lecture notes are intended to refresh your memory on the basic tools of econometrics that will be necessary in order to complete the empirical exercise. It is assumed that you’ve had some exposure to basic probability, statistics, and econometrics. This lecture will not cover the formal proofs involved in those subjects. Rather, the focus will be on outlining the intuition behind the key assumptions and theorems. The goal is for you (1) to be able to type in the right commands into the computer, (2) to understand why typing in those commands might be a good idea (i.e., to remember the key theorems that say good things happen when you do the right procedures), and (3) to understand the key assumptions underlying the theorems well enough to be able to evaluate their validity in an application. Since only some of you may be comfortable with matrix notation, I’ve stated the assumptions and results both with and without using such notation. In order to keep the notation simple, when I’m not using matrix notation, I’ve limited attention to the univariate case. Be assured that all of this analysis generalizes to the case of multivariate regression, even if you can’t follow the matrix notation. Preliminaries (Definitions and Notation) We start with some data. Let N equal the number of observations, with observations indexed by i = 1, 2, … N. Each observation contains the value of each variable for that observation. Our goal is to use data on some variables (called regressors, right-hand-side (RHS) variables, exogenous variables, independent variables, or explanatory variables) to explain variation in another variable (called the left-hand-side (LHS) variable, the endogenous variable, the dependent variable, or the explained variable). Let K equal the number of explanatory variables, with these variables indexed by j = 1, 2, … K. Let Xbar denote the mean value of the Xi and let Ybar denote the mean value of the Yi. Note: The explanatory variables include the constant. For example, in the univariate model Yi = β0 + Xi β1 + εi, there are two explanatory variables: Xi, whose coefficient is β1, and a constant “variable” that takes the value 1 for all observations, whose coefficient is β0. Note: Even though I’ve implied otherwise above, “exogenous” is not a synonym for “explanatory,” and “endogenous” is not a synonym for “explained.” As will be explained in more detail below, it is possible for an explanatory variable to be endogenous. Exogeneity of the RHS variables is an assumption that helps to justify OLS, not a definition. Ordinary Least Squares (OLS) We want to find the effect of X on Y. One way to do this is via ordinary least squares, illustrated in the following graph. Y X The vertical distance between data point i and the line (Yi - β0 - Xi β) is called the residual. What OLS does is to pick to slope and intercept of the line that minimizes the sum of squared residuals. BOLS arg min B0 , B1 N Y i 1 i B0 X i B1 2 BOLS arg min Y XB (Y XB) T B To solve for the OLS estimator, you take the derivative of the above expression with respect to each B, set the derivatives equal to zero, and solve. (The second-order condition is automatically satisfied in these problems, so any solution is guaranteed to be a minimum. Under the identification assumption described below, a unique solution always exists.) N B1 X i 1 i Xbar Yi Ybar N X i 1 i Xbar 2 B0 Ybar Xbar B1 B X T X X T Y 1 Note that because Y (and possibly X) are random variables, the OLS estimates which are functions of Y will also be random variables. As random variables, they have means (expected values) and variances. Since our estimates are random, we would like from them to have some good properties, e.g., we would like for the expected value to be equal to our close to the true value of the parameter, and we would like for the variance of the estimates to be small. The next section deals with the assumptions under which OLS can have nice properties like these. In STATA, running OLS is easy: “regress Y X” runs the regression of Y on X and a constant. Justifications for OLS Why might OLS be a good idea? Why not minimize the horizontal sum of squared residuals, the perpendicular sum of squared residuals, or the sum of the absolute value of the residuals? Why is minimizing the vertical sum of squared residuals better than these alternatives? The real reason why we like OLS is because, under a wide range of different assumptions, OLS has various “good” properties. Exactly what “good” properties it has depends on exactly what assumptions you make. For the purposes of this exercise, I’d like to focus on the Gauss-Markov theorem, which is one of many theorems that starts with some assumptions about the random process generating the data, and proves that OLS has some nice properties. A1: Linearity Yi = β0 + Xi β1 + εi Y=Xβ+ε Note that this assumption requires linearity in the parameters (β), and not necessarily in the variables (X). If your model has a nonlinear variable (e.g., Yi = β0 + ln(Xi)β1 + εi), simply define a new variable (e.g., Wi = ln(Xi)), such that the model will be linear in the new variable (e.g., β0 + Wi β 1 + εi). Similarly, εi can be defined as the difference between Yi and β0 + Xi β1. We call εi the disturbance term. It captures all of the variables that we do not observe (and therefore cannot account for) that affect Yi, conditional on Xi. A2: There is “enough” variation in the explanatory variables (The Identification Condition) In the univariate case, this assumption requires that the data contain at least two different values for X. In the multivariate case, this assumption requires that X be an N by K matrix with rank K. This not only means that there must be variation in each individual variable (other than the constant), but also that the variables must be linearly independent. For example, if income is always exactly 10% of wealth, it would be impossible for a regression to estimate separate coefficients for the effects of income and wealth on consumption. Here’s a picture that shows what goes wrong if this assumption is violated: Y ANY Line where the predicted Y equals the mean value of Y at the observed value of X will minimize the residual sum of squares. X What happens if you attempt to get a computer to run a regression on data that has this problem? The exact details vary by which software package you’re using, but the results will typically immediately reveal that you have a problem. For example, if STATA encounters this problem, it will drop variables from the regression until the problem goes away. In the above example, STATA will drop the only variable (X), and will simply report a horizontal line at the mean value of Y. A3: Exogeneity E[εi | X1, X2 … XN] = 0 E[ε | X] = 0 If assumption 3 is satisfied, we say that X is exogenous. If it is not satisfied, we say that X is endogenous. Note that E[ε | Y ] does not equal zero (it equals Y - β0 - β1 Xbar), so we say that Y is endogenous. How can this assumption fail? If there are omitted variables (reflected in the disturbance term) that are correlated with X, then the expected value of the disturbance term, conditional on X, will be a function of X, and hence will not equal zero. If we omit the constant B0, and in fact the true β0 does not equal zero, then E[εi] = β0. Theorem 1: Under A1-A3, OLS is unbiased. An estimator is unbiased if E[B] = β. A4: Homoscedasticity Var[εi | X1, X2 … XN] = σ2 Var[ε | X] = σ2 This assumption requires that the variance of the disturbance be a constant that does not vary with the explanatory variables. Theorem 2 (Gauss-Markov Theorem): Under A1-A4, OLS is BLUE (Best Linear Unbiased Estimator) What does this mean? Because the disturbances are random variables, any estimator based in part of the observed values of Y will also be a random variable. Like all random variables, such estimators have a mean (i.e., an expected value) and a variance. An estimator is unbiased if its expected value equals the true value of the parameter. An estimator, B, is linear if it is a linear function of Y (it can be non-linear in X). B = f1(X1,X2,…XN)*Y1 + … + fN(X1,X2,…XN)*YN B = f(X)Y One estimator is “better” than another if it has lower variance. Thus, the Gauss-Markov theorem says that, of all possible estimators that are both linear and unbiased, the OLS estimator has the lowest variance. Some statisticians and econometricians object to calling a minimum variance estimator the “best” estimator. They point out that there may be many other criteria for deciding on the best estimator. For example, you might want to use the estimator with the least meansquare error (E[(B-β)2]), and accept some small bias in return for lowering the variance. These people prefer the phrase “MVLUE” (“minimum variance linear unbiased estimator”) to “BLUE.” [ASK ME ABOUT A FUNNY STORY ABOUT “BLUE” AT THIS POINT.] Note: In calculating standard errors, most computer programs (including STATA) assume A1-A4 by default. If you only want to assume A1-A3, you can usually set an option to do so (in STATA, you use the “robust” option, e.g., reg Y X, robust). Of course, under A1-A3, OLS is still unbiased, but it is not minimum variance, under these assumptions there may be better estimators available. A5: Normality εi ~ N(0, σ2) ε ~ N(0, σ2) Theorem 3 (based on the Rao-Blackwell Theorem): Under A1-A5, OLS is the minimum variance unbiased estimator. This theorem tells us that, under the stronger assumption of normality, OLS not only has lower variance than all other estimators that are unbiased and linear, it also has lower variance than all non-linear unbiased estimators. Theorem 4 (based on the Cramer-Rao Inequality): Under A1-A5, the OLS estimator is a maximum likelihood estimator, and is consistent, asymptotically normally distributed, asymptotically efficient, and invariant to transformations of the parameter. This theorem gives us an entirely different approach to justifying the use of OLS. An estimator B is consistent if it approaches the true value β as the number of observations increases. An estimator is asymptotically normally distributed, if its distribution approaches a normal distribution as the number of observations increases. An estimator is asymptotically efficient if, in the limit as the number of observations goes to infinity, its variance is smaller than any other consistent and asymptotically normal estimator. An estimator B is invariant to a transformation of the parameter if the corresponding estimator of f(β) equals f(B). Simultaneous Equations and the Problem Endogenous Variables Now I’d like to focus on one key assumption in more detail. A3: Exogeneity E[εi | X1, X2 … XN] = 0 E[ε | X] = 0 Under many circumstances, including the empirical exercise, this is an unreasonable assumption, and OLS gives biased estimates. In particular, A3 is violated whenever the disturbance is correlated with one of the explanatory variables, since this means that expected value of the disturbance, conditional on the explanatory variables, will be a nonzero function of the explanatory variables, and thus cannot equal zero. Short example: Suppose we want to find out the effect of schooling on income. Note that “ability” may affect both schooling (smarter people get more education) and income (smarter people earn more). We can attempt to control for this by adding RHS variables that proxy for ability. But, as long as the RHS variables we include do not perfectly capture ability, there will be some residual unobserved ability that is reflected in the disturbance term. If the residual component of ability affects only income and not schooling, then we have no problem and OLS will work fine. But if the residual component of ability still affects both schooling and income, then schooling will be correlated with the disturbance, and A3 will be violated. In this case (because the correlation is positive), the OLS estimate of the effect of schooling is biased upward. Longer example, closely related to the empirical exercise: if both the dependent variable and an explanatory variable are actually determined by a system of simultaneous equations, than A3 will almost certainly be violated. To see this, we’ll going to walk through a simple supply and demand example. [Equation 1a: Demand] [Equation 2a: Supply] Q = a – b*P + εd P = c + d*Q + εs Suppose that εd ~ N(0,σ2d) and that εs ~ N(0, σ2s), and that εd and εs are independent. We would like to estimate [1a]. Is A3 satisfied? The answer is no. To see this, suppose that there is a positive shock to demand (e.g., the price of a complement fell). This means that consumers are willing to purchase a greater quantity at each price than before, i.e., εd increased. What happens? In equilibrium, when demand shifts out, both price and quantity increase. Thus, we have established that changes in εd are positively correlated with changes in P, via movement along an upward sloping supply curve. But this means that P cannot be exogenous, and an OLS estimate of b will be biased toward zero (i.e., OLS will estimate a demand curve that is more inelastic than the true demand curve). Here’s another way of thinking about this. If we were to simply regress Q on P, would we be estimating a demand curve or a supply curve? It might be tempting to say that we would be estimating a demand curve, since above we have written demand with Q as a function of P, and supply as P as a function of Q. But this is incorrect, since we could have written the equations the other way around. As illustrated in the graphs below, the answer depends on what is changing between observations: supply and/or demand. If the demand curve is constant across the observations (εd = 0), then movements in the demand curve (variation in εs) will trace out the demand curve. Conversely, if supply curve is constant across observations (εs = 0), then movements in the demand curve will trace out the supply curve. On other hand, we might expect that typically, both supply and demand are subject to shocks that vary across the observations. In that case, what we estimate when we regress Q on P is some mixture of the two curves, weighted to whichever one moves around the least. Q Q P Q P P Note: P is on the horizontal axis and Q is on the vertical axis: exactly the opposite of the usual layout. Instrumental Variables (IV) What can we do about this problem? One possible solution is called instrumental variables regression. Let’s start by supposing that we found some way to decompose P into Pexog + Pendog in such a way that Pexog were exogenous (E[εd | Pexog] = 0) and that Pendog were defined to equal P – Pexog (we know that Pendog must be endogenous and positively correlated with εd, since P is positively correlated with εd and Pexog is not). We can then rewrite the demand equation as follows: [Equation 1b] Q = a – b*Pexog + (εd – b*Pendog) Note that if we treat Pexog as the explanatory variable and (εd – b*Pendog) as the disturbance, we have an equation that satisfies A3. Thus if we regress Q on Pexog, we will get an unbiased estimate of a and b. Furthermore, if Pexog is positively correlated with P, this regression will have some power (in other words if Pexog is uncorrelated with P, the expected variance of the estimator is infinite). Thus, we can solve the endogeneity problem if we can construct a variable that is uncorrelated with the residual, but still moves with the endogenous explanatory variable. How could construct such a variable? Let’s suppose that we have data on another variable, Z, which gives the price of raw materials used to produce the good. Obviously Z should appear in the supply equation, but probably not in the demand equation. [Equation 1b: Demand] [Equation 2b: Supply] Q = a – b*P + εd P = c + d*Q + e*Z + εs If Z is independent of demand shocks, it will have no effect on demand except through its effect on price, and E[εd | Z] = 0 (Z is exogenous). This fact (E[εd | Z] = 0) is the key assumption that we need to be true in order to solve the endogeneity problem. Now what happens if we regress P on Z (P = f + g*Z), and then use the results of this regression to get estimates of P for each observation (Phati = fOLS + gOLS * Zi)? Since Z is exogenous, and Phat is a linear function of Z (and some constants), Phat is exogenous (E[εd | Phat] = 0). [Quick proof: Suppose Phat were not exogenous, i.e., that E[εd | Phat] = h(Phat) for some function h. Then it must be the case that E[εd | Z] = h(fOLS + gOLS * Z). But this contradicts the assumption that Z was exogenous.] Thus, if we use Pexog = Phat, we have solved the endogeneity problem. Phat can be contructed from our exogenous variable Z. Since Z is one of the determinants of P, Phat will be positively correlated with P. Since Z is exogenous, Phat will also be exogenous. Thus, a regression of Q on Phat will yield unbiased estimates of the structural coefficients a and b that we are interested in. Note 1: The above argument relied heavily on the following assumptions. a. E[εd | Z] = 0 b. Z does not appear in the equation we are estimating (the demand equation). [If Z did appear in c. Z and P are correlated (either positively or negatively). Note 2: The instrumental variables approach is much more general than this simple example indicates. In particular you can use IV regression in models with some exogenous explanatory variables that appear in the equation of interest, multiple endogenous explanatory variables, and multiple instrumental variables. In general, the setup is as follows. [Equation of interest] Yi = β0*1 + β1*Xexog1 + … + βK *XexogK + … + βK+1*Xendog1 + … + βL*XendogM + εd [Instrumental variables] E[εd | Zr] = 0 for m = 1, 2, … R. Note that the assumptions a-c above must be satisfied for each instrumental variable Zm. Furthermore, you need at least one instrumental variable for each endogenous explanatory variable (i.e., R >= M). [There is one more assumption (the rank condition). I’m not telling it to you because (1) I don’t know how to express it without using linear algebra, and (2) it is almost impossible for the rank condition to not be satisfied when there is at least one instrument for each endogenous explanatory variable. If the rank condition is not satisfied, STATA will let you know.] The general procedure is as follows: Step 1a: Regress each of the endogenous explanatory variables on all of the exogenous variables (this latter group includes of the constant, the exogenous explanatory variables, and the instrumental variables). Step 1b: Using the regression coefficients from step 1a, construct a prediction of each endogenous explanatory variable. Step 2: Regress the dependent variable of interest on the explanatory variables, replacing each endogenous explanatory variable with its prediction from step 1b. The regression coefficients will be unbiased estimates of the true coefficients in the structural equation of interest. Because of this two-step procedure, IV regression is sometimes called Two-Stage-LeastSquares (2SLS). Note 3: You do an IV regression by hand, using one command for each step. But STATA and most other software packages can do it automatically. Here’s how to get STATA to do it: ivreg Y Xexog1 … XexogK (Xendog1 … XendogM = Z1 … ZR) In the case of the empirical exercise, you will have two endogenous explanatory variables, two instrumental variables, and at least one exogenous explanatory variable. Thus, you will enter commands that looks something like this: ivreg q1 income (p1 p2 = aac1 aac2) ivreg q2 income (p1 p2 = aac1 aac2) For details, see the assignment.