Intro to Econometrics Course Notes

ECON 11020 FALL 2020 INTRODUCTION TO ECONOMETRICS PABLO A. PEÑA UNIVERSITY OF CHICAGO pablo@uchicago.edu ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 1 1 Introduction This manuscript constitutes the notes for the course. Why not just use an Econometrics textbook? Because I want to make my teaching more effective and your learning more efficient. In my experience, textbooks put too much emphasis on how the methods work instead of on how those methods can be used. To use an analogy, textbooks are like a book about microwave ovens explaining how electricity is transformed into microwaves, and how microwaves excite water molecules inside food. A person using a microwave oven doesn’t need to know any of that to make a good use of it. If the person knows what materials not to put inside the oven, and the approximate relationship between cooking times and power, she can make the most out of the oven safely and efficiently. The point here is that econometrics textbooks explain way more about the methods than a regular practitioner needs to know in order to put those methods to good use. These notes emphasize how regressions can be used. There is a second argument against using econometric textbooks in this course. Since they are written by academics, textbooks are biased towards the types of problems academics care the most about—namely, establishing causal relationships. The type of questions academics passionately pursue are only a subset of the type of questions practitioners are interested in. In my experience, practitioners try to establish causal relationships less than 10% of the time. So, these notes give a more general perspective of what can be done with regressions. A third argument for an alternative look at how econometrics is taught for practitioners is the growth in computer capacity. Back in the day—so I am told—running a regression was costly. A person had to punch holes on a card to input data into a computer and wait to have a chance to use a computer. People thought hard before running a regression. The fact that now we can draw samples of our data thousands of times in a matter of minutes—if not seconds—has made feasible the use of newer methods that are free from many unverifiable assumptions made in classic theory. Computing power not only made empirical analysis easier. It also changed what we think are more appropriate methods in practice. These notes will discuss some of those newer methods. A fourth argument is the format. Textbooks require the reader to know matrix algebra and probability theory. That depletes the attention of students. Keeping an eye on matrices or random variable probability distributions prevents them from focusing on what really matters. Here we keep those definitions to a minimum. Lastly, the title of these notes could be “Empirical Analysis for Business Economics.” Each part of that alternative title would illustrate an important aspect of what you will learn. “Empirical” means we will refer to data collected in the real world. “Analysis” refers to the statistical tools we will use. “Business” means we will consider practical questions of the kind real organizations face out there. Lastly, “Economics” means that throughout we will keep our perspective as economists—we are neither mathematicians nor statisticians. In sum, you will learn how to apply statistical tools to data in order to answer relevant questions from an economic perspective. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 2 1.1 Primacy of the question A frequent answer to many questions in this course is “it depends on the question you want to answer.” As a general rule, the methods we use must fit the question at hand. The question should be carefully examined before jumping to figuring out how to answer it. In my experience, pounding on a question is essential. Paraphrasing it multiple times and thinking what it is and what it isn’t may be of great help. That is what I mean by the primacy if the question. Assume there is arm-wrestling tournament with 149 participants. When two contestants face each other, the winner advances and the loser is out. What is the total number of matches in the tournament? Perhaps you had the impulse—like me—to create in your mind a bracket structure adding rounds to reach a number close but below to 149. You can proceed that way and find the solution. But it’ll take time and it won’t be general. What if the number of participants is 471 or 7,733? Carefully examining the setup of the question may give you the path of least resistance to the answer. Here is a crucial piece of information. The tournament ends when all but one of the participants are out. In every match, one participant is out. If there are N participants, there must be a total N − 1 matches. This is a quick and general answer. It may take time to figure it out, but once you do it, you can apply it more generally, to any number of participants. In our context, we will always refer back to the question we have in mind. Sometimes the question is impossible to answer. Some other times the answer is obvious, and no analysis is needed. Most frequently, the question requires some polishing to be properly (or at least reasonably) answered. 1.2 The cake structure Our departure point is the origin of the term regression and the method it represents. Once we establish the general idea of what a regression is, we will proceed according to what we can call the three-layer cake structure. The first layer is the mathematics of regression, and it is about how we compute a regression. It has the most formulas. The second layer is the probability content of the regression results. In this layer we will talk about why we expect results to vary and what information they convey. The third layer is the economics of regressions. We will learn the uses and interpretations of the results. The economics of regressions (the third layer) can be split into three slices that correspond to three distinct uses: descriptive, predictive and prescriptive. In the descriptive use, regressions are used to measure relationships accounting for other factors. This is useful when trying to judge to what extent two things move together or not, or when comparing averages in equality of circumstances. The predictive use is about knowing what to expect. We will explain how predictions are different from forecasts. The third use is prescription, and we will decompose it into the most common methods used by practitioners: randomized control trials, regression discontinuity designs, and difference-in-differences. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 3 2 Regression A fascinating question in natural and social sciences is the extent to which parental traits are transmitted to children. In the late 19th century, Francis Galton worked on this topic. He conducted surveys and collected data to analyze the relationship across many variables. One of them was height. Using families with adult children, Galton computed the “mid-parent height” (the average height of mother and father) and plotted it against the height of children. A stylized version of the chart he produced is below. The horizontal axis represents parental height. The vertical axis represents children height. The first thing to note is that there is a cloud of points. The second is that there seems to be a positive relation. On average, taller parents have taller children, and shorter parents have shorter children. How can we summarize this relationship? Galton modelled the height of children as a linear function of parental height plus an error term. If 𝑦𝑖 and 𝑥𝑖 denote the heights of person i and her parents, respectively, then Galton assumed 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . Galton came up with a straight line that best fitted the cloud of points. If the straight line has a positive slope, it means the relation is positive (tall parents have tall children). If the slope is negative it means the relation is negative (tall parents have short children). Lastly, if the slope is zero, then tall and short parents have children of the similar stature. The graph below depicts the line that best fits Galton’s data (in blue), and also a line with a slope of one as a benchmark. In the next section, we will discuss at length how Galton came up with that line. Put very simply, we pick the intercept and the slope that minimize the square of the vertical distance between the points in the cloud and the line. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 4 Galton found a positive relationship between the height of parents and children but didn’t stop there. After all, it is evident that tall parents tend to have tall children. What interested Galton the most was whether the slope of the line he produced was greater or smaller than one. If the slope is greater than one it means differences in height across parents become even larger differences in height across their children. Alternatively, if the slope is smaller than one, differences in height across parents become smaller differences in the next generation. Galton found that the slope is smaller than one. Therefore, having a tall ancestor doesn’t matter much for the descendants. After a few generations, we expect the height of any family to get closer to the mean. This process was called “regression to the mediocrity,” and now we call it “regression to the mean.” The term regression was originally the description of this particular result. It later became the name of the method used to find that result. Today, regression is the workhorse of empirical economists. There is a wide variety of regression models, but they all share the same essence. They all produce numbers that summarize the relationships among variables in the data. To analyze how regressions are used by practitioners, we will proceed according to our threelayer cake structure. The first layer is given by the mathematical aspects of regressions. Put bluntly, the mathematical layer has no probabilistic or economic contents. This is very much all algebra. But first, a short introduction to how regressions looks in practice. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 5 3 The mathematics of regression 3.1 The basics 3.1.1 A cloud of points Our starting point is data. Usually, we think of data in tables or spreadsheets. We can generally think of any data set as being organized in “variables” and “observations.” For instance, if we have the expenditures in a given month of a group of customers at an online store, the unit of observation may be the customer, and each customer would constitute one observation. In addition to the expenditures, the information we have for each customer may include number of purchases, number of returns, shipping costs, as well as age of the customer, gender, and zip code. Those features of behavior and customer traits would be our variables. Each variable could take different values, which could be continuous (like amounts in dollars or age in days) or categorical (like gender or age in five-year ranges). In the case of Galton, each adult child constitutes an observation, and his or her height together with the height of his or her parents constitute our two variables. For instance, if we have: Person Height of the person Height of the parents Robert 1.82 1.75 Anne 1.73 1.79 Cristopher 1.78 1.74 Laura 1.69 1.70 Charles 1.80 1.80 We can think more generally in terms of variables x and y, and the subscript i: 𝑖 𝑖=1 𝑦 𝑥 𝑦1 = 1.82 𝑥1 = 1.75 𝑖=2 𝑦2 = 1.73 𝑥2 = 1.79 𝑖=3 𝑦3 = 1.78 𝑥3 = 1.74 𝑖=4 𝑦4 = 1.69 𝑥4 = 1.70 𝑖=5 𝑦5 = 1.80 𝑥5 = 1.80 The variables in our data are not a chaotic mass of information. We usually have something in mind that provides them with structure. We usually think that one variable is a function of the other variables. For instance, Galton thought of children height as a function of parental height, or 𝑦𝑖 = 𝑓(𝑥𝑖 ). In math, we usually plot in the vertical axis the value of the function and we plot in the horizontal axis the argument of the function. Thus, in the chart of height we saw before, parental height is in the horizontal axis and children height is on the vertical axis. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 6 For the purpose of understanding how regressions work, we will think of our data visually, as a cloud of points, like in Galton’s problem. The height of the points in the cloud represents the height of children whereas the location of the points in the horizontal axis represent parental height. The height of children is a function of the height of parents. 3.1.2 A model In general, in a regression model like: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . We refer to 𝑦𝑖 as the dependent variable or left-hand side variable. We refer to 𝑥𝑖 as the regressor, the explanatory variable, the independent variable, or the right-hand side variable. Lastly, we usually refer to 𝜀𝑖 (the Greek letter epsilon) as the error term or idiosyncratic shock. Notice the subscript 𝑖, which denotes the observation. In contrast, 𝑥 and 𝑦 denote variables, whereas 𝑥𝑖 and 𝑦𝑖 denote the values that those two variables take in the case of observation 𝑖. The error term catches anything else unaccounted for in the model. Imagine in the case of Galton’s analysis of height, it could include some phenotypic differences across individuals or malnourishment of some children or parents. In the model above, 𝑦𝑖 is expressed as a linear function of 𝑥𝑖 and an error term. Thus, 𝛼 (the Greek letter alpha) is the intercept and 𝛽 (the Greek letter beta) is the slope. The interpretation of the slope is the expected change in 𝑦 is associated with a change in 𝑥. Mathematically, the slope is the partial derivative of 𝑦 with respect to 𝑥: 𝜕𝑦 𝜕 = (𝛼 + 𝛽𝑥) = 𝛽 𝜕𝑥 𝜕𝑥 At the same time, we can interpret the model in terms of what we would expect. Suppose that we are told the explanatory variable takes a value of 𝑥′. What is the corresponding value of the dependent variable that we should expect to observe? The answer is: 𝑦 ′ = 𝛼 + 𝛽𝑥′ These two ways of interpreting the model (the partial derivative and the conditional expectation) are very useful and we will revisit them multiple times. 3.1.3 The minimization problem In a nutshell, a regression simply fits a cloud of points with a straight line. To do that, it minimizes the square of the vertical distance between each point in the cloud and the line. Notice that ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 7 it doesn’t minimize the vertical distance (we use the square) or the square of the distance (the vertical part is crucial). Mathematically, to find the straight line that minimizes the square of the vertical distance to the points in the cloud, we set up the following problem: 𝑁 min{𝑎,𝑏} ∑ (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )2 𝑖=1 We take first-order conditions with respect to 𝑎 and 𝑏: 𝑁 ∑ 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 = 0 𝑖=1 𝑁 ∑ (𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 )𝑥𝑖 = 0 𝑖=1 The first-order conditions above produce a linear system with two equations and two unknowns. We can easily find the solution. Let us introduce some useful nomenclature. The solution are the regression coefficients and we denote them with a “hat”, 𝛼̂ and 𝛽̂. The fitted value of 𝑦𝑖 is: 𝑦̂𝑖 = 𝛼̂ + 𝛽̂ 𝑥𝑖 Notice the hat is also on 𝑦̂𝑖 . The difference between the fitted and actual values of 𝑦𝑖 is the residual, which is also denoted with a hat: 𝑦𝑖 − 𝑦̂𝑖 = 𝜀̂𝑖 The residual 𝜀̂𝑖 is an estimate of the error term 𝜀𝑖 . It is convenient to establish the following identities: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 = 𝑦̂𝑖 + 𝜀̂𝑖 = 𝛼̂ + 𝛽̂ 𝑥𝑖 + 𝜀̂𝑖 Going back to our minimization problem, it is easy to show that the first-order conditions 𝑁 imply that ∑𝑁 𝑖=1 𝜀̂𝑖 = 0, and ∑𝑖=1 𝜀̂𝑥𝑖 = 0. In words, the residuals average zero and the covariance between the residual and the explanatory variable is zero. The first order conditions can be arranged to provide formulas for the regression coefficients: 𝑐𝑜𝑣(𝑥, 𝑦) 𝑣𝑎𝑟(𝑥) 𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅ 𝛽̂ = ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 8 Where 𝑐𝑜𝑣(𝑥, 𝑦) represents the covariance between 𝑥 and 𝑦, 𝑣𝑎𝑟(𝑥) represents the variance of 𝑥, and 𝑥̅ and 𝑦̅ are the averages of 𝑥 and 𝑦, respectively. Notice the similarity between the regression coefficient 𝛽̂ and the correlation coefficient between 𝑥 and 𝑦, which is usually denoted by 𝜌: 𝑣𝑎𝑟(𝑦) 𝛽̂ = 𝜌√ 𝑣𝑎𝑟(𝑥) In other words, 𝛽̂ is a re-scaled correlation coefficient. The factor for the re-scaling is a positive number equal to the ratio of the standard deviation of 𝑦 to the standard deviation of 𝑥. Notice that the fitted value 𝑦𝑖 can be interpreted as the expected value of 𝑦 conditional on 𝑥 taking a particular value, say 𝑥 = 𝑥𝑖 : 𝐸[𝑦|𝑥 = 𝑥𝑖 ] = 𝛼̂ + 𝛽̂ 𝐸[𝑥𝑖 ] + 𝐸[𝜀̂] = 𝛼̂ + 𝛽̂ 𝑥𝑖 = 𝑦̂𝑖 In general, the regression coefficients can be interpreted as partial correlation coefficients (as in “partial” derivatives), and the fitted values can be interpreted as conditional expectations. In the case of Galton’s regression, 𝛽̂ is interpreted as the difference in child height given a difference of one unit in parental height, and 𝑦̂𝑖 = 𝛼̂ + 𝛽̂ 𝑥𝑖 is interpreted as the expected height of a child with parents of height 𝑥𝑖 . The following chart summarizes these concepts. The orange points represent our cloud. The green points are the fitted values. They lie on the regression line. The intercept of the line is the coefficient 𝛼̂. The slope is the coefficient 𝛽̂. The residual is the difference between the actual values and the fitted values. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 9 3.1.4 Multivariate regression The world of univariate regressions (i.e. regressions with only one explanatory variable) is very simple. However, we rarely run regressions with only one regressor. Most of the time we use regressions with multiple regressors. They are known as multivariate regressions. Multivariate regression models are usually expressed using different letters as variables and different Greek letters as coefficients. For instance: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑤𝑖 + 𝛿𝑧𝑖 + 𝜀𝑖 For simplicity, if we have 𝑘 explanatory variables, we denote them by 𝑥1 , 𝑥2 , … , 𝑥𝑘 . Notice that we have 𝑘 + 1 regressors: 𝑦𝑖 = 𝛽0 𝑥0𝑖 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝜀𝑖 Where 𝑥0𝑖 = 1 for every 𝑖. In other words, 𝑥0𝑖 is constant and its coefficient is the intercept. We can express the regression in matrices and vectors: 𝛽0 𝑦1 𝑥01 𝑥11 𝑥21 ⋯ 𝑥𝑘1 𝜀1 𝑦2 𝑥02 𝑥12 𝑥22 ⋯ 𝑥𝑘2 𝛽1 𝜀2 [ ⋮ ] = [ ⋮ ⋮ ⋮ ⋱ ⋮ ] 𝛽2 + [ ⋮ ] 𝑦𝑁 𝑥0𝑁 𝑥1𝑁 𝑥2𝑁 ⋯𝑥𝑘𝑁 ⋮ 𝜀𝑁 [𝛽𝑘 ] 𝑌 = 𝑋𝛽 + 𝜀 In this case we have a cloud of 𝑁 points in a k-dimensional space. We want to find the plane or hyper-plane that minimizes the square of the vertical distance to those points. Using matrix notation, we can write the minimization problem as: min𝛽 (𝑌 − 𝑋𝛽)′(𝑌 − 𝑋𝛽) Our 𝑘 + 1 first-order conditions can be expressed as: 𝛽̂ = (𝑋 ′ 𝑋)−1 𝑋′𝑌 This formula involves a series of simple mathematical operations with the data. Keep in mind that the vector 𝛽̂ contains 𝑘 + 1 regression coefficients. Our results can be expressed as: 𝑦̂ = 𝛽̂0 𝑥0𝑖 + 𝛽̂1 𝑥1𝑖 + 𝛽̂2 𝑥2𝑖 + ⋯ + 𝛽̂𝑘 𝑥𝑘𝑖 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 10 We usually have 𝑁 > 𝑘 by a lot. Think about what would happen if 𝑘 + 1 = 𝑁. It’s very helpful to start with the case of 𝑁 = 2. We would be trying to fit a cloud that consists of only two points with a straight line. The fit would be perfect, and the residuals would be zero. Now extend that idea to 𝑁 = 3. We would try to fit a cloud of three points with a plane. Again, the fit would be perfect. This is a general result. As long as 𝑘 + 1 = 𝑁, the fitted values would be equal to the actual values of 𝑦. For illustrative purposes, we will use univariate or bivariate regression examples because we can analyze them graphically. Their intuition extends to the case with more regressors. The graph below shows a plane fitting a cloud of points in three dimensions (two explanatory variables and one dependent variable). The plane cuts through the cloud, leaving some points above (in blue) and other points below (in red). 3.1.5 Goodness of fit Remember that we are trying to fit a cloud of points with a linear structure (a line, a plane or a hyperplane). We can always measure how well we do that using the R-square (𝑅 2), a measure of goodness of fit. The formula is very simple: 𝑅2 = 1 − ∑𝑁 ̂𝑖 )2 𝑖=1(𝑦𝑖 − 𝑦 ∑𝑁 ̅)2 𝑖=1(𝑦𝑖 − 𝑦 If our regression model fits the cloud perfectly, then all residuals are equal to zero and the Rsquare would be equal to one. If, on the contrary, the model is not better than using a flat line or plane with a value of 𝑦̅, then our regression model would not explain any of the variation in the data, and the R-square would be equal to zero. As you can probably deduce, the R-square is ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 11 always between 0 and 1, with 0 representing the worst possible fit (none), and one representing the perfect fit. Why is the R-square called that way? One very intuitive way of measuring goodness of fit is to compute the correlation between 𝑦̂ and 𝑦. If the model fits the data perfectly, the correlation should be 1. If the model has a very poor fit, the correlation would be close to zero (positive or negative). Let R stand for that correlation. How is the R-square related to R? Well, you probably guessed it by now. The R-square is simply the square of R, that is, the square of the correlation between 𝑦̂ and 𝑦. The R-square is a mathematical concept. It is not informative of the probabilistic or economic aspects of our regression. High R-squares are not per se better than low R-squares. The relevance of different goodness of fit depends on the context. Later we will see some examples where the R-square is not even mentioned (when we try to estimate causal effects) and other examples in which the R-square is the most important aspect (when we try to predict). We will come back to discuss goodness of fit as we advance in the course. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 12 3.2 Intermediate concepts So far, we can say the mechanics of regression are simple. In fact, they are so simple that one could be tempted to deem regression analysis as “too simplistic.” However, that misses several points. Here we will go over some of them to give you a taste of the power of regression. 3.2.1 Dummies Dummy variables (also known as indicator or dichotomic variables) are a very useful type of regressors. A dummy takes a value of 1 if a condition holds true, and 0 if it doesn’t: 𝑥𝑖 = { 1 0 if condition holds otherwise Assume our regression model is 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . How do we interpret 𝛼̂ and 𝛽̂ when 𝑥𝑖 is a dummy? The following chart provides some guidance. If 𝑥𝑖 is a dummy, then our cloud of points would consist of two columns of points. One would be located over the value of 𝑥𝑖 = 0 and the other would be located over the value 𝑥𝑖 = 1. No points would lie between 𝑥𝑖 = 0 and 𝑥𝑖 = 1. Our regression line would cross both columns. The graph below presents an example. The resulting intercept and slope can be interpreted in terms of conditional expectations: 𝐸[𝑦̂|𝑥 = 0] = 𝛼̂ 𝐸[𝑦̂|𝑥 = 1] = 𝛼̂ + 𝛽̂ 𝐸[𝑦̂|𝑥 = 1] − 𝐸[𝑦̂|𝑥 = 0] = 𝛽̂ This is a very useful feature. Let’s move on to the case with two independent dummies. You can imagine one dummy indicates gender (zero for male and one for female) and the other ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 13 indicates minority status (zero for non-minority and one for minority). There are four possible combinations of (𝑥1 , 𝑥2 ): (0,0), (1,0), (0,1) and (1,1). In this case, the cloud of points consists of four columns of points floating above or below those four coordinates. Our model is: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 Since we have two regressors, we can still get a visual interpretation of the plane that fits the cloud of points. The following chart illustrates the cloud of points and the regression plane. The height of the plane at each of the four coordinates (0,0), (1,0), (0,1) and (1,1) can be expressed in terms of the beta hats. In this example, 𝛽̂1 < 0, 𝛽̂2 > 0, and 𝛽̂1 + 𝛽̂2 < 0. Let’s assume for the illustrative purposes that 𝑦 is wage and we are looking a group of employees of a company. The regression coefficients tell us what the expected value of 𝑦̂: The expected value the wage for… Non-minority males Non-minority females Minority males Minority females …is: 𝐸[𝑦̂|𝑥1 = 0, 𝑥2 = 0] = 𝛽̂0 𝐸[𝑦̂|𝑥1 = 1, 𝑥2 = 0] = 𝛽̂0 + 𝛽̂1 𝐸[𝑦̂|𝑥1 = 0, 𝑥2 = 1] = 𝛽̂0 + 𝛽̂2 𝐸[𝑦̂|𝑥1 = 1, 𝑥2 = 1] = 𝛽̂0 + 𝛽̂1 + 𝛽̂2 There are many more ways of using dummies. We will learn more about them later. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 14 3.2.2 Splitting and warping Sometimes our cloud of points doesn’t look linear. Can we still fit it with a linear structure? The answer is affirmative. Imagine that our cloud of points looks like 𝑦 is polynomial in 𝑥. That is the case in the figure below. Let’s start with a polynomial of degree ℎ in 𝑥: 𝑦 = 𝑎0 + 𝑎1 𝑥 + 𝑎2 𝑥 2 + 𝑎3 𝑥 3 + ⋯ + 𝑎ℎ 𝑥 ℎ If we define 𝑥0𝑖 = 1, 𝑥1𝑖 = 𝑥𝑖 , 𝑥2𝑖 = 𝑥𝑖2 , 𝑥3𝑖 = 𝑥𝑖3 , … , 𝑥ℎ𝑖 = 𝑥𝑖ℎ , then we arrive at: 𝑦𝑖 = 𝑎0 + 𝑎1 𝑥1𝑖 + 𝑎2 𝑥2𝑖 + 𝑎3 𝑥3𝑖 + ⋯ + 𝑎ℎ 𝑥ℎ𝑖 Which has a linear structure. All the regressors enter the model linearly (there aren’t any quadratic, cubic, or higher degree terms). Thus, although a linear structure sounds restrictive, it turns out that it isn’t. This is possible because we split and warp the regressors. In the case above, 𝑥 is split into ℎ regressors, and each of them is warped differently. Although the original relationship may be non-linear, we can find a specification with a linear relationship between 𝑦 and 𝑥, once we split it and warp it. Notice that, in general, when we split and warp the regressors, the derivatives are no longer constant. In the case above, we have: ℎ ℎ ℎ 𝜕𝑥𝑗 𝜕𝑦 𝜕𝑦 𝜕𝑥𝑗 =∑ = ∑ 𝑎𝑗 = ∑ 𝑎𝑗 𝑗𝑥 𝑗−1 𝜕𝑥 𝜕𝑥 𝑗=1 𝜕𝑥𝑗 𝜕𝑥 𝑗=1 𝑗=1 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 15 Graphically, to understand what happens when we split and warp, we can focus on the case of a quadratic polynomial. Assume we have only one explanatory variable 𝑥, and the cloud of points looks like a parabola that opens upward. Let’s split 𝑥 into 𝑥1 = 𝑥 and 𝑥2 = 𝑥 2 . The graph below shows the parabolic relationship between 𝑦 and 𝑥 as blue points on the wall at the left. On the floor you can see the relationship between 𝑥1 and 𝑥2 (the latter is the square of the former). When we run a regression of 𝑦 on 𝑥, we are choosing the right height and tilt of the blue plane to fit the cloud of red points. The red points you see in the graph lie on the blue plane. The takeaway is that, by splitting and warping our regressors, we can fit non-linear looking clouds with linear structures. 3.2.3 Logarithms Logarithm is a recurrent tool in economics because of its nice properties. Sometimes we use logarithmic transformations of our regressors. For instance, we may be interested in the regression model: 𝑦𝑖 = 𝛽0 + 𝛽1 ln(𝑥𝑖 ) + 𝜀𝑖 If we take the derivative of 𝑦 with respect to 𝑥 and multiply it by a change in 𝑥 equal to 𝑑𝑥, we get: 𝜕𝑦 𝑑𝑥 𝑑𝑥 = 𝛽1 𝜕𝑥 𝑥 Assume 𝑑𝑥 ≈ 1% of 𝑥. In this case, 𝛽1 is interpreted as the change in 𝑦 associated with a one percent increase in 𝑥. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 16 Sometimes we use the logarithm of the dependent variable: ln (𝑦𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 The interpretation differs from the one in the previous example. To show it, let’s apply the antilogarithm of the above expression, in other words, compute 𝑒 ln (𝑦) : 𝑦 = 𝑒 𝛽0 +𝛽1 𝑥𝑖+𝜀𝑖 In the above expression, the derivative of 𝑦 with respect to 𝑥 is: 𝜕𝑦 = 𝛽1 𝑒 𝛽0 +𝛽1 𝑥𝑖+𝜀𝑖 𝜕𝑥 If we divide by 𝑦 we get: 1 𝜕𝑦 = 𝛽1 𝑦 𝜕𝑥 Thus, the coefficient 𝛽1 can be interpreted as the change as a fraction of 𝑦 associated with a change in 𝑥. Notice that nothing prevents us from using this last model when 𝑥 is a dummy variable. The interpretation would be the same: 𝛽1 is the change as a fraction of 𝑦 associated with “turning on” the dummy variable 𝑥. 3.2.4 Turning continues variables into dummies Sometimes it’s more convenient to define a group of dummies to represent different intervals of a continuous variable. For instance, instead age or income, we may want to have age groups or income brackets. Assume 𝑧 is an independent variable. Let: 𝑥0 = 𝟏(𝑧 ∈ [0, 𝑎)) 𝑥1 = 𝟏(𝑧 ∈ [𝑎, 𝑏)) ⋮ 𝑥𝑘 = 𝟏(𝑧 ∈ [ℎ, 𝑖)) We have a regression model with a constant and 𝑘 dummies, one for each interval of 𝑧: 𝑦𝑖 = 𝛽0 𝑥0𝑖 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝜀𝑖 Using dummies this way may help us fit complicated pattern in the data in a very simple manner. The graph below shows an example with an intercept and 𝑘 dummies: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 17 3.2.5 Kinks and jumps Sometimes we expect heterogeneity in the regression coefficients across subgroups. That heterogeneity may come in the form of kinks or jumps. Formally, we say that there are heterogenous coefficients. The graphs below present some examples. If we simply use a model like 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 we would be missing the kink or the jump. We can incorporate the possibility of kinks and jumps. To do that, let 𝑑𝑖 be such that: 1 𝑑𝑖 = { 0 if 𝑥𝑖 ≥ 𝑎 if 𝑥𝑖 < 𝑎 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 18 To include the possibility of heterogenous coefficients based on the value of 𝑎, our model would become: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑑𝑖 + 𝛿𝑑𝑖 𝑥𝑖 + 𝜀𝑖 In this case, we say “the variable 𝑑 interacts with 𝑥𝑖 ” or that “there are interaction terms of 𝑥𝑖 and 𝑑𝑖 .” Notice that now we have two intercepts and two slopes. Which is applicable depends on whether 𝑥𝑖 ≥ 𝑎 or 𝑥𝑖 < 𝑎. For 𝑥𝑖 < 𝑎, the model is: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 Whereas for 𝑥𝑖 ≥ 𝑎, the model is: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾 + 𝛿𝑥𝑖 + 𝜀𝑖 The intercept would be 𝛼 + 𝛾 and the slope would be 𝛽 + 𝛿. To show more clearly the heterogeneity, we can write the model for both cases as: 𝑦𝑖 = (𝛼 + 𝛾𝑑𝑖 ) + (𝛽 + 𝛿𝑑𝑖 )𝑥𝑖 + 𝜀𝑖 Assuming there is a kink and a jump at 𝑎, the graph below shows how our model would fit the cloud. As an exercise, assume 𝑦𝑖 is wage, 𝑥𝑖 is years of schooling, 𝑑𝑖 = 1 if individual 𝑖 is female and 𝑑𝑖 = 0 otherwise. How would you interpret the coefficients in the following model? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 19 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑑𝑖 + 𝛿𝑑𝑖 𝑥𝑖 + 𝜀𝑖 3.2.6 Interactions Some relationships between regressors and the dependent variable may be complicated. In the regression context, we call interactions to the product of two or more regressors. We can have interactions with dummies (as we saw before) or with any other regressors. Suppose we have two independent variables, 𝑥1 and 𝑥2 . A model with an interaction between 𝑥1 and 𝑥2 is: 𝑦𝑖 = 𝛼 + 𝛽𝑥1𝑖 + 𝛾𝑥2𝑖 + 𝛿𝑥1𝑖 𝑥2𝑖 + 𝜀𝑖 If we take partial derivatives of the dependent variable with respect to each of the two regressors, we don’t get constants terms. Instead, we get values that vary: 𝜕𝑦𝑖 = 𝛽 + 𝛿𝑥2𝑖 𝜕𝑥1𝑖 𝜕𝑦𝑖 = 𝛾 + 𝛿𝑥1𝑖 𝜕𝑥2𝑖 The slopes vary with the other explanatory variables. Like derivatives, the terms above can be evaluated at different values of 𝑥1 and 𝑥2 . Since slopes vary across observations, we say they are heterogenous. As you can imagine, we can have many types of interactions. They may involve more than two independent variables. However, it is important to keep in mind that too many interactions may obscure the meaning of our regression. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 20 3.2.7 De-meaning or centering In some circumstances we may find convenient to de-mean the data, i.e. to center the data around its mean. What does that do to our estimates? Consider the model 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . As we saw before, the regression coefficients would be: 𝑐𝑜𝑣(𝑥, 𝑦) 𝑣𝑎𝑟(𝑥) 𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅ 𝛽̂ = What if instead we use 𝑥𝑖∗ = (𝑥𝑖 − 𝑥̅ ) as our explanatory variable? The process of subtracting the mean to create a new variable is called de-meaning or centering. If we did so, our model would be 𝑦𝑖 = 𝛼 ∗ + 𝛽 ∗ 𝑥𝑖∗ + 𝜀𝑖 . A natural question is, would 𝛽̂ and 𝛽̂ ∗ be the same? What about 𝛼̂ and 𝛼̂ ∗ ? Let’s compute them: 𝛽̂ ∗ = 𝑐𝑜𝑣(𝑥 ∗ , 𝑦) 𝑐𝑜𝑣(𝑥 − 𝑥̅ , 𝑦) 𝑐𝑜𝑣(𝑥, 𝑦) = = = 𝛽̂ 𝑣𝑎𝑟(𝑥 ∗ ) 𝑣𝑎𝑟(𝑥 − 𝑥̅ ) 𝑣𝑎𝑟(𝑥) Thus, the slope is unchanged. But that’s not the case with the intercept: 𝛼̂ ∗ = 𝑦̅ − 𝛽̂ (0) = 𝑦̅ If we de-mean the regressors, then we can interpret our estimates as “evaluated at the mean.” This is particularly interesting for the intercept, since it becomes the average for the dependent variable. As an exercise, think what would happen if we also de-meaned the dependent variable. 3.2.8 Hierarchical and rectangular forms Imagine we have yearly data on sales for three sales representatives. The data covers the years 2015 through 2018. There are (at least) two ways of structuring the data into a table. The first, shown below, is what is known as a rectangular form or wide shape. Sales representative Sales 2015 Anne 2016 2017 2018 120 129 108 112 Bob 98 92 105 121 Chris 89 82 97 98 In this case, each row denotes a sales representative and the columns show the sales across different years. Notice that, as we accumulate data for more years, the number of columns would grow. A second way of presenting the same data is by using what is known as a hierarchical form or long shape. Below is the same data but in hierarchical form. Notice that now each row is ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 21 a unique combination of sales representative and year, and there is only one column for sales. the The first level of our hierarchy is given by the sales representative. The second level is given by the year. In this case, there is only one column displaying the sales information. Adding more years in this case would increase the number of rows. Sales representative Year Sales Anne 2015 120 Anne 2016 129 Anne 2017 108 Anne 2018 112 Bob 2015 98 Bob 2016 92 Bob 2017 105 Bob 2018 121 Chris 2015 89 Chris 2016 82 Chris 2017 97 Chris 2018 98 The data can come to you in many different shapes. You must be able to arrange it so that you can analyze any way you desire. To do that, it’s helpful to keep in mind these two general ways of organizing a table. Of course, when we have more complex data (more hierarchies and more variables), there are more ways to organize them. Some ways could be partly hierarchical and partly rectangular. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 22 4 Probability and regression At this point, we already have the first layer of our cake structure. We have a mathematical method (regression) to summarize the relationship between a dependent variable (𝑦) and a group of independent variables (𝑥1 , 𝑥2 , … , 𝑥𝑘 ). We start with a cloud of points (our data), and we fit it with a linear structure. We can fit clouds with all kinds of shapes. They don’t have to look like lines or planes. They can be curvy, and they can have jumps and kinks. Now, we will proceed to the second layer, which incorporates probability. 4.1 Sampling and estimation Let’s start with a silly example. Assume I measure your expenditures on entertainment over the last three months and plot it against the last two digits of your Social Security number. Would there be any correlation? You can correctly guess there should be no correlation. The graph below illustrates this example. The cloud represents different levels of expenditures along the vertical axis, and the last two digits of the Social Security number in the horizontal axis. We know the actual value of the slope should be zero because there is no reason those two variables should be connected. That’s represented by the blue line. However, what if we randomly got two samples like the ones depicted in the graph below? If our sample consisted of the observations denoted by triangles, a regression using that sample would produce a negative slope (red line). In contrast, if our sample consisted of the observations denoted by squares, the regression would produce a positive slope (green line). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 23 When we use samples, by sheer luck we may get positive or negative slopes even if the actual value of the slope should be zero. Probability enters regressions trough the notion of sampling. 4.2 Nomenclature Let’s introduce some useful nomenclature. We call parameters to the regression coefficients we would get if we ran a regression using data for the entire population or universe. In contrast, we call estimates to the regression coefficients we get when we run a regression using a sample. Colloquially, parameters are sometimes are referred to as the betas or the true betas, whereas estimates are referred to as the beta hats. The hat comes from the convention of adding the symbol ^ on top of the coefficient to distinguish it from the parameter. We hope the estimates are informative of the parameters. In fact, that’s the only reason we care about them. In the real world, we don’t observe the population or the universe. We only observe samples. Our challenge is to determine if our estimates are close or far from the parameters. Notice something important and intuitive in the example about the Social Security numbers that is true more generally. First, the less 𝑦 varies (relative to 𝑥), the smaller the chances of getting very different regression coefficients across random samples—the beta hats would be more similar across samples. Second, the larger the sample, the smaller the chances the regression coefficients will differ by much from the population regression coefficient—the beta hat would be more similar to the true beta. Those are two general principles worth keeping always in mind. 4.3 The magic of the Central Limit Theorem Imagine that, given a population of size 𝑀, we draw one million random samples of size 𝑁 < 𝑀. For each sample, we run the regression 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 and get an estimate of beta (that is, we gate a 𝛽̂). We would have one million of such beta hats. If we create a histogram with ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 24 all those values, how would it look? By the Central Limit Theorem, we know it would look like a normal distribution centered at the true beta. This property is independent of anything else. It only depends on the concept of random sampling. This is an awesome result and we get a lot of mileage out of it. The graph below shows how the one-million beta hat histogram would look. In our Social Security number example we have that 𝛽 = 0 but, because of sampling, we would get 𝛽̂ > 0 half of the time and 𝛽̂ < 0 the other half. However, estimates close to the parameter are more likely than estimates far from it—look at the chart above. If we knew how much 𝛽̂ varies, then we could calculate the probability of 𝛽̂ (the estimate) being close to or far from 𝛽 (the parameter). 4.4 Standard error We measure how much 𝛽̂ varies using its standard deviation. We call the standard deviation of 𝛽̂ the standard error. There are two ways to estimate the standard error of 𝛽̂. One way is bootstrapping. It consists of treating our sample as the population, and then drawing many samples from it, replacing each time with draw and observation all the data points. By taking samples with replacement we can get a very good idea of how much our estimate varies based exclusively on the luck of the draw. This is very easily done with today’s computers. Thus, there is no excuse to not do it. We can select the number of repetitions we want (100, 1000, 10,000 or a million). Notice that ceteris paribus larger sample sizes mean smaller standard errors because larger samples sizes produce estimates closer to the parameter and therefore they vary less. We can also proceed in the classic way and estimate the standard error using the residuals of our (only one) regression using the full original sample. Based on assumptions that we won’t ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 25 review here (some of which aren’t verifiable), you can approximate the standard error this way. It is important to know this method because most people use it. The standard error of 𝛽̂ is estimated based on how large or small the residuals are using the following formula:1 𝑆. 𝐸. (𝛽̂ ) = √𝑣𝑎𝑟(𝛽̂ ) = √𝜎̂ 2 (𝑋 ′ 𝑋)−1 The above expression also decreases with sample size through the term (𝑋 ′ 𝑋)−1 . To see it, notice that, in the case of a univariate regression, multiplying by the term (𝑋 ′ 𝑋)−1 is equivalent to 2 dividing by the term ∑𝑁 𝑖=1(𝑥𝑖 − 𝑥̅ ) , which is increasing in 𝑁. You may notice that we introduced the term 𝜎̂ 2 : 𝑁 1 𝜎̂ = ∑ 𝜀̂𝑖2 𝑁 2 𝑖=1 Whichever way we measure the standard error (bootstrapping using many regressions or based on the residuals of a single regression), the idea is that our 𝛽̂ is normally distributed with mean 𝛽 and variance equal to the square of the standard error. We express that statement as: 𝛽̂ ~𝑁(𝛽, [𝑆. 𝐸. (𝛽̂ )]2 ) Let’s assume the true beta is zero. This is an arbitrary but very useful assumption. Given a standard error, we can compute the probability of 𝛽̂ being in any interval we want. Let’s focus on symmetric intervals around zero. The graph below shows the distribution of 𝛽̂ assuming 𝛽 = 0, and given a standard error of one (as an example). Given an interval [−𝑎, 𝑎], where 𝑎 is a positive number, we can easily calculate the probability of 𝛽̂ being outside that interval. We could do that calculation in a spreadsheet or any statistical software. There are better and slightly more sophisticated formulas to estimate the standard error that account for some other factors. We will briefly discuss them later. 1 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 26 We can also proceed backwards. Start with a given probability. We can find the symmetric interval around zero such that 𝛽̂ would fall outside with that given probability. The graph below illustrates that situation. If we start with a probability of, say, 0.05 of 𝛽̂ falling outside of the interval [−𝑏, 𝑏], then we can determine the value of 𝑏. To summarize, estimates are sample regression coefficients and parameters are population regression coefficients. Because of sampling, we think of estimates as random variables. Estimates are normally distributed, and their mean are the parameters. The Central Limit Theorem doesn’t require any assumption on the distribution of 𝑦, 𝑥 or 𝜀. Based on the Central Limit Theorem result, we can formulate and test hypotheses. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 27 4.5 Significance Once we know the shape of the distribution of the estimates (a normal distribution), we may find useful to hypothesize that it is centered at zero. Another way to state the same hypothesis is that there is no relation between 𝑥 and 𝑦 in the population or that the true beta is zero. However, as we saw before, even if the true beta is zero, there is a chance we could get a sample for which the beta hat is not zero. Thus, we never know with certainty if the hypothesis is true or false. But we can check whether the data lend little or a lot of support to it. We can test the hypothesis 𝛽 = 0 based on the distribution of 𝛽̂ (under the assumption that is centered at zero) and the actual sample coefficient we obtain. With those ingredients, we define a rejection region associated with a confidence level (as you previously saw in your Stats course). Keep in mind that, since there is uncertainty, the best we can do is to live with a level of confidence. Sometimes it helps to understand these issues in terms of a coin. Suppose we’re interested in determining whether a coin is fair (i.e. it isn’t loaded). By tossing it one hundred times (i.e. by getting one sample of size 𝑁 = 100) we’ll never know for sure if it’s fair or not. But we may get a very good idea. If out of one hundred tosses we get ninety-five heads, we have good reasons to believe the coin isn’t fair. Why? Because, assuming the coin is fair, getting ninety-five heads or more is extremely unlikely. What about eighty or more heads? Seventy or more? As we approach fifty heads (what we would expect with a fair coin), the probability gets closer to fifty percent. For instance, the probability of observing sixty heads or more is one in thirty-five (0.0284). Still small, but not microscopic anymore. Lastly, the probability of observing fifty-five heads or more is close to one in five (0.1841). When we produce estimates using regressions, we have a similar situation. It’s hard to reconcile estimates that are far from zero with a true parameter equal to zero. Just as in the coin toss example, we can compute the probabilities associated with each value of 𝛽̂. Remember that once we know the standard error of 𝛽̂, we also know the hypothetical distribution of 𝛽̂ assuming 𝛽 = 0. The graph below shows such distribution. Notice that the location of the distribution doesn’t depend on the value of 𝛽̂—we assumed it’s centered at zero. What does the distribution mean intuitively? Given 𝛽 = 0, values of 𝛽̂ far from zero (be them positive or negative) are unlikely. Thus, if our 𝛽̂ is very large or very small, then it is highly unlikely that it comes from a distribution centered at zero. Always keep in mind that the standard error is our measure of how far is 𝛽̂ from 0, because it is the standard deviation of the distribution of 𝛽̂. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 28 Let’s revisit the normal distribution you’ve studied before. The graph below shows the probability by intervals for a random variable that is normally distributed with mean zero and standard deviation one (the horizontal axis is expressed in standard deviations). For instance, the probability that such variable falls between 0 and 1 is 0.341. Since the distribution is symmetric, the probability of the variable falling between −1 and 0 is also 0.341. Thus, the probability of falling between −1 and 1 is 0.682, which is equal to 2 × 0.341. The probability of the variable falling outside of the interval (−1,1) is 0.318, which is 1 – 0.682. More generally, we can compute the probability of the variable falling inside or outside any interval we want. We can also proceed the other way around. We can start with a probability, say 0.90 or 90%, and find the symmetric interval that corresponds to that probability. An interval is defined by its upper and lower bounds. The graph below shows the values of the upper and lower bounds given three probabilities: 0.99, 0.95 and 0.90. As before, the horizontal axis is expressed in standard deviations. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 29 Given the value of an estimate 𝛽̂ (which may be positive or negative), we can compute the probability of obtaining estimates (drawn from the same distribution centered at zero) that are greater than |𝛽̂ | or smaller than −|𝛽̂ |. Such probability is known as the p-value associated with the estimate 𝛽̂. The graph below presents an example with 𝛽̂ = 1.405 and a standard error of 1. The probability of obtaining an estimate above 1.405 is 0.08, and the probability of obtaining an estimate below −1.405 is also 0.08. Thus, the probability of obtaining an estimate that is farther away from zero than 1.405 is 0.16, which is 2 × 0.08. In other words, the p-value of the estimate 1.405 is 0.16. It should be clear that the probability of getting an estimate closer to zero than 1.405 is 0.84. The definition of p-value stated above corresponds to two-sided tests. We can define the pvalue for one-sided tests. In that case, we only care about either the probability of getting estimates that are larger than our estimate or smaller than our estimate. As you can see in the graph above, that’s equivalent to looking at only one of the tails of the distribution. The p-value in a right-side ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 30 test (which measures the probability of an estimate being greater than 1.4095) is 0.08. Because the distribution is symmetric and centered at zero, that’s the same as the p-value in a left-side test (which measures the probability of an estimate being smaller than −1.4095). Now that we have reviewed some probability notions, we can introduce a crucial concept. If ̂ 𝛽 falls outside of the 95% interval centered at 0, we say it is statistically significant (or statistically different from zero) at 95% confidence. If it falls inside, we say that it is statistically insignificant (or statistically not different from zero). We can use other levels of confidence. Traditionally, 95% is the norm. However, with larger samples, we can be more demanding and use 99% or 99.9%. Notice that the definition of statistical significance can also be expressed in terms of p-values. If the p-value is below 0.05, then we say that the estimate is statistically significant at 95% confidence. If the p-value is greater than 0.05, we say the estimate is insignificant or not significant. As you can imagine, the definition of significance can be adjusted to reflect one-side tests if that’s what we need. Imagine that a regression produces an estimate 𝛽̂ = 1.8 and the standard error associated is 1. Is that estimate statistically significant? The answer depends on the level of confidence we use and whether we are performing a one- or two-sided test. If we use a confidence of 95% (or higher) in a two-sided test, the estimate is not significant (see the chart above). But if we use 90%, it is significant, since it lies outside of the 90% interval that has bounds −1.64 and 1.64 (1.80 > 1.64). If we use one-sided tests, then the estimate would be significant at 95% confidence because the interval’s bounds are −∞ and 1.64. In sum, you cannot say a priori whether an estimate is significant or not just by looking at it. You need to know (1) whether we’re talking about a one- or two-sided test, (2) the p-value of the estimate, and (3) the confidence level. Notice that, all else constant, significance is directly affected by the sample size. We mentioned that larger samples result in smaller standard errors. That means the distribution of the estimates is more narrowly concentrated around the assumed value of the parameter. Thus, any non-zero estimate will eventually result significant if we keep increasing the sample size. So far, we’ve assumed 𝛽 = 0. However, we could assume any other value for 𝛽 and test whether our estimate is likely to be coming from a distribution centered at that (non-zero) value. That would be similar to assuming a loaded coin that lands heads with a probability different from one half. Intuitively, given an estimate, some parameter values would be more “reasonable” than others. After all, it’s more believable that the estimate 𝛽̂ = 21.3 comes from a distribution centered at 20 than from a distribution centered at 100. We will talk about this in the next two sections. 4.6 Confidence intervals Given a confidence level, what parameters would be consistent with our estimate? We have a Goldilocks situation. Some parameter values seem too big for our estimate, while others seem too small. The graph below illustrates this situation. Imagine our estimate is 𝛽̂, and we consider two possible values of the true parameter, 𝛽 ∗ and 𝛽 ∗∗ . If the distribution of 𝛽̂ were centered at any of ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 31 those two values, it would be very unlikely to get 𝛽̂ (just like it’d be very unlikely to get fifty heads in one-hundred tosses using a coin heavily loaded in favor of heads, or using another coin heavily loaded against heads). Which values of the parameters seem “right” given our estimate? The answer is very intuitive. It’d be the values that are close to our estimate. Closeness to the estimate makes those parameters appear more reasonable. One simple way to measure how close a possible parameter value is to our (known) estimate is to look at the p-value we would get under the assumption that the true parameter has any particular value. Imagine we can adopt the following rule. We pick a confidence level, say 95%. Then we determine all the values of 𝛽 for which the p-value of our estimate would be above the critical value, which is defined as one minus the confidence level we picked. In this case the critical value is 0.05. We would end up with an interval of possible values of 𝛽. Colloquially speaking, it wouldn’t surprise us if our estimate 𝛽̂ came from a distribution centered anywhere within that interval—the probability of such event wouldn’t be too small. To make things easy, we can focus on the lower and upper bounds of the interval just described. If the critical value is 0.05, we need to find the values of the parameter such that the pvalue of 𝛽̂ is precisely 0.05. There are two such parameter values. One will be greater than 𝛽̂ and the other will be smaller. The graph below illustrates this point. If we assume the parameter is equal to 𝛽′, the p-value of 𝛽̂ is 0.05. Similarly, if we assume the parameter is equal to 𝛽′′, then the p-value of 𝛽̂ is also 0.05. For any parameter value between 𝛽′ and 𝛽′′, the p-value of 𝛽̂ is greater than 0.05. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 32 All possible betas for which the p-value of 𝛽̂ is greater than (or equal to) 0.05 constitute the 95% confidence interval of our estimate. In the example above, the 95% CI (as we usually abbreviate the confidence interval) is given by (𝛽′, 𝛽′′). In layperson terms, the confidence interval tells us which values of the parameter are consistent with our estimate. Our estimate is not statistically different from those parameter values (at the given level of confidence). This is very helpful in many contexts. A common mistake is to say that the parameter falls inside our confidence interval with a 95% probability. Why is this wrong? The parameter is fixed. It isn’t a variable—let alone a random one. Put differently, the parameter is either is in the interval or not. We cannot make probability statements about it. 4.7 Hypothesis testing Often times we would like to make decisions based on 𝛽, but we don’t observe it. We only observe 𝛽̂. However, we know 𝛽̂ and 𝛽 are related. First, 𝛽̂ comes from a normal distribution centered at 𝛽. Second, we have a proxy for the standard deviation of that distribution—the standard error of 𝛽̂. Thus, we can use 𝛽̂ as a piece of information about 𝛽 the same way we use a sample average to inform us of the population average. How do we do this? We use hypothesis testing. A very common hypothesis (usually denoted by 𝐻𝑜) is 𝛽 = 0. If 𝛽̂ is far from 0, then we reject that hypothesis. However, we don’t reject it with certainty. We reject it with some level of confidence picked a priori (usually 95%). When 𝛽̂ is close to 0, we don’t reject the hypothesis. However, not rejecting a hypothesis is different from accepting it. To illustrate that, imagine two different hypotheses (e.g. 𝛽 = 0 and 𝛽 = 0.1) are tested using the same regression and aren’t rejected. They both cannot be accepted because they are different (0 ≠ 0.1). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 33 The measure of how far 𝛽̂ is from 𝛽 is given by the standard error. However, we don’t know the standard error with certainty. We estimate it based on our sample through bootstrapping or the classic way based on the residuals. For our hypothesis tests we use a t distribution in lieu of a normal distribution because we have a proxy for the standard deviation. The difference between the estimate and the parameter, divided by the standard error, is a random variable distributed t with 𝑁 − 𝑘 − 1 degrees of freedom: 𝛽̂ − 𝛽 ∼ 𝑡𝑁−𝑘−1 𝑆. 𝐸. Where 𝑁 is the number of observations, and 𝑘 is the number of explanatory variables. Whenever we have more than one hundred degrees of freedom (which is almost always the case), the t distribution is indistinguishable from a normal distribution. Hence the focus in these notes on the latter. However, formally we use the t distribution for hypothesis testing and confidence intervals. The ratio (𝛽̂ − 𝛽)/𝑆. 𝐸. is the t-statistic. Knowing the distribution of the t-statistic allows us to formulate different hypothesis tests. It’s crucial to note that the same estimate may be significant in some regressions but not in others, depending on the standard error. In other words, the same hypothesis may be rejected with the same estimate depending on the standard errors. Remember that significance is the result of comparing the magnitude of the estimate with how much we would expect it to vary across samples. For instance, assume two different standard errors, 0.5 and 1, and the hypothesis 𝛽 = 0. The two distributions of 𝛽̂ are depicted in the graph below. For which values of 𝛽̂ do we reject the hypothesis 𝐻0: 𝛽 = 0 in each case? The shaded areas denote the rejection regions at a 95% confidence. Try different estimate values and convince yourself of the different conclusions. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 34 As the sample size increases, the size of the standard error decreases. To see this intuitively, remember that estimates from larger samples resemble more the population parameter. Therefore, there is less variation across estimates produced with larger samples. Thus, two identical estimates may lead to different conclusions about the same hypothesis if those estimates come from samples with different sizes. With these tools we can test many hypotheses. We can hypothesize that 𝛽 takes any particular value of interest to us (1.5, 3, −1.2, etc.) and test it. In this context, confidence intervals are very useful. Given a level of confidence and an estimate 𝛽̂, a confidence interval tells us all the values for which we wouldn’t reject the hypothesis that the parameter is equal to any of those values. Colloquially, a confidence interval gives us a range of parameter values of distributions from which our estimate is likely to come from. 4.8 Joint-hypothesis tests In the same regression, 𝛽̂0 , 𝛽̂1 , 𝛽̂2 , … , 𝛽̂𝑘 aren’t independent random variables. In general, they are correlated. To see this, think of the original example using Social Security numbers and expenditures on entertainment. We mentioned the possibilities of getting samples for which the estimate of the slope would be positive or negative. Greater positive slopes come accompanied with lower intercepts, whereas greater negative slopes come accompanied with greater intercepts. We can formulate hypothesis tests that involve more than one estimate at a time. For instance, we can test whether the sum of two estimates is equal to one, whether the ratio of two estimates is equal to two, etc. Statistical software does that for us in an incredibly easy way. The underlying ideas about significance and confidence intervals are the same. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 35 5 The economics of regression So far, we’ve discussed the mathematical and probability aspects of regressions. Now we are moving on to the economics. Always keep in mind that regression is a tool. How we should use it depends on the question we are trying to answer. We can broadly classify most questions into three uses: descriptive, predictive and prescriptive. 5.1 Descriptive use The greatest asymmetry between econometrics textbooks and the practice of econometrics is in the emphasis on describing the data. Explicitly or implicitly, econometric textbooks focus on causal relationships and assume we already have a theory. In practice, the way we model phenomena (how we think the situation under analysis works) comes after observing the data. That isn’t cheating, as some theoretical extremist may suggest. It’s the scientific method. We first observe the world, then we come up with ideas about how it works. In business, we start with the overall goal of improving the performance of the organization (reducing churn, increasing loyalty, reducing employee turnover, decreasing unused promotions, etc.). Then we look at the data to get ideas. What seems to be associated with what? Is churn associated with gender? Are older customers more loyal? Is turnover associated with personality traits measured by the human resources department? Is the rate of unopened promotional emails related to the time of day they are sent? The exploration of the world through the data lenses allows us to find problems or areas of opportunity, and then come up with potential solutions or ideas. How do we explore the data? Visual inspection is usually insufficient or not feasible. We have many variables and we cannot plot more than three dimensions at the same time. The analytical “weapons of choice” for practitioners are partial correlation coefficients and conditional averages, which are computed using regressions. The difference with regular or naïve correlations and averages is that with partial correlation coefficients and conditional averages, we “hold everything else constant,” “control for other factors,” or “adjust for other variables.” Let’s start with the use of regression coefficients as partial correlation coefficients. The idea is closely related to the mathematical concept of a partial derivative. Partial correlation coefficients offer numerical answers to the question, what is the relation between 𝑦 and 𝑥 holding everything else constant? Think in terms of the regression model: 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑤𝑖 + 𝛿𝑧𝑖 + 𝜀𝑖 When we explore the relation between 𝑦 and 𝑥, we want to hold constant 𝑤 and 𝑧. If we simply eyeball the data (say, with a scatter plot), we wouldn’t be holding constant 𝑤 and 𝑧. With a ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 36 regression, when we look at the coefficient 𝛽, by definition we are holding constant the other regressors. Remember that: 𝛽= 𝜕𝑦 𝜕𝑥 Suppose a restaurant chain is exploring the relationship between ticket size per customer (𝑦) and party size (𝑥). The restaurant chain is entertaining the possibility of giving promotions to increase party size because they believe larger parties spend more per customer. By running a regression, they can get an estimate of 𝛽 and they can test whether is different from zero. The regression may hold constant other variables (e.g. the day of the week, the time of the day or whether there was a special event like Monday Night Football). They can also test hypothesis about the dollar value of the increase associated with one additional person in the party. For instance, they could formally test whether the increase is $5 (i.e. 𝛽 = 5). The second workhorse of descriptive analysis is conditional averages. Regressions allow us to calculate averages “adjusting for other factors” or “holding all else constant.” To illustrate the relevance of this, suppose a company is comparing the productivity of managers supervising different groups of employees (perhaps the comparison will be used to pay bonuses). Let 𝑦𝑗𝑖 represent the performance of employee 𝑖 in who works with manager 𝑗. For each manager, we have a unique group of workers. Let 𝑦̅𝑗 represent the average performance of workers supervised by manager 𝑗. What are some potential issues with simply comparing average worker performance across managers? In the real world, not all workers are the same. Some are more motivated or more skillful. A naïve comparison of average performance across managers may lead to wrong decisions. Assume an expert tells you that worker performance is affected by work experience. Thus, it’d be better to think in terms of the model: 𝑦𝑗𝑖 = 𝜃𝑗 + 𝛽𝑥𝑗𝑖 + 𝜀𝑗𝑖 Where 𝜃𝑗 is the productivity of the manager j, and 𝑥𝑗𝑖 is the years of experience of worker 𝑖. By comparing averages without any sort of adjustment, we would be missing the effect of experience, 𝑥𝑗𝑖 , on the observed performance of each manager. Take managers 1 and 2. If our model above is true, the naïve difference in average performance is not 𝜃2 − 𝜃1 . Rather, it is: 𝑦̅2 − 𝑦̅2 = 𝜃2 − 𝜃1 + 𝛽(𝑥̅2 − 𝑥̅1 ) + (𝜀̅2 − 𝜀̅1 ) As you can see, the naïve approach involves differences in manager productivity but also in worker experience. Unless 𝑥̅2 = 𝑥̅1 , we would be omitting important information. If 𝛽 > 0, then ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 37 there would be a bias in favor of the manager with more-experienced workers. Let’s look at a graphical version of the same example. The graph below presents a cloud of points denoting the performance of different workers. The different colors of the points in the cloud denote the different managers supervising each worker. The blue points correspond to manager 1, and the orange points correspond to manager 2. If we simply computed average worker performance by manager, the average for manager 1 would be lower than the average for manager 2. However, by looking at the experience of all workers, it is clear that manager 1 supervises workers with less experience than manager 2. The shape of the cloud also suggests there is a positive relationship between worker performance and experience. The lines in the graph represent the results of fitting the model 𝑦𝑗𝑖 = 𝜃𝑗 + 𝛽𝑥𝑗𝑖 + 𝜀𝑗𝑖 , which has a different intercept for each manager and the same slope for worker experience. The estimate of the intercept for manager 1 is 𝜃̂1 , and the estimate of the intercept for manager 2 is 𝜃̂2 . Remember that those estimates can be interpreted as managerial productivity. In the graph, 𝜃̂1 > 𝜃̂2 , which means that, holding worker experience constant, manager 1 is more productive than manager 2. The result is the opposite to the naïve comparison. Similar examples are given by performance comparisons in many occupations (doctors with patients with different challenges, teachers with students of different backgrounds, lawyers with cases with different difficulties) or in prices of goods with many attributes (insurance premiums for people with different characteristics, prices of cars or computers with different features, wages for workers with different sociodemographic traits). Examples like those above can be grouped into what we call hedonic models. The name comes from pricing models where “the price of the total is the sum of the prices of the parts,” even if those parts’ prices aren’t observed in the market. For instance, think of houses prices. Being close to public transportation or having a backyard are valuable traits and certainly affect the price of a house. However, you cannot buy those features in a market and add them to your ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 38 house. With hedonic models we can estimate the contribution of those traits to the total price as if those traits could be added. In sum, never use correlation or naïve comparisons of averages when you can use a multivariate regression. Multivariate regressions allow you to control or adjust for other factors. However, when running a regression, you must pay attention to what controls, covariates or regressors are included in your analysis. It’s possible that sometimes you omit important explanatory variables. Some other times you may be including too many. We’ll discuss those two possibilities after we talk about fixed effects. 5.1.1 Fixed effects In a regression model, fixed effects can be defined as different intercepts for different groups of points. In our example of managers and workers above, we introduced manager-fixed effects. All observations associated with one manager would share the same intercept, and those intercepts could differ across managers. Fixed effects are estimates themselves. They are nothing but coefficients on dummies. Fixed effects can be used as controls (their value may be irrelevant to us) or as the subject of our analysis (their value may be important to us). In the model above, we could be interested in the relation of experience and worker performance. If we didn’t include manager fixed effects, our fitted line would understate the actual relation. In that case, manager-fixed effects aren’t interesting per se. We just use them to get the right estimate of a different parameter (𝛽). If instead, we are interested in measuring the difference managers make in worker performance, the manager-fixed effects would be the most important result of the analysis. To estimate fixed effects our cloud of points must include several observations associated to the same unit. For instance, to estimate manager-fixed effects, we need multiple workers associated with each manager. We also need to know the identity of their managers—otherwise we cannot group observations by manager. In the example above, each resulting coefficient 𝜃̂𝑗 is interpreted as the “manager effect.” Depending on our subject of analysis, there may be also a “location effect,” “Holiday effect,” “rush-hour effect,” and a long et cetera (notice that, for brevity, we omitted the word “fixed”) . In terms of notation, fixed effects can be written very concisely. Imagine that we have a dayof-the-week fixed effect. We can denote it by 𝜂𝑑 (the Greek letter eta with a subscript indicating the day): 𝑦𝑖 = 𝜂𝑑 + 𝛽𝑥𝑖 + 𝜀𝑖 In that case, the subscript 𝑑 would take seven possible values, from Sunday through Saturday. Compare that to the equivalent dummy approach, where we would have one coefficient and one dummy for each day of the week: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 39 𝑦𝑖 = 𝜂𝑆𝑢 𝑑𝑖𝑆𝑢 + 𝜂𝑀 𝑑𝑖𝑀 + 𝜂 𝑇𝑢 𝑑𝑖𝑇𝑢 + 𝜂𝑊 𝑑𝑖𝑊 + 𝜂 𝑇ℎ 𝑑𝑖𝑇ℎ + 𝜂𝐹 𝑑𝑖𝐹 + 𝜂𝑆𝑎 𝑑𝑖𝑆𝑎 + 𝛽𝑥𝑖 + 𝜀𝑖 Clearly, it’s better to use the fixed effects notation rather than the dummy one, particularly when we have large numbers of fixed effects. Lastly, we can have two or more fixed effects in the same model. For instance, in the same regression we can have fixed effects for day of the week and fixed effects for the hour of the day (say, morning, midday, afternoon and evening). The model could be written as: 𝑦𝑖 = 𝜂𝑑 + 𝜃ℎ + 𝛽𝑥𝑖 + 𝜀𝑖 Fixed effects are very useful and not very well understood by many practitioners. Paradoxically, they are incredibly easy to work with in practice. Also, they seem to create information out of thin air. After all, without any direct information about managers, we are able to measure (otherwise unobserved) differences in productivity. The intuition for this is that we get indirect information though the multiple workers supervised by each manager. 5.1.2 Omitted variables When a variable belongs in a model and we omit it, we create a bias. To show it, let’s start by assuming the correct model (without omissions) and contrast its results with what we get with the omission. Suppose the correct model is 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . Our estimate of 𝛽 is: 𝛽̂ = 𝑐𝑜𝑣(𝑥, 𝑦) 𝑣𝑎𝑟(𝑥) If our model is correct, we can substitute 𝑦 with 𝛼 + 𝛽𝑥 + 𝜀 in the formula above. After some algebra, and using the properties of covariance, we get that the expected value of our estimate is the parameter:2 𝐸[𝛽̂ ] = 𝑐𝑜𝑣(𝑥, 𝑦) 𝑐𝑜𝑣(𝑥, 𝛼 + 𝛽𝑥 + 𝜀) 𝑐𝑜𝑣(𝑥, 𝛼) + 𝑐𝑜𝑣(𝑥, 𝛽𝑥) + 𝑐𝑜𝑣(𝑥, 𝜀) = = 𝑣𝑎𝑟(𝑥) 𝑣𝑎𝑟(𝑥) 𝑣𝑎𝑟(𝑥) 𝑐𝑜𝑣(𝑥, 𝛽𝑥) 𝑐𝑜𝑣(𝑥, 𝑥) 𝑣𝑎𝑟(𝑥) = =𝛽 =𝛽 =𝛽 𝑣𝑎𝑟(𝑥) 𝑣𝑎𝑟(𝑥) 𝑣𝑎𝑟(𝑥) This equality holds with expected values. As you know, when we talk about a particular sample, the estimate will likely not be equal to the parameter. 2 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 40 In this case, we say that the estimate is unbiased. The expectation of 𝛽̂ is 𝛽. In reality, we cannot be certain about what the correct model is. But contemplating the possibility of omitting relevant variables is important. Let’s go back to our example of party size and average ticket in a restaurant. What can be missing from the analysis? We can think of many determinants of ticket size per customer besides party size. An obvious one is socioeconomic status—there can be many others. Let’s think of the model 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑧𝑖 + 𝜀𝑖 , where 𝑦𝑖 is the ticket size per customer of party 𝑖, 𝑥𝑖 is the party 𝑖’s size, 𝑧𝑖 is the socioeconomic status of the person paying the check (perhaps measured by the type of payment). How is party size related to expenditure per customer? Holding all else constant, 𝜕𝑦/𝜕𝑥 = 𝛽. If we could run the regression 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑧𝑖 + 𝜀𝑖 , we would obtain estimates for the three parameters. However, when we run a regression of 𝑦 on 𝑥 alone (omitting 𝑧), what do we get? Let’s look at the expected value of our estimate of 𝛽: 𝐸[𝛽̂ ] = 𝑐𝑜𝑣(𝑥, 𝛼 + 𝛽𝑥 + 𝛾𝑧 + 𝜀) 𝑐𝑜𝑣(𝑥, 𝑧) =𝛽+𝛾 𝑣𝑎𝑟(𝑥) 𝑣𝑎𝑟(𝑥) The term 𝛾 × 𝑐𝑜𝑣(𝑥, 𝑧)/𝑣𝑎𝑟(𝑥) is the omitted-variable bias. In words, by omitting 𝑧 from the regression, our estimate of 𝛽 is biased. What can we say about the sign and magnitude of the omitted-variable bias? The bias depends on: (i) the coefficient on the omitted variable, in this case 𝛾, and (ii) the covariance between included and omitted regressors, in this case 𝑥 and 𝑧. The following table goes over all the possibilities. 𝛾 𝑐𝑜𝑣(𝑥, 𝑧) Omitted variable bias is… 0 Any value Zero Any value 0 Zero >0 >0 Positive <0 >0 Negative >0 <0 Negative <0 <0 Positive What does the table imply for the analysis of ticket size? The analysis is omitting socioeconomic status. The sign of the bias depends on whether socioeconomic status increases or decreases average tickets size and how it relates to party size. to develop your intuition, go over several possibilities. Economists always think of potential omitted-variable bias when they look at correlations or regression coefficients. How does omitted-variable bias look in practice? Sometimes we have other potential regressors. We simply include them and see what happens. If there is no omittedvariable bias, adding regressors doesn’t change our estimates in a meaningful way. If there is omitted-variable bias, adding regressors changes our estimates. When we don’t have other ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 41 regressors, economic theory may be informative of the sign or even the magnitude of the omittedvariable bias. Let’s revisit the manager productivity example above. Imagine we are interested in the relation between worker performance and experience. If we use a model without manager-fixed effects (i.e. with one intercept), the slope we would get would be smaller than if we use a model with manager-fixed effects (i.e. with multiple intercepts, one for each manager). The omission of manager dummies as explanatory variables bias downward the estimate of the relation between worker performance and experience. 5.1.3 Redundant variables In a regression, what happens if two regressors are very much measuring the same? This is called collinearity. If collinearity is perfect, then two or more regressors have a linear relationship. In the case of two regressors, which we can represent prefect collinearity with the equality: 𝑥2𝑖 = 𝛿0 + 𝛿1 𝑥1𝑖 With perfect collinearity, we cannot include both 𝑥1 and 𝑥2 as regressors in our regression. Notice what would happen if we did: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝜀𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 (𝛿0 + 𝛿1 𝑥1𝑖 ) + 𝜀𝑖 = (𝛽0 + 𝛽2 𝛿0 ) + (𝛽1 + 𝛿1 )𝑥1𝑖 + 𝜀𝑖 = 𝛾0 + 𝛾1 𝑥1𝑖 + 𝜀𝑖 This is equivalent to dropping 𝑥2 from the regression (we could have dropped 𝑥1 and keep only 𝑥2 instead). In fact, statistical software automatically does it for us. The question remains, can we recover estimates of 𝛽0 and 𝛽1 from estimates of 𝛾0 and 𝛾1 ? The answers is negative. To see why, let’s consider an example. Think of 𝑥1 as temperature in degrees Celsius (°𝐶) and 𝑥2 as temperature in degrees Fahrenheit (°𝐹). Notice that there is perfect collinearity, since °𝐹 = 32 + 1.8°𝐶. We can estimate the effect of temperature measured in degrees Celsius or Fahrenheit, but we cannot estimate the effect of one holding the other constant—it’d be meaningless. What about cases in which collinearity isn’t perfect but high? Imagine, that the correlation is greater than 0.8. In this case, the coefficients on the collinear regressors “dilute.” Jointly, they may be significant. But each of them (or at least some of them) may be not. Think of income and wealth, or grit and conscientiousness. To some extent, they measure the same. Coefficients become hard to interpret as partial derivatives. If we include variables that measure similar things, we should be explicit about this issue. Whenever possible, the inclusion of regressors must be informed by theory. We should ask ourselves the question, do we truly believe these regressors belong in the regression? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 42 5.1.4 Dummies and redundancy In many instances, dummy variables result in redundancy. In other words, they show perfect collinearity. That’s not necessarily a problem. Consider the following example. In a questionnaire, you are asked to mark yes or no: Yes No             Female Minority College degree Over 65 years of age The sum of these four dummies ranges between 0 and 4. Like switches, they can be turned on (1) or off (0) independently of each other. We can imagine people in each of the 16 possible combinations. These dummies are independent. Now consider a questionnaire that includes the following questions:       Yes No     Yes No         Yes Female Male    White Black Hispanic Other Yes    No       High school or less Some college College degree No       0 to 30 years of age 31 to 65 years of age Over 65 years of age Notice that you can only mark female or male, and therefore the sum of the first two dummies is always equal to one. The sum of the next four dummies for race/ethnicity is always equal to one. The same can be said about the dummies for educational attainment and age. That’s because within each group those dummies are associated with mutually exclusive categories. Hence, they are not independent. If you are in one category, you must not be in another. They are dependent. Other categories may be nested. The following question asks you for your place of birth. Whenever the dummy for Chicago is 1, the dummies for Illinois and U.S. are also 1, and the dummies for Outside of Chicago, Outside of Illinois and Outside of the U.S. are 0. Clearly, those dummies are not independent. They aren’t perfectly collinear either—their sum isn’t always the same. However, there is redundant information. If you were born in Chicago, then you weren’t born outside of Chicago, Illinois or the U.S. Within subsets, some dummies are perfectly collinear. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO       Yes No             43 Outside of the U.S. U.S. Illinois Outside of Illinois Chicago Outside of Chicago Consider the alternative version of the same question about your place of birth: Yes No          U.S. Illinois Chicago It should be clear that the information elicited is exactly the same. The second version of the question eliminated all redundancy, but the dummies remain dependent. That’s a result of them being nested. Chicago is in Illinois, and Illinois is in the U.S. Regardless of whether we have dependent or independent dummies, we must pay attention to perfectly collinearity. Remember that when there is an intercept in our model there is regressor 𝑥0 equal to 1. Assume 𝑥1 and 𝑥2 are perfectly collinear dummies. Since 𝑥1𝑖 + 𝑥2𝑖 = 1, then we have that 𝑥1𝑖 + 𝑥2𝑖 = 𝑥0𝑖 . Thus, we cannot estimate the model: 𝑦𝑖 = 𝛽0 𝑥0𝑖 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝜀𝑖 We must drop either 𝑥0 , 𝑥1 or 𝑥2 from our regression. If we drop 𝑥1 , our model becomes: 𝑦𝑖 = 𝛾0 + 𝛾2 𝑥2𝑖 + 𝜀𝑖 Alternatively, by dropping the constant (𝑥0 ), our model becomes: 𝑦𝑖 = 𝛿1 𝑥1𝑖 + 𝛿2 𝑥2𝑖 + 𝜀𝑖 But, since 𝑥1𝑖 = 1 − 𝑥2𝑖 , we can write as: 𝑦𝑖 = 𝛿1 (1 − 𝑥2𝑖 ) + 𝛿2 𝑥2𝑖 + 𝜀𝑖 = 𝛿1 + (𝛿2 − 𝛿1 )𝑥2𝑖 + 𝜀𝑖 The models above are equivalent. If we look at the conditional expected value of the dependent variable, we get: 𝐸[𝑦|𝑥1 = 1, 𝑥2 = 0] = 𝛾0 = 𝛿1 𝐸[𝑦|𝑥1 = 0, 𝑥2 = 1] = 𝛾0 + 𝛾2 = 𝛿2 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 44 When we have perfectly collinear dummies, there are multiple equivalent ways to formulate our model. The results are exactly the same, but they are stated differently. Think of the wage gender gap. Assume our dependent variable is wage. We could include as regressors a constant and a dummy for female. The intercept would tell us the average wage among males, and the coefficient on the dummy would tell us the female minus male gender gap. Alternatively, we could substitute the dummy for female with a dummy for male. In this case, the intercept would tell us the average wage among females, and the coefficient on the dummy would tell us the male minus female gender gap. Lastly, we could exclude the constant and include a dummy for male and a dummy for female. Now the coefficient on the dummies would tell us the average wages for males and females, respectively. The difference would be the gender gap. The information these three models provide is exactly the same, just arranged differently. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 45 5.1.5 Measurement error Frequently, our measurements aren’t accurate. In other words, we have measurement error. Not every kind of measurement error is equally interesting or relevant. If the error is always equal to a constant positive number, then it means we always overstate the true value of our variable by that number. If the constant number is negative, then it means we understate the true value. Those cases aren’t very interesting because measurement error simply shifts the cloud of points up, down or sideways but it doesn’t affect its shape. The most interesting kind of measurement error is the one that doesn’t systematically inflate or deflate our measurements, but it makes them noisy. Sometimes it’s called classical measurement error. Height, income, IQ are examples of variables that can have this type of noise. What are the effects of measurement error on our estimates? Put simply, it depends. The most important lesson we’ll learn is that measurement error in the regressors causes attenuation bias, which means that the regression coefficients are biased towards zero. Graphically, measurement error in the explanatory variable stretches horizontally our cloud of points. The figure below shows a very simple example. Imagine we start with the cloud of points given by the two solid positions. The regression in absence of measurement error is denoted by the solid line. With measurement error in 𝑥, the cloud would look like the hollow points. Given the height in the cloud, some hollow points would be shifted to the right of the solid points while others would be shifted to the left, but their average location would be given by the solid points. The dashed line denotes the regression in presence of measurement error. Horizontally stretching the cloud flattens the slope of the regression. To show this algebraically, assume 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . Instead of 𝑥𝑖 , we observe 𝑥𝑖∗ = 𝑥𝑖 + 𝑢𝑖 , where 𝑐𝑜𝑣(𝑥, 𝑢) = 0. This zero covariance means that measurement error (denoted by 𝑢) isn’t associated with 𝑥 in any systematic way. The expected value of our estimate of 𝛽 is: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 𝐸[𝛽̂ ] = 46 𝑐𝑜𝑣(𝑥 ∗ , 𝑦) 𝑐𝑜𝑣(𝑥 + 𝑢, 𝑦) 𝑣𝑎𝑟(𝑢) = = 𝛽 ( ) 𝑣𝑎𝑟(𝑥 ∗ ) 𝑣𝑎𝑟(𝑥 + 𝑢) 𝑣𝑎𝑟(𝑥) + 𝑣𝑎𝑟(𝑢) Notice that the term in parentheses is always in the interval (0,1) because all terms inside are positive. That means with measurement error we expect 𝛽̂ to be somewhere between 0 and 𝛽 (i.e. closer to zero). It’s important to note that the sign of the attenuation bias depends on the sign of the coefficient. It’s negative when the parameter is positive and vice versa. The t-statistic of our estimate is also biased towards zero, which means we would be more likely to not find significance. A different situation is when we have measurement error in the dependent variable. In this case there is no bias, but we experience loss of precision. Our cloud is vertically stretched, which results in larger standard errors. The figure below provides a simple example. The solid points show the situation without measurement error. The line represents the regression line in that case. The hollow points constitute the cloud with measurement error in 𝑦. Some are shifted up and some are shifted down relative to where they should be. Their average vertical position is unaltered. Thus, the regression line is the same as without measurement error. However, it should be born in mind that, if we took many samples, now we can get different slopes and therefore the standard error is larger. The fact that measurement error in the explanatory variable attenuates the estimates is important. That means that, in absence of measurement error, the estimates would have a greater magnitude and smaller p-values. In other words, if you get significant coefficients in a regression and someone has a hard time believing your results arguing that there is measurement error (perhaps not with those words) then you can reply that, if there is measurement error, getting rid of it would make your coefficients larger and more significant. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 47 5.2 Predictive use Before we get started, let’s make a distinction between forecast and prediction. When we forecast, we determine what the future will bring, only conditional on time passing. When we make a prediction, we come up with what we expect 𝑦 to be, assuming we know 𝑥1 , 𝑥2 , … , 𝑥𝑘 . Most businesses rarely make forecasts. They routinely make predictions—some good and some bad. Let’s imagine attendance to a chain of gyms can be modeled as: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + ⋯ + 𝛽𝑘 𝑥𝑘𝑖 + 𝜀𝑖 Assume the dependent variable is defined as the number of days attended over the course of 12 months after joining the gym. Assume the regressors 𝑥1 , 𝑥2 , … , 𝑥𝑘 are observed or reported at the moment of signing up—i.e. before attendance occurs. They include age, body mass index, gender, marital status, educational attainment, etc. We can estimate the 𝛽1 , 𝛽2 , … , 𝛽𝑘 using all members who signed up in January 2019. We would have 12 months of data for each of them (up to December 2019). Suppose a new member 𝑗 signs up. We observe 𝑥1𝑗 , 𝑥2𝑗 , … , 𝑥𝑘𝑗 for her. Our prediction of attendance given her age, body mass index, gender, and so on, is: 𝑦̂𝑗 = 𝛽̂0 + 𝛽̂1 𝑥1𝑗 + 𝛽̂2 𝑥2𝑗 + ⋯ + 𝛽̂𝑘 𝑥𝑘𝑗 In other words, our predicted value or prediction is a fitted value for some values of the regressors. When we try to predict, we pay little or no attention to each regression coefficient or to their significance. We only care about the fit. How can we know if our predictions are good? A higher R-square (our measure of goodness of fit) means a better prediction. Also, a narrower confidence interval around the prediction means more accuracy. The graphs below show examples with different R-squares and different confidence intervals. By definition, greater residuals mean worse predictions. The R-square captures those larger residuals. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 48 In practice, we usually start with two data sets. The first of them is retrospective. It consists of the 𝑥1 , 𝑥2 , … , 𝑥𝑘 observed before the fact we care about took place (e.g. gym member characteristics at sign up in January 2019, before attendance occur), and the 𝑦 observed after that fact (e.g. gym attendance over the course of 2019). The second data set is prospective. We only observe the 𝑥1 , 𝑥2 , … , 𝑥𝑘 before the fact (e.g. characteristics for gym members signing up in January 2020 and for which we haven’t observed attendance because it hasn’t occurred yet). We pool the two data sets together. Notice that the dependent variable 𝑦 is missing for the prospective observations. Then, we run our regression, which will only include the retrospective data. Lastly, we use the fitted model (i.e. 𝛽̂1 , 𝛽̂2 , … , 𝛽̂𝑛 ) to compute the gym attendance we expect for each new member, given her values of 𝑥1 , 𝑥2 , … , 𝑥𝑘 . In practice, this is very easy—it can be done in three lines of code. The important part is understanding the logic. We can compute statistics using the predicted values for the prospective observations (mean, variance, proportion greater than a threshold, etc.). Some examples where regressions are used to predict important variables are: consumer lifetime contribution of newly acquired customers, performance of potential new hires, credit scores, admissions, fraud detection, and consumer behavior in platforms like Netflix, Amazon or Spotify. Can you imagine how? When we are predicting, it’s important to distinguish between two types of predictions. One of them is within-sample predictions, which is when the values of the prospective 𝑥’s fall inside the range of the retrospective 𝑥’s. The other type is out-of-sample predictions, which is when the values of prospective 𝑥’s fall outside of the range of the retrospective 𝑥’s. There isn’t much cause for concern when we make within-sample predictions. However, when we make out-of-sample predictions, our model could be flat out incorrect. To see it, imagine extrapolating any behavior (drinking, dating, working) based on customer age when your retrospective data only includes people between the ages of 15 and 25. What would happen if you try to predict the same behavior for five-year-olds? How about sixty-year-olds? Intuitively, predictions for prospective 𝑥’s closer to the average value of the retrospective 𝑥’s are more accurate—they have narrower confidence intervals). Keep in mind that confidence intervals look like bow ties. Can you say why? We can use decoys to verify that our predictions make sense. For instance, we can use one subset of the retrospective data to predict another subset of the same data. This type of exercise is what is used for machine learning and artificial intelligence. The idea behind the notion of “training an algorithm” is simply finding better the 𝛽̂’s to produce predictions with higher Rsquares of whatever is it that we care about. For instance, think of speech or face recognition. To construct the confidence interval for our predictions, we think of 𝑦 as a parameter given our 𝛽’s and the 𝑥’s: 𝑦 = 𝛽0 + 𝛽1 𝑥1 + ⋯ + 𝛽𝑘 𝑥𝑘 What would be our estimate of the “parameter” 𝑦? The fitted value 𝑦̂ = 𝛽̂0 + 𝛽̂1 𝑥1 + ⋯ + 𝛽̂𝑘 𝑥𝑘 . In this context, 𝑥1 , 𝑥2 , … , 𝑥𝑛 are fixed numbers. Because of sampling, we think of the 𝛽̂1 , 𝛽̂2 , … , 𝛽̂𝑛 as random variables. Thus, 𝑦̂ is also a random variable with a distribution centered at 𝑦 and some ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 49 standard error derived from the standard error of the 𝛽̂’s. For each possible 𝑥1 , 𝑥2 , … , 𝑥𝑛 , statistical software can give us 𝑦̂ and its standard error. To build the confidence interval we use both. For instance, assume we choose a confidence level of 95%. Given some value of 𝑥1 , 𝑥2 , … , 𝑥𝑛 , the 95% confidence interval of our prediction is defined as: 95% 𝐶. 𝐼. = (𝑦̂ − 1.96 × 𝑆. 𝐸. , 𝑦̂ + 1.96 × 𝑆. 𝐸. ) We can look at a graphical example in two dimensions. Given a value of 𝑥, we build the confidence interval of our prediction using the prediction itself (𝑦̂) plus/minus the t-statistic corresponding to the confidence we want (the value 1.96 corresponds to 95% confidence) multiplied by the standard error of 𝑦̂. By construction, the confidence interval is centered at the prediction. For a given 𝑥, what is the interpretation of the 95% confidence interval around 𝑦̂ pictured above? It’s analogous to what we discussed before for the 𝛽̂’s. Make sure you can explain this with your own words. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 50 5.3 Prescriptive use Whether in business, government or the not-for-profit sector, the ultimate goal of empirical analysis is to produce recommendations. Based on evidence, we want to know what should be done to improve the bottom line of a company or the results of a policy or program. Although the descriptive and predictive uses of regression may shed some light on what could make sense to do, they do not offer solid advice. For instance, think about the prices charged by a company. Are the prices too high or too low relative to the profit maximizing level? You can imagine arguments in favor of price increases, as well as arguments in favor or lower prices. In theory, it is unclear whether the price is too high, too low or just right. In order to be able to make a recommendation we need evidence. How would you determine whether a price increase would result in higher or lower profits? This type of problems goes well beyond pricing decisions. It involves pretty much every decision. Imagine a retail company has 1,000 stores and 500 are upgraded with the intention of improving customer satisfaction and boosting sales. The quarterly report is just in. Average sales in stores without upgrade are 600 (to make the example more appealing, you can imagine sales are expressed in thousands of dollars). Average sales in stores with upgrade are 550. Did the upgrade cause a decrease in sales? What should the board of the company do? Expand the (costly) upgrades to all remaining branches? There are several ways to answer this sort of questions. That’s what we will learn in this section. Before discussing the empirical methods, we must introduce a few concepts. 5.3.1 Causality and the Rubin model We will start with the so-called Rubin Model. Let’s focus on the store-upgrade example. Take the case of the store 𝑖. Without the upgrade, sales at that store would have been 𝑌0𝑖 . With the upgrade, sales would have been 𝑌1𝑖 . The causal effect of the upgrade is the difference in sales across the two situations, which is given by 𝑌1𝑖 − 𝑌0𝑖 . Notice that the causal effect is defined for each store 𝑖. We are interested in the average causal effect across stores, which we call Average Treatment Effect or ATE, and it is defined in terms of expected values: 𝐸[𝑌1𝑖 − 𝑌0𝑖 ] = 𝐸[𝑌1𝑖 ] − 𝐸[𝑌0𝑖 ] However, for each store we only observe one situation (either it was upgraded or it wasn’t), not both. Let 𝐷𝑖 denote a dummy indicating whether a store was upgraded. Thus, 𝐷𝑖 = 0 means the store wasn’t upgraded and 𝐷𝑖 = 1 means the store was upgraded. We only observe: 𝑌𝑖 = 𝑌1𝑖 𝐷𝑖 + 𝑌0𝑖 (1 − 𝐷𝑖 ) ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 51 In words, for stores with 𝐷𝑖 = 1 we observe 𝑌1𝑖 , and for stores with 𝐷𝑖 = 0 we observe 𝑌0𝑖 . We know facts but not the counterfactuals. A counterfactual is what would have happened in an alternative reality—e.g. think where you would be now had you not enrolled at this university. In the table below we explain the difference between what we observe and we don’t observe. Stores can be are divided based on whether they were upgraded or not. The first column corresponds to stores that weren’t upgraded (𝐷𝑖 = 0). The second column corresponds to stores that were upgraded (𝐷𝑖 = 1). The third column agglutinates all stores, regardless of upgrade status. We divide the world into two alternative realities for each store. The first row corresponds to a reality without upgrade (𝑌0𝑖 ), and the second row correspond to a reality with upgrade (𝑌1𝑖 ). It should be clear that we only observe information for two cells of the table. We know the sales without upgrade for stores that weren’t upgraded (600). We also know the sales with upgrade for stores that were upgraded (550). We don’t observe the counterfactuals, that is, sales with upgrade for stores that weren’t upgraded, and sales without upgrade for stores that were upgraded. Naturally, we don’t know the average across all stores for each row. We don’t know the difference across alternative realities for each group of stores either. So, there is a lot we don’t know. 𝐷𝑖 = 0 𝐷𝑖 = 1 All 600 ? ? Average sales with upgrade ? 550 ? Difference made by upgrade ? ? ? Average sales without upgrade However, at least conceptually, we can fill in the table with correct notions even if we don’t observe them. That’s what the formulas in orange represent: 𝐷𝑖 = 0 Average sales without upgrade Average sales with upgrade Difference made by upgrade 𝐷𝑖 = 1 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] 𝐸[𝑌1𝑖 − 𝑌0𝑖 |𝐷𝑖 = 0] 𝐸[𝑌1𝑖 − 𝑌0𝑖 |𝐷𝑖 = 1] All 𝐸[𝑌0𝑖 ] 𝐸[𝑌1𝑖 ] 𝐸[𝑌1𝑖 − 𝑌0𝑖 ] We can adopt some useful definitions for the formulas in the last row. Those formulas provide causal effects. The Average Treatment on the Untreated (ATU) is the average difference the upgrade would make among stores that weren’t upgraded (i.e. the causal effect among untreated stores): 𝐴𝑇𝑈 = 𝐸[𝑌1𝑖 − 𝑌0𝑖 |𝐷𝑖 = 0] The Average Treatment on the Treated (ATT) is the average difference the upgrade would make among stores that were upgraded (i.e. the causal effect among treated stores): ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 52 𝐴𝑇𝑇 = 𝐸[𝑌1𝑖 − 𝑌0𝑖 |𝐷𝑖 = 1] Lastly, as we saw before, the ATE is the average difference the upgrade would make among all stores (i.e. the causal effect among the whole group of stores): 𝐴𝑇𝐸 = 𝐸[𝑌1𝑖 − 𝑌0𝑖 ] We can also express the ATE as the (weighted) average of the ATU and ATT. If we have the 500 500 same number of stores upgraded (500) and not upgraded (500), then 𝐴𝑇𝐸 = 1000 𝐴𝑇𝑈 + 1000 𝐴𝑇𝑇. Going back to our problem, what do you think is more relevant to know for the company when assessing the upgrades? The ATE, the ATT or the ATU? What is the economic relevance of each of them? As an exercise, imagine situations in which each of them may matter most. In a naïve comparison (i.e. a simple difference of observed average sales across the two groups of stores) we get: 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] How different is that naïve comparison from the ATE, ATU or ATT? To find out, let’s add and subtract the counterfactual 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] (which is the average sales without upgrade among stores that were upgraded): 𝑁𝑎𝑖̈𝑣𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] + 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] = 𝐴𝑇𝑇 + 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] = 𝐴𝑇𝑇 + 𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝐵𝑖𝑎𝑠 𝑖𝑛 𝑌0 The selection bias in 𝑌0 is defined as 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0]. How would you explain it in your own words? Please try until you come up with a simple version. Instead, we could add and subtract 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] (i.e., the average sales with upgrade among stores that weren’t upgraded): 𝑁𝑎𝑖̈𝑣𝑒 𝑐𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛 = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] + 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] + 𝐴𝑇𝑈 = 𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛 𝐵𝑖𝑎𝑠 𝑖𝑛 𝑌1 + 𝐴𝑇𝑈 The selection bias in 𝑌1 is defined as 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌1𝑖 |𝐷𝑖 = 0]. Explain in your own words how it may be different from the selection bias in 𝑌0 . It should be clear that the sign and the magnitude of the selection biases depend on the determinants of the upgrade status. What possible stories can you come up with for the biases to be positive or negative? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 53 Notice that, if there are no selection biases, then the ATU and the ATT are equal to the naïve comparison, and therefore the ATE is also equal to the naïve comparison. Make sure you can express this powerful idea in terms of the formulas above. There are several lessons stemming from the Rubin model. First, if used to infer causal effects, naïve comparisons can be misleading when there are selection biases. Second, without counterfactuals, we cannot know the ATE, the ATT or the ATU. Third, since people act with purpose, selection is ubiquitous and, in general, correlation isn’t indicative of causation. This framework distinguishes economists from most other professionals—think about news reports on drinking wine as a healthy habit or the effects of doing yoga on productivity at work. Keep in mind that selection may occur in unobservable traits, such as motivation, perceptions, grit, opinions, etc. As a general rule, we cannot be sure we control for selection simply by adding more regressors in our analysis. Let’s look at an example about educational attainment and earnings. In this case, the treatment is “getting a college degree.” When we compare earnings of college graduates versus earnings of people without a college degree (i.e. non-college graduates), we only observe the figures in black in the table below. We don’t know the counterfactuals, e.g. how much would non-college graduates earn had they gotten a college degree. Obviously, there may be selection into college attendance. Imagine we were given believable estimates of the counterfactuals in red. Based on those estimates, we can compute the causal effect of attending college on earnings. Given the figures in the table, what could you say about selection biases? Can you compute them? You surely can. Selection bias in 𝑌0 is 30,000, whereas selection bias in 𝑌1 is 39,000. Make sure you know how to interpret the entire table. Non-college College graduates (2/3) graduates (1/3) Without a college degree 45,000 75,000 55,000 With a college degree 60,000 99,000 73,000 Difference (with minus without) 15,000 24,000 18,000 Annual earnings at age 40 … All We don’t have to restrict ourselves to binary comparisons. Let’s look at another example with three alternatives. In this case, we compare earnings of college graduates who attended different higher education institutions. One could argue those institutions may have different effects on the earnings of their graduates. However, students purposely seek admission only to some institutions, and admissions officers purposely reject some applicants and admit others. Based on those purposive behaviors, we expect some selection biases. Instead of presenting amounts of money, the table below simply presents placeholders. The ones in black (A, E and I) are observed. The rest (in blue) aren’t. A comparison of I vs E, or I vs A may not be informative of the difference it makes to attend one university instead of the other. Appropriate comparisons require estimating counterfactuals (the letters in blue). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO Earnings if attended… 54 Institution attended Chicago State U. of I. at Chicago U. of Chicago Chicago State A B C U. of Illinois at Chicago D E F U. of Chicago G H I Before we proceed, let’s think of a last example also related to educational attainment. In this case, we look at years of schooling, which we can think of as a continuous variable. People who attain more years of schooling on average have higher earnings. The graph below shows an example of a cloud of points representing observed annual earnings and for people with different levels of schooling. We could run a regression of earnings (𝑦) on years of schooling (𝑥) in a model such as 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜀𝑖 . Our regression would produce a slope of $12,000. We may be tempted to conclude that, on average, attending college increases annual earnings by $48,000 (4 × $12,000). In light of what we’ve discussed, do you agree with that conclusion? Is $12,000 a valid estimate of the causal effect of a year of schooling? Our estimate 𝛽̂ is naïve. In general, it shouldn’t be interpreted as the causal effect of years of schooling on earnings because there may be selection into years of schooling attained. Perhaps people who attain more years of schooling would make more money than those who attained fewer years of schooling even if everyone had the same educational attainment. Think of the appropriate counterfactuals. The graph below shows two examples. The red triangles show counterfactual earnings across different levels of schooling for people who in fact attained twelve years of schooling. Obviously, factual and counterfactual earnings coincide for twelve years of schooling. The green diamonds show counterfactual earnings for people who attained sixteen ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 55 years of schooling. In this case, factual and counterfactual earnings coincide for sixteen years of schooling. In this example, given equal schooling, people who attained sixteen years of schooling (green markers) on average would make more money than people who attained twelve years (red markers). That means there is a positive selection bias. Those who would earn more ceteris paribus are also the ones who end up with more schooling. If we could observe the counterfactuals in the graph above, we would run a regression and adequately estimate the causal effect of one year of schooling on earnings. In that example, the slope would be $6,000, which is half the naïve estimate (the other half is the selection bias). This is just one example in which I assumed positive selection. As you can imagine, there are many theoretical possibilities for the selection biases. Try go over a few examples with different signs. The example above illustrates that, as a general rule, we shouldn’t interpret regression coefficients as estimates of causal effects. If we want to do that, we need good reasons why we should believe there are no selection biases. The group of techniques used to estimate counterfactuals and gauge causal effects is referred to as impact evaluation. Next, we will learn three ways to estimate causal effects that avoid selection and other biases. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 56 5.3.2 Randomized-Control Trials Organizations want to find what works best for them and do it. For instance, they may be interested in contrasting the status quo of a program or policy versus a new idea or a set of different new ideas. When trying to prescribe what to do, it’s helpful to think in terms of a medical analogy. There is a diagnosis, i.e. the situation we believe is wrong or that could be improved. There is also a treatment, i.e. the action that will cause the correction or improvement. It must be something we can manipulate—academics may be interested in non-manipulated causes but that’s not the case among practitioners. Lastly, there is an outcome or metric of interest, i.e. the variable where success would be observed. In a nutshell, we attempt to estimate the causal effect of the treatment on the outcome of interest, and decide whether the treatment should be introduced, stopped or continued. The key concept here is causality. There is no statistical test for causality. It is a logical—not statistical—concept. The methods used for estimating causal effects are usually referred to as impact evaluation. We will study three of the most common approaches used in impact evaluation. Please look at the World Bank’s Impact Evaluation in Practice, which is freely available online here. Ideally, we would like to conduct an experiment or randomized-control trial (RCT), the gold standard in impact evaluation. To see it in practice, imagine a company that delivers a newsletter by email to a list of one million subscribers. The newsletter contains offers and is used as a sales tool. A metric of success is email opening—it leads to sales. Someone in the company detects an area of opportunity. The subject line in those emails is “impersonal and unappealing.” One idea is to include the recipient’s given name in the subject line (e.g. Pablo, great deals just for you!). Other people think such strategy would become ineffective after a while, when the recipients get used to seeing their name in the subject line. Someone suggests alternating subjects at different frequencies. We can imagine 𝑘 different treatments in addition to the regular subject line (the control group that represents the status quo). Treatment 1 would include the recipient’s name in every newsletter. Treatment 2 would include the recipient’s name every other newsletter. Treatment 3 would include the recipient’s name every three newsletters, and so on. In terms of 𝑌𝑖 in the Rubin Model, we would have 1,2, … , 𝑘 alternative treatments and therefore 𝑘 + 1 possible outcomes 𝑌0𝑖 , 𝑌1𝑖 , 𝑌2𝑖 , … , 𝑌𝑘𝑖 . What steps should we take? Start with the list of all email recipients of the newsletter. Choose a random sample for the experiment. Within that sample, randomize who gets which of the treatments and who doesn’t get any. In other words, using a lottery we create the 𝑘 treatment groups and a control group. Different treatments are called arms of treatment. Apply the treatment and measure what happens after time elapses (say, three months later). To determine what worked best, compare the metrics of interest (the email opening rate). You could do this with a regression model: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + +𝛽𝑘 𝑥𝑖𝑘 + 𝜀𝑖 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 57 where 𝑥𝑖1 is a dummy indicating whether recipient 𝑖 received treatment 1, 𝑥𝑖2 is a dummy indicating whether recipient 𝑖 received treatment 2, and so on. Since each treated recipient receives one treatment only, then ∑𝑘𝑗=1 𝑥𝑖𝑗 = 1 among treated patients. Among non-treated patients we have ∑𝑘𝑗=1 𝑥𝑖𝑗 = 0. Notice that: 𝐸[𝑦𝑖 |𝑥𝑖1 = 𝑥𝑖2 = ⋯ = 𝑥𝑖𝑘 = 0] = 𝛽̂0 𝐸[𝑦𝑖 |𝑥𝑖1 = 1] = 𝛽̂0 + 𝛽̂1 ⋮ 𝐸[𝑦𝑖 |𝑥𝑖𝑘 = 1] = 𝛽̂0 + 𝛽̂𝑘 In words, 𝛽̂0 is the average opening rate in the control group, and 𝛽̂0 + 𝛽̂𝑗 is the average opening rate in the arm of treatment 𝑗, where 𝑗 = 1,2, . . , 𝑘. Our estimate of the causal effect of receiving treatment 1 is given by the difference with respect to the control group: 𝐸[𝑦𝑖 |𝑥𝑖1 = 1] − 𝐸[𝑦𝑖 |𝑥𝑖1 = 𝑥𝑖2 = ⋯ = 𝑥𝑖𝑘 = 0] = (𝛽̂0 + 𝛽̂1 ) − 𝛽̂0 = 𝛽̂1 In a similar way, we get estimates of the causal effect of the other treatments. We can also compare causal effects across treatments. For instance, we may be interested in whether the causal effect of treatment 3 is greater than the causal effect of treatment 2. Our estimate of the difference between the causal effects of those two treatments is: 𝐸[𝑦𝑖 |𝑥𝑖3 = 1] − 𝐸[𝑦𝑖 |𝑥𝑖2 = 1] = (𝛽̂0 + 𝛽̂3 ) − (𝛽̂0 + 𝛽̂2 ) = 𝛽̂3 − 𝛽̂2 We already know that our estimates 𝛽̂0 , 𝛽̂1 , … , 𝛽̂𝑘 have standard errors associated to them. Therefore, we can measure significance, build confidence intervals, and test (joint) hypotheses. The crucial part is that we can causally attribute differences in the metric of interest (the opening rate) to differences in the treatment received (subject lines). Once we determine which treatment works best, we can implement it in the whole mailing list. To make perfectly clear why an RCT is the ideal way to measure causal effects, let’s revisit our store upgrade example. Assume an RCT was performed (i.e. selection of stores for the upgrade was random). Thus, since “coins were tossed” to decide which stores were upgraded, we have that: 𝐸[𝑌0𝑖 |𝐷𝑖 = 1] = 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] = 𝐸[𝑌1𝑖 |𝐷𝑖 = 0] ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 58 In words, we expect no selection bias of any kind (there may still be some difference due to the luck of the draw, but the chances are tiny). Thus, the naïve comparison in an RCT produces the ATE, which in turn is equal to the ATU and ATT: 𝐴𝑇𝐸 = 𝐴𝑇𝑇 = 𝐴𝑇𝑈 = 𝐸[𝑌1𝑖 |𝐷𝑖 = 1] − 𝐸[𝑌0𝑖 |𝐷𝑖 = 0] Why are experiments so useful? The source of identification (the variation that allows us to make causal claims) is the randomization of allocation into the treatment or control groups. That means the only systematic difference across treatment and control groups (or between different arms) is the treatment. Obviously, there is no selection bias. But it’s just as important to make clear that there is no omitted variable bias either. That’s because the treatment status is not related to any other determinant of the metric of interest. Last but not least, in experiments we usually control sample size and therefore we have a handle on the precision of our estimates (remember that sample size mechanically affects standard errors, which in turn determine significance and confidence intervals). At this point it’s crucial to establish a difference between two concepts. Statistical significance and economic relevance. In colloquial terms, statistical significance means our estimate is different from zero, and it doesn’t look like that’s the result of chance—we have a small p-value, regardless of its magnitude. In contrast, economic relevance means the magnitude of our estimate indicates the causal effect makes an important difference—it has a considerable magnitude, regardless of the p-value. They may or may not come together. Keep in mind they are separate concepts. To see the difference, think of the following statements. Ceteris paribus, sample size can always change the statistical significance of any (non-zero) estimate, even if it’s small and therefore economically irrelevant. At the same time, ceteris paribus, a very high opportunity cost can always make any estimate economically irrelevant. When designing an experiment, we must consider both concepts to determine our sample size. A sample size that is too large may result in a partial waste, but a sample that is too small could result in a total waste. Let explain both cases. Having a sample that is too large is a concern because of costs. Think of out-of-pocket expenses (e.g. data gathering, conducting surveys on the field, paying contractors to clean data, acquiring the right hardware and software) as well as non-monetary costs (some organizations don’t like the idea of fiddling with their operation at a large scale or with many of their customers). Given those costs, having an unnecessarily large sample is a partial waste. We could do just as well with a smaller, less costly sample. However, a sample size that is too small is worse. We may not be able to tell whether whatever estimate we get is the result of luck or not (our standard errors would be large). That would be a total waste. How do we decide ex ante on the right sample size for an RCT? We need information on how much the metric of interest varies. Intuitively, if the outcome of interest varies very little, then a causal effect of a given magnitude is easier to identify than when the metric of interest varies a ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 59 lot. To illustrate this, let’s revisit the email opening example. Assume we know the standard deviation of the number of newsletters opened is 𝜎𝑌 . Assume also that the CEO of the company considers appropriate a confidence level of 95%. Let’s focus on 𝛽̂1 , which is the causal effect of treatment 1 relative to the control. If it’s close to zero, it means that treatment doesn’t work—it doesn’t improve the opening rate. If it’s negative, it means it performs worse than the status quo. By assuming 𝛽1 = 0, we have a distribution of 𝛽̂1 centered at zero with a standard deviation that is an increasing function of the ratio 𝜎𝑌 ⁄√𝑁, where 𝑁 is the sample size. A larger 𝑁 means a narrower distribution of 𝛽̂1 . Thus, given the same estimate, larger samples would place that estimate in the rejection region of the hypothesis 𝛽1 = 0, and smaller samples would place it in the no-rejection region. The graph below illustrates this idea. The hypothesis tested is the same (𝛽1 = 0) and the estimate is also the same (𝛽̂1 = 1.5). The difference in the distributions comes from different hypothetical sample sizes calculated a priori. The orange distribution comes from a sample four times the size of the sample of the blue distribution. In one case, the estimate would fall in the rejection region (orange distribution, with larger sample size) and in the other it would fall in the no-rejection region (blue distribution, with smaller sample size). We know that a greater sample size results in a larger rejection region. At the same time, not all magnitudes of causal effects are economically relevant. Why waste resources detecting the magnitudes that are irrelevant? Based on economic criteria, we can define a priori the minimum detectable effect or MDE we are interested in. Using 𝜎𝑌 , we can compute the sample size consistent with such MDE. For instance, imagine treatments are costly. We are only interested in effects that surpass their costs, which means they are above a threshold 𝐵 > 0. We can use that threshold to reverse engineer the appropriate sample size. We start with the number 𝐵, which is the minimum magnitude that is relevant to us. We calculate the “right” sample size given 𝜎𝑌 and a confidence level, so that the MDE is equal to 𝐵. The table below shows different sample sizes as a function of the desired MDE given two possible values of the standard deviation of the metric of interest (𝜎𝑌 = 1 and 𝜎𝑌 = 2). We use ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 60 95% confidence, but we could pick any level we want. We assume a control group and just one treatment group (of equal size). The sample size shown includes both groups (control and treatment). The principle is simple. We find the sample size that (a priori) would lead us to reject the hypothesis 𝛽 = 0 if we had 𝛽̂ = 𝐵. Keep in mind that the sample size enters the calculation through the standard error of 𝛽̂. As an example, assume we want an MDE of 0.4. With a standard deviation of 1, the sample size should be 200. A larger sample size would allow us detecting smaller causal effects, but those effects would be economically irrelevant. If the standard deviation is 2, then we need a sample size of 788 to achieve the same MDE. Statistical software does this for us very easily. Minimum detectable effect (at 95% confidence) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Standard deviation of the metric of interest 1.00 2.00 Sample size 3,142 12,562 788 3,142 352 1,398 200 788 128 506 90 352 68 260 52 200 42 158 34 128 Experiments may seem like a silver bullet, but they face important challenges. Some of those challenges are technical (e.g. subjects knowing they are part of an experiment), ethical (e.g. when the control group is excluded from a potentially beneficial treatment) or legal (e.g. price discrimination). Sometimes it’s logistically impossible to run an experiment (e.g. when there is no way to exclude the control group from the treatment). In those circumstances we must rely on quasi-experiments, which are situations that, to some extent, resemble an experiment. The most common quasi-experimental approaches are Regression Discontinuity Design and Difference in differences. 5.3.3 Regression Discontinuity Design In some situations, a treatment is assigned based on whether a variable (known as the running variable or the assignment variable) passes a threshold (a cutoff point). If the treatment has a causal effect, then we expect a discontinuity (a “jump”) in the outcome of interest at the cutoff. As an example, think of the problem of determining whether attending a selective enrollment school makes a difference in earnings in adulthood. Supposed all applicants must take an admissions test. We denote the score on that test by the variable 𝑥. Only applicants with a score equal to or above 𝐶 are admitted. In other words, applicant 𝑖 is admitted if and only if 𝑥𝑖 ≥ 𝐶. All applicants who don’t make the cutoff are rejected and attend a non-selective school (which ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 61 is believed to be worse). A few years later, when those subjects are thirty years old, we observe their annual earnings, which we denote by 𝑦. If attending the selective school has an effect on earnings down the road, we expect greater average earnings among its graduates in comparison to the graduates of the non-selective option. However, a simple comparison of averages could be misleading. After all, the fact that students are screened for admission into the selective option creates a selection bias. In other words, even if the causal effect we are interested in is zero, we may find a difference between the earnings of graduates of the two schools. To avoid the selection bias, we can look applicants with test scores in the vicinity of the cutoff 𝐶. We could argue that, if we look close to the cutoff, whether an applicant was admitted or not is a matter of luck. In fact, when we zoom in, it looks like a small experiment where, solely by chance, some students got a few more points than others in the admissions test, but on average they are similar in any other respect. Thus, any difference in earnings between applicants right below the cutoff and applicants right above the cutoff must be the result of attending different schools. The following graph illustrates this point. The cloud of points represents earnings at age thirty and scores in the admissions test. The points to the left of the cutoff 𝐶 correspond to applicants who attended the non-selective school, whereas the points to the right correspond to applicants who attended the selective school. Since earnings in adulthood and academic performance are usually related, we expect the cloud to show a positive trend. But that trend should be smooth— without jumps. If there is a jump at the cutoff, then we can attribute the difference in earnings to the difference in schools. We can fit a model with a jump at 𝐶. Let 𝑑 be a dummy such that 𝑑𝑖 = 0 if 𝑥 < 𝐶, and 𝑑𝑖 = 1 if 𝑥 ≥ 𝐶. In words, 𝑑 represents the treatment (defined as attending the selective school instead of the non-selective school). Our regression model is 𝑦𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝛾𝑑𝑖 + 𝜀𝑖 . The coefficient 𝛾̂ is our estimate of the causal effect at the discontinuity created by the cutoff 𝐶. This set up is called Regression Discontinuity Design or RDD. RDDs may come in different forms. For instance, we may want to use a polynomial or even different polynomials at each side of the cutoff. If we truly have a discontinuity, we expect our regression to catch it. To show this, let’s look at another example. Suppose a company is deploying ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 62 a training program for its workforce. They can only afford to train 600 of their 1500 workers. To decide who is trained and who isn’t, the company gives priority to the youngblood. All 1500 workers are sorted according to their age in months. The youngest 600 are sent to training. One year later, the company measures the productivity of all 1500 workers. They want to determine whether the training program made a difference or not in terms of worker performance. In this case, our running variable 𝑥 is age in months and our metric of interest 𝑦 is worker performance a year after the training program took place. The treatment was given only to the 600 youngest workers. Let’s suppose that the oldest treated worker was 𝐴 months old at the moment of the selection. Then, we define 𝑑 as a dummy such that 𝑑𝑖 = 1 if 𝑥 ≤ 𝐴, and 𝑑𝑖 = 0 if 𝑥 ≥ 𝐴. The graph below shows the cloud of points in terms of performance and age of the workers. Since it doesn’t look linear, we introduce in our regression quadratic polynomials at each side of the cutoff. Our model would be: 𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖2 + 𝑑𝑖 (𝛾0 + 𝛾1 𝑥𝑖 + 𝛾2 𝑥𝑖2 ) + 𝜀𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝛽2 𝑥𝑖2 + 𝛾0 𝑑𝑖 + 𝛾1 𝑑𝑖 𝑥𝑖 + 𝛾2 𝑑𝑖 𝑥𝑖2 + 𝜀𝑖 = (𝛽0 + 𝛾0 𝑑𝑖 ) + (𝛽1 + 𝛾1 𝑑𝑖 )𝑥𝑖 + (𝛽2 + 𝛾2 𝑑𝑖 )𝑥𝑖2 + 𝜀𝑖 The estimate of the causal effect would be 𝛾̂0, which is the jump at the discontinuity 𝐴, and it’s positive in the graph above. Notice that the slope of our fitted model could be the different around the discontinuity. At the cutoff, the slope approaching 𝐴 from the right would be 𝛽̂1 + 2𝛽̂2 𝐴 , whereas approaching it from the left it would be 𝛽̂1 + 𝛾̂1 + 2(𝛽̂2 + 𝛾̂2 )𝐴. If 𝛾̂1 ≠ 0 or 𝛾̂2 ≠ 0 then the slopes at the discontinuity would differ. By now it should be clear that we can use polynomials in our RDD. We can also add covariates and interactions. However, we must always verify compliance with the cutoff. We must make sure there are no signs of manipulation or cheating of the running variable. The interpretation of the magnitude of our estimate of the causal effect is valid close to the discontinuity, not far. Please explain in your own words why. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 63 When using an RDD, it’s inevitable to run into questions related to whether we should limit our analysis to a small vicinity around the cutoff and whether we should use linear, quadratic, cubic, or higher order polynomials to control for the smooth trend. There is no rule of thumb to determine what should be done. Instead of choosing on particular model, it’s convenient to think about trying different combinations of vicinities and trend controls as robustness checks. Always remember that, when we use an RDD, our estimates are informative of causal effects close to the cutoff, and their validity hinges on compliance with the cutoff. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 64 5.3.4 Difference-in-Differences In many instances, we observe the outcome of interest across periods for a control group and a treatment group. If we assume that in absence of the treatment the trend would be the same across the two groups, we can estimate the causal effect of the treatment. This assumption can be referred to as parallelism, because it means we would observe parallel trends in the control and treatment groups in the outcome of interest if there wasn’t a treatment. In its simplest form, what we observe can be described by the following table. The value inside each cell represents the observed average of the outcome we care about. In other words, the table depicts facts. Observed averages Before treatment Not treated 𝐴 Treated 𝐵 After treatment 𝐶 𝐷 𝐶−𝐴 𝐷−𝐵 Difference To determine the effect of the treatment, we must create a counterfactual, which in this case is the average we would observe in absence of the treatment in the treated group. When we apply a Difference in Differences or Diff-in-Diff approach, we assume a parallel trend in time across groups. If the average grew from 𝐴 to 𝐶 among the non-treated and we assume a parallel trend among the treated, then without the treatment the average among the treated would have grown 𝐶 − 𝐴. Since the starting point is 𝐵, the ending point would be 𝐵 + (𝐶 − 𝐴). That’s the counterfactual we are looking for. The estimate of the causal effect of the treatment is the difference between the observed average, 𝐷, and the counterfactual, 𝐵 + (𝐶 − 𝐴). Treatment effect = 𝐷 − [𝐵 + (𝐶 − 𝐴)] If we rearrange the expression, we get a more intuitive expression—a double difference: Treatment effect = (𝐷 − 𝐵) − (𝐶 − 𝐴) The first term at the right-hand side is the observed difference across periods among the treated units. The second term is the observed difference across periods among the untreated units. Our estimate of the treatment effect is the difference between the two differences—hence the name of the method. Notice that the idea is general and doesn’t depend on the order of the differences. Treatment effect = (𝐷 − 𝐵) − (𝐶 − 𝐴) = (𝐷 − 𝐶) − (𝐵 − 𝐴) = 𝐷 − [𝐶 + (𝐵 − 𝐴)] Implicitly, we are assuming the counterfactual is 𝐶 + (𝐵 − 𝐴). In words, we could also say that the counterfactual is built as the observed average after the treatment among the non-treated, ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 65 plus the difference across groups before the treatment took place. That’s just another way to interpret the same assumption of parallel trends. Let’s look at a graphic version of Diff-in-Diff. The horizontal axis represents the two periods. The two solid dots denote observed averages. If we assume parallel trends absent the treatment, then the change we would expect among the treated in absence of the treatment is 𝐶 − 𝐴. To preserve parallelism, we add that amount to the starting point for the treated, which is 𝐵. The difference between 𝐷 and the counterfactual average 𝐵 + (𝐶 − 𝐴) is our estimate of the treatment effect. The Diff-in-Diff approach is very intuitive. How do we implement it in the regression context? We start with two dummies. The first dummy, 𝑑𝑇 , denotes whether an observation belongs to the treated group (𝑑𝑇 = 1) or not (𝑑𝑇 = 0). The second dummy, 𝑑 𝐴 , indicates whether an observation corresponds to the period before (𝑑 𝐴 = 0) or after (𝑑 𝐴 = 1) the treatment. Their interaction, 𝑑𝑇 𝑑 𝐴 , indicates the situation where the treated group has been treated. The outcome of interest can be expressed as: 𝑦𝑖 = 𝛼 + 𝛽𝑑𝑖𝑇 + 𝛾𝑑𝑖𝐴 + 𝛿𝑑𝑖𝑇 𝑑𝑖𝐴 + 𝜀𝑖 By substituting the four possible combinations of 𝑑𝑖𝑇 and 𝑑𝑖𝑇 we can clearly see the correspondence between the coefficients in the regression and the observed averages. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 66 Facts: observed averages Not treated (𝑑𝑖𝑇 = 0) Treated (𝑑𝑖𝑇 = 1) 𝐴=𝛼 𝐵 =𝛼+𝛽 𝐶 =𝛼+𝛾 𝐷 =𝛼+𝛽+𝛾+𝛿 Before treatment (𝑑𝑖𝐴 = 0) After treatment (𝑑𝑖𝐴 = 1) Let’s make sense of the above equalities. The coefficient 𝛼 is catching the pre-treatment level among the non-treated. The coefficient 𝛽 is catching the difference between treated and nontreated in absence of the treatment—bear in mind that pre-treatment averages don’t need to be the same. The coefficient 𝛾 is catching the trend in absence of treatment among the non-treated. Lastly, the coefficient 𝛿 is catching the trend among the treated that is above (or below) the trend among the untreated, and that’s our estimate of the treatment effect. In other words, our Diff-inDiff estimate of the treatment effect is the coefficient on the interaction between the dummies: Treatment effect = (𝐷 − 𝐵) − (𝐶 − 𝐴) = ([𝛼 + 𝛽 + 𝛾 + 𝛿] − [𝛼 + 𝛽]) − ([𝛼 + 𝛾] − [𝛼]) = (𝛾 + 𝛿) − (𝛾) =𝛿 If we estimate the regression 𝑦𝑖 = 𝛼 + 𝛽𝑑𝑖𝑇 + 𝛾𝑑𝑖𝐴 + 𝛿𝑑𝑖𝑇 𝑑𝑖𝐴 + 𝜀𝑖 , the equalities between regression coefficients and averages displayed in the table above would necessarily hold. However, if we include other regressors as controls, the equalities will not hold in general because our regression coefficients would reflect averages adjusted for other factors. The interpretation is similar, but the values wouldn’t be identical. It’s important to note that we run regressions and not just create a table with observed averages because of two reasons. First, a regression allows us to control for other variables—we already discussed the benefits of this. Second, a regression allows us to make statements about significance and test hypothesis. Anyone can compute a table like the one above. But it takes a good understanding of econometrics to interpret it in a way that is helpful to make decisions. Let’s consider a practical example. A company that owns a chain of retail stores has upgraded a few of them that are in the urban area of Chicago. The rest, which are located in suburban areas, weren’t upgraded. With data from two years and the two types of locations you can estimate a causal effect. Assume the treatment took place at the end of 2018. Thus, you can consider 2018 the pre-treatment period and 2019 the post-treatment period. Year Chicago stores Suburban Urban 2018 A B 2019 C D ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 67 We could apply the regression model above and estimate 𝛿. It should be obvious that our assumption of parallel behavior may be off. Do we really expect the same trends in urban and suburban locations? It’s a valid criticism but at this point there isn’t much you can do to address it. Now, imagine the company also has stores in the metropolitan area of Milwaukee. The good news is, Milwaukee hasn’t been reached by this upgrade program. Could we use that additional information? Yes. Milwaukee may provide information on whether the trends between sales in urban stores are parallel to sales in suburban stores. Year Milwaukee stores Chicago stores Suburban Urban Suburban Urban 2018 E F A B 2019 G H C D In Milwaukee, the trend in average urban sales is equal to 𝐻 − 𝐹, whereas the trend in average suburban sales is 𝐺 − 𝐸. None of those stores has been upgraded. Thus, we can measure the difference in trends in absence of treatment (something we cannot do for Chicago stores.) The difference in trends across urban and suburban stores absent the treatment is (𝐻 − 𝐹) − (𝐺 − 𝐸). This is a crucial piece of information. If this difference in trends is nonzero, then our Diff-in-Diff estimates for Chicago based on the formula (𝐷 − 𝐵) − (𝐶 − 𝐴) may be off. A natural idea is to subtract the trend in Milwaukee from the Diff-in-Diff estimate for Chicago: Treatment effect = [(𝐷 − 𝐵) − (𝐶 − 𝐴)] − [(𝐻 − 𝐹) − (𝐺 − 𝐸)] Intuitively, this is a triple difference. It is the difference between two Diff-in-Diff estimates. One of them has the treatment, and the other doesn’t—that’s our decoy. The decoy allows us to account for the trend. The assumption of parallelism was relaxed a little. It’s still there but in a subtler way. Notice that, if the trends are truly parallel between urban and suburban stores in Milwaukee, then the [(𝐻 − 𝐹) − (𝐺 − 𝐸)] = 0 and the triple difference estimate would be identical to the Diff-inDiff estimate. How do we implement the triple difference? We create a new dummy, denoted by 𝑑𝑖𝐶 , indicating Chicago stores (𝑑𝑖𝐶 = 1) or Milwaukee stores (𝑑𝑖𝐶 = 0). We interact this dummy with our previous model and add new coefficients (knowing the Greek alphabet comes in handy): 𝑦𝑖 = 𝜂 + 𝜃𝑑𝑖𝑈 + 𝜆𝑑𝑖𝐴 + 𝜇𝑑𝑖𝑈 𝑑𝑖𝐴 + 𝑑𝑖𝐶 × (𝜌 + 𝜏𝑑𝑖𝑈 + 𝜑𝑑𝑖𝐴 + 𝜓𝑑𝑖𝑈 𝑑𝑖𝐴 ) + 𝜀𝑖 If we arrange the expression, we get to: 𝑦𝑖 = 𝜂 + 𝜃𝑑𝑖𝑈 + 𝜆𝑑𝑖𝐴 + 𝜇𝑑𝑖𝑈 𝑑𝑖𝐴 + 𝜌𝑑𝑖𝐶 + 𝜏𝑑𝑖𝑈 𝑑𝑖𝐶 + 𝜑𝑑𝑖𝐴 𝑑𝑖𝐶 + 𝜓𝑑𝑖𝑈 𝑑𝑖𝐴 𝑑𝑖𝐶 + 𝜀𝑖 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 68 If we go back to the table format and substitute all dummies by their respective values (0 or 1), we have the equivalence of the average in each cell. Year Milwaukee stores 𝑑𝑖𝐶 Chicago stores 𝑑𝑖𝐶 = 1 =0 2018 𝑑𝑖𝐴 = 0 Suburban 𝑑𝑖𝑈 = 0 𝜂 Urban 𝑑𝑖𝑈 = 1 𝜂+𝜃 Suburban 𝑑𝑖𝑈 = 0 𝜂+𝜌 Urban 𝑑𝑖𝑈 = 1 𝜂+𝜃+𝜌+𝜏 2019 𝑑𝑖𝐴 = 1 𝜂+𝜆 𝜂+𝜃+𝜆+𝜇 𝜂+𝜆+𝜌+𝜑 𝜂+𝜃+𝜆+𝜇 +𝜌 + 𝜏 + 𝜑 + 𝜓 In this case, our triple-difference estimate of the treatment effect is: Treatment effect = [(𝐷 − 𝐵) − (𝐶 − 𝐴)] − [(𝐻 − 𝐹) − (𝐺 − 𝐸)] = [(𝜆 + 𝜇 + 𝜑 + 𝜓) − (𝜆 + 𝜑)] − [(𝜆 + 𝜇) − 𝜆] = [𝜇 + 𝜓] − [𝜇] =𝜓 In words, the triple-difference estimate of the causal effect is the coefficient on the interaction of the three dummies, i.e. 𝑑𝑖𝑈 𝑑𝑖𝐴 𝑑𝑖𝐶 . Now, someone may question the validity of using Milwaukee as a comparison group for Chicago. Some may argue that urban-vs-suburban trends in one city may not be parallel across cities. Can we do anything else? It all depends on the availability of data. Imagine that we have data from 2016 and 2017 for both cities. Year Milwaukee stores Chicago stores Suburban Urban Suburban Urban 2016 I J M N 2017 K L O P The analogous to our triple difference estimate for this decoy period 2016-2017 is: [(𝐿 − 𝐽) − (𝐾 − 𝐼)] − [(𝑃 − 𝑁) − (𝑂 − 𝑀)] What does it mean? It’s the gap in trends in the urban-suburban differences between Chicago and Milwaukee before the treatment took place. We can estimate the causal effect using a quadruple difference, that is the difference between triple differences: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 69 Treatment effect = {[(𝐷 − 𝐵) − (𝐶 − 𝐴)] − [(𝐻 − 𝐹) − (𝐺 − 𝐸)]} − {[(𝐿 − 𝐽) − (𝐾 − 𝐼)] − [(𝑃 − 𝑁) − (𝑂 − 𝑀)]} Obviously, we can write the regression model equivalent to the expression above by simply doubling the terms in our previous model. The logic is exactly the same as before. In cases like this, writing the equation is more complicated that running the actual a regression in Stata. As an exercise, write the equation for the quadruple difference. Every time we take an additional difference, we are purging out an additional trend and making our estimates more believable. Think of having more and more sophisticated decoys with every additional difference. It’s important to say that the Diff-in-Diff approach doesn’t require us to have information across periods. That’s its more natural context. But we can think of the same type of estimation using cross-sectional data—i.e. data for one period alone. Imagine we have the same problem as before, but you only observe data for 2019. What could you do? Would you still be able to compute differences? The answer is affirmative. You’d go back to the simpler double difference model using Milwaukee and Chicago: Location Type of stores Suburban Urban Milwaukee A B Chicago C D As before, we can think of the difference between 𝐷 and 𝐵 + (𝐶 − 𝐴) is our estimate of the treatment effect. How believable our estimates are hinges on the assumption of parallelism. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 70 5.3.5 Other techniques There are other quasi-experimental approaches that are less frequently used in practice, and there is a good reason. Stakeholders (those who use the estimates) prefer to make decisions based on what is more convincing and intuitive. Hence RDD and Diff-in-Diff are the more commonly used quasi-experimental techniques. But there are other methods—mostly used in academia. Here are three. If you want to learn more about them, please see the World bank book Impact Evaluation in Practice. First, we have Instrumental Variables or Two-Stage Least Squares. This is a very specific type of quasi-experiment in which we find a variable (the instrument) that induces a treatment in an exogenous way and isn’t directly related to the outcome of interest. We exploit that exogenous variation to measure the causal effect. In practice, it’s difficult to find an instrument that is both exogenous and unrelated to the outcome of interest. And if you find one, it’s just as difficult to convince people of its validity (i.e. that satisfies the two assumptions). Second, we have matching. Basically, we find matches for the treated subjects among a group of non-treated subjects. This is very intuitive but also very unconvincing. Think about the following question. Why would there be apparently similar people some of whom took the treatment while others didn’t? Unless we have an experiment, there are good reasons to be skeptic about this method. Third, we have propensity score matching. In a way, it’s similar to matching, but the matching is done by groups in terms of their probability of being treated. This method is neither intuitive nor convincing. That’s why is rarely used to make decisions in practice. Its use is mostly confined to academic studies. 6 Additional topics Even if you don’t run regressions for a living, you’re probably going to encounter them. Perhaps someone will try to persuade you by showing you some regression results. Here are a few concepts to keep in mind. 6.1 Robustness We say a result is robust if it doesn’t change much when we consider reasonable variations in our analysis. Those variations are called robustness checks, and here are some examples. First, we can try different samples with similar information (e.g. other periods or regions, some subgroups). We can include or exclude different sets of controls. We can try different functional forms (e.g. linear vs quadratic, cubic, logarithmic, interactions). There is no formal test for robustness. It’s a qualitative result based on common sense. Consider the following examples of results that wouldn’t be robust. We use a different sample in the gym attendance prediction, and the predicted values change when we use a previous cohort. We use ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 71 a different functional form in an RDD, and the discontinuity estimate changes when we fit a quadratic polynomial in the running variable instead of a linear polynomial. In a Diff-in-Diff model, the estimates of the causal effect change when we add a triple or a quadruple difference. To show that a result is robust, practitioners should display (or at the very least mention) results with different samples, sets of controls and functional forms. When they don’t, be suspicious. 6.2 Forensic analysis Professional life involves reviewing the empirical analyses of other people. It could be colleagues, intellectual adversaries, or academic researchers. To understand those results and determine how much we believe them, we should proceed by deconstructing the cake. The first step is to determine what the analysis is trying to do. We gain a lot of mileage from having the goal clear. Is it trying to produce a description, a prediction or a prescription? Remember that the way we judge the quality of a description is different from the way we judge the quality of a prediction or a prescription. The second step is to try to prove the analysis wrong. Given its goal, is the empirical strategy credible? For instance, if we are looking at an RDD, ask yourself if there was good compliance with the cutoff. If we are looking at a Diff-ion-Diff analysis, ask yourself if the parallelism assumption is reasonable. Do the results seem robust? Sometimes people cherry-pick models to get small p-values that favor their hypothesis (something known as p-hacking). Make sure you look or at least ask about other model specifications (polynomials of higher order, interactions, etc.). You should also ask whether confidence and significance are properly calculated, reported, and interpreted. Remember that people have a difficult time interpreting confidence intervals. The third step is to get under the hood and look at the regression equation. There is nothing as frustrating as discussing the results of a regression without looking at the actual model used. We must also know the exact method. There are multiple ways to fit a cloud with a linear structure. Although they all resemble what we saw, they are not identical. There are methods like Probit, Logit, or Maximum Likelihood. If you run into them and you have a chance, ask the authors of the analysis how they think the results would differ if they used the standard method of minimizing the square of the vertical distance, which is known as Ordinary Least Squares. Ask how the data were treated or manipulated. For instance, how are categorical variables and missing values treated? You aren’t mathematicians, computer scientists or statisticians. You are economists—use economic thinking to judge what makes sense. 6.3 Artificial intelligence, machine learning and bid data In recent times, concepts like big data, artificial intelligence, and machine learning have captured the imagination of many journalists and laypeople. Shouldn’t we analyze those concepts ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 72 instead of the method introduced by Galton over a century ago? The short answer is no. Those concepts are building blocks that help—but don’t replace—the concepts you’ve learned in this course. Think of the classic example of artificial intelligence in which a computer must distinguish images of Chihuahua dogs and blueberry muffins. Computers are better than humans at telling differences, or more generally, figuring out patterns (one they’ve been trained). But in the end, they are nothing but fancy models of numerical prediction. Additionally, there is issue of what question do we want to answer and whether we are interpreting correctly the probabilistic aspects of it. Put simply, how would we use the ability to tell Chihuahuas from blueberry muffins? I’m not trivializing the technological advance. But the same can be said about previous advances—personal computers or the Internet. A lot of data and huge computing power doesn’t necessarily mean better empirical analysis. If the analyst doesn’t know what he or she is doing, it may be a bad thing. The methods and the data are the vehicle. The most important aspect is knowing where we are going so that we can drive that vehicle to our destination. It doesn’t really matter how nice the vehicle is, if we don’t know where we are going, we might as well not go anywhere. Machine learning is another concept that has caught on. It simply means that we substitute what an analyst would do with an algorithm. The nice property is that the results produced feed the algorithm. Imagine we are trying to maximize the opening rate of email newsletters. We can pick the time of day for each person. How can we do that? We can design a sequence of RCTs and measure what works best. An analyst would run each RCT, look at the results, and decide whether to try a different time in the next RCT, and whether it should be earlier or later. By laying these methods in an algorithm, we would be collecting information and accumulating actionable information. That process would be autonomous. It could even be a permanent process, continually trying new times—in case the schedule of preferences of subscribers changed. A lot of common sense goes into the use of artificial intelligence and machine learning. They are complements, not substitutes of your skills. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 73 Another concept that has received attention is big data. We could say that it refers to the wealth of information created in day-to-day transactions. You wake up in the morning. As you grab your phone, there is a record of when you started looking at it. If you went to Instagram or the New York Times app, you also leave records. What post you saw and which ones you liked, what music you were listening, which emails you answered first, which ones you discarded without opening. Then you take public transportation or a Divvy bike. That information is also stored. When you scan your ID at your office, there is a record of when you arrived. The same for every time you go in or out. At lunch, go to a restaurant and use a loyalty card. You go home and order from Amazon. You also ask Alexa to play some music. You ask Waze directions to go to a friend’s house or take Uber. You go home and stream a movie, leaving record of the shows you browsed. And so on. That is without counting your performance measures at work, your grades at school, your travel records, etc. In addition to that, you can take DNA tests. DNA tests are particularly interesting because they pose the problem of false positives. Imagine that we ask a large group of people with what statement they agree more: “I deeply dislike Tom Brady” or “I am a fan of Tom Brady.” Let’s code it so that 1 means being a fan of the New England Patriots quarterback. Assume we have binary information about 100 genes. Remember that if we have 100 regressors, by sheer luck we can expect 5 of them to be significant at 95% confidence. Similarly, we can expect one coefficient to be significant at 99% confidence. We would call this the Brady-fan gene. However, that wouldn’t be solid evidence—to say the least. Having many regressors brings the potential problem of false positives. If you dig enough, you are always going to find a pattern that looks highly unlikely. Think of sports broadcaster when they say “this is the first time in major league baseball that a lefty rookie pitcher has stroke out three right-handed batters in a row after walking three left-handed batters in a post-season game.” Did we just witness a highly unlikely, one-of-a-kind event? Or is it the case that, if we dig enough, we’d find that in its own way everything is a first? Adding regressors to a model is like dicing more finely the categories and therefore mechanically increasing the chances of finding something “special.” In sum, when you hear the terms artificial intelligence, machine learning or big data, keep in mind that those concepts are not substitutes of the methods you learned in this course. Rather, they are complements. If you have a good grasp of the concepts taught in this course, you will be in a better position to make the most out of artificial intelligence, machine learning and big data.

Intro to Econometrics Course Notes

Related documents

Products

Support

Intro to Econometrics Course Notes

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib