ECON 11020 FALL 2020 INTRODUCTION TO ECONOMETRICS PABLO A. PEÑA UNIVERSITY OF CHICAGO pablo@uchicago.edu ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 1 1 Introduction This manuscript constitutes the notes for the course. Why not just use an Econometrics textbook? Because I want to make my teaching more effective and your learning more efficient. In my experience, textbooks put too much emphasis on how the methods work instead of on how those methods can be used. To use an analogy, textbooks are like a book about microwave ovens explaining how electricity is transformed into microwaves, and how microwaves excite water molecules inside food. A person using a microwave oven doesn’t need to know any of that to make a good use of it. If the person knows what materials not to put inside the oven, and the approximate relationship between cooking times and power, she can make the most out of the oven safely and efficiently. The point here is that econometrics textbooks explain way more about the methods than a regular practitioner needs to know in order to put those methods to good use. These notes emphasize how regressions can be used. There is a second argument against using econometric textbooks in this course. Since they are written by academics, textbooks are biased towards the types of problems academics care the most about—namely, establishing causal relationships. The type of questions academics passionately pursue are only a subset of the type of questions practitioners are interested in. In my experience, practitioners try to establish causal relationships less than 10% of the time. So, these notes give a more general perspective of what can be done with regressions. A third argument for an alternative look at how econometrics is taught for practitioners is the growth in computer capacity. Back in the day—so I am told—running a regression was costly. A person had to punch holes on a card to input data into a computer and wait to have a chance to use a computer. People thought hard before running a regression. The fact that now we can draw samples of our data thousands of times in a matter of minutes—if not seconds—has made feasible the use of newer methods that are free from many unverifiable assumptions made in classic theory. Computing power not only made empirical analysis easier. It also changed what we think are more appropriate methods in practice. These notes will discuss some of those newer methods. A fourth argument is the format. Textbooks require the reader to know matrix algebra and probability theory. That depletes the attention of students. Keeping an eye on matrices or random variable probability distributions prevents them from focusing on what really matters. Here we keep those definitions to a minimum. Lastly, the title of these notes could be “Empirical Analysis for Business Economics.” Each part of that alternative title would illustrate an important aspect of what you will learn. “Empirical” means we will refer to data collected in the real world. “Analysis” refers to the statistical tools we will use. “Business” means we will consider practical questions of the kind real organizations face out there. Lastly, “Economics” means that throughout we will keep our perspective as economists—we are neither mathematicians nor statisticians. In sum, you will learn how to apply statistical tools to data in order to answer relevant questions from an economic perspective. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 2 1.1 Primacy of the question A frequent answer to many questions in this course is “it depends on the question you want to answer.” As a general rule, the methods we use must fit the question at hand. The question should be carefully examined before jumping to figuring out how to answer it. In my experience, pounding on a question is essential. Paraphrasing it multiple times and thinking what it is and what it isn’t may be of great help. That is what I mean by the primacy if the question. Assume there is arm-wrestling tournament with 149 participants. When two contestants face each other, the winner advances and the loser is out. What is the total number of matches in the tournament? Perhaps you had the impulse—like me—to create in your mind a bracket structure adding rounds to reach a number close but below to 149. You can proceed that way and find the solution. But it’ll take time and it won’t be general. What if the number of participants is 471 or 7,733? Carefully examining the setup of the question may give you the path of least resistance to the answer. Here is a crucial piece of information. The tournament ends when all but one of the participants are out. In every match, one participant is out. If there are N participants, there must be a total N − 1 matches. This is a quick and general answer. It may take time to figure it out, but once you do it, you can apply it more generally, to any number of participants. In our context, we will always refer back to the question we have in mind. Sometimes the question is impossible to answer. Some other times the answer is obvious, and no analysis is needed. Most frequently, the question requires some polishing to be properly (or at least reasonably) answered. 1.2 The cake structure Our departure point is the origin of the term regression and the method it represents. Once we establish the general idea of what a regression is, we will proceed according to what we can call the three-layer cake structure. The first layer is the mathematics of regression, and it is about how we compute a regression. It has the most formulas. The second layer is the probability content of the regression results. In this layer we will talk about why we expect results to vary and what information they convey. The third layer is the economics of regressions. We will learn the uses and interpretations of the results. The economics of regressions (the third layer) can be split into three slices that correspond to three distinct uses: descriptive, predictive and prescriptive. In the descriptive use, regressions are used to measure relationships accounting for other factors. This is useful when trying to judge to what extent two things move together or not, or when comparing averages in equality of circumstances. The predictive use is about knowing what to expect. We will explain how predictions are different from forecasts. The third use is prescription, and we will decompose it into the most common methods used by practitioners: randomized control trials, regression discontinuity designs, and difference-in-differences. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 3 2 Regression A fascinating question in natural and social sciences is the extent to which parental traits are transmitted to children. In the late 19th century, Francis Galton worked on this topic. He conducted surveys and collected data to analyze the relationship across many variables. One of them was height. Using families with adult children, Galton computed the “mid-parent height” (the average height of mother and father) and plotted it against the height of children. A stylized version of the chart he produced is below. The horizontal axis represents parental height. The vertical axis represents children height. The first thing to note is that there is a cloud of points. The second is that there seems to be a positive relation. On average, taller parents have taller children, and shorter parents have shorter children. How can we summarize this relationship? Galton modelled the height of children as a linear function of parental height plus an error term. If ๐ฆ๐ and ๐ฅ๐ denote the heights of person i and her parents, respectively, then Galton assumed ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . Galton came up with a straight line that best fitted the cloud of points. If the straight line has a positive slope, it means the relation is positive (tall parents have tall children). If the slope is negative it means the relation is negative (tall parents have short children). Lastly, if the slope is zero, then tall and short parents have children of the similar stature. The graph below depicts the line that best fits Galton’s data (in blue), and also a line with a slope of one as a benchmark. In the next section, we will discuss at length how Galton came up with that line. Put very simply, we pick the intercept and the slope that minimize the square of the vertical distance between the points in the cloud and the line. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 4 Galton found a positive relationship between the height of parents and children but didn’t stop there. After all, it is evident that tall parents tend to have tall children. What interested Galton the most was whether the slope of the line he produced was greater or smaller than one. If the slope is greater than one it means differences in height across parents become even larger differences in height across their children. Alternatively, if the slope is smaller than one, differences in height across parents become smaller differences in the next generation. Galton found that the slope is smaller than one. Therefore, having a tall ancestor doesn’t matter much for the descendants. After a few generations, we expect the height of any family to get closer to the mean. This process was called “regression to the mediocrity,” and now we call it “regression to the mean.” The term regression was originally the description of this particular result. It later became the name of the method used to find that result. Today, regression is the workhorse of empirical economists. There is a wide variety of regression models, but they all share the same essence. They all produce numbers that summarize the relationships among variables in the data. To analyze how regressions are used by practitioners, we will proceed according to our threelayer cake structure. The first layer is given by the mathematical aspects of regressions. Put bluntly, the mathematical layer has no probabilistic or economic contents. This is very much all algebra. But first, a short introduction to how regressions looks in practice. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 5 3 The mathematics of regression 3.1 The basics 3.1.1 A cloud of points Our starting point is data. Usually, we think of data in tables or spreadsheets. We can generally think of any data set as being organized in “variables” and “observations.” For instance, if we have the expenditures in a given month of a group of customers at an online store, the unit of observation may be the customer, and each customer would constitute one observation. In addition to the expenditures, the information we have for each customer may include number of purchases, number of returns, shipping costs, as well as age of the customer, gender, and zip code. Those features of behavior and customer traits would be our variables. Each variable could take different values, which could be continuous (like amounts in dollars or age in days) or categorical (like gender or age in five-year ranges). In the case of Galton, each adult child constitutes an observation, and his or her height together with the height of his or her parents constitute our two variables. For instance, if we have: Person Height of the person Height of the parents Robert 1.82 1.75 Anne 1.73 1.79 Cristopher 1.78 1.74 Laura 1.69 1.70 Charles 1.80 1.80 We can think more generally in terms of variables x and y, and the subscript i: ๐ ๐=1 ๐ฆ ๐ฅ ๐ฆ1 = 1.82 ๐ฅ1 = 1.75 ๐=2 ๐ฆ2 = 1.73 ๐ฅ2 = 1.79 ๐=3 ๐ฆ3 = 1.78 ๐ฅ3 = 1.74 ๐=4 ๐ฆ4 = 1.69 ๐ฅ4 = 1.70 ๐=5 ๐ฆ5 = 1.80 ๐ฅ5 = 1.80 The variables in our data are not a chaotic mass of information. We usually have something in mind that provides them with structure. We usually think that one variable is a function of the other variables. For instance, Galton thought of children height as a function of parental height, or ๐ฆ๐ = ๐(๐ฅ๐ ). In math, we usually plot in the vertical axis the value of the function and we plot in the horizontal axis the argument of the function. Thus, in the chart of height we saw before, parental height is in the horizontal axis and children height is on the vertical axis. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 6 For the purpose of understanding how regressions work, we will think of our data visually, as a cloud of points, like in Galton’s problem. The height of the points in the cloud represents the height of children whereas the location of the points in the horizontal axis represent parental height. The height of children is a function of the height of parents. 3.1.2 A model In general, in a regression model like: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . We refer to ๐ฆ๐ as the dependent variable or left-hand side variable. We refer to ๐ฅ๐ as the regressor, the explanatory variable, the independent variable, or the right-hand side variable. Lastly, we usually refer to ๐๐ (the Greek letter epsilon) as the error term or idiosyncratic shock. Notice the subscript ๐, which denotes the observation. In contrast, ๐ฅ and ๐ฆ denote variables, whereas ๐ฅ๐ and ๐ฆ๐ denote the values that those two variables take in the case of observation ๐. The error term catches anything else unaccounted for in the model. Imagine in the case of Galton’s analysis of height, it could include some phenotypic differences across individuals or malnourishment of some children or parents. In the model above, ๐ฆ๐ is expressed as a linear function of ๐ฅ๐ and an error term. Thus, ๐ผ (the Greek letter alpha) is the intercept and ๐ฝ (the Greek letter beta) is the slope. The interpretation of the slope is the expected change in ๐ฆ is associated with a change in ๐ฅ. Mathematically, the slope is the partial derivative of ๐ฆ with respect to ๐ฅ: ๐๐ฆ ๐ = (๐ผ + ๐ฝ๐ฅ) = ๐ฝ ๐๐ฅ ๐๐ฅ At the same time, we can interpret the model in terms of what we would expect. Suppose that we are told the explanatory variable takes a value of ๐ฅ′. What is the corresponding value of the dependent variable that we should expect to observe? The answer is: ๐ฆ ′ = ๐ผ + ๐ฝ๐ฅ′ These two ways of interpreting the model (the partial derivative and the conditional expectation) are very useful and we will revisit them multiple times. 3.1.3 The minimization problem In a nutshell, a regression simply fits a cloud of points with a straight line. To do that, it minimizes the square of the vertical distance between each point in the cloud and the line. Notice that ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 7 it doesn’t minimize the vertical distance (we use the square) or the square of the distance (the vertical part is crucial). Mathematically, to find the straight line that minimizes the square of the vertical distance to the points in the cloud, we set up the following problem: ๐ min{๐,๐} ∑ (๐ฆ๐ − ๐ − ๐๐ฅ๐ )2 ๐=1 We take first-order conditions with respect to ๐ and ๐: ๐ ∑ ๐ฆ๐ − ๐ − ๐๐ฅ๐ = 0 ๐=1 ๐ ∑ (๐ฆ๐ − ๐ − ๐๐ฅ๐ )๐ฅ๐ = 0 ๐=1 The first-order conditions above produce a linear system with two equations and two unknowns. We can easily find the solution. Let us introduce some useful nomenclature. The solution are the regression coefficients and we denote them with a “hat”, ๐ผฬ and ๐ฝฬ. The fitted value of ๐ฆ๐ is: ๐ฆฬ๐ = ๐ผฬ + ๐ฝฬ ๐ฅ๐ Notice the hat is also on ๐ฆฬ๐ . The difference between the fitted and actual values of ๐ฆ๐ is the residual, which is also denoted with a hat: ๐ฆ๐ − ๐ฆฬ๐ = ๐ฬ๐ The residual ๐ฬ๐ is an estimate of the error term ๐๐ . It is convenient to establish the following identities: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ = ๐ฆฬ๐ + ๐ฬ๐ = ๐ผฬ + ๐ฝฬ ๐ฅ๐ + ๐ฬ๐ Going back to our minimization problem, it is easy to show that the first-order conditions ๐ imply that ∑๐ ๐=1 ๐ฬ๐ = 0, and ∑๐=1 ๐ฬ๐ฅ๐ = 0. In words, the residuals average zero and the covariance between the residual and the explanatory variable is zero. The first order conditions can be arranged to provide formulas for the regression coefficients: ๐๐๐ฃ(๐ฅ, ๐ฆ) ๐ฃ๐๐(๐ฅ) ๐ผฬ = ๐ฆฬ − ๐ฝฬ ๐ฅฬ ๐ฝฬ = ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 8 Where ๐๐๐ฃ(๐ฅ, ๐ฆ) represents the covariance between ๐ฅ and ๐ฆ, ๐ฃ๐๐(๐ฅ) represents the variance of ๐ฅ, and ๐ฅฬ and ๐ฆฬ are the averages of ๐ฅ and ๐ฆ, respectively. Notice the similarity between the regression coefficient ๐ฝฬ and the correlation coefficient between ๐ฅ and ๐ฆ, which is usually denoted by ๐: ๐ฃ๐๐(๐ฆ) ๐ฝฬ = ๐√ ๐ฃ๐๐(๐ฅ) In other words, ๐ฝฬ is a re-scaled correlation coefficient. The factor for the re-scaling is a positive number equal to the ratio of the standard deviation of ๐ฆ to the standard deviation of ๐ฅ. Notice that the fitted value ๐ฆ๐ can be interpreted as the expected value of ๐ฆ conditional on ๐ฅ taking a particular value, say ๐ฅ = ๐ฅ๐ : ๐ธ[๐ฆ|๐ฅ = ๐ฅ๐ ] = ๐ผฬ + ๐ฝฬ ๐ธ[๐ฅ๐ ] + ๐ธ[๐ฬ] = ๐ผฬ + ๐ฝฬ ๐ฅ๐ = ๐ฆฬ๐ In general, the regression coefficients can be interpreted as partial correlation coefficients (as in “partial” derivatives), and the fitted values can be interpreted as conditional expectations. In the case of Galton’s regression, ๐ฝฬ is interpreted as the difference in child height given a difference of one unit in parental height, and ๐ฆฬ๐ = ๐ผฬ + ๐ฝฬ ๐ฅ๐ is interpreted as the expected height of a child with parents of height ๐ฅ๐ . The following chart summarizes these concepts. The orange points represent our cloud. The green points are the fitted values. They lie on the regression line. The intercept of the line is the coefficient ๐ผฬ. The slope is the coefficient ๐ฝฬ. The residual is the difference between the actual values and the fitted values. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 9 3.1.4 Multivariate regression The world of univariate regressions (i.e. regressions with only one explanatory variable) is very simple. However, we rarely run regressions with only one regressor. Most of the time we use regressions with multiple regressors. They are known as multivariate regressions. Multivariate regression models are usually expressed using different letters as variables and different Greek letters as coefficients. For instance: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐ค๐ + ๐ฟ๐ง๐ + ๐๐ For simplicity, if we have ๐ explanatory variables, we denote them by ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ . Notice that we have ๐ + 1 regressors: ๐ฆ๐ = ๐ฝ0 ๐ฅ0๐ + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + โฏ + ๐ฝ๐ ๐ฅ๐๐ + ๐๐ Where ๐ฅ0๐ = 1 for every ๐. In other words, ๐ฅ0๐ is constant and its coefficient is the intercept. We can express the regression in matrices and vectors: ๐ฝ0 ๐ฆ1 ๐ฅ01 ๐ฅ11 ๐ฅ21 โฏ ๐ฅ๐1 ๐1 ๐ฆ2 ๐ฅ02 ๐ฅ12 ๐ฅ22 โฏ ๐ฅ๐2 ๐ฝ1 ๐2 [ โฎ ] = [ โฎ โฎ โฎ โฑ โฎ ] ๐ฝ2 + [ โฎ ] ๐ฆ๐ ๐ฅ0๐ ๐ฅ1๐ ๐ฅ2๐ โฏ๐ฅ๐๐ โฎ ๐๐ [๐ฝ๐ ] ๐ = ๐๐ฝ + ๐ In this case we have a cloud of ๐ points in a k-dimensional space. We want to find the plane or hyper-plane that minimizes the square of the vertical distance to those points. Using matrix notation, we can write the minimization problem as: min๐ฝ (๐ − ๐๐ฝ)′(๐ − ๐๐ฝ) Our ๐ + 1 first-order conditions can be expressed as: ๐ฝฬ = (๐ ′ ๐)−1 ๐′๐ This formula involves a series of simple mathematical operations with the data. Keep in mind that the vector ๐ฝฬ contains ๐ + 1 regression coefficients. Our results can be expressed as: ๐ฆฬ = ๐ฝฬ0 ๐ฅ0๐ + ๐ฝฬ1 ๐ฅ1๐ + ๐ฝฬ2 ๐ฅ2๐ + โฏ + ๐ฝฬ๐ ๐ฅ๐๐ ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 10 We usually have ๐ > ๐ by a lot. Think about what would happen if ๐ + 1 = ๐. It’s very helpful to start with the case of ๐ = 2. We would be trying to fit a cloud that consists of only two points with a straight line. The fit would be perfect, and the residuals would be zero. Now extend that idea to ๐ = 3. We would try to fit a cloud of three points with a plane. Again, the fit would be perfect. This is a general result. As long as ๐ + 1 = ๐, the fitted values would be equal to the actual values of ๐ฆ. For illustrative purposes, we will use univariate or bivariate regression examples because we can analyze them graphically. Their intuition extends to the case with more regressors. The graph below shows a plane fitting a cloud of points in three dimensions (two explanatory variables and one dependent variable). The plane cuts through the cloud, leaving some points above (in blue) and other points below (in red). 3.1.5 Goodness of fit Remember that we are trying to fit a cloud of points with a linear structure (a line, a plane or a hyperplane). We can always measure how well we do that using the R-square (๐ 2), a measure of goodness of fit. The formula is very simple: ๐ 2 = 1 − ∑๐ ฬ๐ )2 ๐=1(๐ฆ๐ − ๐ฆ ∑๐ ฬ )2 ๐=1(๐ฆ๐ − ๐ฆ If our regression model fits the cloud perfectly, then all residuals are equal to zero and the Rsquare would be equal to one. If, on the contrary, the model is not better than using a flat line or plane with a value of ๐ฆฬ , then our regression model would not explain any of the variation in the data, and the R-square would be equal to zero. As you can probably deduce, the R-square is ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 11 always between 0 and 1, with 0 representing the worst possible fit (none), and one representing the perfect fit. Why is the R-square called that way? One very intuitive way of measuring goodness of fit is to compute the correlation between ๐ฆฬ and ๐ฆ. If the model fits the data perfectly, the correlation should be 1. If the model has a very poor fit, the correlation would be close to zero (positive or negative). Let R stand for that correlation. How is the R-square related to R? Well, you probably guessed it by now. The R-square is simply the square of R, that is, the square of the correlation between ๐ฆฬ and ๐ฆ. The R-square is a mathematical concept. It is not informative of the probabilistic or economic aspects of our regression. High R-squares are not per se better than low R-squares. The relevance of different goodness of fit depends on the context. Later we will see some examples where the R-square is not even mentioned (when we try to estimate causal effects) and other examples in which the R-square is the most important aspect (when we try to predict). We will come back to discuss goodness of fit as we advance in the course. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 12 3.2 Intermediate concepts So far, we can say the mechanics of regression are simple. In fact, they are so simple that one could be tempted to deem regression analysis as “too simplistic.” However, that misses several points. Here we will go over some of them to give you a taste of the power of regression. 3.2.1 Dummies Dummy variables (also known as indicator or dichotomic variables) are a very useful type of regressors. A dummy takes a value of 1 if a condition holds true, and 0 if it doesn’t: ๐ฅ๐ = { 1 0 if condition holds otherwise Assume our regression model is ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . How do we interpret ๐ผฬ and ๐ฝฬ when ๐ฅ๐ is a dummy? The following chart provides some guidance. If ๐ฅ๐ is a dummy, then our cloud of points would consist of two columns of points. One would be located over the value of ๐ฅ๐ = 0 and the other would be located over the value ๐ฅ๐ = 1. No points would lie between ๐ฅ๐ = 0 and ๐ฅ๐ = 1. Our regression line would cross both columns. The graph below presents an example. The resulting intercept and slope can be interpreted in terms of conditional expectations: ๐ธ[๐ฆฬ|๐ฅ = 0] = ๐ผฬ ๐ธ[๐ฆฬ|๐ฅ = 1] = ๐ผฬ + ๐ฝฬ ๐ธ[๐ฆฬ|๐ฅ = 1] − ๐ธ[๐ฆฬ|๐ฅ = 0] = ๐ฝฬ This is a very useful feature. Let’s move on to the case with two independent dummies. You can imagine one dummy indicates gender (zero for male and one for female) and the other ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 13 indicates minority status (zero for non-minority and one for minority). There are four possible combinations of (๐ฅ1 , ๐ฅ2 ): (0,0), (1,0), (0,1) and (1,1). In this case, the cloud of points consists of four columns of points floating above or below those four coordinates. Our model is: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + ๐๐ Since we have two regressors, we can still get a visual interpretation of the plane that fits the cloud of points. The following chart illustrates the cloud of points and the regression plane. The height of the plane at each of the four coordinates (0,0), (1,0), (0,1) and (1,1) can be expressed in terms of the beta hats. In this example, ๐ฝฬ1 < 0, ๐ฝฬ2 > 0, and ๐ฝฬ1 + ๐ฝฬ2 < 0. Let’s assume for the illustrative purposes that ๐ฆ is wage and we are looking a group of employees of a company. The regression coefficients tell us what the expected value of ๐ฆฬ: The expected value the wage for… Non-minority males Non-minority females Minority males Minority females …is: ๐ธ[๐ฆฬ|๐ฅ1 = 0, ๐ฅ2 = 0] = ๐ฝฬ0 ๐ธ[๐ฆฬ|๐ฅ1 = 1, ๐ฅ2 = 0] = ๐ฝฬ0 + ๐ฝฬ1 ๐ธ[๐ฆฬ|๐ฅ1 = 0, ๐ฅ2 = 1] = ๐ฝฬ0 + ๐ฝฬ2 ๐ธ[๐ฆฬ|๐ฅ1 = 1, ๐ฅ2 = 1] = ๐ฝฬ0 + ๐ฝฬ1 + ๐ฝฬ2 There are many more ways of using dummies. We will learn more about them later. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 14 3.2.2 Splitting and warping Sometimes our cloud of points doesn’t look linear. Can we still fit it with a linear structure? The answer is affirmative. Imagine that our cloud of points looks like ๐ฆ is polynomial in ๐ฅ. That is the case in the figure below. Let’s start with a polynomial of degree โ in ๐ฅ: ๐ฆ = ๐0 + ๐1 ๐ฅ + ๐2 ๐ฅ 2 + ๐3 ๐ฅ 3 + โฏ + ๐โ ๐ฅ โ If we define ๐ฅ0๐ = 1, ๐ฅ1๐ = ๐ฅ๐ , ๐ฅ2๐ = ๐ฅ๐2 , ๐ฅ3๐ = ๐ฅ๐3 , … , ๐ฅโ๐ = ๐ฅ๐โ , then we arrive at: ๐ฆ๐ = ๐0 + ๐1 ๐ฅ1๐ + ๐2 ๐ฅ2๐ + ๐3 ๐ฅ3๐ + โฏ + ๐โ ๐ฅโ๐ Which has a linear structure. All the regressors enter the model linearly (there aren’t any quadratic, cubic, or higher degree terms). Thus, although a linear structure sounds restrictive, it turns out that it isn’t. This is possible because we split and warp the regressors. In the case above, ๐ฅ is split into โ regressors, and each of them is warped differently. Although the original relationship may be non-linear, we can find a specification with a linear relationship between ๐ฆ and ๐ฅ, once we split it and warp it. Notice that, in general, when we split and warp the regressors, the derivatives are no longer constant. In the case above, we have: โ โ โ ๐๐ฅ๐ ๐๐ฆ ๐๐ฆ ๐๐ฅ๐ =∑ = ∑ ๐๐ = ∑ ๐๐ ๐๐ฅ ๐−1 ๐๐ฅ ๐๐ฅ ๐=1 ๐๐ฅ๐ ๐๐ฅ ๐=1 ๐=1 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 15 Graphically, to understand what happens when we split and warp, we can focus on the case of a quadratic polynomial. Assume we have only one explanatory variable ๐ฅ, and the cloud of points looks like a parabola that opens upward. Let’s split ๐ฅ into ๐ฅ1 = ๐ฅ and ๐ฅ2 = ๐ฅ 2 . The graph below shows the parabolic relationship between ๐ฆ and ๐ฅ as blue points on the wall at the left. On the floor you can see the relationship between ๐ฅ1 and ๐ฅ2 (the latter is the square of the former). When we run a regression of ๐ฆ on ๐ฅ, we are choosing the right height and tilt of the blue plane to fit the cloud of red points. The red points you see in the graph lie on the blue plane. The takeaway is that, by splitting and warping our regressors, we can fit non-linear looking clouds with linear structures. 3.2.3 Logarithms Logarithm is a recurrent tool in economics because of its nice properties. Sometimes we use logarithmic transformations of our regressors. For instance, we may be interested in the regression model: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ln(๐ฅ๐ ) + ๐๐ If we take the derivative of ๐ฆ with respect to ๐ฅ and multiply it by a change in ๐ฅ equal to ๐๐ฅ, we get: ๐๐ฆ ๐๐ฅ ๐๐ฅ = ๐ฝ1 ๐๐ฅ ๐ฅ Assume ๐๐ฅ ≈ 1% of ๐ฅ. In this case, ๐ฝ1 is interpreted as the change in ๐ฆ associated with a one percent increase in ๐ฅ. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 16 Sometimes we use the logarithm of the dependent variable: ln (๐ฆ๐ ) = ๐ฝ0 + ๐ฝ1 ๐ฅ๐ + ๐๐ The interpretation differs from the one in the previous example. To show it, let’s apply the antilogarithm of the above expression, in other words, compute ๐ ln (๐ฆ) : ๐ฆ = ๐ ๐ฝ0 +๐ฝ1 ๐ฅ๐+๐๐ In the above expression, the derivative of ๐ฆ with respect to ๐ฅ is: ๐๐ฆ = ๐ฝ1 ๐ ๐ฝ0 +๐ฝ1 ๐ฅ๐+๐๐ ๐๐ฅ If we divide by ๐ฆ we get: 1 ๐๐ฆ = ๐ฝ1 ๐ฆ ๐๐ฅ Thus, the coefficient ๐ฝ1 can be interpreted as the change as a fraction of ๐ฆ associated with a change in ๐ฅ. Notice that nothing prevents us from using this last model when ๐ฅ is a dummy variable. The interpretation would be the same: ๐ฝ1 is the change as a fraction of ๐ฆ associated with “turning on” the dummy variable ๐ฅ. 3.2.4 Turning continues variables into dummies Sometimes it’s more convenient to define a group of dummies to represent different intervals of a continuous variable. For instance, instead age or income, we may want to have age groups or income brackets. Assume ๐ง is an independent variable. Let: ๐ฅ0 = ๐(๐ง ∈ [0, ๐)) ๐ฅ1 = ๐(๐ง ∈ [๐, ๐)) โฎ ๐ฅ๐ = ๐(๐ง ∈ [โ, ๐)) We have a regression model with a constant and ๐ dummies, one for each interval of ๐ง: ๐ฆ๐ = ๐ฝ0 ๐ฅ0๐ + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + โฏ + ๐ฝ๐ ๐ฅ๐๐ + ๐๐ Using dummies this way may help us fit complicated pattern in the data in a very simple manner. The graph below shows an example with an intercept and ๐ dummies: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 17 3.2.5 Kinks and jumps Sometimes we expect heterogeneity in the regression coefficients across subgroups. That heterogeneity may come in the form of kinks or jumps. Formally, we say that there are heterogenous coefficients. The graphs below present some examples. If we simply use a model like ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ we would be missing the kink or the jump. We can incorporate the possibility of kinks and jumps. To do that, let ๐๐ be such that: 1 ๐๐ = { 0 if ๐ฅ๐ ≥ ๐ if ๐ฅ๐ < ๐ ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 18 To include the possibility of heterogenous coefficients based on the value of ๐, our model would become: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐๐ + ๐ฟ๐๐ ๐ฅ๐ + ๐๐ In this case, we say “the variable ๐ interacts with ๐ฅ๐ ” or that “there are interaction terms of ๐ฅ๐ and ๐๐ .” Notice that now we have two intercepts and two slopes. Which is applicable depends on whether ๐ฅ๐ ≥ ๐ or ๐ฅ๐ < ๐. For ๐ฅ๐ < ๐, the model is: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ Whereas for ๐ฅ๐ ≥ ๐, the model is: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ + ๐ฟ๐ฅ๐ + ๐๐ The intercept would be ๐ผ + ๐พ and the slope would be ๐ฝ + ๐ฟ. To show more clearly the heterogeneity, we can write the model for both cases as: ๐ฆ๐ = (๐ผ + ๐พ๐๐ ) + (๐ฝ + ๐ฟ๐๐ )๐ฅ๐ + ๐๐ Assuming there is a kink and a jump at ๐, the graph below shows how our model would fit the cloud. As an exercise, assume ๐ฆ๐ is wage, ๐ฅ๐ is years of schooling, ๐๐ = 1 if individual ๐ is female and ๐๐ = 0 otherwise. How would you interpret the coefficients in the following model? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 19 ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐๐ + ๐ฟ๐๐ ๐ฅ๐ + ๐๐ 3.2.6 Interactions Some relationships between regressors and the dependent variable may be complicated. In the regression context, we call interactions to the product of two or more regressors. We can have interactions with dummies (as we saw before) or with any other regressors. Suppose we have two independent variables, ๐ฅ1 and ๐ฅ2 . A model with an interaction between ๐ฅ1 and ๐ฅ2 is: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ1๐ + ๐พ๐ฅ2๐ + ๐ฟ๐ฅ1๐ ๐ฅ2๐ + ๐๐ If we take partial derivatives of the dependent variable with respect to each of the two regressors, we don’t get constants terms. Instead, we get values that vary: ๐๐ฆ๐ = ๐ฝ + ๐ฟ๐ฅ2๐ ๐๐ฅ1๐ ๐๐ฆ๐ = ๐พ + ๐ฟ๐ฅ1๐ ๐๐ฅ2๐ The slopes vary with the other explanatory variables. Like derivatives, the terms above can be evaluated at different values of ๐ฅ1 and ๐ฅ2 . Since slopes vary across observations, we say they are heterogenous. As you can imagine, we can have many types of interactions. They may involve more than two independent variables. However, it is important to keep in mind that too many interactions may obscure the meaning of our regression. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 20 3.2.7 De-meaning or centering In some circumstances we may find convenient to de-mean the data, i.e. to center the data around its mean. What does that do to our estimates? Consider the model ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . As we saw before, the regression coefficients would be: ๐๐๐ฃ(๐ฅ, ๐ฆ) ๐ฃ๐๐(๐ฅ) ๐ผฬ = ๐ฆฬ − ๐ฝฬ ๐ฅฬ ๐ฝฬ = What if instead we use ๐ฅ๐∗ = (๐ฅ๐ − ๐ฅฬ ) as our explanatory variable? The process of subtracting the mean to create a new variable is called de-meaning or centering. If we did so, our model would be ๐ฆ๐ = ๐ผ ∗ + ๐ฝ ∗ ๐ฅ๐∗ + ๐๐ . A natural question is, would ๐ฝฬ and ๐ฝฬ ∗ be the same? What about ๐ผฬ and ๐ผฬ ∗ ? Let’s compute them: ๐ฝฬ ∗ = ๐๐๐ฃ(๐ฅ ∗ , ๐ฆ) ๐๐๐ฃ(๐ฅ − ๐ฅฬ , ๐ฆ) ๐๐๐ฃ(๐ฅ, ๐ฆ) = = = ๐ฝฬ ๐ฃ๐๐(๐ฅ ∗ ) ๐ฃ๐๐(๐ฅ − ๐ฅฬ ) ๐ฃ๐๐(๐ฅ) Thus, the slope is unchanged. But that’s not the case with the intercept: ๐ผฬ ∗ = ๐ฆฬ − ๐ฝฬ (0) = ๐ฆฬ If we de-mean the regressors, then we can interpret our estimates as “evaluated at the mean.” This is particularly interesting for the intercept, since it becomes the average for the dependent variable. As an exercise, think what would happen if we also de-meaned the dependent variable. 3.2.8 Hierarchical and rectangular forms Imagine we have yearly data on sales for three sales representatives. The data covers the years 2015 through 2018. There are (at least) two ways of structuring the data into a table. The first, shown below, is what is known as a rectangular form or wide shape. Sales representative Sales 2015 Anne 2016 2017 2018 120 129 108 112 Bob 98 92 105 121 Chris 89 82 97 98 In this case, each row denotes a sales representative and the columns show the sales across different years. Notice that, as we accumulate data for more years, the number of columns would grow. A second way of presenting the same data is by using what is known as a hierarchical form or long shape. Below is the same data but in hierarchical form. Notice that now each row is ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 21 a unique combination of sales representative and year, and there is only one column for sales. the The first level of our hierarchy is given by the sales representative. The second level is given by the year. In this case, there is only one column displaying the sales information. Adding more years in this case would increase the number of rows. Sales representative Year Sales Anne 2015 120 Anne 2016 129 Anne 2017 108 Anne 2018 112 Bob 2015 98 Bob 2016 92 Bob 2017 105 Bob 2018 121 Chris 2015 89 Chris 2016 82 Chris 2017 97 Chris 2018 98 The data can come to you in many different shapes. You must be able to arrange it so that you can analyze any way you desire. To do that, it’s helpful to keep in mind these two general ways of organizing a table. Of course, when we have more complex data (more hierarchies and more variables), there are more ways to organize them. Some ways could be partly hierarchical and partly rectangular. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 22 4 Probability and regression At this point, we already have the first layer of our cake structure. We have a mathematical method (regression) to summarize the relationship between a dependent variable (๐ฆ) and a group of independent variables (๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ ). We start with a cloud of points (our data), and we fit it with a linear structure. We can fit clouds with all kinds of shapes. They don’t have to look like lines or planes. They can be curvy, and they can have jumps and kinks. Now, we will proceed to the second layer, which incorporates probability. 4.1 Sampling and estimation Let’s start with a silly example. Assume I measure your expenditures on entertainment over the last three months and plot it against the last two digits of your Social Security number. Would there be any correlation? You can correctly guess there should be no correlation. The graph below illustrates this example. The cloud represents different levels of expenditures along the vertical axis, and the last two digits of the Social Security number in the horizontal axis. We know the actual value of the slope should be zero because there is no reason those two variables should be connected. That’s represented by the blue line. However, what if we randomly got two samples like the ones depicted in the graph below? If our sample consisted of the observations denoted by triangles, a regression using that sample would produce a negative slope (red line). In contrast, if our sample consisted of the observations denoted by squares, the regression would produce a positive slope (green line). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 23 When we use samples, by sheer luck we may get positive or negative slopes even if the actual value of the slope should be zero. Probability enters regressions trough the notion of sampling. 4.2 Nomenclature Let’s introduce some useful nomenclature. We call parameters to the regression coefficients we would get if we ran a regression using data for the entire population or universe. In contrast, we call estimates to the regression coefficients we get when we run a regression using a sample. Colloquially, parameters are sometimes are referred to as the betas or the true betas, whereas estimates are referred to as the beta hats. The hat comes from the convention of adding the symbol ^ on top of the coefficient to distinguish it from the parameter. We hope the estimates are informative of the parameters. In fact, that’s the only reason we care about them. In the real world, we don’t observe the population or the universe. We only observe samples. Our challenge is to determine if our estimates are close or far from the parameters. Notice something important and intuitive in the example about the Social Security numbers that is true more generally. First, the less ๐ฆ varies (relative to ๐ฅ), the smaller the chances of getting very different regression coefficients across random samples—the beta hats would be more similar across samples. Second, the larger the sample, the smaller the chances the regression coefficients will differ by much from the population regression coefficient—the beta hat would be more similar to the true beta. Those are two general principles worth keeping always in mind. 4.3 The magic of the Central Limit Theorem Imagine that, given a population of size ๐, we draw one million random samples of size ๐ < ๐. For each sample, we run the regression ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ and get an estimate of beta (that is, we gate a ๐ฝฬ). We would have one million of such beta hats. If we create a histogram with ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 24 all those values, how would it look? By the Central Limit Theorem, we know it would look like a normal distribution centered at the true beta. This property is independent of anything else. It only depends on the concept of random sampling. This is an awesome result and we get a lot of mileage out of it. The graph below shows how the one-million beta hat histogram would look. In our Social Security number example we have that ๐ฝ = 0 but, because of sampling, we would get ๐ฝฬ > 0 half of the time and ๐ฝฬ < 0 the other half. However, estimates close to the parameter are more likely than estimates far from it—look at the chart above. If we knew how much ๐ฝฬ varies, then we could calculate the probability of ๐ฝฬ (the estimate) being close to or far from ๐ฝ (the parameter). 4.4 Standard error We measure how much ๐ฝฬ varies using its standard deviation. We call the standard deviation of ๐ฝฬ the standard error. There are two ways to estimate the standard error of ๐ฝฬ. One way is bootstrapping. It consists of treating our sample as the population, and then drawing many samples from it, replacing each time with draw and observation all the data points. By taking samples with replacement we can get a very good idea of how much our estimate varies based exclusively on the luck of the draw. This is very easily done with today’s computers. Thus, there is no excuse to not do it. We can select the number of repetitions we want (100, 1000, 10,000 or a million). Notice that ceteris paribus larger sample sizes mean smaller standard errors because larger samples sizes produce estimates closer to the parameter and therefore they vary less. We can also proceed in the classic way and estimate the standard error using the residuals of our (only one) regression using the full original sample. Based on assumptions that we won’t ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 25 review here (some of which aren’t verifiable), you can approximate the standard error this way. It is important to know this method because most people use it. The standard error of ๐ฝฬ is estimated based on how large or small the residuals are using the following formula:1 ๐. ๐ธ. (๐ฝฬ ) = √๐ฃ๐๐(๐ฝฬ ) = √๐ฬ 2 (๐ ′ ๐)−1 The above expression also decreases with sample size through the term (๐ ′ ๐)−1 . To see it, notice that, in the case of a univariate regression, multiplying by the term (๐ ′ ๐)−1 is equivalent to 2 dividing by the term ∑๐ ๐=1(๐ฅ๐ − ๐ฅฬ ) , which is increasing in ๐. You may notice that we introduced the term ๐ฬ 2 : ๐ 1 ๐ฬ = ∑ ๐ฬ๐2 ๐ 2 ๐=1 Whichever way we measure the standard error (bootstrapping using many regressions or based on the residuals of a single regression), the idea is that our ๐ฝฬ is normally distributed with mean ๐ฝ and variance equal to the square of the standard error. We express that statement as: ๐ฝฬ ~๐(๐ฝ, [๐. ๐ธ. (๐ฝฬ )]2 ) Let’s assume the true beta is zero. This is an arbitrary but very useful assumption. Given a standard error, we can compute the probability of ๐ฝฬ being in any interval we want. Let’s focus on symmetric intervals around zero. The graph below shows the distribution of ๐ฝฬ assuming ๐ฝ = 0, and given a standard error of one (as an example). Given an interval [−๐, ๐], where ๐ is a positive number, we can easily calculate the probability of ๐ฝฬ being outside that interval. We could do that calculation in a spreadsheet or any statistical software. There are better and slightly more sophisticated formulas to estimate the standard error that account for some other factors. We will briefly discuss them later. 1 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 26 We can also proceed backwards. Start with a given probability. We can find the symmetric interval around zero such that ๐ฝฬ would fall outside with that given probability. The graph below illustrates that situation. If we start with a probability of, say, 0.05 of ๐ฝฬ falling outside of the interval [−๐, ๐], then we can determine the value of ๐. To summarize, estimates are sample regression coefficients and parameters are population regression coefficients. Because of sampling, we think of estimates as random variables. Estimates are normally distributed, and their mean are the parameters. The Central Limit Theorem doesn’t require any assumption on the distribution of ๐ฆ, ๐ฅ or ๐. Based on the Central Limit Theorem result, we can formulate and test hypotheses. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 27 4.5 Significance Once we know the shape of the distribution of the estimates (a normal distribution), we may find useful to hypothesize that it is centered at zero. Another way to state the same hypothesis is that there is no relation between ๐ฅ and ๐ฆ in the population or that the true beta is zero. However, as we saw before, even if the true beta is zero, there is a chance we could get a sample for which the beta hat is not zero. Thus, we never know with certainty if the hypothesis is true or false. But we can check whether the data lend little or a lot of support to it. We can test the hypothesis ๐ฝ = 0 based on the distribution of ๐ฝฬ (under the assumption that is centered at zero) and the actual sample coefficient we obtain. With those ingredients, we define a rejection region associated with a confidence level (as you previously saw in your Stats course). Keep in mind that, since there is uncertainty, the best we can do is to live with a level of confidence. Sometimes it helps to understand these issues in terms of a coin. Suppose we’re interested in determining whether a coin is fair (i.e. it isn’t loaded). By tossing it one hundred times (i.e. by getting one sample of size ๐ = 100) we’ll never know for sure if it’s fair or not. But we may get a very good idea. If out of one hundred tosses we get ninety-five heads, we have good reasons to believe the coin isn’t fair. Why? Because, assuming the coin is fair, getting ninety-five heads or more is extremely unlikely. What about eighty or more heads? Seventy or more? As we approach fifty heads (what we would expect with a fair coin), the probability gets closer to fifty percent. For instance, the probability of observing sixty heads or more is one in thirty-five (0.0284). Still small, but not microscopic anymore. Lastly, the probability of observing fifty-five heads or more is close to one in five (0.1841). When we produce estimates using regressions, we have a similar situation. It’s hard to reconcile estimates that are far from zero with a true parameter equal to zero. Just as in the coin toss example, we can compute the probabilities associated with each value of ๐ฝฬ. Remember that once we know the standard error of ๐ฝฬ, we also know the hypothetical distribution of ๐ฝฬ assuming ๐ฝ = 0. The graph below shows such distribution. Notice that the location of the distribution doesn’t depend on the value of ๐ฝฬ—we assumed it’s centered at zero. What does the distribution mean intuitively? Given ๐ฝ = 0, values of ๐ฝฬ far from zero (be them positive or negative) are unlikely. Thus, if our ๐ฝฬ is very large or very small, then it is highly unlikely that it comes from a distribution centered at zero. Always keep in mind that the standard error is our measure of how far is ๐ฝฬ from 0, because it is the standard deviation of the distribution of ๐ฝฬ. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 28 Let’s revisit the normal distribution you’ve studied before. The graph below shows the probability by intervals for a random variable that is normally distributed with mean zero and standard deviation one (the horizontal axis is expressed in standard deviations). For instance, the probability that such variable falls between 0 and 1 is 0.341. Since the distribution is symmetric, the probability of the variable falling between −1 and 0 is also 0.341. Thus, the probability of falling between −1 and 1 is 0.682, which is equal to 2 × 0.341. The probability of the variable falling outside of the interval (−1,1) is 0.318, which is 1 – 0.682. More generally, we can compute the probability of the variable falling inside or outside any interval we want. We can also proceed the other way around. We can start with a probability, say 0.90 or 90%, and find the symmetric interval that corresponds to that probability. An interval is defined by its upper and lower bounds. The graph below shows the values of the upper and lower bounds given three probabilities: 0.99, 0.95 and 0.90. As before, the horizontal axis is expressed in standard deviations. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 29 Given the value of an estimate ๐ฝฬ (which may be positive or negative), we can compute the probability of obtaining estimates (drawn from the same distribution centered at zero) that are greater than |๐ฝฬ | or smaller than −|๐ฝฬ |. Such probability is known as the p-value associated with the estimate ๐ฝฬ. The graph below presents an example with ๐ฝฬ = 1.405 and a standard error of 1. The probability of obtaining an estimate above 1.405 is 0.08, and the probability of obtaining an estimate below −1.405 is also 0.08. Thus, the probability of obtaining an estimate that is farther away from zero than 1.405 is 0.16, which is 2 × 0.08. In other words, the p-value of the estimate 1.405 is 0.16. It should be clear that the probability of getting an estimate closer to zero than 1.405 is 0.84. The definition of p-value stated above corresponds to two-sided tests. We can define the pvalue for one-sided tests. In that case, we only care about either the probability of getting estimates that are larger than our estimate or smaller than our estimate. As you can see in the graph above, that’s equivalent to looking at only one of the tails of the distribution. The p-value in a right-side ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 30 test (which measures the probability of an estimate being greater than 1.4095) is 0.08. Because the distribution is symmetric and centered at zero, that’s the same as the p-value in a left-side test (which measures the probability of an estimate being smaller than −1.4095). Now that we have reviewed some probability notions, we can introduce a crucial concept. If ฬ ๐ฝ falls outside of the 95% interval centered at 0, we say it is statistically significant (or statistically different from zero) at 95% confidence. If it falls inside, we say that it is statistically insignificant (or statistically not different from zero). We can use other levels of confidence. Traditionally, 95% is the norm. However, with larger samples, we can be more demanding and use 99% or 99.9%. Notice that the definition of statistical significance can also be expressed in terms of p-values. If the p-value is below 0.05, then we say that the estimate is statistically significant at 95% confidence. If the p-value is greater than 0.05, we say the estimate is insignificant or not significant. As you can imagine, the definition of significance can be adjusted to reflect one-side tests if that’s what we need. Imagine that a regression produces an estimate ๐ฝฬ = 1.8 and the standard error associated is 1. Is that estimate statistically significant? The answer depends on the level of confidence we use and whether we are performing a one- or two-sided test. If we use a confidence of 95% (or higher) in a two-sided test, the estimate is not significant (see the chart above). But if we use 90%, it is significant, since it lies outside of the 90% interval that has bounds −1.64 and 1.64 (1.80 > 1.64). If we use one-sided tests, then the estimate would be significant at 95% confidence because the interval’s bounds are −∞ and 1.64. In sum, you cannot say a priori whether an estimate is significant or not just by looking at it. You need to know (1) whether we’re talking about a one- or two-sided test, (2) the p-value of the estimate, and (3) the confidence level. Notice that, all else constant, significance is directly affected by the sample size. We mentioned that larger samples result in smaller standard errors. That means the distribution of the estimates is more narrowly concentrated around the assumed value of the parameter. Thus, any non-zero estimate will eventually result significant if we keep increasing the sample size. So far, we’ve assumed ๐ฝ = 0. However, we could assume any other value for ๐ฝ and test whether our estimate is likely to be coming from a distribution centered at that (non-zero) value. That would be similar to assuming a loaded coin that lands heads with a probability different from one half. Intuitively, given an estimate, some parameter values would be more “reasonable” than others. After all, it’s more believable that the estimate ๐ฝฬ = 21.3 comes from a distribution centered at 20 than from a distribution centered at 100. We will talk about this in the next two sections. 4.6 Confidence intervals Given a confidence level, what parameters would be consistent with our estimate? We have a Goldilocks situation. Some parameter values seem too big for our estimate, while others seem too small. The graph below illustrates this situation. Imagine our estimate is ๐ฝฬ, and we consider two possible values of the true parameter, ๐ฝ ∗ and ๐ฝ ∗∗ . If the distribution of ๐ฝฬ were centered at any of ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 31 those two values, it would be very unlikely to get ๐ฝฬ (just like it’d be very unlikely to get fifty heads in one-hundred tosses using a coin heavily loaded in favor of heads, or using another coin heavily loaded against heads). Which values of the parameters seem “right” given our estimate? The answer is very intuitive. It’d be the values that are close to our estimate. Closeness to the estimate makes those parameters appear more reasonable. One simple way to measure how close a possible parameter value is to our (known) estimate is to look at the p-value we would get under the assumption that the true parameter has any particular value. Imagine we can adopt the following rule. We pick a confidence level, say 95%. Then we determine all the values of ๐ฝ for which the p-value of our estimate would be above the critical value, which is defined as one minus the confidence level we picked. In this case the critical value is 0.05. We would end up with an interval of possible values of ๐ฝ. Colloquially speaking, it wouldn’t surprise us if our estimate ๐ฝฬ came from a distribution centered anywhere within that interval—the probability of such event wouldn’t be too small. To make things easy, we can focus on the lower and upper bounds of the interval just described. If the critical value is 0.05, we need to find the values of the parameter such that the pvalue of ๐ฝฬ is precisely 0.05. There are two such parameter values. One will be greater than ๐ฝฬ and the other will be smaller. The graph below illustrates this point. If we assume the parameter is equal to ๐ฝ′, the p-value of ๐ฝฬ is 0.05. Similarly, if we assume the parameter is equal to ๐ฝ′′, then the p-value of ๐ฝฬ is also 0.05. For any parameter value between ๐ฝ′ and ๐ฝ′′, the p-value of ๐ฝฬ is greater than 0.05. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 32 All possible betas for which the p-value of ๐ฝฬ is greater than (or equal to) 0.05 constitute the 95% confidence interval of our estimate. In the example above, the 95% CI (as we usually abbreviate the confidence interval) is given by (๐ฝ′, ๐ฝ′′). In layperson terms, the confidence interval tells us which values of the parameter are consistent with our estimate. Our estimate is not statistically different from those parameter values (at the given level of confidence). This is very helpful in many contexts. A common mistake is to say that the parameter falls inside our confidence interval with a 95% probability. Why is this wrong? The parameter is fixed. It isn’t a variable—let alone a random one. Put differently, the parameter is either is in the interval or not. We cannot make probability statements about it. 4.7 Hypothesis testing Often times we would like to make decisions based on ๐ฝ, but we don’t observe it. We only observe ๐ฝฬ. However, we know ๐ฝฬ and ๐ฝ are related. First, ๐ฝฬ comes from a normal distribution centered at ๐ฝ. Second, we have a proxy for the standard deviation of that distribution—the standard error of ๐ฝฬ. Thus, we can use ๐ฝฬ as a piece of information about ๐ฝ the same way we use a sample average to inform us of the population average. How do we do this? We use hypothesis testing. A very common hypothesis (usually denoted by ๐ป๐) is ๐ฝ = 0. If ๐ฝฬ is far from 0, then we reject that hypothesis. However, we don’t reject it with certainty. We reject it with some level of confidence picked a priori (usually 95%). When ๐ฝฬ is close to 0, we don’t reject the hypothesis. However, not rejecting a hypothesis is different from accepting it. To illustrate that, imagine two different hypotheses (e.g. ๐ฝ = 0 and ๐ฝ = 0.1) are tested using the same regression and aren’t rejected. They both cannot be accepted because they are different (0 ≠ 0.1). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 33 The measure of how far ๐ฝฬ is from ๐ฝ is given by the standard error. However, we don’t know the standard error with certainty. We estimate it based on our sample through bootstrapping or the classic way based on the residuals. For our hypothesis tests we use a t distribution in lieu of a normal distribution because we have a proxy for the standard deviation. The difference between the estimate and the parameter, divided by the standard error, is a random variable distributed t with ๐ − ๐ − 1 degrees of freedom: ๐ฝฬ − ๐ฝ ∼ ๐ก๐−๐−1 ๐. ๐ธ. Where ๐ is the number of observations, and ๐ is the number of explanatory variables. Whenever we have more than one hundred degrees of freedom (which is almost always the case), the t distribution is indistinguishable from a normal distribution. Hence the focus in these notes on the latter. However, formally we use the t distribution for hypothesis testing and confidence intervals. The ratio (๐ฝฬ − ๐ฝ)/๐. ๐ธ. is the t-statistic. Knowing the distribution of the t-statistic allows us to formulate different hypothesis tests. It’s crucial to note that the same estimate may be significant in some regressions but not in others, depending on the standard error. In other words, the same hypothesis may be rejected with the same estimate depending on the standard errors. Remember that significance is the result of comparing the magnitude of the estimate with how much we would expect it to vary across samples. For instance, assume two different standard errors, 0.5 and 1, and the hypothesis ๐ฝ = 0. The two distributions of ๐ฝฬ are depicted in the graph below. For which values of ๐ฝฬ do we reject the hypothesis ๐ป0: ๐ฝ = 0 in each case? The shaded areas denote the rejection regions at a 95% confidence. Try different estimate values and convince yourself of the different conclusions. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 34 As the sample size increases, the size of the standard error decreases. To see this intuitively, remember that estimates from larger samples resemble more the population parameter. Therefore, there is less variation across estimates produced with larger samples. Thus, two identical estimates may lead to different conclusions about the same hypothesis if those estimates come from samples with different sizes. With these tools we can test many hypotheses. We can hypothesize that ๐ฝ takes any particular value of interest to us (1.5, 3, −1.2, etc.) and test it. In this context, confidence intervals are very useful. Given a level of confidence and an estimate ๐ฝฬ, a confidence interval tells us all the values for which we wouldn’t reject the hypothesis that the parameter is equal to any of those values. Colloquially, a confidence interval gives us a range of parameter values of distributions from which our estimate is likely to come from. 4.8 Joint-hypothesis tests In the same regression, ๐ฝฬ0 , ๐ฝฬ1 , ๐ฝฬ2 , … , ๐ฝฬ๐ aren’t independent random variables. In general, they are correlated. To see this, think of the original example using Social Security numbers and expenditures on entertainment. We mentioned the possibilities of getting samples for which the estimate of the slope would be positive or negative. Greater positive slopes come accompanied with lower intercepts, whereas greater negative slopes come accompanied with greater intercepts. We can formulate hypothesis tests that involve more than one estimate at a time. For instance, we can test whether the sum of two estimates is equal to one, whether the ratio of two estimates is equal to two, etc. Statistical software does that for us in an incredibly easy way. The underlying ideas about significance and confidence intervals are the same. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 35 5 The economics of regression So far, we’ve discussed the mathematical and probability aspects of regressions. Now we are moving on to the economics. Always keep in mind that regression is a tool. How we should use it depends on the question we are trying to answer. We can broadly classify most questions into three uses: descriptive, predictive and prescriptive. 5.1 Descriptive use The greatest asymmetry between econometrics textbooks and the practice of econometrics is in the emphasis on describing the data. Explicitly or implicitly, econometric textbooks focus on causal relationships and assume we already have a theory. In practice, the way we model phenomena (how we think the situation under analysis works) comes after observing the data. That isn’t cheating, as some theoretical extremist may suggest. It’s the scientific method. We first observe the world, then we come up with ideas about how it works. In business, we start with the overall goal of improving the performance of the organization (reducing churn, increasing loyalty, reducing employee turnover, decreasing unused promotions, etc.). Then we look at the data to get ideas. What seems to be associated with what? Is churn associated with gender? Are older customers more loyal? Is turnover associated with personality traits measured by the human resources department? Is the rate of unopened promotional emails related to the time of day they are sent? The exploration of the world through the data lenses allows us to find problems or areas of opportunity, and then come up with potential solutions or ideas. How do we explore the data? Visual inspection is usually insufficient or not feasible. We have many variables and we cannot plot more than three dimensions at the same time. The analytical “weapons of choice” for practitioners are partial correlation coefficients and conditional averages, which are computed using regressions. The difference with regular or naïve correlations and averages is that with partial correlation coefficients and conditional averages, we “hold everything else constant,” “control for other factors,” or “adjust for other variables.” Let’s start with the use of regression coefficients as partial correlation coefficients. The idea is closely related to the mathematical concept of a partial derivative. Partial correlation coefficients offer numerical answers to the question, what is the relation between ๐ฆ and ๐ฅ holding everything else constant? Think in terms of the regression model: ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐ค๐ + ๐ฟ๐ง๐ + ๐๐ When we explore the relation between ๐ฆ and ๐ฅ, we want to hold constant ๐ค and ๐ง. If we simply eyeball the data (say, with a scatter plot), we wouldn’t be holding constant ๐ค and ๐ง. With a ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 36 regression, when we look at the coefficient ๐ฝ, by definition we are holding constant the other regressors. Remember that: ๐ฝ= ๐๐ฆ ๐๐ฅ Suppose a restaurant chain is exploring the relationship between ticket size per customer (๐ฆ) and party size (๐ฅ). The restaurant chain is entertaining the possibility of giving promotions to increase party size because they believe larger parties spend more per customer. By running a regression, they can get an estimate of ๐ฝ and they can test whether is different from zero. The regression may hold constant other variables (e.g. the day of the week, the time of the day or whether there was a special event like Monday Night Football). They can also test hypothesis about the dollar value of the increase associated with one additional person in the party. For instance, they could formally test whether the increase is $5 (i.e. ๐ฝ = 5). The second workhorse of descriptive analysis is conditional averages. Regressions allow us to calculate averages “adjusting for other factors” or “holding all else constant.” To illustrate the relevance of this, suppose a company is comparing the productivity of managers supervising different groups of employees (perhaps the comparison will be used to pay bonuses). Let ๐ฆ๐๐ represent the performance of employee ๐ in who works with manager ๐. For each manager, we have a unique group of workers. Let ๐ฆฬ ๐ represent the average performance of workers supervised by manager ๐. What are some potential issues with simply comparing average worker performance across managers? In the real world, not all workers are the same. Some are more motivated or more skillful. A naïve comparison of average performance across managers may lead to wrong decisions. Assume an expert tells you that worker performance is affected by work experience. Thus, it’d be better to think in terms of the model: ๐ฆ๐๐ = ๐๐ + ๐ฝ๐ฅ๐๐ + ๐๐๐ Where ๐๐ is the productivity of the manager j, and ๐ฅ๐๐ is the years of experience of worker ๐. By comparing averages without any sort of adjustment, we would be missing the effect of experience, ๐ฅ๐๐ , on the observed performance of each manager. Take managers 1 and 2. If our model above is true, the naïve difference in average performance is not ๐2 − ๐1 . Rather, it is: ๐ฆฬ 2 − ๐ฆฬ 2 = ๐2 − ๐1 + ๐ฝ(๐ฅฬ 2 − ๐ฅฬ 1 ) + (๐ฬ 2 − ๐ฬ 1 ) As you can see, the naïve approach involves differences in manager productivity but also in worker experience. Unless ๐ฅฬ 2 = ๐ฅฬ 1 , we would be omitting important information. If ๐ฝ > 0, then ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 37 there would be a bias in favor of the manager with more-experienced workers. Let’s look at a graphical version of the same example. The graph below presents a cloud of points denoting the performance of different workers. The different colors of the points in the cloud denote the different managers supervising each worker. The blue points correspond to manager 1, and the orange points correspond to manager 2. If we simply computed average worker performance by manager, the average for manager 1 would be lower than the average for manager 2. However, by looking at the experience of all workers, it is clear that manager 1 supervises workers with less experience than manager 2. The shape of the cloud also suggests there is a positive relationship between worker performance and experience. The lines in the graph represent the results of fitting the model ๐ฆ๐๐ = ๐๐ + ๐ฝ๐ฅ๐๐ + ๐๐๐ , which has a different intercept for each manager and the same slope for worker experience. The estimate of the intercept for manager 1 is ๐ฬ1 , and the estimate of the intercept for manager 2 is ๐ฬ2 . Remember that those estimates can be interpreted as managerial productivity. In the graph, ๐ฬ1 > ๐ฬ2 , which means that, holding worker experience constant, manager 1 is more productive than manager 2. The result is the opposite to the naïve comparison. Similar examples are given by performance comparisons in many occupations (doctors with patients with different challenges, teachers with students of different backgrounds, lawyers with cases with different difficulties) or in prices of goods with many attributes (insurance premiums for people with different characteristics, prices of cars or computers with different features, wages for workers with different sociodemographic traits). Examples like those above can be grouped into what we call hedonic models. The name comes from pricing models where “the price of the total is the sum of the prices of the parts,” even if those parts’ prices aren’t observed in the market. For instance, think of houses prices. Being close to public transportation or having a backyard are valuable traits and certainly affect the price of a house. However, you cannot buy those features in a market and add them to your ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 38 house. With hedonic models we can estimate the contribution of those traits to the total price as if those traits could be added. In sum, never use correlation or naïve comparisons of averages when you can use a multivariate regression. Multivariate regressions allow you to control or adjust for other factors. However, when running a regression, you must pay attention to what controls, covariates or regressors are included in your analysis. It’s possible that sometimes you omit important explanatory variables. Some other times you may be including too many. We’ll discuss those two possibilities after we talk about fixed effects. 5.1.1 Fixed effects In a regression model, fixed effects can be defined as different intercepts for different groups of points. In our example of managers and workers above, we introduced manager-fixed effects. All observations associated with one manager would share the same intercept, and those intercepts could differ across managers. Fixed effects are estimates themselves. They are nothing but coefficients on dummies. Fixed effects can be used as controls (their value may be irrelevant to us) or as the subject of our analysis (their value may be important to us). In the model above, we could be interested in the relation of experience and worker performance. If we didn’t include manager fixed effects, our fitted line would understate the actual relation. In that case, manager-fixed effects aren’t interesting per se. We just use them to get the right estimate of a different parameter (๐ฝ). If instead, we are interested in measuring the difference managers make in worker performance, the manager-fixed effects would be the most important result of the analysis. To estimate fixed effects our cloud of points must include several observations associated to the same unit. For instance, to estimate manager-fixed effects, we need multiple workers associated with each manager. We also need to know the identity of their managers—otherwise we cannot group observations by manager. In the example above, each resulting coefficient ๐ฬ๐ is interpreted as the “manager effect.” Depending on our subject of analysis, there may be also a “location effect,” “Holiday effect,” “rush-hour effect,” and a long et cetera (notice that, for brevity, we omitted the word “fixed”) . In terms of notation, fixed effects can be written very concisely. Imagine that we have a dayof-the-week fixed effect. We can denote it by ๐๐ (the Greek letter eta with a subscript indicating the day): ๐ฆ๐ = ๐๐ + ๐ฝ๐ฅ๐ + ๐๐ In that case, the subscript ๐ would take seven possible values, from Sunday through Saturday. Compare that to the equivalent dummy approach, where we would have one coefficient and one dummy for each day of the week: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 39 ๐ฆ๐ = ๐๐๐ข ๐๐๐๐ข + ๐๐ ๐๐๐ + ๐ ๐๐ข ๐๐๐๐ข + ๐๐ ๐๐๐ + ๐ ๐โ ๐๐๐โ + ๐๐น ๐๐๐น + ๐๐๐ ๐๐๐๐ + ๐ฝ๐ฅ๐ + ๐๐ Clearly, it’s better to use the fixed effects notation rather than the dummy one, particularly when we have large numbers of fixed effects. Lastly, we can have two or more fixed effects in the same model. For instance, in the same regression we can have fixed effects for day of the week and fixed effects for the hour of the day (say, morning, midday, afternoon and evening). The model could be written as: ๐ฆ๐ = ๐๐ + ๐โ + ๐ฝ๐ฅ๐ + ๐๐ Fixed effects are very useful and not very well understood by many practitioners. Paradoxically, they are incredibly easy to work with in practice. Also, they seem to create information out of thin air. After all, without any direct information about managers, we are able to measure (otherwise unobserved) differences in productivity. The intuition for this is that we get indirect information though the multiple workers supervised by each manager. 5.1.2 Omitted variables When a variable belongs in a model and we omit it, we create a bias. To show it, let’s start by assuming the correct model (without omissions) and contrast its results with what we get with the omission. Suppose the correct model is ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . Our estimate of ๐ฝ is: ๐ฝฬ = ๐๐๐ฃ(๐ฅ, ๐ฆ) ๐ฃ๐๐(๐ฅ) If our model is correct, we can substitute ๐ฆ with ๐ผ + ๐ฝ๐ฅ + ๐ in the formula above. After some algebra, and using the properties of covariance, we get that the expected value of our estimate is the parameter:2 ๐ธ[๐ฝฬ ] = ๐๐๐ฃ(๐ฅ, ๐ฆ) ๐๐๐ฃ(๐ฅ, ๐ผ + ๐ฝ๐ฅ + ๐) ๐๐๐ฃ(๐ฅ, ๐ผ) + ๐๐๐ฃ(๐ฅ, ๐ฝ๐ฅ) + ๐๐๐ฃ(๐ฅ, ๐) = = ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฅ) ๐๐๐ฃ(๐ฅ, ๐ฝ๐ฅ) ๐๐๐ฃ(๐ฅ, ๐ฅ) ๐ฃ๐๐(๐ฅ) = =๐ฝ =๐ฝ =๐ฝ ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฅ) This equality holds with expected values. As you know, when we talk about a particular sample, the estimate will likely not be equal to the parameter. 2 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 40 In this case, we say that the estimate is unbiased. The expectation of ๐ฝฬ is ๐ฝ. In reality, we cannot be certain about what the correct model is. But contemplating the possibility of omitting relevant variables is important. Let’s go back to our example of party size and average ticket in a restaurant. What can be missing from the analysis? We can think of many determinants of ticket size per customer besides party size. An obvious one is socioeconomic status—there can be many others. Let’s think of the model ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐ง๐ + ๐๐ , where ๐ฆ๐ is the ticket size per customer of party ๐, ๐ฅ๐ is the party ๐’s size, ๐ง๐ is the socioeconomic status of the person paying the check (perhaps measured by the type of payment). How is party size related to expenditure per customer? Holding all else constant, ๐๐ฆ/๐๐ฅ = ๐ฝ. If we could run the regression ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐ง๐ + ๐๐ , we would obtain estimates for the three parameters. However, when we run a regression of ๐ฆ on ๐ฅ alone (omitting ๐ง), what do we get? Let’s look at the expected value of our estimate of ๐ฝ: ๐ธ[๐ฝฬ ] = ๐๐๐ฃ(๐ฅ, ๐ผ + ๐ฝ๐ฅ + ๐พ๐ง + ๐) ๐๐๐ฃ(๐ฅ, ๐ง) =๐ฝ+๐พ ๐ฃ๐๐(๐ฅ) ๐ฃ๐๐(๐ฅ) The term ๐พ × ๐๐๐ฃ(๐ฅ, ๐ง)/๐ฃ๐๐(๐ฅ) is the omitted-variable bias. In words, by omitting ๐ง from the regression, our estimate of ๐ฝ is biased. What can we say about the sign and magnitude of the omitted-variable bias? The bias depends on: (i) the coefficient on the omitted variable, in this case ๐พ, and (ii) the covariance between included and omitted regressors, in this case ๐ฅ and ๐ง. The following table goes over all the possibilities. ๐พ ๐๐๐ฃ(๐ฅ, ๐ง) Omitted variable bias is… 0 Any value Zero Any value 0 Zero >0 >0 Positive <0 >0 Negative >0 <0 Negative <0 <0 Positive What does the table imply for the analysis of ticket size? The analysis is omitting socioeconomic status. The sign of the bias depends on whether socioeconomic status increases or decreases average tickets size and how it relates to party size. to develop your intuition, go over several possibilities. Economists always think of potential omitted-variable bias when they look at correlations or regression coefficients. How does omitted-variable bias look in practice? Sometimes we have other potential regressors. We simply include them and see what happens. If there is no omittedvariable bias, adding regressors doesn’t change our estimates in a meaningful way. If there is omitted-variable bias, adding regressors changes our estimates. When we don’t have other ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 41 regressors, economic theory may be informative of the sign or even the magnitude of the omittedvariable bias. Let’s revisit the manager productivity example above. Imagine we are interested in the relation between worker performance and experience. If we use a model without manager-fixed effects (i.e. with one intercept), the slope we would get would be smaller than if we use a model with manager-fixed effects (i.e. with multiple intercepts, one for each manager). The omission of manager dummies as explanatory variables bias downward the estimate of the relation between worker performance and experience. 5.1.3 Redundant variables In a regression, what happens if two regressors are very much measuring the same? This is called collinearity. If collinearity is perfect, then two or more regressors have a linear relationship. In the case of two regressors, which we can represent prefect collinearity with the equality: ๐ฅ2๐ = ๐ฟ0 + ๐ฟ1 ๐ฅ1๐ With perfect collinearity, we cannot include both ๐ฅ1 and ๐ฅ2 as regressors in our regression. Notice what would happen if we did: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + ๐๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 (๐ฟ0 + ๐ฟ1 ๐ฅ1๐ ) + ๐๐ = (๐ฝ0 + ๐ฝ2 ๐ฟ0 ) + (๐ฝ1 + ๐ฟ1 )๐ฅ1๐ + ๐๐ = ๐พ0 + ๐พ1 ๐ฅ1๐ + ๐๐ This is equivalent to dropping ๐ฅ2 from the regression (we could have dropped ๐ฅ1 and keep only ๐ฅ2 instead). In fact, statistical software automatically does it for us. The question remains, can we recover estimates of ๐ฝ0 and ๐ฝ1 from estimates of ๐พ0 and ๐พ1 ? The answers is negative. To see why, let’s consider an example. Think of ๐ฅ1 as temperature in degrees Celsius (°๐ถ) and ๐ฅ2 as temperature in degrees Fahrenheit (°๐น). Notice that there is perfect collinearity, since °๐น = 32 + 1.8°๐ถ. We can estimate the effect of temperature measured in degrees Celsius or Fahrenheit, but we cannot estimate the effect of one holding the other constant—it’d be meaningless. What about cases in which collinearity isn’t perfect but high? Imagine, that the correlation is greater than 0.8. In this case, the coefficients on the collinear regressors “dilute.” Jointly, they may be significant. But each of them (or at least some of them) may be not. Think of income and wealth, or grit and conscientiousness. To some extent, they measure the same. Coefficients become hard to interpret as partial derivatives. If we include variables that measure similar things, we should be explicit about this issue. Whenever possible, the inclusion of regressors must be informed by theory. We should ask ourselves the question, do we truly believe these regressors belong in the regression? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 42 5.1.4 Dummies and redundancy In many instances, dummy variables result in redundancy. In other words, they show perfect collinearity. That’s not necessarily a problem. Consider the following example. In a questionnaire, you are asked to mark yes or no: Yes No ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ Female Minority College degree Over 65 years of age The sum of these four dummies ranges between 0 and 4. Like switches, they can be turned on (1) or off (0) independently of each other. We can imagine people in each of the 16 possible combinations. These dummies are independent. Now consider a questionnaire that includes the following questions: ๏ ๏ ๏ ๏ ๏ ๏ Yes No ๏ ๏ ๏ ๏ Yes No ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ Yes Female Male ๏ ๏ ๏ White Black Hispanic Other Yes ๏ ๏ ๏ No ๏ ๏ ๏ ๏ ๏ ๏ High school or less Some college College degree No ๏ ๏ ๏ ๏ ๏ ๏ 0 to 30 years of age 31 to 65 years of age Over 65 years of age Notice that you can only mark female or male, and therefore the sum of the first two dummies is always equal to one. The sum of the next four dummies for race/ethnicity is always equal to one. The same can be said about the dummies for educational attainment and age. That’s because within each group those dummies are associated with mutually exclusive categories. Hence, they are not independent. If you are in one category, you must not be in another. They are dependent. Other categories may be nested. The following question asks you for your place of birth. Whenever the dummy for Chicago is 1, the dummies for Illinois and U.S. are also 1, and the dummies for Outside of Chicago, Outside of Illinois and Outside of the U.S. are 0. Clearly, those dummies are not independent. They aren’t perfectly collinear either—their sum isn’t always the same. However, there is redundant information. If you were born in Chicago, then you weren’t born outside of Chicago, Illinois or the U.S. Within subsets, some dummies are perfectly collinear. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO ๏ ๏ ๏ ๏ ๏ ๏ Yes No ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ 43 Outside of the U.S. U.S. Illinois Outside of Illinois Chicago Outside of Chicago Consider the alternative version of the same question about your place of birth: Yes No ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ ๏ U.S. Illinois Chicago It should be clear that the information elicited is exactly the same. The second version of the question eliminated all redundancy, but the dummies remain dependent. That’s a result of them being nested. Chicago is in Illinois, and Illinois is in the U.S. Regardless of whether we have dependent or independent dummies, we must pay attention to perfectly collinearity. Remember that when there is an intercept in our model there is regressor ๐ฅ0 equal to 1. Assume ๐ฅ1 and ๐ฅ2 are perfectly collinear dummies. Since ๐ฅ1๐ + ๐ฅ2๐ = 1, then we have that ๐ฅ1๐ + ๐ฅ2๐ = ๐ฅ0๐ . Thus, we cannot estimate the model: ๐ฆ๐ = ๐ฝ0 ๐ฅ0๐ + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + โฏ + ๐ฝ๐ ๐ฅ๐๐ + ๐๐ We must drop either ๐ฅ0 , ๐ฅ1 or ๐ฅ2 from our regression. If we drop ๐ฅ1 , our model becomes: ๐ฆ๐ = ๐พ0 + ๐พ2 ๐ฅ2๐ + ๐๐ Alternatively, by dropping the constant (๐ฅ0 ), our model becomes: ๐ฆ๐ = ๐ฟ1 ๐ฅ1๐ + ๐ฟ2 ๐ฅ2๐ + ๐๐ But, since ๐ฅ1๐ = 1 − ๐ฅ2๐ , we can write as: ๐ฆ๐ = ๐ฟ1 (1 − ๐ฅ2๐ ) + ๐ฟ2 ๐ฅ2๐ + ๐๐ = ๐ฟ1 + (๐ฟ2 − ๐ฟ1 )๐ฅ2๐ + ๐๐ The models above are equivalent. If we look at the conditional expected value of the dependent variable, we get: ๐ธ[๐ฆ|๐ฅ1 = 1, ๐ฅ2 = 0] = ๐พ0 = ๐ฟ1 ๐ธ[๐ฆ|๐ฅ1 = 0, ๐ฅ2 = 1] = ๐พ0 + ๐พ2 = ๐ฟ2 ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 44 When we have perfectly collinear dummies, there are multiple equivalent ways to formulate our model. The results are exactly the same, but they are stated differently. Think of the wage gender gap. Assume our dependent variable is wage. We could include as regressors a constant and a dummy for female. The intercept would tell us the average wage among males, and the coefficient on the dummy would tell us the female minus male gender gap. Alternatively, we could substitute the dummy for female with a dummy for male. In this case, the intercept would tell us the average wage among females, and the coefficient on the dummy would tell us the male minus female gender gap. Lastly, we could exclude the constant and include a dummy for male and a dummy for female. Now the coefficient on the dummies would tell us the average wages for males and females, respectively. The difference would be the gender gap. The information these three models provide is exactly the same, just arranged differently. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 45 5.1.5 Measurement error Frequently, our measurements aren’t accurate. In other words, we have measurement error. Not every kind of measurement error is equally interesting or relevant. If the error is always equal to a constant positive number, then it means we always overstate the true value of our variable by that number. If the constant number is negative, then it means we understate the true value. Those cases aren’t very interesting because measurement error simply shifts the cloud of points up, down or sideways but it doesn’t affect its shape. The most interesting kind of measurement error is the one that doesn’t systematically inflate or deflate our measurements, but it makes them noisy. Sometimes it’s called classical measurement error. Height, income, IQ are examples of variables that can have this type of noise. What are the effects of measurement error on our estimates? Put simply, it depends. The most important lesson we’ll learn is that measurement error in the regressors causes attenuation bias, which means that the regression coefficients are biased towards zero. Graphically, measurement error in the explanatory variable stretches horizontally our cloud of points. The figure below shows a very simple example. Imagine we start with the cloud of points given by the two solid positions. The regression in absence of measurement error is denoted by the solid line. With measurement error in ๐ฅ, the cloud would look like the hollow points. Given the height in the cloud, some hollow points would be shifted to the right of the solid points while others would be shifted to the left, but their average location would be given by the solid points. The dashed line denotes the regression in presence of measurement error. Horizontally stretching the cloud flattens the slope of the regression. To show this algebraically, assume ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . Instead of ๐ฅ๐ , we observe ๐ฅ๐∗ = ๐ฅ๐ + ๐ข๐ , where ๐๐๐ฃ(๐ฅ, ๐ข) = 0. This zero covariance means that measurement error (denoted by ๐ข) isn’t associated with ๐ฅ in any systematic way. The expected value of our estimate of ๐ฝ is: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO ๐ธ[๐ฝฬ ] = 46 ๐๐๐ฃ(๐ฅ ∗ , ๐ฆ) ๐๐๐ฃ(๐ฅ + ๐ข, ๐ฆ) ๐ฃ๐๐(๐ข) = = ๐ฝ ( ) ๐ฃ๐๐(๐ฅ ∗ ) ๐ฃ๐๐(๐ฅ + ๐ข) ๐ฃ๐๐(๐ฅ) + ๐ฃ๐๐(๐ข) Notice that the term in parentheses is always in the interval (0,1) because all terms inside are positive. That means with measurement error we expect ๐ฝฬ to be somewhere between 0 and ๐ฝ (i.e. closer to zero). It’s important to note that the sign of the attenuation bias depends on the sign of the coefficient. It’s negative when the parameter is positive and vice versa. The t-statistic of our estimate is also biased towards zero, which means we would be more likely to not find significance. A different situation is when we have measurement error in the dependent variable. In this case there is no bias, but we experience loss of precision. Our cloud is vertically stretched, which results in larger standard errors. The figure below provides a simple example. The solid points show the situation without measurement error. The line represents the regression line in that case. The hollow points constitute the cloud with measurement error in ๐ฆ. Some are shifted up and some are shifted down relative to where they should be. Their average vertical position is unaltered. Thus, the regression line is the same as without measurement error. However, it should be born in mind that, if we took many samples, now we can get different slopes and therefore the standard error is larger. The fact that measurement error in the explanatory variable attenuates the estimates is important. That means that, in absence of measurement error, the estimates would have a greater magnitude and smaller p-values. In other words, if you get significant coefficients in a regression and someone has a hard time believing your results arguing that there is measurement error (perhaps not with those words) then you can reply that, if there is measurement error, getting rid of it would make your coefficients larger and more significant. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 47 5.2 Predictive use Before we get started, let’s make a distinction between forecast and prediction. When we forecast, we determine what the future will bring, only conditional on time passing. When we make a prediction, we come up with what we expect ๐ฆ to be, assuming we know ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ . Most businesses rarely make forecasts. They routinely make predictions—some good and some bad. Let’s imagine attendance to a chain of gyms can be modeled as: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ1๐ + ๐ฝ2 ๐ฅ2๐ + โฏ + ๐ฝ๐ ๐ฅ๐๐ + ๐๐ Assume the dependent variable is defined as the number of days attended over the course of 12 months after joining the gym. Assume the regressors ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ are observed or reported at the moment of signing up—i.e. before attendance occurs. They include age, body mass index, gender, marital status, educational attainment, etc. We can estimate the ๐ฝ1 , ๐ฝ2 , … , ๐ฝ๐ using all members who signed up in January 2019. We would have 12 months of data for each of them (up to December 2019). Suppose a new member ๐ signs up. We observe ๐ฅ1๐ , ๐ฅ2๐ , … , ๐ฅ๐๐ for her. Our prediction of attendance given her age, body mass index, gender, and so on, is: ๐ฆฬ๐ = ๐ฝฬ0 + ๐ฝฬ1 ๐ฅ1๐ + ๐ฝฬ2 ๐ฅ2๐ + โฏ + ๐ฝฬ๐ ๐ฅ๐๐ In other words, our predicted value or prediction is a fitted value for some values of the regressors. When we try to predict, we pay little or no attention to each regression coefficient or to their significance. We only care about the fit. How can we know if our predictions are good? A higher R-square (our measure of goodness of fit) means a better prediction. Also, a narrower confidence interval around the prediction means more accuracy. The graphs below show examples with different R-squares and different confidence intervals. By definition, greater residuals mean worse predictions. The R-square captures those larger residuals. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 48 In practice, we usually start with two data sets. The first of them is retrospective. It consists of the ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ observed before the fact we care about took place (e.g. gym member characteristics at sign up in January 2019, before attendance occur), and the ๐ฆ observed after that fact (e.g. gym attendance over the course of 2019). The second data set is prospective. We only observe the ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ before the fact (e.g. characteristics for gym members signing up in January 2020 and for which we haven’t observed attendance because it hasn’t occurred yet). We pool the two data sets together. Notice that the dependent variable ๐ฆ is missing for the prospective observations. Then, we run our regression, which will only include the retrospective data. Lastly, we use the fitted model (i.e. ๐ฝฬ1 , ๐ฝฬ2 , … , ๐ฝฬ๐ ) to compute the gym attendance we expect for each new member, given her values of ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ . In practice, this is very easy—it can be done in three lines of code. The important part is understanding the logic. We can compute statistics using the predicted values for the prospective observations (mean, variance, proportion greater than a threshold, etc.). Some examples where regressions are used to predict important variables are: consumer lifetime contribution of newly acquired customers, performance of potential new hires, credit scores, admissions, fraud detection, and consumer behavior in platforms like Netflix, Amazon or Spotify. Can you imagine how? When we are predicting, it’s important to distinguish between two types of predictions. One of them is within-sample predictions, which is when the values of the prospective ๐ฅ’s fall inside the range of the retrospective ๐ฅ’s. The other type is out-of-sample predictions, which is when the values of prospective ๐ฅ’s fall outside of the range of the retrospective ๐ฅ’s. There isn’t much cause for concern when we make within-sample predictions. However, when we make out-of-sample predictions, our model could be flat out incorrect. To see it, imagine extrapolating any behavior (drinking, dating, working) based on customer age when your retrospective data only includes people between the ages of 15 and 25. What would happen if you try to predict the same behavior for five-year-olds? How about sixty-year-olds? Intuitively, predictions for prospective ๐ฅ’s closer to the average value of the retrospective ๐ฅ’s are more accurate—they have narrower confidence intervals). Keep in mind that confidence intervals look like bow ties. Can you say why? We can use decoys to verify that our predictions make sense. For instance, we can use one subset of the retrospective data to predict another subset of the same data. This type of exercise is what is used for machine learning and artificial intelligence. The idea behind the notion of “training an algorithm” is simply finding better the ๐ฝฬ’s to produce predictions with higher Rsquares of whatever is it that we care about. For instance, think of speech or face recognition. To construct the confidence interval for our predictions, we think of ๐ฆ as a parameter given our ๐ฝ’s and the ๐ฅ’s: ๐ฆ = ๐ฝ0 + ๐ฝ1 ๐ฅ1 + โฏ + ๐ฝ๐ ๐ฅ๐ What would be our estimate of the “parameter” ๐ฆ? The fitted value ๐ฆฬ = ๐ฝฬ0 + ๐ฝฬ1 ๐ฅ1 + โฏ + ๐ฝฬ๐ ๐ฅ๐ . In this context, ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ are fixed numbers. Because of sampling, we think of the ๐ฝฬ1 , ๐ฝฬ2 , … , ๐ฝฬ๐ as random variables. Thus, ๐ฆฬ is also a random variable with a distribution centered at ๐ฆ and some ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 49 standard error derived from the standard error of the ๐ฝฬ’s. For each possible ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ , statistical software can give us ๐ฆฬ and its standard error. To build the confidence interval we use both. For instance, assume we choose a confidence level of 95%. Given some value of ๐ฅ1 , ๐ฅ2 , … , ๐ฅ๐ , the 95% confidence interval of our prediction is defined as: 95% ๐ถ. ๐ผ. = (๐ฆฬ − 1.96 × ๐. ๐ธ. , ๐ฆฬ + 1.96 × ๐. ๐ธ. ) We can look at a graphical example in two dimensions. Given a value of ๐ฅ, we build the confidence interval of our prediction using the prediction itself (๐ฆฬ) plus/minus the t-statistic corresponding to the confidence we want (the value 1.96 corresponds to 95% confidence) multiplied by the standard error of ๐ฆฬ. By construction, the confidence interval is centered at the prediction. For a given ๐ฅ, what is the interpretation of the 95% confidence interval around ๐ฆฬ pictured above? It’s analogous to what we discussed before for the ๐ฝฬ’s. Make sure you can explain this with your own words. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 50 5.3 Prescriptive use Whether in business, government or the not-for-profit sector, the ultimate goal of empirical analysis is to produce recommendations. Based on evidence, we want to know what should be done to improve the bottom line of a company or the results of a policy or program. Although the descriptive and predictive uses of regression may shed some light on what could make sense to do, they do not offer solid advice. For instance, think about the prices charged by a company. Are the prices too high or too low relative to the profit maximizing level? You can imagine arguments in favor of price increases, as well as arguments in favor or lower prices. In theory, it is unclear whether the price is too high, too low or just right. In order to be able to make a recommendation we need evidence. How would you determine whether a price increase would result in higher or lower profits? This type of problems goes well beyond pricing decisions. It involves pretty much every decision. Imagine a retail company has 1,000 stores and 500 are upgraded with the intention of improving customer satisfaction and boosting sales. The quarterly report is just in. Average sales in stores without upgrade are 600 (to make the example more appealing, you can imagine sales are expressed in thousands of dollars). Average sales in stores with upgrade are 550. Did the upgrade cause a decrease in sales? What should the board of the company do? Expand the (costly) upgrades to all remaining branches? There are several ways to answer this sort of questions. That’s what we will learn in this section. Before discussing the empirical methods, we must introduce a few concepts. 5.3.1 Causality and the Rubin model We will start with the so-called Rubin Model. Let’s focus on the store-upgrade example. Take the case of the store ๐. Without the upgrade, sales at that store would have been ๐0๐ . With the upgrade, sales would have been ๐1๐ . The causal effect of the upgrade is the difference in sales across the two situations, which is given by ๐1๐ − ๐0๐ . Notice that the causal effect is defined for each store ๐. We are interested in the average causal effect across stores, which we call Average Treatment Effect or ATE, and it is defined in terms of expected values: ๐ธ[๐1๐ − ๐0๐ ] = ๐ธ[๐1๐ ] − ๐ธ[๐0๐ ] However, for each store we only observe one situation (either it was upgraded or it wasn’t), not both. Let ๐ท๐ denote a dummy indicating whether a store was upgraded. Thus, ๐ท๐ = 0 means the store wasn’t upgraded and ๐ท๐ = 1 means the store was upgraded. We only observe: ๐๐ = ๐1๐ ๐ท๐ + ๐0๐ (1 − ๐ท๐ ) ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 51 In words, for stores with ๐ท๐ = 1 we observe ๐1๐ , and for stores with ๐ท๐ = 0 we observe ๐0๐ . We know facts but not the counterfactuals. A counterfactual is what would have happened in an alternative reality—e.g. think where you would be now had you not enrolled at this university. In the table below we explain the difference between what we observe and we don’t observe. Stores can be are divided based on whether they were upgraded or not. The first column corresponds to stores that weren’t upgraded (๐ท๐ = 0). The second column corresponds to stores that were upgraded (๐ท๐ = 1). The third column agglutinates all stores, regardless of upgrade status. We divide the world into two alternative realities for each store. The first row corresponds to a reality without upgrade (๐0๐ ), and the second row correspond to a reality with upgrade (๐1๐ ). It should be clear that we only observe information for two cells of the table. We know the sales without upgrade for stores that weren’t upgraded (600). We also know the sales with upgrade for stores that were upgraded (550). We don’t observe the counterfactuals, that is, sales with upgrade for stores that weren’t upgraded, and sales without upgrade for stores that were upgraded. Naturally, we don’t know the average across all stores for each row. We don’t know the difference across alternative realities for each group of stores either. So, there is a lot we don’t know. ๐ท๐ = 0 ๐ท๐ = 1 All 600 ? ? Average sales with upgrade ? 550 ? Difference made by upgrade ? ? ? Average sales without upgrade However, at least conceptually, we can fill in the table with correct notions even if we don’t observe them. That’s what the formulas in orange represent: ๐ท๐ = 0 Average sales without upgrade Average sales with upgrade Difference made by upgrade ๐ท๐ = 1 ๐ธ[๐0๐ |๐ท๐ = 0] ๐ธ[๐0๐ |๐ท๐ = 1] ๐ธ[๐1๐ |๐ท๐ = 0] ๐ธ[๐1๐ |๐ท๐ = 1] ๐ธ[๐1๐ − ๐0๐ |๐ท๐ = 0] ๐ธ[๐1๐ − ๐0๐ |๐ท๐ = 1] All ๐ธ[๐0๐ ] ๐ธ[๐1๐ ] ๐ธ[๐1๐ − ๐0๐ ] We can adopt some useful definitions for the formulas in the last row. Those formulas provide causal effects. The Average Treatment on the Untreated (ATU) is the average difference the upgrade would make among stores that weren’t upgraded (i.e. the causal effect among untreated stores): ๐ด๐๐ = ๐ธ[๐1๐ − ๐0๐ |๐ท๐ = 0] The Average Treatment on the Treated (ATT) is the average difference the upgrade would make among stores that were upgraded (i.e. the causal effect among treated stores): ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 52 ๐ด๐๐ = ๐ธ[๐1๐ − ๐0๐ |๐ท๐ = 1] Lastly, as we saw before, the ATE is the average difference the upgrade would make among all stores (i.e. the causal effect among the whole group of stores): ๐ด๐๐ธ = ๐ธ[๐1๐ − ๐0๐ ] We can also express the ATE as the (weighted) average of the ATU and ATT. If we have the 500 500 same number of stores upgraded (500) and not upgraded (500), then ๐ด๐๐ธ = 1000 ๐ด๐๐ + 1000 ๐ด๐๐. Going back to our problem, what do you think is more relevant to know for the company when assessing the upgrades? The ATE, the ATT or the ATU? What is the economic relevance of each of them? As an exercise, imagine situations in which each of them may matter most. In a naïve comparison (i.e. a simple difference of observed average sales across the two groups of stores) we get: ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] How different is that naïve comparison from the ATE, ATU or ATT? To find out, let’s add and subtract the counterfactual ๐ธ[๐0๐ |๐ท๐ = 1] (which is the average sales without upgrade among stores that were upgraded): ๐๐๐ฬ๐ฃ๐ ๐๐๐๐๐๐๐๐ ๐๐ = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 1] + ๐ธ[๐0๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] = ๐ด๐๐ + ๐ธ[๐0๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] = ๐ด๐๐ + ๐๐๐๐๐๐ก๐๐๐ ๐ต๐๐๐ ๐๐ ๐0 The selection bias in ๐0 is defined as ๐ธ[๐0๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0]. How would you explain it in your own words? Please try until you come up with a simple version. Instead, we could add and subtract ๐ธ[๐1๐ |๐ท๐ = 0] (i.e., the average sales with upgrade among stores that weren’t upgraded): ๐๐๐ฬ๐ฃ๐ ๐๐๐๐๐๐๐๐ ๐๐ = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐1๐ |๐ท๐ = 0] + ๐ธ[๐1๐ |๐ท๐ = 0] − ๐ธ[๐0๐ |๐ท๐ = 0] = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐1๐ |๐ท๐ = 0] + ๐ด๐๐ = ๐๐๐๐๐๐ก๐๐๐ ๐ต๐๐๐ ๐๐ ๐1 + ๐ด๐๐ The selection bias in ๐1 is defined as ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐1๐ |๐ท๐ = 0]. Explain in your own words how it may be different from the selection bias in ๐0 . It should be clear that the sign and the magnitude of the selection biases depend on the determinants of the upgrade status. What possible stories can you come up with for the biases to be positive or negative? ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 53 Notice that, if there are no selection biases, then the ATU and the ATT are equal to the naïve comparison, and therefore the ATE is also equal to the naïve comparison. Make sure you can express this powerful idea in terms of the formulas above. There are several lessons stemming from the Rubin model. First, if used to infer causal effects, naïve comparisons can be misleading when there are selection biases. Second, without counterfactuals, we cannot know the ATE, the ATT or the ATU. Third, since people act with purpose, selection is ubiquitous and, in general, correlation isn’t indicative of causation. This framework distinguishes economists from most other professionals—think about news reports on drinking wine as a healthy habit or the effects of doing yoga on productivity at work. Keep in mind that selection may occur in unobservable traits, such as motivation, perceptions, grit, opinions, etc. As a general rule, we cannot be sure we control for selection simply by adding more regressors in our analysis. Let’s look at an example about educational attainment and earnings. In this case, the treatment is “getting a college degree.” When we compare earnings of college graduates versus earnings of people without a college degree (i.e. non-college graduates), we only observe the figures in black in the table below. We don’t know the counterfactuals, e.g. how much would non-college graduates earn had they gotten a college degree. Obviously, there may be selection into college attendance. Imagine we were given believable estimates of the counterfactuals in red. Based on those estimates, we can compute the causal effect of attending college on earnings. Given the figures in the table, what could you say about selection biases? Can you compute them? You surely can. Selection bias in ๐0 is 30,000, whereas selection bias in ๐1 is 39,000. Make sure you know how to interpret the entire table. Non-college College graduates (2/3) graduates (1/3) Without a college degree 45,000 75,000 55,000 With a college degree 60,000 99,000 73,000 Difference (with minus without) 15,000 24,000 18,000 Annual earnings at age 40 … All We don’t have to restrict ourselves to binary comparisons. Let’s look at another example with three alternatives. In this case, we compare earnings of college graduates who attended different higher education institutions. One could argue those institutions may have different effects on the earnings of their graduates. However, students purposely seek admission only to some institutions, and admissions officers purposely reject some applicants and admit others. Based on those purposive behaviors, we expect some selection biases. Instead of presenting amounts of money, the table below simply presents placeholders. The ones in black (A, E and I) are observed. The rest (in blue) aren’t. A comparison of I vs E, or I vs A may not be informative of the difference it makes to attend one university instead of the other. Appropriate comparisons require estimating counterfactuals (the letters in blue). ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO Earnings if attended… 54 Institution attended Chicago State U. of I. at Chicago U. of Chicago Chicago State A B C U. of Illinois at Chicago D E F U. of Chicago G H I Before we proceed, let’s think of a last example also related to educational attainment. In this case, we look at years of schooling, which we can think of as a continuous variable. People who attain more years of schooling on average have higher earnings. The graph below shows an example of a cloud of points representing observed annual earnings and for people with different levels of schooling. We could run a regression of earnings (๐ฆ) on years of schooling (๐ฅ) in a model such as ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐๐ . Our regression would produce a slope of $12,000. We may be tempted to conclude that, on average, attending college increases annual earnings by $48,000 (4 × $12,000). In light of what we’ve discussed, do you agree with that conclusion? Is $12,000 a valid estimate of the causal effect of a year of schooling? Our estimate ๐ฝฬ is naïve. In general, it shouldn’t be interpreted as the causal effect of years of schooling on earnings because there may be selection into years of schooling attained. Perhaps people who attain more years of schooling would make more money than those who attained fewer years of schooling even if everyone had the same educational attainment. Think of the appropriate counterfactuals. The graph below shows two examples. The red triangles show counterfactual earnings across different levels of schooling for people who in fact attained twelve years of schooling. Obviously, factual and counterfactual earnings coincide for twelve years of schooling. The green diamonds show counterfactual earnings for people who attained sixteen ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 55 years of schooling. In this case, factual and counterfactual earnings coincide for sixteen years of schooling. In this example, given equal schooling, people who attained sixteen years of schooling (green markers) on average would make more money than people who attained twelve years (red markers). That means there is a positive selection bias. Those who would earn more ceteris paribus are also the ones who end up with more schooling. If we could observe the counterfactuals in the graph above, we would run a regression and adequately estimate the causal effect of one year of schooling on earnings. In that example, the slope would be $6,000, which is half the naïve estimate (the other half is the selection bias). This is just one example in which I assumed positive selection. As you can imagine, there are many theoretical possibilities for the selection biases. Try go over a few examples with different signs. The example above illustrates that, as a general rule, we shouldn’t interpret regression coefficients as estimates of causal effects. If we want to do that, we need good reasons why we should believe there are no selection biases. The group of techniques used to estimate counterfactuals and gauge causal effects is referred to as impact evaluation. Next, we will learn three ways to estimate causal effects that avoid selection and other biases. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 56 5.3.2 Randomized-Control Trials Organizations want to find what works best for them and do it. For instance, they may be interested in contrasting the status quo of a program or policy versus a new idea or a set of different new ideas. When trying to prescribe what to do, it’s helpful to think in terms of a medical analogy. There is a diagnosis, i.e. the situation we believe is wrong or that could be improved. There is also a treatment, i.e. the action that will cause the correction or improvement. It must be something we can manipulate—academics may be interested in non-manipulated causes but that’s not the case among practitioners. Lastly, there is an outcome or metric of interest, i.e. the variable where success would be observed. In a nutshell, we attempt to estimate the causal effect of the treatment on the outcome of interest, and decide whether the treatment should be introduced, stopped or continued. The key concept here is causality. There is no statistical test for causality. It is a logical—not statistical—concept. The methods used for estimating causal effects are usually referred to as impact evaluation. We will study three of the most common approaches used in impact evaluation. Please look at the World Bank’s Impact Evaluation in Practice, which is freely available online here. Ideally, we would like to conduct an experiment or randomized-control trial (RCT), the gold standard in impact evaluation. To see it in practice, imagine a company that delivers a newsletter by email to a list of one million subscribers. The newsletter contains offers and is used as a sales tool. A metric of success is email opening—it leads to sales. Someone in the company detects an area of opportunity. The subject line in those emails is “impersonal and unappealing.” One idea is to include the recipient’s given name in the subject line (e.g. Pablo, great deals just for you!). Other people think such strategy would become ineffective after a while, when the recipients get used to seeing their name in the subject line. Someone suggests alternating subjects at different frequencies. We can imagine ๐ different treatments in addition to the regular subject line (the control group that represents the status quo). Treatment 1 would include the recipient’s name in every newsletter. Treatment 2 would include the recipient’s name every other newsletter. Treatment 3 would include the recipient’s name every three newsletters, and so on. In terms of ๐๐ in the Rubin Model, we would have 1,2, … , ๐ alternative treatments and therefore ๐ + 1 possible outcomes ๐0๐ , ๐1๐ , ๐2๐ , … , ๐๐๐ . What steps should we take? Start with the list of all email recipients of the newsletter. Choose a random sample for the experiment. Within that sample, randomize who gets which of the treatments and who doesn’t get any. In other words, using a lottery we create the ๐ treatment groups and a control group. Different treatments are called arms of treatment. Apply the treatment and measure what happens after time elapses (say, three months later). To determine what worked best, compare the metrics of interest (the email opening rate). You could do this with a regression model: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ๐1 + ๐ฝ2 ๐ฅ๐2 + โฏ + +๐ฝ๐ ๐ฅ๐๐ + ๐๐ ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 57 where ๐ฅ๐1 is a dummy indicating whether recipient ๐ received treatment 1, ๐ฅ๐2 is a dummy indicating whether recipient ๐ received treatment 2, and so on. Since each treated recipient receives one treatment only, then ∑๐๐=1 ๐ฅ๐๐ = 1 among treated patients. Among non-treated patients we have ∑๐๐=1 ๐ฅ๐๐ = 0. Notice that: ๐ธ[๐ฆ๐ |๐ฅ๐1 = ๐ฅ๐2 = โฏ = ๐ฅ๐๐ = 0] = ๐ฝฬ0 ๐ธ[๐ฆ๐ |๐ฅ๐1 = 1] = ๐ฝฬ0 + ๐ฝฬ1 โฎ ๐ธ[๐ฆ๐ |๐ฅ๐๐ = 1] = ๐ฝฬ0 + ๐ฝฬ๐ In words, ๐ฝฬ0 is the average opening rate in the control group, and ๐ฝฬ0 + ๐ฝฬ๐ is the average opening rate in the arm of treatment ๐, where ๐ = 1,2, . . , ๐. Our estimate of the causal effect of receiving treatment 1 is given by the difference with respect to the control group: ๐ธ[๐ฆ๐ |๐ฅ๐1 = 1] − ๐ธ[๐ฆ๐ |๐ฅ๐1 = ๐ฅ๐2 = โฏ = ๐ฅ๐๐ = 0] = (๐ฝฬ0 + ๐ฝฬ1 ) − ๐ฝฬ0 = ๐ฝฬ1 In a similar way, we get estimates of the causal effect of the other treatments. We can also compare causal effects across treatments. For instance, we may be interested in whether the causal effect of treatment 3 is greater than the causal effect of treatment 2. Our estimate of the difference between the causal effects of those two treatments is: ๐ธ[๐ฆ๐ |๐ฅ๐3 = 1] − ๐ธ[๐ฆ๐ |๐ฅ๐2 = 1] = (๐ฝฬ0 + ๐ฝฬ3 ) − (๐ฝฬ0 + ๐ฝฬ2 ) = ๐ฝฬ3 − ๐ฝฬ2 We already know that our estimates ๐ฝฬ0 , ๐ฝฬ1 , … , ๐ฝฬ๐ have standard errors associated to them. Therefore, we can measure significance, build confidence intervals, and test (joint) hypotheses. The crucial part is that we can causally attribute differences in the metric of interest (the opening rate) to differences in the treatment received (subject lines). Once we determine which treatment works best, we can implement it in the whole mailing list. To make perfectly clear why an RCT is the ideal way to measure causal effects, let’s revisit our store upgrade example. Assume an RCT was performed (i.e. selection of stores for the upgrade was random). Thus, since “coins were tossed” to decide which stores were upgraded, we have that: ๐ธ[๐0๐ |๐ท๐ = 1] = ๐ธ[๐0๐ |๐ท๐ = 0] ๐ธ[๐1๐ |๐ท๐ = 1] = ๐ธ[๐1๐ |๐ท๐ = 0] ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 58 In words, we expect no selection bias of any kind (there may still be some difference due to the luck of the draw, but the chances are tiny). Thus, the naïve comparison in an RCT produces the ATE, which in turn is equal to the ATU and ATT: ๐ด๐๐ธ = ๐ด๐๐ = ๐ด๐๐ = ๐ธ[๐1๐ |๐ท๐ = 1] − ๐ธ[๐0๐ |๐ท๐ = 0] Why are experiments so useful? The source of identification (the variation that allows us to make causal claims) is the randomization of allocation into the treatment or control groups. That means the only systematic difference across treatment and control groups (or between different arms) is the treatment. Obviously, there is no selection bias. But it’s just as important to make clear that there is no omitted variable bias either. That’s because the treatment status is not related to any other determinant of the metric of interest. Last but not least, in experiments we usually control sample size and therefore we have a handle on the precision of our estimates (remember that sample size mechanically affects standard errors, which in turn determine significance and confidence intervals). At this point it’s crucial to establish a difference between two concepts. Statistical significance and economic relevance. In colloquial terms, statistical significance means our estimate is different from zero, and it doesn’t look like that’s the result of chance—we have a small p-value, regardless of its magnitude. In contrast, economic relevance means the magnitude of our estimate indicates the causal effect makes an important difference—it has a considerable magnitude, regardless of the p-value. They may or may not come together. Keep in mind they are separate concepts. To see the difference, think of the following statements. Ceteris paribus, sample size can always change the statistical significance of any (non-zero) estimate, even if it’s small and therefore economically irrelevant. At the same time, ceteris paribus, a very high opportunity cost can always make any estimate economically irrelevant. When designing an experiment, we must consider both concepts to determine our sample size. A sample size that is too large may result in a partial waste, but a sample that is too small could result in a total waste. Let explain both cases. Having a sample that is too large is a concern because of costs. Think of out-of-pocket expenses (e.g. data gathering, conducting surveys on the field, paying contractors to clean data, acquiring the right hardware and software) as well as non-monetary costs (some organizations don’t like the idea of fiddling with their operation at a large scale or with many of their customers). Given those costs, having an unnecessarily large sample is a partial waste. We could do just as well with a smaller, less costly sample. However, a sample size that is too small is worse. We may not be able to tell whether whatever estimate we get is the result of luck or not (our standard errors would be large). That would be a total waste. How do we decide ex ante on the right sample size for an RCT? We need information on how much the metric of interest varies. Intuitively, if the outcome of interest varies very little, then a causal effect of a given magnitude is easier to identify than when the metric of interest varies a ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 59 lot. To illustrate this, let’s revisit the email opening example. Assume we know the standard deviation of the number of newsletters opened is ๐๐ . Assume also that the CEO of the company considers appropriate a confidence level of 95%. Let’s focus on ๐ฝฬ1 , which is the causal effect of treatment 1 relative to the control. If it’s close to zero, it means that treatment doesn’t work—it doesn’t improve the opening rate. If it’s negative, it means it performs worse than the status quo. By assuming ๐ฝ1 = 0, we have a distribution of ๐ฝฬ1 centered at zero with a standard deviation that is an increasing function of the ratio ๐๐ ⁄√๐, where ๐ is the sample size. A larger ๐ means a narrower distribution of ๐ฝฬ1 . Thus, given the same estimate, larger samples would place that estimate in the rejection region of the hypothesis ๐ฝ1 = 0, and smaller samples would place it in the no-rejection region. The graph below illustrates this idea. The hypothesis tested is the same (๐ฝ1 = 0) and the estimate is also the same (๐ฝฬ1 = 1.5). The difference in the distributions comes from different hypothetical sample sizes calculated a priori. The orange distribution comes from a sample four times the size of the sample of the blue distribution. In one case, the estimate would fall in the rejection region (orange distribution, with larger sample size) and in the other it would fall in the no-rejection region (blue distribution, with smaller sample size). We know that a greater sample size results in a larger rejection region. At the same time, not all magnitudes of causal effects are economically relevant. Why waste resources detecting the magnitudes that are irrelevant? Based on economic criteria, we can define a priori the minimum detectable effect or MDE we are interested in. Using ๐๐ , we can compute the sample size consistent with such MDE. For instance, imagine treatments are costly. We are only interested in effects that surpass their costs, which means they are above a threshold ๐ต > 0. We can use that threshold to reverse engineer the appropriate sample size. We start with the number ๐ต, which is the minimum magnitude that is relevant to us. We calculate the “right” sample size given ๐๐ and a confidence level, so that the MDE is equal to ๐ต. The table below shows different sample sizes as a function of the desired MDE given two possible values of the standard deviation of the metric of interest (๐๐ = 1 and ๐๐ = 2). We use ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 60 95% confidence, but we could pick any level we want. We assume a control group and just one treatment group (of equal size). The sample size shown includes both groups (control and treatment). The principle is simple. We find the sample size that (a priori) would lead us to reject the hypothesis ๐ฝ = 0 if we had ๐ฝฬ = ๐ต. Keep in mind that the sample size enters the calculation through the standard error of ๐ฝฬ. As an example, assume we want an MDE of 0.4. With a standard deviation of 1, the sample size should be 200. A larger sample size would allow us detecting smaller causal effects, but those effects would be economically irrelevant. If the standard deviation is 2, then we need a sample size of 788 to achieve the same MDE. Statistical software does this for us very easily. Minimum detectable effect (at 95% confidence) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Standard deviation of the metric of interest 1.00 2.00 Sample size 3,142 12,562 788 3,142 352 1,398 200 788 128 506 90 352 68 260 52 200 42 158 34 128 Experiments may seem like a silver bullet, but they face important challenges. Some of those challenges are technical (e.g. subjects knowing they are part of an experiment), ethical (e.g. when the control group is excluded from a potentially beneficial treatment) or legal (e.g. price discrimination). Sometimes it’s logistically impossible to run an experiment (e.g. when there is no way to exclude the control group from the treatment). In those circumstances we must rely on quasi-experiments, which are situations that, to some extent, resemble an experiment. The most common quasi-experimental approaches are Regression Discontinuity Design and Difference in differences. 5.3.3 Regression Discontinuity Design In some situations, a treatment is assigned based on whether a variable (known as the running variable or the assignment variable) passes a threshold (a cutoff point). If the treatment has a causal effect, then we expect a discontinuity (a “jump”) in the outcome of interest at the cutoff. As an example, think of the problem of determining whether attending a selective enrollment school makes a difference in earnings in adulthood. Supposed all applicants must take an admissions test. We denote the score on that test by the variable ๐ฅ. Only applicants with a score equal to or above ๐ถ are admitted. In other words, applicant ๐ is admitted if and only if ๐ฅ๐ ≥ ๐ถ. All applicants who don’t make the cutoff are rejected and attend a non-selective school (which ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 61 is believed to be worse). A few years later, when those subjects are thirty years old, we observe their annual earnings, which we denote by ๐ฆ. If attending the selective school has an effect on earnings down the road, we expect greater average earnings among its graduates in comparison to the graduates of the non-selective option. However, a simple comparison of averages could be misleading. After all, the fact that students are screened for admission into the selective option creates a selection bias. In other words, even if the causal effect we are interested in is zero, we may find a difference between the earnings of graduates of the two schools. To avoid the selection bias, we can look applicants with test scores in the vicinity of the cutoff ๐ถ. We could argue that, if we look close to the cutoff, whether an applicant was admitted or not is a matter of luck. In fact, when we zoom in, it looks like a small experiment where, solely by chance, some students got a few more points than others in the admissions test, but on average they are similar in any other respect. Thus, any difference in earnings between applicants right below the cutoff and applicants right above the cutoff must be the result of attending different schools. The following graph illustrates this point. The cloud of points represents earnings at age thirty and scores in the admissions test. The points to the left of the cutoff ๐ถ correspond to applicants who attended the non-selective school, whereas the points to the right correspond to applicants who attended the selective school. Since earnings in adulthood and academic performance are usually related, we expect the cloud to show a positive trend. But that trend should be smooth— without jumps. If there is a jump at the cutoff, then we can attribute the difference in earnings to the difference in schools. We can fit a model with a jump at ๐ถ. Let ๐ be a dummy such that ๐๐ = 0 if ๐ฅ < ๐ถ, and ๐๐ = 1 if ๐ฅ ≥ ๐ถ. In words, ๐ represents the treatment (defined as attending the selective school instead of the non-selective school). Our regression model is ๐ฆ๐ = ๐ผ + ๐ฝ๐ฅ๐ + ๐พ๐๐ + ๐๐ . The coefficient ๐พฬ is our estimate of the causal effect at the discontinuity created by the cutoff ๐ถ. This set up is called Regression Discontinuity Design or RDD. RDDs may come in different forms. For instance, we may want to use a polynomial or even different polynomials at each side of the cutoff. If we truly have a discontinuity, we expect our regression to catch it. To show this, let’s look at another example. Suppose a company is deploying ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 62 a training program for its workforce. They can only afford to train 600 of their 1500 workers. To decide who is trained and who isn’t, the company gives priority to the youngblood. All 1500 workers are sorted according to their age in months. The youngest 600 are sent to training. One year later, the company measures the productivity of all 1500 workers. They want to determine whether the training program made a difference or not in terms of worker performance. In this case, our running variable ๐ฅ is age in months and our metric of interest ๐ฆ is worker performance a year after the training program took place. The treatment was given only to the 600 youngest workers. Let’s suppose that the oldest treated worker was ๐ด months old at the moment of the selection. Then, we define ๐ as a dummy such that ๐๐ = 1 if ๐ฅ ≤ ๐ด, and ๐๐ = 0 if ๐ฅ ≥ ๐ด. The graph below shows the cloud of points in terms of performance and age of the workers. Since it doesn’t look linear, we introduce in our regression quadratic polynomials at each side of the cutoff. Our model would be: ๐ฆ๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ๐ + ๐ฝ2 ๐ฅ๐2 + ๐๐ (๐พ0 + ๐พ1 ๐ฅ๐ + ๐พ2 ๐ฅ๐2 ) + ๐๐ = ๐ฝ0 + ๐ฝ1 ๐ฅ๐ + ๐ฝ2 ๐ฅ๐2 + ๐พ0 ๐๐ + ๐พ1 ๐๐ ๐ฅ๐ + ๐พ2 ๐๐ ๐ฅ๐2 + ๐๐ = (๐ฝ0 + ๐พ0 ๐๐ ) + (๐ฝ1 + ๐พ1 ๐๐ )๐ฅ๐ + (๐ฝ2 + ๐พ2 ๐๐ )๐ฅ๐2 + ๐๐ The estimate of the causal effect would be ๐พฬ0, which is the jump at the discontinuity ๐ด, and it’s positive in the graph above. Notice that the slope of our fitted model could be the different around the discontinuity. At the cutoff, the slope approaching ๐ด from the right would be ๐ฝฬ1 + 2๐ฝฬ2 ๐ด , whereas approaching it from the left it would be ๐ฝฬ1 + ๐พฬ1 + 2(๐ฝฬ2 + ๐พฬ2 )๐ด. If ๐พฬ1 ≠ 0 or ๐พฬ2 ≠ 0 then the slopes at the discontinuity would differ. By now it should be clear that we can use polynomials in our RDD. We can also add covariates and interactions. However, we must always verify compliance with the cutoff. We must make sure there are no signs of manipulation or cheating of the running variable. The interpretation of the magnitude of our estimate of the causal effect is valid close to the discontinuity, not far. Please explain in your own words why. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 63 When using an RDD, it’s inevitable to run into questions related to whether we should limit our analysis to a small vicinity around the cutoff and whether we should use linear, quadratic, cubic, or higher order polynomials to control for the smooth trend. There is no rule of thumb to determine what should be done. Instead of choosing on particular model, it’s convenient to think about trying different combinations of vicinities and trend controls as robustness checks. Always remember that, when we use an RDD, our estimates are informative of causal effects close to the cutoff, and their validity hinges on compliance with the cutoff. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 64 5.3.4 Difference-in-Differences In many instances, we observe the outcome of interest across periods for a control group and a treatment group. If we assume that in absence of the treatment the trend would be the same across the two groups, we can estimate the causal effect of the treatment. This assumption can be referred to as parallelism, because it means we would observe parallel trends in the control and treatment groups in the outcome of interest if there wasn’t a treatment. In its simplest form, what we observe can be described by the following table. The value inside each cell represents the observed average of the outcome we care about. In other words, the table depicts facts. Observed averages Before treatment Not treated ๐ด Treated ๐ต After treatment ๐ถ ๐ท ๐ถ−๐ด ๐ท−๐ต Difference To determine the effect of the treatment, we must create a counterfactual, which in this case is the average we would observe in absence of the treatment in the treated group. When we apply a Difference in Differences or Diff-in-Diff approach, we assume a parallel trend in time across groups. If the average grew from ๐ด to ๐ถ among the non-treated and we assume a parallel trend among the treated, then without the treatment the average among the treated would have grown ๐ถ − ๐ด. Since the starting point is ๐ต, the ending point would be ๐ต + (๐ถ − ๐ด). That’s the counterfactual we are looking for. The estimate of the causal effect of the treatment is the difference between the observed average, ๐ท, and the counterfactual, ๐ต + (๐ถ − ๐ด). Treatment effect = ๐ท − [๐ต + (๐ถ − ๐ด)] If we rearrange the expression, we get a more intuitive expression—a double difference: Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด) The first term at the right-hand side is the observed difference across periods among the treated units. The second term is the observed difference across periods among the untreated units. Our estimate of the treatment effect is the difference between the two differences—hence the name of the method. Notice that the idea is general and doesn’t depend on the order of the differences. Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด) = (๐ท − ๐ถ) − (๐ต − ๐ด) = ๐ท − [๐ถ + (๐ต − ๐ด)] Implicitly, we are assuming the counterfactual is ๐ถ + (๐ต − ๐ด). In words, we could also say that the counterfactual is built as the observed average after the treatment among the non-treated, ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 65 plus the difference across groups before the treatment took place. That’s just another way to interpret the same assumption of parallel trends. Let’s look at a graphic version of Diff-in-Diff. The horizontal axis represents the two periods. The two solid dots denote observed averages. If we assume parallel trends absent the treatment, then the change we would expect among the treated in absence of the treatment is ๐ถ − ๐ด. To preserve parallelism, we add that amount to the starting point for the treated, which is ๐ต. The difference between ๐ท and the counterfactual average ๐ต + (๐ถ − ๐ด) is our estimate of the treatment effect. The Diff-in-Diff approach is very intuitive. How do we implement it in the regression context? We start with two dummies. The first dummy, ๐๐ , denotes whether an observation belongs to the treated group (๐๐ = 1) or not (๐๐ = 0). The second dummy, ๐ ๐ด , indicates whether an observation corresponds to the period before (๐ ๐ด = 0) or after (๐ ๐ด = 1) the treatment. Their interaction, ๐๐ ๐ ๐ด , indicates the situation where the treated group has been treated. The outcome of interest can be expressed as: ๐ฆ๐ = ๐ผ + ๐ฝ๐๐๐ + ๐พ๐๐๐ด + ๐ฟ๐๐๐ ๐๐๐ด + ๐๐ By substituting the four possible combinations of ๐๐๐ and ๐๐๐ we can clearly see the correspondence between the coefficients in the regression and the observed averages. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 66 Facts: observed averages Not treated (๐๐๐ = 0) Treated (๐๐๐ = 1) ๐ด=๐ผ ๐ต =๐ผ+๐ฝ ๐ถ =๐ผ+๐พ ๐ท =๐ผ+๐ฝ+๐พ+๐ฟ Before treatment (๐๐๐ด = 0) After treatment (๐๐๐ด = 1) Let’s make sense of the above equalities. The coefficient ๐ผ is catching the pre-treatment level among the non-treated. The coefficient ๐ฝ is catching the difference between treated and nontreated in absence of the treatment—bear in mind that pre-treatment averages don’t need to be the same. The coefficient ๐พ is catching the trend in absence of treatment among the non-treated. Lastly, the coefficient ๐ฟ is catching the trend among the treated that is above (or below) the trend among the untreated, and that’s our estimate of the treatment effect. In other words, our Diff-inDiff estimate of the treatment effect is the coefficient on the interaction between the dummies: Treatment effect = (๐ท − ๐ต) − (๐ถ − ๐ด) = ([๐ผ + ๐ฝ + ๐พ + ๐ฟ] − [๐ผ + ๐ฝ]) − ([๐ผ + ๐พ] − [๐ผ]) = (๐พ + ๐ฟ) − (๐พ) =๐ฟ If we estimate the regression ๐ฆ๐ = ๐ผ + ๐ฝ๐๐๐ + ๐พ๐๐๐ด + ๐ฟ๐๐๐ ๐๐๐ด + ๐๐ , the equalities between regression coefficients and averages displayed in the table above would necessarily hold. However, if we include other regressors as controls, the equalities will not hold in general because our regression coefficients would reflect averages adjusted for other factors. The interpretation is similar, but the values wouldn’t be identical. It’s important to note that we run regressions and not just create a table with observed averages because of two reasons. First, a regression allows us to control for other variables—we already discussed the benefits of this. Second, a regression allows us to make statements about significance and test hypothesis. Anyone can compute a table like the one above. But it takes a good understanding of econometrics to interpret it in a way that is helpful to make decisions. Let’s consider a practical example. A company that owns a chain of retail stores has upgraded a few of them that are in the urban area of Chicago. The rest, which are located in suburban areas, weren’t upgraded. With data from two years and the two types of locations you can estimate a causal effect. Assume the treatment took place at the end of 2018. Thus, you can consider 2018 the pre-treatment period and 2019 the post-treatment period. Year Chicago stores Suburban Urban 2018 A B 2019 C D ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 67 We could apply the regression model above and estimate ๐ฟ. It should be obvious that our assumption of parallel behavior may be off. Do we really expect the same trends in urban and suburban locations? It’s a valid criticism but at this point there isn’t much you can do to address it. Now, imagine the company also has stores in the metropolitan area of Milwaukee. The good news is, Milwaukee hasn’t been reached by this upgrade program. Could we use that additional information? Yes. Milwaukee may provide information on whether the trends between sales in urban stores are parallel to sales in suburban stores. Year Milwaukee stores Chicago stores Suburban Urban Suburban Urban 2018 E F A B 2019 G H C D In Milwaukee, the trend in average urban sales is equal to ๐ป − ๐น, whereas the trend in average suburban sales is ๐บ − ๐ธ. None of those stores has been upgraded. Thus, we can measure the difference in trends in absence of treatment (something we cannot do for Chicago stores.) The difference in trends across urban and suburban stores absent the treatment is (๐ป − ๐น) − (๐บ − ๐ธ). This is a crucial piece of information. If this difference in trends is nonzero, then our Diff-in-Diff estimates for Chicago based on the formula (๐ท − ๐ต) − (๐ถ − ๐ด) may be off. A natural idea is to subtract the trend in Milwaukee from the Diff-in-Diff estimate for Chicago: Treatment effect = [(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)] Intuitively, this is a triple difference. It is the difference between two Diff-in-Diff estimates. One of them has the treatment, and the other doesn’t—that’s our decoy. The decoy allows us to account for the trend. The assumption of parallelism was relaxed a little. It’s still there but in a subtler way. Notice that, if the trends are truly parallel between urban and suburban stores in Milwaukee, then the [(๐ป − ๐น) − (๐บ − ๐ธ)] = 0 and the triple difference estimate would be identical to the Diff-inDiff estimate. How do we implement the triple difference? We create a new dummy, denoted by ๐๐๐ถ , indicating Chicago stores (๐๐๐ถ = 1) or Milwaukee stores (๐๐๐ถ = 0). We interact this dummy with our previous model and add new coefficients (knowing the Greek alphabet comes in handy): ๐ฆ๐ = ๐ + ๐๐๐๐ + ๐๐๐๐ด + ๐๐๐๐ ๐๐๐ด + ๐๐๐ถ × (๐ + ๐๐๐๐ + ๐๐๐๐ด + ๐๐๐๐ ๐๐๐ด ) + ๐๐ If we arrange the expression, we get to: ๐ฆ๐ = ๐ + ๐๐๐๐ + ๐๐๐๐ด + ๐๐๐๐ ๐๐๐ด + ๐๐๐๐ถ + ๐๐๐๐ ๐๐๐ถ + ๐๐๐๐ด ๐๐๐ถ + ๐๐๐๐ ๐๐๐ด ๐๐๐ถ + ๐๐ ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 68 If we go back to the table format and substitute all dummies by their respective values (0 or 1), we have the equivalence of the average in each cell. Year Milwaukee stores ๐๐๐ถ Chicago stores ๐๐๐ถ = 1 =0 2018 ๐๐๐ด = 0 Suburban ๐๐๐ = 0 ๐ Urban ๐๐๐ = 1 ๐+๐ Suburban ๐๐๐ = 0 ๐+๐ Urban ๐๐๐ = 1 ๐+๐+๐+๐ 2019 ๐๐๐ด = 1 ๐+๐ ๐+๐+๐+๐ ๐+๐+๐+๐ ๐+๐+๐+๐ +๐ + ๐ + ๐ + ๐ In this case, our triple-difference estimate of the treatment effect is: Treatment effect = [(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)] = [(๐ + ๐ + ๐ + ๐) − (๐ + ๐)] − [(๐ + ๐) − ๐] = [๐ + ๐] − [๐] =๐ In words, the triple-difference estimate of the causal effect is the coefficient on the interaction of the three dummies, i.e. ๐๐๐ ๐๐๐ด ๐๐๐ถ . Now, someone may question the validity of using Milwaukee as a comparison group for Chicago. Some may argue that urban-vs-suburban trends in one city may not be parallel across cities. Can we do anything else? It all depends on the availability of data. Imagine that we have data from 2016 and 2017 for both cities. Year Milwaukee stores Chicago stores Suburban Urban Suburban Urban 2016 I J M N 2017 K L O P The analogous to our triple difference estimate for this decoy period 2016-2017 is: [(๐ฟ − ๐ฝ) − (๐พ − ๐ผ)] − [(๐ − ๐) − (๐ − ๐)] What does it mean? It’s the gap in trends in the urban-suburban differences between Chicago and Milwaukee before the treatment took place. We can estimate the causal effect using a quadruple difference, that is the difference between triple differences: ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 69 Treatment effect = {[(๐ท − ๐ต) − (๐ถ − ๐ด)] − [(๐ป − ๐น) − (๐บ − ๐ธ)]} − {[(๐ฟ − ๐ฝ) − (๐พ − ๐ผ)] − [(๐ − ๐) − (๐ − ๐)]} Obviously, we can write the regression model equivalent to the expression above by simply doubling the terms in our previous model. The logic is exactly the same as before. In cases like this, writing the equation is more complicated that running the actual a regression in Stata. As an exercise, write the equation for the quadruple difference. Every time we take an additional difference, we are purging out an additional trend and making our estimates more believable. Think of having more and more sophisticated decoys with every additional difference. It’s important to say that the Diff-in-Diff approach doesn’t require us to have information across periods. That’s its more natural context. But we can think of the same type of estimation using cross-sectional data—i.e. data for one period alone. Imagine we have the same problem as before, but you only observe data for 2019. What could you do? Would you still be able to compute differences? The answer is affirmative. You’d go back to the simpler double difference model using Milwaukee and Chicago: Location Type of stores Suburban Urban Milwaukee A B Chicago C D As before, we can think of the difference between ๐ท and ๐ต + (๐ถ − ๐ด) is our estimate of the treatment effect. How believable our estimates are hinges on the assumption of parallelism. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 70 5.3.5 Other techniques There are other quasi-experimental approaches that are less frequently used in practice, and there is a good reason. Stakeholders (those who use the estimates) prefer to make decisions based on what is more convincing and intuitive. Hence RDD and Diff-in-Diff are the more commonly used quasi-experimental techniques. But there are other methods—mostly used in academia. Here are three. If you want to learn more about them, please see the World bank book Impact Evaluation in Practice. First, we have Instrumental Variables or Two-Stage Least Squares. This is a very specific type of quasi-experiment in which we find a variable (the instrument) that induces a treatment in an exogenous way and isn’t directly related to the outcome of interest. We exploit that exogenous variation to measure the causal effect. In practice, it’s difficult to find an instrument that is both exogenous and unrelated to the outcome of interest. And if you find one, it’s just as difficult to convince people of its validity (i.e. that satisfies the two assumptions). Second, we have matching. Basically, we find matches for the treated subjects among a group of non-treated subjects. This is very intuitive but also very unconvincing. Think about the following question. Why would there be apparently similar people some of whom took the treatment while others didn’t? Unless we have an experiment, there are good reasons to be skeptic about this method. Third, we have propensity score matching. In a way, it’s similar to matching, but the matching is done by groups in terms of their probability of being treated. This method is neither intuitive nor convincing. That’s why is rarely used to make decisions in practice. Its use is mostly confined to academic studies. 6 Additional topics Even if you don’t run regressions for a living, you’re probably going to encounter them. Perhaps someone will try to persuade you by showing you some regression results. Here are a few concepts to keep in mind. 6.1 Robustness We say a result is robust if it doesn’t change much when we consider reasonable variations in our analysis. Those variations are called robustness checks, and here are some examples. First, we can try different samples with similar information (e.g. other periods or regions, some subgroups). We can include or exclude different sets of controls. We can try different functional forms (e.g. linear vs quadratic, cubic, logarithmic, interactions). There is no formal test for robustness. It’s a qualitative result based on common sense. Consider the following examples of results that wouldn’t be robust. We use a different sample in the gym attendance prediction, and the predicted values change when we use a previous cohort. We use ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 71 a different functional form in an RDD, and the discontinuity estimate changes when we fit a quadratic polynomial in the running variable instead of a linear polynomial. In a Diff-in-Diff model, the estimates of the causal effect change when we add a triple or a quadruple difference. To show that a result is robust, practitioners should display (or at the very least mention) results with different samples, sets of controls and functional forms. When they don’t, be suspicious. 6.2 Forensic analysis Professional life involves reviewing the empirical analyses of other people. It could be colleagues, intellectual adversaries, or academic researchers. To understand those results and determine how much we believe them, we should proceed by deconstructing the cake. The first step is to determine what the analysis is trying to do. We gain a lot of mileage from having the goal clear. Is it trying to produce a description, a prediction or a prescription? Remember that the way we judge the quality of a description is different from the way we judge the quality of a prediction or a prescription. The second step is to try to prove the analysis wrong. Given its goal, is the empirical strategy credible? For instance, if we are looking at an RDD, ask yourself if there was good compliance with the cutoff. If we are looking at a Diff-ion-Diff analysis, ask yourself if the parallelism assumption is reasonable. Do the results seem robust? Sometimes people cherry-pick models to get small p-values that favor their hypothesis (something known as p-hacking). Make sure you look or at least ask about other model specifications (polynomials of higher order, interactions, etc.). You should also ask whether confidence and significance are properly calculated, reported, and interpreted. Remember that people have a difficult time interpreting confidence intervals. The third step is to get under the hood and look at the regression equation. There is nothing as frustrating as discussing the results of a regression without looking at the actual model used. We must also know the exact method. There are multiple ways to fit a cloud with a linear structure. Although they all resemble what we saw, they are not identical. There are methods like Probit, Logit, or Maximum Likelihood. If you run into them and you have a chance, ask the authors of the analysis how they think the results would differ if they used the standard method of minimizing the square of the vertical distance, which is known as Ordinary Least Squares. Ask how the data were treated or manipulated. For instance, how are categorical variables and missing values treated? You aren’t mathematicians, computer scientists or statisticians. You are economists—use economic thinking to judge what makes sense. 6.3 Artificial intelligence, machine learning and bid data In recent times, concepts like big data, artificial intelligence, and machine learning have captured the imagination of many journalists and laypeople. Shouldn’t we analyze those concepts ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 72 instead of the method introduced by Galton over a century ago? The short answer is no. Those concepts are building blocks that help—but don’t replace—the concepts you’ve learned in this course. Think of the classic example of artificial intelligence in which a computer must distinguish images of Chihuahua dogs and blueberry muffins. Computers are better than humans at telling differences, or more generally, figuring out patterns (one they’ve been trained). But in the end, they are nothing but fancy models of numerical prediction. Additionally, there is issue of what question do we want to answer and whether we are interpreting correctly the probabilistic aspects of it. Put simply, how would we use the ability to tell Chihuahuas from blueberry muffins? I’m not trivializing the technological advance. But the same can be said about previous advances—personal computers or the Internet. A lot of data and huge computing power doesn’t necessarily mean better empirical analysis. If the analyst doesn’t know what he or she is doing, it may be a bad thing. The methods and the data are the vehicle. The most important aspect is knowing where we are going so that we can drive that vehicle to our destination. It doesn’t really matter how nice the vehicle is, if we don’t know where we are going, we might as well not go anywhere. Machine learning is another concept that has caught on. It simply means that we substitute what an analyst would do with an algorithm. The nice property is that the results produced feed the algorithm. Imagine we are trying to maximize the opening rate of email newsletters. We can pick the time of day for each person. How can we do that? We can design a sequence of RCTs and measure what works best. An analyst would run each RCT, look at the results, and decide whether to try a different time in the next RCT, and whether it should be earlier or later. By laying these methods in an algorithm, we would be collecting information and accumulating actionable information. That process would be autonomous. It could even be a permanent process, continually trying new times—in case the schedule of preferences of subscribers changed. A lot of common sense goes into the use of artificial intelligence and machine learning. They are complements, not substitutes of your skills. ECON 11020 INTRODUCTION TO ECONOMETRICS, PABLO PEÑA, FALL 2020, UNIVERSITY OF CHICAGO 73 Another concept that has received attention is big data. We could say that it refers to the wealth of information created in day-to-day transactions. You wake up in the morning. As you grab your phone, there is a record of when you started looking at it. If you went to Instagram or the New York Times app, you also leave records. What post you saw and which ones you liked, what music you were listening, which emails you answered first, which ones you discarded without opening. Then you take public transportation or a Divvy bike. That information is also stored. When you scan your ID at your office, there is a record of when you arrived. The same for every time you go in or out. At lunch, go to a restaurant and use a loyalty card. You go home and order from Amazon. You also ask Alexa to play some music. You ask Waze directions to go to a friend’s house or take Uber. You go home and stream a movie, leaving record of the shows you browsed. And so on. That is without counting your performance measures at work, your grades at school, your travel records, etc. In addition to that, you can take DNA tests. DNA tests are particularly interesting because they pose the problem of false positives. Imagine that we ask a large group of people with what statement they agree more: “I deeply dislike Tom Brady” or “I am a fan of Tom Brady.” Let’s code it so that 1 means being a fan of the New England Patriots quarterback. Assume we have binary information about 100 genes. Remember that if we have 100 regressors, by sheer luck we can expect 5 of them to be significant at 95% confidence. Similarly, we can expect one coefficient to be significant at 99% confidence. We would call this the Brady-fan gene. However, that wouldn’t be solid evidence—to say the least. Having many regressors brings the potential problem of false positives. If you dig enough, you are always going to find a pattern that looks highly unlikely. Think of sports broadcaster when they say “this is the first time in major league baseball that a lefty rookie pitcher has stroke out three right-handed batters in a row after walking three left-handed batters in a post-season game.” Did we just witness a highly unlikely, one-of-a-kind event? Or is it the case that, if we dig enough, we’d find that in its own way everything is a first? Adding regressors to a model is like dicing more finely the categories and therefore mechanically increasing the chances of finding something “special.” In sum, when you hear the terms artificial intelligence, machine learning or big data, keep in mind that those concepts are not substitutes of the methods you learned in this course. Rather, they are complements. If you have a good grasp of the concepts taught in this course, you will be in a better position to make the most out of artificial intelligence, machine learning and big data.